I don't think that's the problem. Here's why... When I was importing my data, it eventually slowed down to a crawl (though it was pretty fast at first). Someone pointed out that since I was trying to do it all in one transaction, it was filling the java heap too much. They suggested I commit after every 40,000 node/edge creations (that's empirically when the slowdown happened). I did and then the import zipped along just fine.
I'm only committing after the outer pass through an item, which is again only after tens of thousands of writes/property updates. Hmm, that makes me wonder if it's possible I'm not committing often enough. Well, when I have more memory we'll see how it does. - Jeff Klann p.s. The simple counter in my last post isn't using transactions at all. On Wed, Jul 28, 2010 at 5:48 PM, Rick Bullotta < rick.bullo...@burningskysoftware.com> wrote: > Hi, Jeff. > > If you are committing after each item, it definitely will slow down > performance. Start a single transaction, commit when you're all done the > entire traversal, and report back the results. You will still "see" the > changes you've made prior to committing the transaction, as long as you're > on the same execution thread. > > Rick > > -----Original Message----- > From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] > On > Behalf Of Jeff Klann > Sent: Wednesday, July 28, 2010 5:43 PM > To: Neo4j user discussions > Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take > a month to run! > > Thank you both for your responses. > > - I will get some more RAM tomorrow and give Neo4J another shot. Hopefully > that's a huge factor. > - Tim, I like your algorithm trick! Would save a lot of reading/writing but > would definitely require more memory due to the massive increase in # of > edges. > - Transactions are not the issue, unless reading AFTER comitting a > transaction is somehow slower? I'm only committing after each of 7,000 > items > and like I said it took 8 hours to run through 90-some items... committing > is not where the time is being spent. > > To gauge the performance problem, I wanted to see how many customers are > purchasing each item and I'm concerned that even this query is taking a > really long time. It's simple: > > > For each item A > > Count the number of relationships to a customer > > > It took 15 minutes to do 200 items. That's almost 5 seconds an item just to > count the number of customers who purchased an item! (Looks like on average > about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE > HOUR > query! Considering that Neo4J advertises it can traverse 1m > relationships/sec on "commodity hardware", I would expect this to be much > faster. (Even if it were 50k customers per item, that'd be 7000items * > 50000customers / 1m traversals = 350 seconds. 6 minutes would be much more > reasonable.) > > My "commodity hardware" will have a lot more memory tomorrow, hopefully > that'll solve these problems! > > Thanks, > Jeff Klann > p.s. My propertystore is big because I was naive on import and stored > everything as string properties (this will change). How does that affect > performance? > > On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones <bogol...@ymail.com> wrote: > > > I can't give too much help on this unfortunately, but as far as > possibility > > 1) > > goes, my database contains around 8 million nodes, and I traverse them in > > about > > 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of > > RAM. I > > allocate a 1GB heap to the JDK. > > > > Inserts take a little longer because of the approach I use - inserting > 200K > > nodes now takes a few minutes. I then have a separate step to remove > > duplicates > > that takes about 10-15 minutes. > > > > It seems to me that you might be better off doing something similar: > > creating a > > new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always > add > > this > > relationship between products in catalogues A and B. > > > > Then run a post-processing job that retrieves all PURCHASED_BOTH > > relationships > > for each product in catalogue A, and build an in-memory map so you only > > keep one > > of these relationships, and update the 'count' attribute in memory for > that > > relationship. Delete the duplicates and commit. This way to get your > > desired > > result in 2 passes instead of doing everything in one go. > > > > It seems a bit of a fiddle and I can't guarantee it'll improve > performance > > (just > > to stress - I may be waaay off the mark here, but it works for me). I > think > > it > > will though because it'll mean that your loop only has to create > > relationships > > instead of performing updates. Oh, and make sure that you aren't > performing > > one > > operation per transaction - you could group together several tens of > > thousands > > before committing (I do 50,000 inserts before committing when I'm running > > this > > post-processing operation, and it's fine). > > > > Tim > > > > > > > > ----- Original Message ---- > > > From: Jeff Klann <jkl...@iupui.edu> > > > To: Neo4j user discussions <user@lists.neo4j.org> > > > Sent: Wed, July 28, 2010 4:20:28 PM > > > Subject: [Neo4j] Stumped by performance issue in traversal - would take > a > > month > > >to run! > > > > > > Hi, I have an algorithm running on my little server that is very very > > slow. > > > It's a recommendation traversal (for all A and B in the catalog of > > items: > > > for each item A, how many customers also purchased another item in the > > > catalog B). It's processed 90 items in about 8 hours so far! Before I > > dive > > > deeper into trying to figure out the performance problem, I thought > I'd > > > email the list to see if more experienced people have ideas. > > > > > > Some characteristics of my datastore: it's size is pretty moderate for > a > > > database application. 7500 items, not sure how many customers and > > purchases > > > (how can I find the size of an index?) but probably ~1 million > > customers. > > > The relationshipstore + nodestore < 500mb. (Propertystore is huge but > I > > > don't access it much in traversals.) > > > > > > The possibilities I see are: > > > > > > 1) *Neo4J is just slow.* Probably not slower than Postgres which I was > > using > > > previously, but maybe I need to switch to a distributed map-reduce db > in > > the > > > cloud and give up the very nice graph modeling approach? I didn't > think > > this > > > would be a problem, because my data size is pretty moderate and Neo4J > is > > > supposed to be fast. > > > > > > 2) *I just need more RAM.* I definitely need more RAM - I have a > measly > > 1GB > > > currently. But would this get my 20day traversal down to a few hours? > > > Doesn't seem like it'd have THAT much impact. I'm running Linux and > > nothing > > > much else besides Neo4j, so I've got 650m physical RAM. Using 300m > heap, > > > about 300m memory-map. > > > > > > 3) *There's some secret about Neo4J performance I don't know.* Is > there > > > something I'm unaware that Neo4J is doing? When I access a property, > > does it > > > load a chunk of properties I don't care about? For the current > node/edge > > or > > > others? I turned off log rotation and I commit after each item A. Are > > there > > > other performance tips I might have missed? > > > > > > 4) *My algorithm is inefficient.* It's a fairly naive algorithm and > > maybe > > > there's some optimizations I can do. It looks like: > > > > > > > For each item A in the catalog: > > > > For each customer C that has purchased that item: > > > > For each item B that customer purchased: > > > > Update the co-occurrence edge between A&B. > > > > > > > (If the edge exists, add one to its weight. If it doesn't exist, > > > > create it with weight one.) > > > > > > > This is O(n^2) worst case, but practically it'll be much better due to > > the > > > sparseness of purchases. The large number of customers slows it down, > > > though. The slowest part, I suspect, is the last line. It's a lot of > > finding > > > and re-finding edges between As and Bs and updating the edge > properties. > > I > > > don't see much way around it, though. I wrote another version that > > avoids > > > this but is always O(n^2), and it takes about 15 minutes per A to > check > > > against all B (which would also take a month). The version above seems > > to be > > > averaging 3 customers/sec, which doesn't seem that slow until you > > realize > > > that some of these items were purchased by thousands of customers. > > > > > > I'd hate to give up on Neo4J. I really like the graph database > concept. > > But > > > can it handle data? I hope someone sees something I'm doing wrong. > > > > > > Thanks, > > > Jeff Klann > > > _______________________________________________ > > > Neo4j mailing list > > > User@lists.neo4j.org > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > > > > > > > > _______________________________________________ > > Neo4j mailing list > > User@lists.neo4j.org > > https://lists.neo4j.org/mailman/listinfo/user > > > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user