I can't give too much help on this unfortunately, but as far as possibility 1) goes, my database contains around 8 million nodes, and I traverse them in about 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of RAM. I allocate a 1GB heap to the JDK.
Inserts take a little longer because of the approach I use - inserting 200K nodes now takes a few minutes. I then have a separate step to remove duplicates that takes about 10-15 minutes. It seems to me that you might be better off doing something similar: creating a new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add this relationship between products in catalogues A and B. Then run a post-processing job that retrieves all PURCHASED_BOTH relationships for each product in catalogue A, and build an in-memory map so you only keep one of these relationships, and update the 'count' attribute in memory for that relationship. Delete the duplicates and commit. This way to get your desired result in 2 passes instead of doing everything in one go. It seems a bit of a fiddle and I can't guarantee it'll improve performance (just to stress - I may be waaay off the mark here, but it works for me). I think it will though because it'll mean that your loop only has to create relationships instead of performing updates. Oh, and make sure that you aren't performing one operation per transaction - you could group together several tens of thousands before committing (I do 50,000 inserts before committing when I'm running this post-processing operation, and it's fine). Tim ----- Original Message ---- > From: Jeff Klann <jkl...@iupui.edu> > To: Neo4j user discussions <user@lists.neo4j.org> > Sent: Wed, July 28, 2010 4:20:28 PM > Subject: [Neo4j] Stumped by performance issue in traversal - would take a > month >to run! > > Hi, I have an algorithm running on my little server that is very very slow. > It's a recommendation traversal (for all A and B in the catalog of items: > for each item A, how many customers also purchased another item in the > catalog B). It's processed 90 items in about 8 hours so far! Before I dive > deeper into trying to figure out the performance problem, I thought I'd > email the list to see if more experienced people have ideas. > > Some characteristics of my datastore: it's size is pretty moderate for a > database application. 7500 items, not sure how many customers and purchases > (how can I find the size of an index?) but probably ~1 million customers. > The relationshipstore + nodestore < 500mb. (Propertystore is huge but I > don't access it much in traversals.) > > The possibilities I see are: > > 1) *Neo4J is just slow.* Probably not slower than Postgres which I was using > previously, but maybe I need to switch to a distributed map-reduce db in the > cloud and give up the very nice graph modeling approach? I didn't think this > would be a problem, because my data size is pretty moderate and Neo4J is > supposed to be fast. > > 2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB > currently. But would this get my 20day traversal down to a few hours? > Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing > much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap, > about 300m memory-map. > > 3) *There's some secret about Neo4J performance I don't know.* Is there > something I'm unaware that Neo4J is doing? When I access a property, does it > load a chunk of properties I don't care about? For the current node/edge or > others? I turned off log rotation and I commit after each item A. Are there > other performance tips I might have missed? > > 4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe > there's some optimizations I can do. It looks like: > > > For each item A in the catalog: > > For each customer C that has purchased that item: > > For each item B that customer purchased: > > Update the co-occurrence edge between A&B. > > > (If the edge exists, add one to its weight. If it doesn't exist, > > create it with weight one.) > > > This is O(n^2) worst case, but practically it'll be much better due to the > sparseness of purchases. The large number of customers slows it down, > though. The slowest part, I suspect, is the last line. It's a lot of finding > and re-finding edges between As and Bs and updating the edge properties. I > don't see much way around it, though. I wrote another version that avoids > this but is always O(n^2), and it takes about 15 minutes per A to check > against all B (which would also take a month). The version above seems to be > averaging 3 customers/sec, which doesn't seem that slow until you realize > that some of these items were purchased by thousands of customers. > > I'd hate to give up on Neo4J. I really like the graph database concept. But > can it handle data? I hope someone sees something I'm doing wrong. > > Thanks, > Jeff Klann > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user