Oh, and you DEFINITELY need more RAM! -----Original Message----- From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On Behalf Of Jeff Klann Sent: Wednesday, July 28, 2010 11:20 AM To: Neo4j user discussions Subject: [Neo4j] Stumped by performance issue in traversal - would take a month to run!
Hi, I have an algorithm running on my little server that is very very slow. It's a recommendation traversal (for all A and B in the catalog of items: for each item A, how many customers also purchased another item in the catalog B). It's processed 90 items in about 8 hours so far! Before I dive deeper into trying to figure out the performance problem, I thought I'd email the list to see if more experienced people have ideas. Some characteristics of my datastore: it's size is pretty moderate for a database application. 7500 items, not sure how many customers and purchases (how can I find the size of an index?) but probably ~1 million customers. The relationshipstore + nodestore < 500mb. (Propertystore is huge but I don't access it much in traversals.) The possibilities I see are: 1) *Neo4J is just slow.* Probably not slower than Postgres which I was using previously, but maybe I need to switch to a distributed map-reduce db in the cloud and give up the very nice graph modeling approach? I didn't think this would be a problem, because my data size is pretty moderate and Neo4J is supposed to be fast. 2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB currently. But would this get my 20day traversal down to a few hours? Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap, about 300m memory-map. 3) *There's some secret about Neo4J performance I don't know.* Is there something I'm unaware that Neo4J is doing? When I access a property, does it load a chunk of properties I don't care about? For the current node/edge or others? I turned off log rotation and I commit after each item A. Are there other performance tips I might have missed? 4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe there's some optimizations I can do. It looks like: > For each item A in the catalog: > For each customer C that has purchased that item: > For each item B that customer purchased: > Update the co-occurrence edge between A&B. > (If the edge exists, add one to its weight. If it doesn't exist, > create it with weight one.) > This is O(n^2) worst case, but practically it'll be much better due to the sparseness of purchases. The large number of customers slows it down, though. The slowest part, I suspect, is the last line. It's a lot of finding and re-finding edges between As and Bs and updating the edge properties. I don't see much way around it, though. I wrote another version that avoids this but is always O(n^2), and it takes about 15 minutes per A to check against all B (which would also take a month). The version above seems to be averaging 3 customers/sec, which doesn't seem that slow until you realize that some of these items were purchased by thousands of customers. I'd hate to give up on Neo4J. I really like the graph database concept. But can it handle data? I hope someone sees something I'm doing wrong. Thanks, Jeff Klann _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user