Oh, and you DEFINITELY need more RAM!

-----Original Message-----
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On
Behalf Of Jeff Klann
Sent: Wednesday, July 28, 2010 11:20 AM
To: Neo4j user discussions
Subject: [Neo4j] Stumped by performance issue in traversal - would take a
month to run!

Hi, I have an algorithm running on my little server that is very very slow.
It's a recommendation traversal (for all A and B in the catalog of items:
for each item A, how many customers also purchased another item in the
catalog B). It's processed 90 items in about 8 hours so far! Before I dive
deeper into trying to figure out the performance problem, I thought I'd
email the list to see if more experienced people have ideas.

Some characteristics of my datastore: it's size is pretty moderate for a
database application. 7500 items, not sure how many customers and purchases
(how can I find the size of an index?) but probably ~1 million customers.
The relationshipstore + nodestore < 500mb. (Propertystore is huge but I
don't access it much in traversals.)

The possibilities I see are:

1) *Neo4J is just slow.* Probably not slower than Postgres which I was using
previously, but maybe I need to switch to a distributed map-reduce db in the
cloud and give up the very nice graph modeling approach? I didn't think this
would be a problem, because my data size is pretty moderate and Neo4J is
supposed to be fast.

2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB
currently. But would this get my 20day traversal down to a few hours?
Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing
much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap,
about 300m memory-map.

3) *There's some secret about Neo4J performance I don't know.* Is there
something I'm unaware that Neo4J is doing? When I access a property, does it
load a chunk of properties I don't care about? For the current node/edge or
others? I turned off log rotation and I commit after each item A. Are there
other performance tips I might have missed?

4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe
there's some optimizations I can do. It looks like:

> For each item A in the catalog:
>   For each customer C that has purchased that item:
>    For each item B that customer purchased:
>       Update the co-occurrence edge between A&B.
>
      (If the edge exists, add one to its weight. If it doesn't exist,
> create it with weight one.)
>
This is O(n^2) worst case, but practically it'll be much better due to the
sparseness of purchases. The large number of customers slows it down,
though. The slowest part, I suspect, is the last line. It's a lot of finding
and re-finding edges between As and Bs and updating the edge properties. I
don't see much way around it, though. I wrote another version that avoids
this but is always O(n^2), and it takes about 15 minutes per A to check
against all B (which would also take a month). The version above seems to be
averaging 3 customers/sec, which doesn't seem that slow until you realize
that some of these items were purchased by thousands of customers.

I'd hate to give up on Neo4J. I really like the graph database concept. But
can it handle data? I hope someone sees something I'm doing wrong.

Thanks,
Jeff Klann
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to