Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Tim Jones Wed, 28 Jul 2010 08:54:22 -0700

I can't give too much help on this unfortunately, but as far as possibility 1) 
goes, my database contains around 8 million nodes, and I traverse them in about 
15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of RAM. 
I 
allocate a 1GB heap to the JDK.


Inserts take a little longer because of the approach I use - inserting 200K 
nodes now takes a few minutes. I then have a separate step to remove duplicates 
that takes about 10-15 minutes.

It seems to me that you might be better off doing something similar: creating a 
new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add 
this 
relationship between products in catalogues A and B.

Then run a post-processing job that retrieves all PURCHASED_BOTH relationships 
for each product in catalogue A, and build an in-memory map so you only keep 
one 
of these relationships, and update the 'count' attribute in memory for that 
relationship. Delete the duplicates and commit. This way to get your desired 
result in 2 passes instead of doing everything in one go.

It seems a bit of a fiddle and I can't guarantee it'll improve performance 
(just 
to stress - I may be waaay off the mark here, but it works for me). I think it 
will though because it'll mean that your loop only has to create relationships 
instead of performing updates. Oh, and make sure that you aren't performing one 
operation per transaction - you could group together several tens of thousands 
before committing (I do 50,000 inserts before committing when I'm running this 
post-processing operation, and it's fine).

Tim



----- Original Message ----
> From: Jeff Klann <jkl...@iupui.edu>
> To: Neo4j user discussions <user@lists.neo4j.org>
> Sent: Wed, July 28, 2010 4:20:28 PM
> Subject: [Neo4j] Stumped by performance issue in traversal - would take a 
> month 
>to run!
> 
> Hi, I have an algorithm running on my little server that is very very  slow.
> It's a recommendation traversal (for all A and B in the catalog of  items:
> for each item A, how many customers also purchased another item in  the
> catalog B). It's processed 90 items in about 8 hours so far! Before I  dive
> deeper into trying to figure out the performance problem, I thought  I'd
> email the list to see if more experienced people have ideas.
> 
> Some  characteristics of my datastore: it's size is pretty moderate for a
> database  application. 7500 items, not sure how many customers and purchases
> (how can I  find the size of an index?) but probably ~1 million customers.
> The  relationshipstore + nodestore < 500mb. (Propertystore is huge but I
> don't  access it much in traversals.)
> 
> The possibilities I see are:
> 
> 1)  *Neo4J is just slow.* Probably not slower than Postgres which I was  using
> previously, but maybe I need to switch to a distributed map-reduce db  in the
> cloud and give up the very nice graph modeling approach? I didn't  think this
> would be a problem, because my data size is pretty moderate and  Neo4J is
> supposed to be fast.
> 
> 2) *I just need more RAM.* I definitely  need more RAM - I have a measly 1GB
> currently. But would this get my 20day  traversal down to a few hours?
> Doesn't seem like it'd have THAT much impact.  I'm running Linux and nothing
> much else besides Neo4j, so I've got 650m  physical RAM. Using 300m heap,
> about 300m memory-map.
> 
> 3) *There's some  secret about Neo4J performance I don't know.* Is there
> something I'm unaware  that Neo4J is doing? When I access a property, does it
> load a chunk of  properties I don't care about? For the current node/edge or
> others? I turned  off log rotation and I commit after each item A. Are there
> other performance  tips I might have missed?
> 
> 4) *My algorithm is inefficient.* It's a fairly  naive algorithm and maybe
> there's some optimizations I can do. It looks  like:
> 
> > For each item A in the catalog:
> >   For each  customer C that has purchased that item:
> >    For each item B  that customer purchased:
> >       Update the co-occurrence  edge between A&B.
> >
>       (If the edge exists, add  one to its weight. If it doesn't exist,
> > create it with weight  one.)
> >
> This is O(n^2) worst case, but practically it'll be much better  due to the
> sparseness of purchases. The large number of customers slows it  down,
> though. The slowest part, I suspect, is the last line. It's a lot of  finding
> and re-finding edges between As and Bs and updating the edge  properties. I
> don't see much way around it, though. I wrote another version  that avoids
> this but is always O(n^2), and it takes about 15 minutes per A to  check
> against all B (which would also take a month). The version above seems  to be
> averaging 3 customers/sec, which doesn't seem that slow until you  realize
> that some of these items were purchased by thousands of  customers.
> 
> I'd hate to give up on Neo4J. I really like the graph database  concept. But
> can it handle data? I hope someone sees something I'm doing  wrong.
> 
> Thanks,
> Jeff  Klann
> _______________________________________________
> Neo4j mailing  list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
> 


      

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Reply via email to