Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Jeff Klann Wed, 28 Jul 2010 16:13:02 -0700

I don't think that's the problem. Here's why...

When I was importing my data, it eventually slowed down to a crawl (though
it was pretty fast at first). Someone pointed out that since I was trying to
do it all in one transaction, it was filling the java heap too much. They
suggested I commit after every 40,000 node/edge creations (that's
empirically when the slowdown happened). I did and then the import zipped
along just fine.


I'm only committing after the outer pass through an item, which is again
only after tens of thousands of writes/property updates.

Hmm, that makes me wonder if it's possible I'm not committing often enough.
Well, when I have more memory we'll see how it does.

- Jeff Klann
p.s. The simple counter in my last post isn't using transactions at all.

On Wed, Jul 28, 2010 at 5:48 PM, Rick Bullotta <
rick.bullo...@burningskysoftware.com> wrote:

> Hi, Jeff.
>
> If you are committing after each item, it definitely will slow down
> performance.  Start a single transaction, commit when you're all done the
> entire traversal, and report back the results.  You will still "see" the
> changes you've made prior to committing the transaction, as long as you're
> on the same execution thread.
>
> Rick
>
> -----Original Message-----
> From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org]
> On
> Behalf Of Jeff Klann
> Sent: Wednesday, July 28, 2010 5:43 PM
> To: Neo4j user discussions
> Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take
> a month to run!
>
> Thank you both for your responses.
>
> - I will get some more RAM tomorrow and give Neo4J another shot. Hopefully
> that's a huge factor.
> - Tim, I like your algorithm trick! Would save a lot of reading/writing but
> would definitely require more memory due to the massive increase in # of
> edges.
> - Transactions are not the issue, unless reading AFTER comitting a
> transaction is somehow slower? I'm only committing after each of 7,000
> items
> and like I said it took 8 hours to run through 90-some items... committing
> is not where the time is being spent.
>
> To gauge the performance problem, I wanted to see how many customers are
> purchasing each item and I'm concerned that even this query is taking a
> really long time. It's simple:
>
> > For each item A
> >   Count the number of relationships to a customer
> >
> It took 15 minutes to do 200 items. That's almost 5 seconds an item just to
> count the number of customers who purchased an item! (Looks like on average
> about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE
> HOUR
> query! Considering that Neo4J advertises it can traverse 1m
> relationships/sec on "commodity hardware", I would expect this to be much
> faster. (Even if it were 50k customers per item, that'd be 7000items *
> 50000customers / 1m traversals = 350 seconds. 6 minutes would be much more
> reasonable.)
>
> My "commodity hardware" will have a lot more memory tomorrow, hopefully
> that'll solve these problems!
>
> Thanks,
> Jeff Klann
> p.s. My propertystore is big because I was naive on import and stored
> everything as string properties (this will change). How does that affect
> performance?
>
> On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones <bogol...@ymail.com> wrote:
>
> > I can't give too much help on this unfortunately, but as far as
> possibility
> > 1)
> > goes, my database contains around 8 million nodes, and I traverse them in
> > about
> > 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of
> > RAM. I
> > allocate a 1GB heap to the JDK.
> >
> > Inserts take a little longer because of the approach I use - inserting
> 200K
> > nodes now takes a few minutes. I then have a separate step to remove
> > duplicates
> > that takes about 10-15 minutes.
> >
> > It seems to me that you might be better off doing something similar:
> > creating a
> > new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always
> add
> > this
> > relationship between products in catalogues A and B.
> >
> > Then run a post-processing job that retrieves all PURCHASED_BOTH
> > relationships
> > for each product in catalogue A, and build an in-memory map so you only
> > keep one
> > of these relationships, and update the 'count' attribute in memory for
> that
> > relationship. Delete the duplicates and commit. This way to get your
> > desired
> > result in 2 passes instead of doing everything in one go.
> >
> > It seems a bit of a fiddle and I can't guarantee it'll improve
> performance
> > (just
> > to stress - I may be waaay off the mark here, but it works for me). I
> think
> > it
> > will though because it'll mean that your loop only has to create
> > relationships
> > instead of performing updates. Oh, and make sure that you aren't
> performing
> > one
> > operation per transaction - you could group together several tens of
> > thousands
> > before committing (I do 50,000 inserts before committing when I'm running
> > this
> > post-processing operation, and it's fine).
> >
> > Tim
> >
> >
> >
> > ----- Original Message ----
> > > From: Jeff Klann <jkl...@iupui.edu>
> > > To: Neo4j user discussions <user@lists.neo4j.org>
> > > Sent: Wed, July 28, 2010 4:20:28 PM
> > > Subject: [Neo4j] Stumped by performance issue in traversal - would take
> a
> > month
> > >to run!
> > >
> > > Hi, I have an algorithm running on my little server that is very very
> >  slow.
> > > It's a recommendation traversal (for all A and B in the catalog of
> >  items:
> > > for each item A, how many customers also purchased another item in  the
> > > catalog B). It's processed 90 items in about 8 hours so far! Before I
> >  dive
> > > deeper into trying to figure out the performance problem, I thought
>  I'd
> > > email the list to see if more experienced people have ideas.
> > >
> > > Some  characteristics of my datastore: it's size is pretty moderate for
> a
> > > database  application. 7500 items, not sure how many customers and
> > purchases
> > > (how can I  find the size of an index?) but probably ~1 million
> > customers.
> > > The  relationshipstore + nodestore < 500mb. (Propertystore is huge but
> I
> > > don't  access it much in traversals.)
> > >
> > > The possibilities I see are:
> > >
> > > 1)  *Neo4J is just slow.* Probably not slower than Postgres which I was
> >  using
> > > previously, but maybe I need to switch to a distributed map-reduce db
> in
> > the
> > > cloud and give up the very nice graph modeling approach? I didn't
>  think
> > this
> > > would be a problem, because my data size is pretty moderate and  Neo4J
> is
> > > supposed to be fast.
> > >
> > > 2) *I just need more RAM.* I definitely  need more RAM - I have a
> measly
> > 1GB
> > > currently. But would this get my 20day  traversal down to a few hours?
> > > Doesn't seem like it'd have THAT much impact.  I'm running Linux and
> > nothing
> > > much else besides Neo4j, so I've got 650m  physical RAM. Using 300m
> heap,
> > > about 300m memory-map.
> > >
> > > 3) *There's some  secret about Neo4J performance I don't know.* Is
> there
> > > something I'm unaware  that Neo4J is doing? When I access a property,
> > does it
> > > load a chunk of  properties I don't care about? For the current
> node/edge
> > or
> > > others? I turned  off log rotation and I commit after each item A. Are
> > there
> > > other performance  tips I might have missed?
> > >
> > > 4) *My algorithm is inefficient.* It's a fairly  naive algorithm and
> > maybe
> > > there's some optimizations I can do. It looks  like:
> > >
> > > > For each item A in the catalog:
> > > >   For each  customer C that has purchased that item:
> > > >    For each item B  that customer purchased:
> > > >       Update the co-occurrence  edge between A&B.
> > > >
> > >       (If the edge exists, add  one to its weight. If it doesn't exist,
> > > > create it with weight  one.)
> > > >
> > > This is O(n^2) worst case, but practically it'll be much better  due to
> > the
> > > sparseness of purchases. The large number of customers slows it  down,
> > > though. The slowest part, I suspect, is the last line. It's a lot of
> >  finding
> > > and re-finding edges between As and Bs and updating the edge
> properties.
> > I
> > > don't see much way around it, though. I wrote another version  that
> > avoids
> > > this but is always O(n^2), and it takes about 15 minutes per A to
>  check
> > > against all B (which would also take a month). The version above seems
> >  to be
> > > averaging 3 customers/sec, which doesn't seem that slow until you
> >  realize
> > > that some of these items were purchased by thousands of  customers.
> > >
> > > I'd hate to give up on Neo4J. I really like the graph database
>  concept.
> > But
> > > can it handle data? I hope someone sees something I'm doing  wrong.
> > >
> > > Thanks,
> > > Jeff  Klann
> > > _______________________________________________
> > > Neo4j mailing  list
> > > User@lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> >
> >
> >
> >
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Reply via email to