Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Jeff Klann Mon, 02 Aug 2010 08:47:23 -0700

Thank you all for your continued interest in helping me. I tweaked the code
more to minimize writes to the database and it now looks like:
For each item A
   For each customer that purchased A
      For each item B (with id>A) that A purchased
         Increment (in memory) the weight of (A-B)
   Write out the edges [(A-B):weight] to disk and clear the in-memory map


This actually (if I'm not mistaken) covers all relationships and does 7500
items in about 45 minutes! Not too bad, especially due to (I think) avoiding
edge-checking, and I think it's usable for my application, though it's still
~200k traversals/sec on average, which is a few times slower than I hoped. I
doubt that's much faster than a two-table join in SQL, though deeper
traversals should show benefits.

- David, thank you for your answers on traversers vs. getRelationships and
on property-loading. I imported some properties I don't really need, perhaps
if I delete them it'll speed things up? Also I'm using the old
Node.traverse(). How is the new framework better? I expect it has a nicer
syntax, which I would like to try, but does it improve performance too?

- David, on checking relationships, I said checking 15 nodes for
relationships to n other nodes (where n might be large, I'm not sure large,
but <<7500), takes 71s. The nodes are a highly-connected graph and also with
edges going out to customers. So in the end the max & edges for a node would
be very high, up to around 7500 items and 300,000 customers.

- Martin, I'm confused a bit about SSDs. I read up on them after I read your
post. You said flash drives are best, but I read that even the highest
performing flash drives are about 30MB/s read, whereas modern hard drives
are at least 50MB/s. True SSDs claim to be 50MB/s too but they're quite
expensive. So why is a flash drive best? I could definitely spring for one
big enough to hold my db if it'd help a lot, but it has that slower read
speed. Does the faster seek time really make that much of a difference? Any
brands you'd recommend?

I will post some code snippets. Looks like there are a lot of sites for
sharing codes snippets. Any recommendation?

Thanks all,
Jeff Klann

On Mon, Aug 2, 2010 at 8:44 AM, David Montag <david.mon...@neotechnology.com
> wrote:

> Hi Jeff,
>
> If I'm not mistaken, Neo4j loads all properties for a node or relationship
> when you invoke any operation that touches a property. As for the
> performance of traversals, it is highly dependent on how deep you traverse,
> and what you do during the traversal, so ymmv.
>
> Using a traverser is slower than doing getRelationships, as the traverser
> does extra processing to keep state around. Are you using Node#traverse()
> or
> the new traversal framework? Is your code available somewhere?
>
> Are you saying that checking whether there's a relationship between A and B
> takes over 20s? How many relationships do A and B have? What does your neo
> config look like (params)? Edge indexing might be a solution, you can look
> at the new indexing component for that. (
> https://svn.neo4j.org/laboratory/components/lucene-index/)
>
> As for the incrementing of a property - while you're within a transaction,
> couldn't you increment a variable and then write that variable at the end
> of
> the transaction?
>
> David
>
> On Fri, Jul 30, 2010 at 8:10 PM, Jeff Klann <jkl...@iupui.edu> wrote:
>
> > Hi, so I got 2GB more RAM and noticed that after adding some more memory
> > map
> > and increasing the heap space, my small query went from 6hrs to 3min.
> Quite
> > reasonable!
> >
> > But the larger one that would take a month would still take a month. So
> > I've
> > been performance testing parts of it:
> >
> > The algorithm as in my first post showed *no* performance improvement on
> > more RAM.
> > But individual parts....
> >   - Traversing only (first three lines) was much speedier, but still
> seems
> > slow. 1.5 million traversals (15 out of 7000 items) took 23sec. It shaves
> > off a few seconds if I run this twice and time it the second time, or if
> I
> > don't print any node properties as I traverse. (Does Neo4J load ALL the
> > properties for a node if one is accessed?) Even with a double run and not
> > reading node properties, it still takes 16sec, which would make traversal
> > take two hours. I thought Neo4J was suppposed to do ~1m traversals/sec,
> > this
> > is doing about 100k. Why? (And in fact on the other query it was getting
> > about 800,000 traversals/sec.) Is one of Traversers vs. getRelationship
> > iterators faster when getting all relationships of a type at depth 1?
> >   - Searching for relationships between A & B (but not writing to them)
> > takes it from 20s to 91s. Yuck. Maybe edge indexing is the way to avoid
> > that?
> >   - Incrementing a property on the root node for every A & B takes it
> from
> > 20s to 61s (57s if it's all in one transaction). THAT seems weird. I
> > imagine
> > it has something to do with logging changes? Any way that can be turned
> off
> > for a particular property (like it could be marked 'volatile' during a
> > transaction or something)?
> >
> > I'm much more hopeful with the extra RAM but it's still kind of slow.
> > Suggestions?
> >
> > Thanks,
> > Jeff Klann
> >
> > On Wed, Jul 28, 2010 at 11:20 AM, Jeff Klann <jkl...@iupui.edu> wrote:
> >
> > > Hi, I have an algorithm running on my little server that is very very
> > slow.
> > > It's a recommendation traversal (for all A and B in the catalog of
> items:
> > > for each item A, how many customers also purchased another item in the
> > > catalog B). It's processed 90 items in about 8 hours so far! Before I
> > dive
> > > deeper into trying to figure out the performance problem, I thought I'd
> > > email the list to see if more experienced people have ideas.
> > >
> > > Some characteristics of my datastore: it's size is pretty moderate for
> a
> > > database application. 7500 items, not sure how many customers and
> > purchases
> > > (how can I find the size of an index?) but probably ~1 million
> customers.
> > > The relationshipstore + nodestore < 500mb. (Propertystore is huge but I
> > > don't access it much in traversals.)
> > >
> > > The possibilities I see are:
> > >
> > > 1) *Neo4J is just slow.* Probably not slower than Postgres which I was
> > > using previously, but maybe I need to switch to a distributed
> map-reduce
> > db
> > > in the cloud and give up the very nice graph modeling approach? I
> didn't
> > > think this would be a problem, because my data size is pretty moderate
> > and
> > > Neo4J is supposed to be fast.
> > >
> > > 2) *I just need more RAM.* I definitely need more RAM - I have a measly
> > > 1GB currently. But would this get my 20day traversal down to a few
> hours?
> > > Doesn't seem like it'd have THAT much impact. I'm running Linux and
> > nothing
> > > much else besides Neo4j, so I've got 650m physical RAM. Using 300m
> heap,
> > > about 300m memory-map.
> > >
> > > 3) *There's some secret about Neo4J performance I don't know.* Is there
> > > something I'm unaware that Neo4J is doing? When I access a property,
> does
> > it
> > > load a chunk of properties I don't care about? For the current
> node/edge
> > or
> > > others? I turned off log rotation and I commit after each item A. Are
> > there
> > > other performance tips I might have missed?
> > >
> > > 4) *My algorithm is inefficient.* It's a fairly naive algorithm and
> maybe
> > > there's some optimizations I can do. It looks like:
> > >
> > >> For each item A in the catalog:
> > >>   For each customer C that has purchased that item:
> > >>    For each item B that customer purchased:
> > >>       Update the co-occurrence edge between A&B.
> > >>
> > >       (If the edge exists, add one to its weight. If it doesn't exist,
> > >> create it with weight one.)
> > >>
> > > This is O(n^2) worst case, but practically it'll be much better due to
> > the
> > > sparseness of purchases. The large number of customers slows it down,
> > > though. The slowest part, I suspect, is the last line. It's a lot of
> > finding
> > > and re-finding edges between As and Bs and updating the edge
> properties.
> > I
> > > don't see much way around it, though. I wrote another version that
> avoids
> > > this but is always O(n^2), and it takes about 15 minutes per A to check
> > > against all B (which would also take a month). The version above seems
> to
> > be
> > > averaging 3 customers/sec, which doesn't seem that slow until you
> realize
> > > that some of these items were purchased by thousands of customers.
> > >
> > > I'd hate to give up on Neo4J. I really like the graph database concept.
> > But
> > > can it handle data? I hope someone sees something I'm doing wrong.
> > >
> > > Thanks,
> > > Jeff Klann
> > >
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

Reply via email to