Re: [Neo4j] Batch inserter shutdown taking forever

Tim Jones Wed, 28 Jul 2010 01:39:27 -0700


----- Original Message ----
> From: Mattias Persson <matt...@neotechnology.com>
> To: Neo4j user discussions <user@lists.neo4j.org>
> Sent: Tue, July 27, 2010 8:27:24 PM
> Subject: Re: [Neo4j] Batch inserter shutdown taking forever
> 
> Since you're doing a depth 1 "traversal" please use something like  this
> instead:
> 
>     for ( Relationship rel :  graphDb.getReferenceNode().getRelationships(
>          Relationships.ROUTE, Direction.OUTGOING ) )
>     {
>          Node node = rel.getEndNode();
>         // Do  stuff
>     }
> 
> Since a traverser keeps more memory than a  simple call to getRelationships.
> Another thing, are you doing any write  operation in that for-loop of yours?
> Also do you shut down the batch inserter  and start a new
> EmbeddedGraphDatabase to traverse on, or how do you get a  hold of the
> graphDb?

Yes, I shut down the batch inserter and instantiate a new EmbeddedGraphDatabase 
to run these operations on. The only thing I do in the loop is update an 
attribute on the nodes.

I've changed my approach a little bit now. All of the Route nodes were related 
to the reference node, but also to Page nodes - now I use a lookup service to 
retrieve all Page nodes, and then traverse to a depth of 1 on each of these 
returned nodes. Performance is better - it takes about an hour now to update 3m 
nodes like this. I think I'll stick with this because it'll scale better than 
the first method I was using (I'm basically removing duplicate nodes based on 
an 
attribute, so I need to build an in-memory look-up table to recognise whether 
I've seen a particular node before). I'll change the traverser like you suggest 
and see if this improves.

Thanks

> 
> 2010/7/26 Tim Jones <bogol...@ymail.com>
> 
> > OK, I  found out what's taking the time. It's iterating over the result set
> > of  a
> > traverser:
> >
> >            //  visit each Route node, and add it to the array
> >             Traverser routes =  graphDb.getReferenceNode().traverse(
> >                     Traverser.Order.BREADTH_FIRST,
> >                      StopEvaluator.DEPTH_ONE,
> >                     ReturnableEvaluator.ALL_BUT_START_NODE,
> >                      Relationships.ROUTE, Direction.OUTGOING);
> >
> >             for (Node node : routes)
> >             {
> >                  // do stuff
> >             }
> >
> >
> > The 'for' loop takes ages. There are probably 2m nodes  being returned by
> > that
> > traverser at the moment, and that's only  a very small subset of the data I
> > want
> > to add to the  database.
> >
> > is there any way to tinker with the neo4j properties  or anything to improve
> > performance here?
> >
> >  Thanks
> >
> >
> > ----- Original Message ----
> > > From:  Mattias Persson <matt...@neotechnology.com>
> >  > To: Neo4j user discussions <user@lists.neo4j.org>
> > >  Sent: Sat, July 24, 2010 10:23:02 PM
> > > Subject: Re: [Neo4j] Batch  inserter shutdown taking forever
> > >
> > > 2010/7/21 Tim Jones  <bogol...@ymail.com>
> > >
> >  > >  Hi,
> > > >
> > > > I'm using a  BatchInserter and a LuceneIndexBatchInserter to  insert >5m
> > >  > nodes and
> > > > >5m relationships into a graph in one   go. The insertion seems to work,
> > but
> > > > shutting down  takes forever - it's  been 2 hours now.
> > > >
> > > >  At first, the JVM gave me garbage collection  exception, so I've  set
> > the
> > > > heap to
> > > > 2gb.
> > >  >
> > > > 'top'  tells me that the application is still  running:
> > > >
> > > >  PID  USER       PR  NI  VIRT  RES  SHR S %CPU  %MEM     TIME+  COMMAND
> > > >  9994 tim          17   0 2620m 2.3g 238m S 99.5 39.1 115:48.84 java
> > > >
> >  > >  but checking the filesystem by running 'ls -l' a few times  doesn't
> > indicate
> > > > that
> > > > files are  being updated.
> > > >
> > > > Is this  normal? Is  there a way to improve performance?
> > > >
> > >
> > >  No, it sounds  quite weird. Any chance to have a look at your code?
> >  >
> > >
> > > >
> > > > I'm  loading all my  data in one go to ease creating the db - it's
> > simpler to
> > >  >  create it from scratch each time instead of updating an  existing
> > database
> > -
> > > > so
> > > >  ideally I don't want to break this job down into multiple  smaller  
jobs
> > > > (actually, this would be OK if performance was good, but  I  ran into
> > > > problems
> > > > inserting data and  retrieving existing  nodes).
> > > >
> > >
> > >  What kind of problems? could you supply code and  description of  your
> > > problems?
> >
> > Problems doing something similar in  relational dbs. Also, the API
> > recommends to
> > optimise the batch  search index before using it for lookups. I just decided
> > not
> > to  take this approach.
> >
> > >
> > >
> > > >
> >  > > Thanks,
> > > >  Tim
> > > >
> > >  >
> > > >
> > > >
> > > >
> > >  >  _______________________________________________
> > > >  Neo4j mailing  list
> > > > User@lists.neo4j.org
> > > > https://lists.neo4j.org/mailman/listinfo/user
> > > >
> >  >
> > >
> > >
> > > --
> > > Mattias Persson, [matt...@neotechnology.com]
> >  > Hacker,  Neo Technology
> > > www.neotechnology.com
> > >  _______________________________________________
> > > Neo4j  mailing  list
> > > User@lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> >  >
> >
> >
> >
> >
> >  _______________________________________________
> > Neo4j mailing  list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> 
> 
> 
> -- 
> Mattias Persson, [matt...@neotechnology.com]
> Hacker,  Neo  Technology
> www.neotechnology.com
> _______________________________________________
> Neo4j  mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
> 


      

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Batch inserter shutdown taking forever

Reply via email to