Re: [Neo4j] Sampling a Neo4j instance?

Anders Lindström Fri, 18 Nov 2011 00:46:26 -0800

Would this work in HA mode too (i.e. HighlyAvailableGraphDatabase)? I can see 
that the 'getConfig' is there -- but does the cast to NeoStoreXaDataSource work 
as well?
Thanks.


> Date: Wed, 16 Nov 2011 21:40:32 +0200
> From: chris.gio...@neotechnology.com
> To: user@lists.neo4j.org
> Subject: Re: [Neo4j] Sampling a Neo4j instance?
> 
> No, GraphDatabaseService wisely hides those things away. I would
> suggest using instanceof and casting to EmbeddedGraphDatabase.
> 
> cheers,
> CG
> 
> 2011/11/16 Anders Lindström <andli...@hotmail.com>:
> >
> > Chris, thanks again for your replies.
> > I realize now that I don't have the 'getConfig' method -- I'm writing a 
> > server plugin and I only get the GraphDatabaseService interface passed to 
> > my method, not a EmbeddedGraphDatabase. Is there an equivalent way of 
> > getting the highest node index through the interface?
> > Thanks.
> >
> >> Date: Thu, 10 Nov 2011 12:01:31 +0200
> >> From: chris.gio...@neotechnology.com
> >> To: user@lists.neo4j.org
> >> Subject: Re: [Neo4j] Sampling a Neo4j instance?
> >>
> >> Answers inline.
> >>
> >> 2011/11/9 Anders Lindström <andli...@hotmail.com>:
> >> >
> >> > Thanks to the both of you. I am very grateful that you took your time to 
> >> > put this into code -- how's that for community!
> >> > I presume this way of getting 'highId' is constant in time? It looks 
> >> > rather messy though -- is it really the most straightforward way to do 
> >> > it?
> >>
> >> This is the safest way to do it, that takes into consideration crashes
> >> and HA cluster membership.
> >>
> >> Another way to do it is
> >>
> >> long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
> >> ).getHighId();
> >>
> >> which can return the same value with the first, if some conditions are
> >> met. It is shorter and cast-free but i'd still use the first way.
> >>
> >> getHighId() is a constant time operation for both ways described - it
> >> is just a field access, with an additional long comparison for the
> >> first case.
> >>
> >> > I am thinking about how efficient this will be. As I understand it, the 
> >> > "sampling misses" come from deleted nodes that once was there. But if I 
> >> > remember correctly, Neo4j tries to reuse these unused node indices when 
> >> > new nodes are added. But is an unused node index _guaranteed_ to be used 
> >> > given that there is one, or could inserting another node result in 
> >> > increasing 'highId' even though some indices below it are not used?
> >>
> >> During the lifetime of a Neo4j instance there is no id reuse for Nodes
> >> and Relationships - deleted ids are saved however and will be reused
> >> the next time Neo4j starts. This means that if during run A you
> >> deleted nodes 3 and 5, the first two nodes returned by createNode() on
> >> the next run will have ids 3 and 5 - so highId will not change.
> >> Additionally, during run A, after deleting nodes 3 and 5, no new nodes
> >> would have the id 3 or 5. A crash (or improper shutdown) of the
> >> database will break this however, since the ids-to-recycle will
> >> probably not make it to disk.
> >>
> >> So, in short, it is guaranteed that ids *won't* be reused in the same
> >> run but not guaranteed to be reused between runs.
> >>
> >> > My conclusion is that the "sampling misses" will increase with index 
> >> > usage sparseness and that we will have a high rate of "sampling misses" 
> >> > when we had many deletes and few insertions recently. Would you agree?
> >>
> >> Yes, that is true, especially given the cost of the "wasted" I/O and
> >> of handling the exception. However, this cost can go down
> >> significantly if you keep a hash set for the ids of nodes you have
> >> deleted and check that before asking for the node by id, instead of
> >> catching an exception. Persisting that between runs would move you
> >> away from encapsulated Neo4j constructs and would also be more
> >> efficient.
> >>
> >> > Thanks again.
> >> > Regards,Anders
> >> >
> >> >> Date: Wed, 9 Nov 2011 19:30:36 +0200
> >> >> From: chris.gio...@neotechnology.com
> >> >> To: user@lists.neo4j.org
> >> >> Subject: Re: [Neo4j] Sampling a Neo4j instance?
> >> >>
> >> >> Hi,
> >> >>
> >> >> Backing Jim's algorithm with some code:
> >> >>
> >> >>     public static void main( String[] args )
> >> >>     {
> >> >>         long SAMPLE_SIZE = 10000;
> >> >>         EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
> >> >>                 "path/to/db/" );
> >> >>         // Determine the highest possible id for the node store
> >> >>         long highId = ( (NeoStoreXaDataSource)
> >> >> db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
> >> >>                 Config.DEFAULT_DATA_SOURCE_NAME )
> >> >> ).getNeoStore().getNodeStore().getHighId();
> >> >>         System.out.println( highId + " is the highest id" );
> >> >>         long i = 0;
> >> >>         long nextId;
> >> >>
> >> >>         // Do the sampling
> >> >>         Random random = new Random();
> >> >>         while ( i < SAMPLE_SIZE )
> >> >>         {
> >> >>             nextId = Math.abs( random.nextLong() ) % highId;
> >> >>             try
> >> >>             {
> >> >>                 db.getNodeById( nextId );
> >> >>                 i++;
> >> >>                 System.out.println( "id " + nextId + " is there" );
> >> >>             }
> >> >>             catch ( NotFoundException e )
> >> >>             {
> >> >>                 // NotFoundException is thrown when the node asked is 
> >> >> not in use
> >> >>                 System.out.println( "id " + nextId + " not in use" );
> >> >>             }
> >> >>         }
> >> >>         db.shutdown();
> >> >>     }
> >> >>
> >> >> Like already mentioned, this will be slow. Random jumps around the
> >> >> graph are not something caches can keep up with - unless your whole db
> >> >> fits in memory. But accessing random pieces of an on-disk file cannot
> >> >> be done much faster.
> >> >>
> >> >> cheers,
> >> >> CG
> >> >>
> >> >> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <j...@neotechnology.com> 
> >> >> wrote:
> >> >> > Hi Anders,
> >> >> >
> >> >> > When you do getAllNodes, you're getting back an iterable so as you 
> >> >> > point out the sample isn't random (unless it was written randomly to 
> >> >> > disk). If you're prepared to take a scattergun approach and tolerate 
> >> >> > being disk-bound, then you can ask for getNodeById using a made-up ID 
> >> >> > and deal with the times when your ID's don't resolve.
> >> >> >
> >> >> > It'll be slow (since the chances of having the nodes in cache are 
> >> >> > low) but as random as your random ID generator.
> >> >> >
> >> >> > Jim
> >> >> > _______________________________________________
> >> >> > Neo4j mailing list
> >> >> > User@lists.neo4j.org
> >> >> > https://lists.neo4j.org/mailman/listinfo/user
> >> >> >
> >> >> _______________________________________________
> >> >> Neo4j mailing list
> >> >> User@lists.neo4j.org
> >> >> https://lists.neo4j.org/mailman/listinfo/user
> >> >
> >> > _______________________________________________
> >> > Neo4j mailing list
> >> > User@lists.neo4j.org
> >> > https://lists.neo4j.org/mailman/listinfo/user
> >> >
> >> _______________________________________________
> >> Neo4j mailing list
> >> User@lists.neo4j.org
> >> https://lists.neo4j.org/mailman/listinfo/user
> >
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
                                          
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Sampling a Neo4j instance?

Reply via email to