Re: [Neo4j] Sampling a Neo4j instance?

Mattias Persson Fri, 18 Nov 2011 06:44:12 -0800

They have a common abstract class AbstractGraphDatabase.

Den 18 november 2011 09:46 skrev Anders Lindström <andli...@hotmail.com>:


>
> Would this work in HA mode too (i.e. HighlyAvailableGraphDatabase)? I can
> see that the 'getConfig' is there -- but does the cast to
> NeoStoreXaDataSource work as well?
> Thanks.
>
> > Date: Wed, 16 Nov 2011 21:40:32 +0200
> > From: chris.gio...@neotechnology.com
> > To: user@lists.neo4j.org
> > Subject: Re: [Neo4j] Sampling a Neo4j instance?
> >
> > No, GraphDatabaseService wisely hides those things away. I would
> > suggest using instanceof and casting to EmbeddedGraphDatabase.
> >
> > cheers,
> > CG
> >
> > 2011/11/16 Anders Lindström <andli...@hotmail.com>:
> > >
> > > Chris, thanks again for your replies.
> > > I realize now that I don't have the 'getConfig' method -- I'm writing
> a server plugin and I only get the GraphDatabaseService interface passed to
> my method, not a EmbeddedGraphDatabase. Is there an equivalent way of
> getting the highest node index through the interface?
> > > Thanks.
> > >
> > >> Date: Thu, 10 Nov 2011 12:01:31 +0200
> > >> From: chris.gio...@neotechnology.com
> > >> To: user@lists.neo4j.org
> > >> Subject: Re: [Neo4j] Sampling a Neo4j instance?
> > >>
> > >> Answers inline.
> > >>
> > >> 2011/11/9 Anders Lindström <andli...@hotmail.com>:
> > >> >
> > >> > Thanks to the both of you. I am very grateful that you took your
> time to put this into code -- how's that for community!
> > >> > I presume this way of getting 'highId' is constant in time? It
> looks rather messy though -- is it really the most straightforward way to
> do it?
> > >>
> > >> This is the safest way to do it, that takes into consideration crashes
> > >> and HA cluster membership.
> > >>
> > >> Another way to do it is
> > >>
> > >> long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
> > >> ).getHighId();
> > >>
> > >> which can return the same value with the first, if some conditions are
> > >> met. It is shorter and cast-free but i'd still use the first way.
> > >>
> > >> getHighId() is a constant time operation for both ways described - it
> > >> is just a field access, with an additional long comparison for the
> > >> first case.
> > >>
> > >> > I am thinking about how efficient this will be. As I understand it,
> the "sampling misses" come from deleted nodes that once was there. But if I
> remember correctly, Neo4j tries to reuse these unused node indices when new
> nodes are added. But is an unused node index _guaranteed_ to be used given
> that there is one, or could inserting another node result in increasing
> 'highId' even though some indices below it are not used?
> > >>
> > >> During the lifetime of a Neo4j instance there is no id reuse for Nodes
> > >> and Relationships - deleted ids are saved however and will be reused
> > >> the next time Neo4j starts. This means that if during run A you
> > >> deleted nodes 3 and 5, the first two nodes returned by createNode() on
> > >> the next run will have ids 3 and 5 - so highId will not change.
> > >> Additionally, during run A, after deleting nodes 3 and 5, no new nodes
> > >> would have the id 3 or 5. A crash (or improper shutdown) of the
> > >> database will break this however, since the ids-to-recycle will
> > >> probably not make it to disk.
> > >>
> > >> So, in short, it is guaranteed that ids *won't* be reused in the same
> > >> run but not guaranteed to be reused between runs.
> > >>
> > >> > My conclusion is that the "sampling misses" will increase with
> index usage sparseness and that we will have a high rate of "sampling
> misses" when we had many deletes and few insertions recently. Would you
> agree?
> > >>
> > >> Yes, that is true, especially given the cost of the "wasted" I/O and
> > >> of handling the exception. However, this cost can go down
> > >> significantly if you keep a hash set for the ids of nodes you have
> > >> deleted and check that before asking for the node by id, instead of
> > >> catching an exception. Persisting that between runs would move you
> > >> away from encapsulated Neo4j constructs and would also be more
> > >> efficient.
> > >>
> > >> > Thanks again.
> > >> > Regards,Anders
> > >> >
> > >> >> Date: Wed, 9 Nov 2011 19:30:36 +0200
> > >> >> From: chris.gio...@neotechnology.com
> > >> >> To: user@lists.neo4j.org
> > >> >> Subject: Re: [Neo4j] Sampling a Neo4j instance?
> > >> >>
> > >> >> Hi,
> > >> >>
> > >> >> Backing Jim's algorithm with some code:
> > >> >>
> > >> >>     public static void main( String[] args )
> > >> >>     {
> > >> >>         long SAMPLE_SIZE = 10000;
> > >> >>         EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
> > >> >>                 "path/to/db/" );
> > >> >>         // Determine the highest possible id for the node store
> > >> >>         long highId = ( (NeoStoreXaDataSource)
> > >> >>
> db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
> > >> >>                 Config.DEFAULT_DATA_SOURCE_NAME )
> > >> >> ).getNeoStore().getNodeStore().getHighId();
> > >> >>         System.out.println( highId + " is the highest id" );
> > >> >>         long i = 0;
> > >> >>         long nextId;
> > >> >>
> > >> >>         // Do the sampling
> > >> >>         Random random = new Random();
> > >> >>         while ( i < SAMPLE_SIZE )
> > >> >>         {
> > >> >>             nextId = Math.abs( random.nextLong() ) % highId;
> > >> >>             try
> > >> >>             {
> > >> >>                 db.getNodeById( nextId );
> > >> >>                 i++;
> > >> >>                 System.out.println( "id " + nextId + " is there" );
> > >> >>             }
> > >> >>             catch ( NotFoundException e )
> > >> >>             {
> > >> >>                 // NotFoundException is thrown when the node asked
> is not in use
> > >> >>                 System.out.println( "id " + nextId + " not in use"
> );
> > >> >>             }
> > >> >>         }
> > >> >>         db.shutdown();
> > >> >>     }
> > >> >>
> > >> >> Like already mentioned, this will be slow. Random jumps around the
> > >> >> graph are not something caches can keep up with - unless your
> whole db
> > >> >> fits in memory. But accessing random pieces of an on-disk file
> cannot
> > >> >> be done much faster.
> > >> >>
> > >> >> cheers,
> > >> >> CG
> > >> >>
> > >> >> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <j...@neotechnology.com>
> wrote:
> > >> >> > Hi Anders,
> > >> >> >
> > >> >> > When you do getAllNodes, you're getting back an iterable so as
> you point out the sample isn't random (unless it was written randomly to
> disk). If you're prepared to take a scattergun approach and tolerate being
> disk-bound, then you can ask for getNodeById using a made-up ID and deal
> with the times when your ID's don't resolve.
> > >> >> >
> > >> >> > It'll be slow (since the chances of having the nodes in cache
> are low) but as random as your random ID generator.
> > >> >> >
> > >> >> > Jim
> > >> >> > _______________________________________________
> > >> >> > Neo4j mailing list
> > >> >> > User@lists.neo4j.org
> > >> >> > https://lists.neo4j.org/mailman/listinfo/user
> > >> >> >
> > >> >> _______________________________________________
> > >> >> Neo4j mailing list
> > >> >> User@lists.neo4j.org
> > >> >> https://lists.neo4j.org/mailman/listinfo/user
> > >> >
> > >> > _______________________________________________
> > >> > Neo4j mailing list
> > >> > User@lists.neo4j.org
> > >> > https://lists.neo4j.org/mailman/listinfo/user
> > >> >
> > >> _______________________________________________
> > >> Neo4j mailing list
> > >> User@lists.neo4j.org
> > >> https://lists.neo4j.org/mailman/listinfo/user
> > >
> > > _______________________________________________
> > > Neo4j mailing list
> > > User@lists.neo4j.org
> > > https://lists.neo4j.org/mailman/listinfo/user
> > >
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Sampling a Neo4j instance?

Reply via email to