Thanks to the both of you. I am very grateful that you took your time to put 
this into code -- how's that for community!
I presume this way of getting 'highId' is constant in time? It looks rather 
messy though -- is it really the most straightforward way to do it?
I am thinking about how efficient this will be. As I understand it, the 
"sampling misses" come from deleted nodes that once was there. But if I 
remember correctly, Neo4j tries to reuse these unused node indices when new 
nodes are added. But is an unused node index _guaranteed_ to be used given that 
there is one, or could inserting another node result in increasing 'highId' 
even though some indices below it are not used?
My conclusion is that the "sampling misses" will increase with index usage 
sparseness and that we will have a high rate of "sampling misses" when we had 
many deletes and few insertions recently. Would you agree?
Thanks again.
Regards,Anders

> Date: Wed, 9 Nov 2011 19:30:36 +0200
> From: chris.gio...@neotechnology.com
> To: user@lists.neo4j.org
> Subject: Re: [Neo4j] Sampling a Neo4j instance?
> 
> Hi,
> 
> Backing Jim's algorithm with some code:
> 
>     public static void main( String[] args )
>     {
>         long SAMPLE_SIZE = 10000;
>         EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
>                 "path/to/db/" );
>         // Determine the highest possible id for the node store
>         long highId = ( (NeoStoreXaDataSource)
> db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
>                 Config.DEFAULT_DATA_SOURCE_NAME )
> ).getNeoStore().getNodeStore().getHighId();
>         System.out.println( highId + " is the highest id" );
>         long i = 0;
>         long nextId;
> 
>         // Do the sampling
>         Random random = new Random();
>         while ( i < SAMPLE_SIZE )
>         {
>             nextId = Math.abs( random.nextLong() ) % highId;
>             try
>             {
>                 db.getNodeById( nextId );
>                 i++;
>                 System.out.println( "id " + nextId + " is there" );
>             }
>             catch ( NotFoundException e )
>             {
>                 // NotFoundException is thrown when the node asked is not in 
> use
>                 System.out.println( "id " + nextId + " not in use" );
>             }
>         }
>         db.shutdown();
>     }
> 
> Like already mentioned, this will be slow. Random jumps around the
> graph are not something caches can keep up with - unless your whole db
> fits in memory. But accessing random pieces of an on-disk file cannot
> be done much faster.
> 
> cheers,
> CG
> 
> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <j...@neotechnology.com> wrote:
> > Hi Anders,
> >
> > When you do getAllNodes, you're getting back an iterable so as you point 
> > out the sample isn't random (unless it was written randomly to disk). If 
> > you're prepared to take a scattergun approach and tolerate being 
> > disk-bound, then you can ask for getNodeById using a made-up ID and deal 
> > with the times when your ID's don't resolve.
> >
> > It'll be slow (since the chances of having the nodes in cache are low) but 
> > as random as your random ID generator.
> >
> > Jim
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
                                          
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to