Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? I am thinking about how efficient this will be. As I understand it, the "sampling misses" come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? My conclusion is that the "sampling misses" will increase with index usage sparseness and that we will have a high rate of "sampling misses" when we had many deletes and few insertions recently. Would you agree? Thanks again. Regards,Anders
> Date: Wed, 9 Nov 2011 19:30:36 +0200 > From: chris.gio...@neotechnology.com > To: user@lists.neo4j.org > Subject: Re: [Neo4j] Sampling a Neo4j instance? > > Hi, > > Backing Jim's algorithm with some code: > > public static void main( String[] args ) > { > long SAMPLE_SIZE = 10000; > EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( > "path/to/db/" ); > // Determine the highest possible id for the node store > long highId = ( (NeoStoreXaDataSource) > db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( > Config.DEFAULT_DATA_SOURCE_NAME ) > ).getNeoStore().getNodeStore().getHighId(); > System.out.println( highId + " is the highest id" ); > long i = 0; > long nextId; > > // Do the sampling > Random random = new Random(); > while ( i < SAMPLE_SIZE ) > { > nextId = Math.abs( random.nextLong() ) % highId; > try > { > db.getNodeById( nextId ); > i++; > System.out.println( "id " + nextId + " is there" ); > } > catch ( NotFoundException e ) > { > // NotFoundException is thrown when the node asked is not in > use > System.out.println( "id " + nextId + " not in use" ); > } > } > db.shutdown(); > } > > Like already mentioned, this will be slow. Random jumps around the > graph are not something caches can keep up with - unless your whole db > fits in memory. But accessing random pieces of an on-disk file cannot > be done much faster. > > cheers, > CG > > On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <j...@neotechnology.com> wrote: > > Hi Anders, > > > > When you do getAllNodes, you're getting back an iterable so as you point > > out the sample isn't random (unless it was written randomly to disk). If > > you're prepared to take a scattergun approach and tolerate being > > disk-bound, then you can ask for getNodeById using a made-up ID and deal > > with the times when your ID's don't resolve. > > > > It'll be slow (since the chances of having the nodes in cache are low) but > > as random as your random ID generator. > > > > Jim > > _______________________________________________ > > Neo4j mailing list > > User@lists.neo4j.org > > https://lists.neo4j.org/mailman/listinfo/user > > > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user