Hi, Backing Jim's algorithm with some code:
public static void main( String[] args ) { long SAMPLE_SIZE = 10000; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( "path/to/db/" ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + " is the highest id" ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i < SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try { db.getNodeById( nextId ); i++; System.out.println( "id " + nextId + " is there" ); } catch ( NotFoundException e ) { // NotFoundException is thrown when the node asked is not in use System.out.println( "id " + nextId + " not in use" ); } } db.shutdown(); } Like already mentioned, this will be slow. Random jumps around the graph are not something caches can keep up with - unless your whole db fits in memory. But accessing random pieces of an on-disk file cannot be done much faster. cheers, CG On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <j...@neotechnology.com> wrote: > Hi Anders, > > When you do getAllNodes, you're getting back an iterable so as you point out > the sample isn't random (unless it was written randomly to disk). If you're > prepared to take a scattergun approach and tolerate being disk-bound, then > you can ask for getNodeById using a made-up ID and deal with the times when > your ID's don't resolve. > > It'll be slow (since the chances of having the nodes in cache are low) but as > random as your random ID generator. > > Jim > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user