They have a common abstract class AbstractGraphDatabase. Den 18 november 2011 09:46 skrev Anders Lindström <andli...@hotmail.com>:
> > Would this work in HA mode too (i.e. HighlyAvailableGraphDatabase)? I can > see that the 'getConfig' is there -- but does the cast to > NeoStoreXaDataSource work as well? > Thanks. > > > Date: Wed, 16 Nov 2011 21:40:32 +0200 > > From: chris.gio...@neotechnology.com > > To: user@lists.neo4j.org > > Subject: Re: [Neo4j] Sampling a Neo4j instance? > > > > No, GraphDatabaseService wisely hides those things away. I would > > suggest using instanceof and casting to EmbeddedGraphDatabase. > > > > cheers, > > CG > > > > 2011/11/16 Anders Lindström <andli...@hotmail.com>: > > > > > > Chris, thanks again for your replies. > > > I realize now that I don't have the 'getConfig' method -- I'm writing > a server plugin and I only get the GraphDatabaseService interface passed to > my method, not a EmbeddedGraphDatabase. Is there an equivalent way of > getting the highest node index through the interface? > > > Thanks. > > > > > >> Date: Thu, 10 Nov 2011 12:01:31 +0200 > > >> From: chris.gio...@neotechnology.com > > >> To: user@lists.neo4j.org > > >> Subject: Re: [Neo4j] Sampling a Neo4j instance? > > >> > > >> Answers inline. > > >> > > >> 2011/11/9 Anders Lindström <andli...@hotmail.com>: > > >> > > > >> > Thanks to the both of you. I am very grateful that you took your > time to put this into code -- how's that for community! > > >> > I presume this way of getting 'highId' is constant in time? It > looks rather messy though -- is it really the most straightforward way to > do it? > > >> > > >> This is the safest way to do it, that takes into consideration crashes > > >> and HA cluster membership. > > >> > > >> Another way to do it is > > >> > > >> long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE > > >> ).getHighId(); > > >> > > >> which can return the same value with the first, if some conditions are > > >> met. It is shorter and cast-free but i'd still use the first way. > > >> > > >> getHighId() is a constant time operation for both ways described - it > > >> is just a field access, with an additional long comparison for the > > >> first case. > > >> > > >> > I am thinking about how efficient this will be. As I understand it, > the "sampling misses" come from deleted nodes that once was there. But if I > remember correctly, Neo4j tries to reuse these unused node indices when new > nodes are added. But is an unused node index _guaranteed_ to be used given > that there is one, or could inserting another node result in increasing > 'highId' even though some indices below it are not used? > > >> > > >> During the lifetime of a Neo4j instance there is no id reuse for Nodes > > >> and Relationships - deleted ids are saved however and will be reused > > >> the next time Neo4j starts. This means that if during run A you > > >> deleted nodes 3 and 5, the first two nodes returned by createNode() on > > >> the next run will have ids 3 and 5 - so highId will not change. > > >> Additionally, during run A, after deleting nodes 3 and 5, no new nodes > > >> would have the id 3 or 5. A crash (or improper shutdown) of the > > >> database will break this however, since the ids-to-recycle will > > >> probably not make it to disk. > > >> > > >> So, in short, it is guaranteed that ids *won't* be reused in the same > > >> run but not guaranteed to be reused between runs. > > >> > > >> > My conclusion is that the "sampling misses" will increase with > index usage sparseness and that we will have a high rate of "sampling > misses" when we had many deletes and few insertions recently. Would you > agree? > > >> > > >> Yes, that is true, especially given the cost of the "wasted" I/O and > > >> of handling the exception. However, this cost can go down > > >> significantly if you keep a hash set for the ids of nodes you have > > >> deleted and check that before asking for the node by id, instead of > > >> catching an exception. Persisting that between runs would move you > > >> away from encapsulated Neo4j constructs and would also be more > > >> efficient. > > >> > > >> > Thanks again. > > >> > Regards,Anders > > >> > > > >> >> Date: Wed, 9 Nov 2011 19:30:36 +0200 > > >> >> From: chris.gio...@neotechnology.com > > >> >> To: user@lists.neo4j.org > > >> >> Subject: Re: [Neo4j] Sampling a Neo4j instance? > > >> >> > > >> >> Hi, > > >> >> > > >> >> Backing Jim's algorithm with some code: > > >> >> > > >> >> public static void main( String[] args ) > > >> >> { > > >> >> long SAMPLE_SIZE = 10000; > > >> >> EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( > > >> >> "path/to/db/" ); > > >> >> // Determine the highest possible id for the node store > > >> >> long highId = ( (NeoStoreXaDataSource) > > >> >> > db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( > > >> >> Config.DEFAULT_DATA_SOURCE_NAME ) > > >> >> ).getNeoStore().getNodeStore().getHighId(); > > >> >> System.out.println( highId + " is the highest id" ); > > >> >> long i = 0; > > >> >> long nextId; > > >> >> > > >> >> // Do the sampling > > >> >> Random random = new Random(); > > >> >> while ( i < SAMPLE_SIZE ) > > >> >> { > > >> >> nextId = Math.abs( random.nextLong() ) % highId; > > >> >> try > > >> >> { > > >> >> db.getNodeById( nextId ); > > >> >> i++; > > >> >> System.out.println( "id " + nextId + " is there" ); > > >> >> } > > >> >> catch ( NotFoundException e ) > > >> >> { > > >> >> // NotFoundException is thrown when the node asked > is not in use > > >> >> System.out.println( "id " + nextId + " not in use" > ); > > >> >> } > > >> >> } > > >> >> db.shutdown(); > > >> >> } > > >> >> > > >> >> Like already mentioned, this will be slow. Random jumps around the > > >> >> graph are not something caches can keep up with - unless your > whole db > > >> >> fits in memory. But accessing random pieces of an on-disk file > cannot > > >> >> be done much faster. > > >> >> > > >> >> cheers, > > >> >> CG > > >> >> > > >> >> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <j...@neotechnology.com> > wrote: > > >> >> > Hi Anders, > > >> >> > > > >> >> > When you do getAllNodes, you're getting back an iterable so as > you point out the sample isn't random (unless it was written randomly to > disk). If you're prepared to take a scattergun approach and tolerate being > disk-bound, then you can ask for getNodeById using a made-up ID and deal > with the times when your ID's don't resolve. > > >> >> > > > >> >> > It'll be slow (since the chances of having the nodes in cache > are low) but as random as your random ID generator. > > >> >> > > > >> >> > Jim > > >> >> > _______________________________________________ > > >> >> > Neo4j mailing list > > >> >> > User@lists.neo4j.org > > >> >> > https://lists.neo4j.org/mailman/listinfo/user > > >> >> > > > >> >> _______________________________________________ > > >> >> Neo4j mailing list > > >> >> User@lists.neo4j.org > > >> >> https://lists.neo4j.org/mailman/listinfo/user > > >> > > > >> > _______________________________________________ > > >> > Neo4j mailing list > > >> > User@lists.neo4j.org > > >> > https://lists.neo4j.org/mailman/listinfo/user > > >> > > > >> _______________________________________________ > > >> Neo4j mailing list > > >> User@lists.neo4j.org > > >> https://lists.neo4j.org/mailman/listinfo/user > > > > > > _______________________________________________ > > > Neo4j mailing list > > > User@lists.neo4j.org > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > _______________________________________________ > > Neo4j mailing list > > User@lists.neo4j.org > > https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user