Re: [Neo4j] Sampling a Neo4j instance?
Would this work in HA mode too (i.e. HighlyAvailableGraphDatabase)? I can see that the 'getConfig' is there -- but does the cast to NeoStoreXaDataSource work as well? Thanks. Date: Wed, 16 Nov 2011 21:40:32 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? No, GraphDatabaseService wisely hides those things away. I would suggest using instanceof and casting to EmbeddedGraphDatabase. cheers, CG 2011/11/16 Anders Lindström andli...@hotmail.com: Chris, thanks again for your replies. I realize now that I don't have the 'getConfig' method -- I'm writing a server plugin and I only get the GraphDatabaseService interface passed to my method, not a EmbeddedGraphDatabase. Is there an equivalent way of getting the highest node index through the interface? Thanks. Date: Thu, 10 Nov 2011 12:01:31 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Answers inline. 2011/11/9 Anders Lindström andli...@hotmail.com: Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? This is the safest way to do it, that takes into consideration crashes and HA cluster membership. Another way to do it is long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE ).getHighId(); which can return the same value with the first, if some conditions are met. It is shorter and cast-free but i'd still use the first way. getHighId() is a constant time operation for both ways described - it is just a field access, with an additional long comparison for the first case. I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? During the lifetime of a Neo4j instance there is no id reuse for Nodes and Relationships - deleted ids are saved however and will be reused the next time Neo4j starts. This means that if during run A you deleted nodes 3 and 5, the first two nodes returned by createNode() on the next run will have ids 3 and 5 - so highId will not change. Additionally, during run A, after deleting nodes 3 and 5, no new nodes would have the id 3 or 5. A crash (or improper shutdown) of the database will break this however, since the ids-to-recycle will probably not make it to disk. So, in short, it is guaranteed that ids *won't* be reused in the same run but not guaranteed to be reused between runs. My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Yes, that is true, especially given the cost of the wasted I/O and of handling the exception. However, this cost can go down significantly if you keep a hash set for the ids of nodes you have deleted and check that before asking for the node by id, instead of catching an exception. Persisting that between runs would move you away from encapsulated Neo4j constructs and would also be more efficient. Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try { db.getNodeById( nextId ); i++; System.out.println( id + nextId + is there ); } catch ( NotFoundException e
Re: [Neo4j] Sampling a Neo4j instance?
They have a common abstract class AbstractGraphDatabase. Den 18 november 2011 09:46 skrev Anders Lindström andli...@hotmail.com: Would this work in HA mode too (i.e. HighlyAvailableGraphDatabase)? I can see that the 'getConfig' is there -- but does the cast to NeoStoreXaDataSource work as well? Thanks. Date: Wed, 16 Nov 2011 21:40:32 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? No, GraphDatabaseService wisely hides those things away. I would suggest using instanceof and casting to EmbeddedGraphDatabase. cheers, CG 2011/11/16 Anders Lindström andli...@hotmail.com: Chris, thanks again for your replies. I realize now that I don't have the 'getConfig' method -- I'm writing a server plugin and I only get the GraphDatabaseService interface passed to my method, not a EmbeddedGraphDatabase. Is there an equivalent way of getting the highest node index through the interface? Thanks. Date: Thu, 10 Nov 2011 12:01:31 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Answers inline. 2011/11/9 Anders Lindström andli...@hotmail.com: Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? This is the safest way to do it, that takes into consideration crashes and HA cluster membership. Another way to do it is long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE ).getHighId(); which can return the same value with the first, if some conditions are met. It is shorter and cast-free but i'd still use the first way. getHighId() is a constant time operation for both ways described - it is just a field access, with an additional long comparison for the first case. I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? During the lifetime of a Neo4j instance there is no id reuse for Nodes and Relationships - deleted ids are saved however and will be reused the next time Neo4j starts. This means that if during run A you deleted nodes 3 and 5, the first two nodes returned by createNode() on the next run will have ids 3 and 5 - so highId will not change. Additionally, during run A, after deleting nodes 3 and 5, no new nodes would have the id 3 or 5. A crash (or improper shutdown) of the database will break this however, since the ids-to-recycle will probably not make it to disk. So, in short, it is guaranteed that ids *won't* be reused in the same run but not guaranteed to be reused between runs. My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Yes, that is true, especially given the cost of the wasted I/O and of handling the exception. However, this cost can go down significantly if you keep a hash set for the ids of nodes you have deleted and check that before asking for the node by id, instead of catching an exception. Persisting that between runs would move you away from encapsulated Neo4j constructs and would also be more efficient. Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try
Re: [Neo4j] Sampling a Neo4j instance?
Thanks Michael for this creative idea. But is it possible to query for _all_ objects in a Lucene index? As I understand it, I need at least the name of an index key field, e.g. 'title', right? What I would like to do is basically query for * (without knowing _anything_ but the index name, i.e. not even names of index keys) and then have the results randomly sorted. Also, when and on what collection is the actual sorting performed? It seems to me an approach like this would sort all entries in the IndexHits first, and then we can start going through them. For a large index, this doesn't scale as sorting is O(nlog n). On the StackOverflow link it says This doesn't consume any I/O when shuffling the results., but I cannot understand how this is. What if the resulting IndexHits does not fit into memory, then we need to go to disk for shuffling too? Lastly, thanks CG. I've implemented your suggestion and it seems to be working fine! From: michael.hun...@neotechnology.com Date: Thu, 10 Nov 2011 11:14:32 +0100 To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Probably using an index for your nodes (could be an auto-index). And then using an random shuffling of the results? You can pass in a lucene query object or query string to index.query(queryOrQueryObject). Sth like this http://stackoverflow.com/questions/7201638/lucene-2-9-2-how-to-show-results-in-random-order perhaps there is also some string based lucene query/sort syntax for it. Michael Am 10.11.2011 um 11:01 schrieb Chris Gioran: Answers inline. 2011/11/9 Anders Lindström andli...@hotmail.com: Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? This is the safest way to do it, that takes into consideration crashes and HA cluster membership. Another way to do it is long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE ).getHighId(); which can return the same value with the first, if some conditions are met. It is shorter and cast-free but i'd still use the first way. getHighId() is a constant time operation for both ways described - it is just a field access, with an additional long comparison for the first case. I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? During the lifetime of a Neo4j instance there is no id reuse for Nodes and Relationships - deleted ids are saved however and will be reused the next time Neo4j starts. This means that if during run A you deleted nodes 3 and 5, the first two nodes returned by createNode() on the next run will have ids 3 and 5 - so highId will not change. Additionally, during run A, after deleting nodes 3 and 5, no new nodes would have the id 3 or 5. A crash (or improper shutdown) of the database will break this however, since the ids-to-recycle will probably not make it to disk. So, in short, it is guaranteed that ids *won't* be reused in the same run but not guaranteed to be reused between runs. My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Yes, that is true, especially given the cost of the wasted I/O and of handling the exception. However, this cost can go down significantly if you keep a hash set for the ids of nodes you have deleted and check that before asking for the node by id, instead of catching an exception. Persisting that between runs would move you away from encapsulated Neo4j constructs and would also be more efficient. Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId
Re: [Neo4j] Sampling a Neo4j instance?
Chris, thanks again for your replies. I realize now that I don't have the 'getConfig' method -- I'm writing a server plugin and I only get the GraphDatabaseService interface passed to my method, not a EmbeddedGraphDatabase. Is there an equivalent way of getting the highest node index through the interface? Thanks. Date: Thu, 10 Nov 2011 12:01:31 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Answers inline. 2011/11/9 Anders Lindström andli...@hotmail.com: Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? This is the safest way to do it, that takes into consideration crashes and HA cluster membership. Another way to do it is long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE ).getHighId(); which can return the same value with the first, if some conditions are met. It is shorter and cast-free but i'd still use the first way. getHighId() is a constant time operation for both ways described - it is just a field access, with an additional long comparison for the first case. I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? During the lifetime of a Neo4j instance there is no id reuse for Nodes and Relationships - deleted ids are saved however and will be reused the next time Neo4j starts. This means that if during run A you deleted nodes 3 and 5, the first two nodes returned by createNode() on the next run will have ids 3 and 5 - so highId will not change. Additionally, during run A, after deleting nodes 3 and 5, no new nodes would have the id 3 or 5. A crash (or improper shutdown) of the database will break this however, since the ids-to-recycle will probably not make it to disk. So, in short, it is guaranteed that ids *won't* be reused in the same run but not guaranteed to be reused between runs. My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Yes, that is true, especially given the cost of the wasted I/O and of handling the exception. However, this cost can go down significantly if you keep a hash set for the ids of nodes you have deleted and check that before asking for the node by id, instead of catching an exception. Persisting that between runs would move you away from encapsulated Neo4j constructs and would also be more efficient. Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try { db.getNodeById( nextId ); i++; System.out.println( id + nextId + is there ); } catch ( NotFoundException e ) { // NotFoundException is thrown when the node asked is not in use System.out.println( id + nextId + not in use ); } } db.shutdown(); } Like already mentioned, this will be slow. Random jumps around the graph are not something caches can keep up with - unless your whole db fits in memory. But accessing random pieces of an on-disk file cannot be done much faster. cheers, CG On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote: Hi Anders, When you do getAllNodes, you're getting back an iterable so as you point
Re: [Neo4j] Sampling a Neo4j instance?
No, GraphDatabaseService wisely hides those things away. I would suggest using instanceof and casting to EmbeddedGraphDatabase. cheers, CG 2011/11/16 Anders Lindström andli...@hotmail.com: Chris, thanks again for your replies. I realize now that I don't have the 'getConfig' method -- I'm writing a server plugin and I only get the GraphDatabaseService interface passed to my method, not a EmbeddedGraphDatabase. Is there an equivalent way of getting the highest node index through the interface? Thanks. Date: Thu, 10 Nov 2011 12:01:31 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Answers inline. 2011/11/9 Anders Lindström andli...@hotmail.com: Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? This is the safest way to do it, that takes into consideration crashes and HA cluster membership. Another way to do it is long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE ).getHighId(); which can return the same value with the first, if some conditions are met. It is shorter and cast-free but i'd still use the first way. getHighId() is a constant time operation for both ways described - it is just a field access, with an additional long comparison for the first case. I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? During the lifetime of a Neo4j instance there is no id reuse for Nodes and Relationships - deleted ids are saved however and will be reused the next time Neo4j starts. This means that if during run A you deleted nodes 3 and 5, the first two nodes returned by createNode() on the next run will have ids 3 and 5 - so highId will not change. Additionally, during run A, after deleting nodes 3 and 5, no new nodes would have the id 3 or 5. A crash (or improper shutdown) of the database will break this however, since the ids-to-recycle will probably not make it to disk. So, in short, it is guaranteed that ids *won't* be reused in the same run but not guaranteed to be reused between runs. My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Yes, that is true, especially given the cost of the wasted I/O and of handling the exception. However, this cost can go down significantly if you keep a hash set for the ids of nodes you have deleted and check that before asking for the node by id, instead of catching an exception. Persisting that between runs would move you away from encapsulated Neo4j constructs and would also be more efficient. Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try { db.getNodeById( nextId ); i++; System.out.println( id + nextId + is there ); } catch ( NotFoundException e ) { // NotFoundException is thrown when the node asked is not in use System.out.println( id + nextId + not in use ); } } db.shutdown(); } Like already mentioned, this will be slow. Random jumps around the graph are not something caches can keep up with - unless your whole db fits in memory. But accessing random pieces of an on-disk file cannot be done much faster
Re: [Neo4j] Sampling a Neo4j instance?
Answers inline. 2011/11/9 Anders Lindström andli...@hotmail.com: Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? This is the safest way to do it, that takes into consideration crashes and HA cluster membership. Another way to do it is long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE ).getHighId(); which can return the same value with the first, if some conditions are met. It is shorter and cast-free but i'd still use the first way. getHighId() is a constant time operation for both ways described - it is just a field access, with an additional long comparison for the first case. I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? During the lifetime of a Neo4j instance there is no id reuse for Nodes and Relationships - deleted ids are saved however and will be reused the next time Neo4j starts. This means that if during run A you deleted nodes 3 and 5, the first two nodes returned by createNode() on the next run will have ids 3 and 5 - so highId will not change. Additionally, during run A, after deleting nodes 3 and 5, no new nodes would have the id 3 or 5. A crash (or improper shutdown) of the database will break this however, since the ids-to-recycle will probably not make it to disk. So, in short, it is guaranteed that ids *won't* be reused in the same run but not guaranteed to be reused between runs. My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Yes, that is true, especially given the cost of the wasted I/O and of handling the exception. However, this cost can go down significantly if you keep a hash set for the ids of nodes you have deleted and check that before asking for the node by id, instead of catching an exception. Persisting that between runs would move you away from encapsulated Neo4j constructs and would also be more efficient. Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try { db.getNodeById( nextId ); i++; System.out.println( id + nextId + is there ); } catch ( NotFoundException e ) { // NotFoundException is thrown when the node asked is not in use System.out.println( id + nextId + not in use ); } } db.shutdown(); } Like already mentioned, this will be slow. Random jumps around the graph are not something caches can keep up with - unless your whole db fits in memory. But accessing random pieces of an on-disk file cannot be done much faster. cheers, CG On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote: Hi Anders, When you do getAllNodes, you're getting back an iterable so as you point out the sample isn't random (unless it was written randomly to disk). If you're prepared to take a scattergun approach and tolerate being disk-bound, then you can ask for getNodeById using a made-up ID and deal with the times when your ID's don't resolve. It'll be slow (since the chances of having the nodes in cache are low) but as random as your random ID generator. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Sampling a Neo4j instance?
Probably using an index for your nodes (could be an auto-index). And then using an random shuffling of the results? You can pass in a lucene query object or query string to index.query(queryOrQueryObject). Sth like this http://stackoverflow.com/questions/7201638/lucene-2-9-2-how-to-show-results-in-random-order perhaps there is also some string based lucene query/sort syntax for it. Michael Am 10.11.2011 um 11:01 schrieb Chris Gioran: Answers inline. 2011/11/9 Anders Lindström andli...@hotmail.com: Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? This is the safest way to do it, that takes into consideration crashes and HA cluster membership. Another way to do it is long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE ).getHighId(); which can return the same value with the first, if some conditions are met. It is shorter and cast-free but i'd still use the first way. getHighId() is a constant time operation for both ways described - it is just a field access, with an additional long comparison for the first case. I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? During the lifetime of a Neo4j instance there is no id reuse for Nodes and Relationships - deleted ids are saved however and will be reused the next time Neo4j starts. This means that if during run A you deleted nodes 3 and 5, the first two nodes returned by createNode() on the next run will have ids 3 and 5 - so highId will not change. Additionally, during run A, after deleting nodes 3 and 5, no new nodes would have the id 3 or 5. A crash (or improper shutdown) of the database will break this however, since the ids-to-recycle will probably not make it to disk. So, in short, it is guaranteed that ids *won't* be reused in the same run but not guaranteed to be reused between runs. My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Yes, that is true, especially given the cost of the wasted I/O and of handling the exception. However, this cost can go down significantly if you keep a hash set for the ids of nodes you have deleted and check that before asking for the node by id, instead of catching an exception. Persisting that between runs would move you away from encapsulated Neo4j constructs and would also be more efficient. Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try { db.getNodeById( nextId ); i++; System.out.println( id + nextId + is there ); } catch ( NotFoundException e ) { // NotFoundException is thrown when the node asked is not in use System.out.println( id + nextId + not in use ); } } db.shutdown(); } Like already mentioned, this will be slow. Random jumps around the graph are not something caches can keep up with - unless your whole db fits in memory. But accessing random pieces of an on-disk file cannot be done much faster. cheers, CG On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote: Hi Anders, When you do getAllNodes, you're getting back an iterable so as you point out the sample isn't random (unless it was written randomly to disk). If you're prepared
[Neo4j] Sampling a Neo4j instance?
Hi, I have looked through the archives and tried to find more information about how to sample the nodes of a Neo4j instance. As it seems, one way to go is to iterate using 'getAllNodes' and keep on sampling until you are happy with the sample size. However, there is a restriction with this approach in that it is not random -- you just get the first N nodes of the 'getAllNodes' iterator. Is there an efficient way to do a random sampling of N nodes? (I believe one way is to iterate through _all_ results from 'getAllNodes' and pick among these randomly -- but this is not efficient and scales pretty bad.) If relevant, the sample will be used as input to a sort of clustering algorithm which will then try to cluster similar semantic node types into different clusters (e.g., in the IMDb case, it can distinguish which nodes are movies and which are actors). I intend to write my own server plugin to do this and then get the results from another application over the REST API. I feel that this can be kind of slow though. Are there any alternatives to send data faster? Thanks! Regards,Anders Lindström ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Sampling a Neo4j instance?
Hi Anders, When you do getAllNodes, you're getting back an iterable so as you point out the sample isn't random (unless it was written randomly to disk). If you're prepared to take a scattergun approach and tolerate being disk-bound, then you can ask for getNodeById using a made-up ID and deal with the times when your ID's don't resolve. It'll be slow (since the chances of having the nodes in cache are low) but as random as your random ID generator. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Sampling a Neo4j instance?
Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try { db.getNodeById( nextId ); i++; System.out.println( id + nextId + is there ); } catch ( NotFoundException e ) { // NotFoundException is thrown when the node asked is not in use System.out.println( id + nextId + not in use ); } } db.shutdown(); } Like already mentioned, this will be slow. Random jumps around the graph are not something caches can keep up with - unless your whole db fits in memory. But accessing random pieces of an on-disk file cannot be done much faster. cheers, CG On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote: Hi Anders, When you do getAllNodes, you're getting back an iterable so as you point out the sample isn't random (unless it was written randomly to disk). If you're prepared to take a scattergun approach and tolerate being disk-bound, then you can ask for getNodeById using a made-up ID and deal with the times when your ID's don't resolve. It'll be slow (since the chances of having the nodes in cache are low) but as random as your random ID generator. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Sampling a Neo4j instance?
Thanks to the both of you. I am very grateful that you took your time to put this into code -- how's that for community! I presume this way of getting 'highId' is constant in time? It looks rather messy though -- is it really the most straightforward way to do it? I am thinking about how efficient this will be. As I understand it, the sampling misses come from deleted nodes that once was there. But if I remember correctly, Neo4j tries to reuse these unused node indices when new nodes are added. But is an unused node index _guaranteed_ to be used given that there is one, or could inserting another node result in increasing 'highId' even though some indices below it are not used? My conclusion is that the sampling misses will increase with index usage sparseness and that we will have a high rate of sampling misses when we had many deletes and few insertions recently. Would you agree? Thanks again. Regards,Anders Date: Wed, 9 Nov 2011 19:30:36 +0200 From: chris.gio...@neotechnology.com To: user@lists.neo4j.org Subject: Re: [Neo4j] Sampling a Neo4j instance? Hi, Backing Jim's algorithm with some code: public static void main( String[] args ) { long SAMPLE_SIZE = 1; EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( path/to/db/ ); // Determine the highest possible id for the node store long highId = ( (NeoStoreXaDataSource) db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( Config.DEFAULT_DATA_SOURCE_NAME ) ).getNeoStore().getNodeStore().getHighId(); System.out.println( highId + is the highest id ); long i = 0; long nextId; // Do the sampling Random random = new Random(); while ( i SAMPLE_SIZE ) { nextId = Math.abs( random.nextLong() ) % highId; try { db.getNodeById( nextId ); i++; System.out.println( id + nextId + is there ); } catch ( NotFoundException e ) { // NotFoundException is thrown when the node asked is not in use System.out.println( id + nextId + not in use ); } } db.shutdown(); } Like already mentioned, this will be slow. Random jumps around the graph are not something caches can keep up with - unless your whole db fits in memory. But accessing random pieces of an on-disk file cannot be done much faster. cheers, CG On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote: Hi Anders, When you do getAllNodes, you're getting back an iterable so as you point out the sample isn't random (unless it was written randomly to disk). If you're prepared to take a scattergun approach and tolerate being disk-bound, then you can ask for getNodeById using a made-up ID and deal with the times when your ID's don't resolve. It'll be slow (since the chances of having the nodes in cache are low) but as random as your random ID generator. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user