Re: [Neo4j] Sampling a Neo4j instance?

2011-11-18 Thread Anders Lindström

Would this work in HA mode too (i.e. HighlyAvailableGraphDatabase)? I can see 
that the 'getConfig' is there -- but does the cast to NeoStoreXaDataSource work 
as well?
Thanks.

 Date: Wed, 16 Nov 2011 21:40:32 +0200
 From: chris.gio...@neotechnology.com
 To: user@lists.neo4j.org
 Subject: Re: [Neo4j] Sampling a Neo4j instance?
 
 No, GraphDatabaseService wisely hides those things away. I would
 suggest using instanceof and casting to EmbeddedGraphDatabase.
 
 cheers,
 CG
 
 2011/11/16 Anders Lindström andli...@hotmail.com:
 
  Chris, thanks again for your replies.
  I realize now that I don't have the 'getConfig' method -- I'm writing a 
  server plugin and I only get the GraphDatabaseService interface passed to 
  my method, not a EmbeddedGraphDatabase. Is there an equivalent way of 
  getting the highest node index through the interface?
  Thanks.
 
  Date: Thu, 10 Nov 2011 12:01:31 +0200
  From: chris.gio...@neotechnology.com
  To: user@lists.neo4j.org
  Subject: Re: [Neo4j] Sampling a Neo4j instance?
 
  Answers inline.
 
  2011/11/9 Anders Lindström andli...@hotmail.com:
  
   Thanks to the both of you. I am very grateful that you took your time to 
   put this into code -- how's that for community!
   I presume this way of getting 'highId' is constant in time? It looks 
   rather messy though -- is it really the most straightforward way to do 
   it?
 
  This is the safest way to do it, that takes into consideration crashes
  and HA cluster membership.
 
  Another way to do it is
 
  long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
  ).getHighId();
 
  which can return the same value with the first, if some conditions are
  met. It is shorter and cast-free but i'd still use the first way.
 
  getHighId() is a constant time operation for both ways described - it
  is just a field access, with an additional long comparison for the
  first case.
 
   I am thinking about how efficient this will be. As I understand it, the 
   sampling misses come from deleted nodes that once was there. But if I 
   remember correctly, Neo4j tries to reuse these unused node indices when 
   new nodes are added. But is an unused node index _guaranteed_ to be used 
   given that there is one, or could inserting another node result in 
   increasing 'highId' even though some indices below it are not used?
 
  During the lifetime of a Neo4j instance there is no id reuse for Nodes
  and Relationships - deleted ids are saved however and will be reused
  the next time Neo4j starts. This means that if during run A you
  deleted nodes 3 and 5, the first two nodes returned by createNode() on
  the next run will have ids 3 and 5 - so highId will not change.
  Additionally, during run A, after deleting nodes 3 and 5, no new nodes
  would have the id 3 or 5. A crash (or improper shutdown) of the
  database will break this however, since the ids-to-recycle will
  probably not make it to disk.
 
  So, in short, it is guaranteed that ids *won't* be reused in the same
  run but not guaranteed to be reused between runs.
 
   My conclusion is that the sampling misses will increase with index 
   usage sparseness and that we will have a high rate of sampling misses 
   when we had many deletes and few insertions recently. Would you agree?
 
  Yes, that is true, especially given the cost of the wasted I/O and
  of handling the exception. However, this cost can go down
  significantly if you keep a hash set for the ids of nodes you have
  deleted and check that before asking for the node by id, instead of
  catching an exception. Persisting that between runs would move you
  away from encapsulated Neo4j constructs and would also be more
  efficient.
 
   Thanks again.
   Regards,Anders
  
   Date: Wed, 9 Nov 2011 19:30:36 +0200
   From: chris.gio...@neotechnology.com
   To: user@lists.neo4j.org
   Subject: Re: [Neo4j] Sampling a Neo4j instance?
  
   Hi,
  
   Backing Jim's algorithm with some code:
  
   public static void main( String[] args )
   {
   long SAMPLE_SIZE = 1;
   EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
   path/to/db/ );
   // Determine the highest possible id for the node store
   long highId = ( (NeoStoreXaDataSource)
   db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
   Config.DEFAULT_DATA_SOURCE_NAME )
   ).getNeoStore().getNodeStore().getHighId();
   System.out.println( highId +  is the highest id );
   long i = 0;
   long nextId;
  
   // Do the sampling
   Random random = new Random();
   while ( i  SAMPLE_SIZE )
   {
   nextId = Math.abs( random.nextLong() ) % highId;
   try
   {
   db.getNodeById( nextId );
   i++;
   System.out.println( id  + nextId +  is there );
   }
   catch ( NotFoundException e

Re: [Neo4j] Sampling a Neo4j instance?

2011-11-18 Thread Mattias Persson
They have a common abstract class AbstractGraphDatabase.

Den 18 november 2011 09:46 skrev Anders Lindström andli...@hotmail.com:


 Would this work in HA mode too (i.e. HighlyAvailableGraphDatabase)? I can
 see that the 'getConfig' is there -- but does the cast to
 NeoStoreXaDataSource work as well?
 Thanks.

  Date: Wed, 16 Nov 2011 21:40:32 +0200
  From: chris.gio...@neotechnology.com
  To: user@lists.neo4j.org
  Subject: Re: [Neo4j] Sampling a Neo4j instance?
 
  No, GraphDatabaseService wisely hides those things away. I would
  suggest using instanceof and casting to EmbeddedGraphDatabase.
 
  cheers,
  CG
 
  2011/11/16 Anders Lindström andli...@hotmail.com:
  
   Chris, thanks again for your replies.
   I realize now that I don't have the 'getConfig' method -- I'm writing
 a server plugin and I only get the GraphDatabaseService interface passed to
 my method, not a EmbeddedGraphDatabase. Is there an equivalent way of
 getting the highest node index through the interface?
   Thanks.
  
   Date: Thu, 10 Nov 2011 12:01:31 +0200
   From: chris.gio...@neotechnology.com
   To: user@lists.neo4j.org
   Subject: Re: [Neo4j] Sampling a Neo4j instance?
  
   Answers inline.
  
   2011/11/9 Anders Lindström andli...@hotmail.com:
   
Thanks to the both of you. I am very grateful that you took your
 time to put this into code -- how's that for community!
I presume this way of getting 'highId' is constant in time? It
 looks rather messy though -- is it really the most straightforward way to
 do it?
  
   This is the safest way to do it, that takes into consideration crashes
   and HA cluster membership.
  
   Another way to do it is
  
   long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
   ).getHighId();
  
   which can return the same value with the first, if some conditions are
   met. It is shorter and cast-free but i'd still use the first way.
  
   getHighId() is a constant time operation for both ways described - it
   is just a field access, with an additional long comparison for the
   first case.
  
I am thinking about how efficient this will be. As I understand it,
 the sampling misses come from deleted nodes that once was there. But if I
 remember correctly, Neo4j tries to reuse these unused node indices when new
 nodes are added. But is an unused node index _guaranteed_ to be used given
 that there is one, or could inserting another node result in increasing
 'highId' even though some indices below it are not used?
  
   During the lifetime of a Neo4j instance there is no id reuse for Nodes
   and Relationships - deleted ids are saved however and will be reused
   the next time Neo4j starts. This means that if during run A you
   deleted nodes 3 and 5, the first two nodes returned by createNode() on
   the next run will have ids 3 and 5 - so highId will not change.
   Additionally, during run A, after deleting nodes 3 and 5, no new nodes
   would have the id 3 or 5. A crash (or improper shutdown) of the
   database will break this however, since the ids-to-recycle will
   probably not make it to disk.
  
   So, in short, it is guaranteed that ids *won't* be reused in the same
   run but not guaranteed to be reused between runs.
  
My conclusion is that the sampling misses will increase with
 index usage sparseness and that we will have a high rate of sampling
 misses when we had many deletes and few insertions recently. Would you
 agree?
  
   Yes, that is true, especially given the cost of the wasted I/O and
   of handling the exception. However, this cost can go down
   significantly if you keep a hash set for the ids of nodes you have
   deleted and check that before asking for the node by id, instead of
   catching an exception. Persisting that between runs would move you
   away from encapsulated Neo4j constructs and would also be more
   efficient.
  
Thanks again.
Regards,Anders
   
Date: Wed, 9 Nov 2011 19:30:36 +0200
From: chris.gio...@neotechnology.com
To: user@lists.neo4j.org
Subject: Re: [Neo4j] Sampling a Neo4j instance?
   
Hi,
   
Backing Jim's algorithm with some code:
   
public static void main( String[] args )
{
long SAMPLE_SIZE = 1;
EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
path/to/db/ );
// Determine the highest possible id for the node store
long highId = ( (NeoStoreXaDataSource)
   
 db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
Config.DEFAULT_DATA_SOURCE_NAME )
).getNeoStore().getNodeStore().getHighId();
System.out.println( highId +  is the highest id );
long i = 0;
long nextId;
   
// Do the sampling
Random random = new Random();
while ( i  SAMPLE_SIZE )
{
nextId = Math.abs( random.nextLong() ) % highId;
try

Re: [Neo4j] Sampling a Neo4j instance?

2011-11-17 Thread Anders Lindström

Thanks Michael for this creative idea.

But is it possible to query for _all_ objects in a Lucene index? As I 
understand it, I need at least the name of an index key field, e.g. 'title', 
right? What I would like to do is basically query for * (without knowing 
_anything_ but the index name, i.e. not even names of index keys) and then have 
the results randomly sorted.

Also, when and on what collection is the actual sorting performed? It seems to 
me an approach like this would sort all entries in the IndexHits first, and 
then we can start going through them. For a large index, this doesn't scale as 
sorting is O(nlog n). On the StackOverflow link it says This doesn't consume 
any I/O when shuffling the results., but I cannot understand how this is. What 
if the resulting IndexHits does not fit into memory, then we need to go to disk 
for shuffling too?


Lastly, thanks CG. I've implemented your suggestion and it seems to be working 
fine!

 From: michael.hun...@neotechnology.com
 Date: Thu, 10 Nov 2011 11:14:32 +0100
 To: user@lists.neo4j.org
 Subject: Re: [Neo4j] Sampling a Neo4j instance?
 
 Probably using an index for your nodes (could be an auto-index).
 
 And then using an random shuffling of the results? You can pass in a lucene 
 query object or query string to index.query(queryOrQueryObject).
 
 Sth like this 
 http://stackoverflow.com/questions/7201638/lucene-2-9-2-how-to-show-results-in-random-order
 
 perhaps there is also some string based lucene query/sort syntax for it.
 
 Michael
 
 Am 10.11.2011 um 11:01 schrieb Chris Gioran:
 
  Answers inline.
  
  2011/11/9 Anders Lindström andli...@hotmail.com:
  
  Thanks to the both of you. I am very grateful that you took your time to 
  put this into code -- how's that for community!
  I presume this way of getting 'highId' is constant in time? It looks 
  rather messy though -- is it really the most straightforward way to do it?
  
  This is the safest way to do it, that takes into consideration crashes
  and HA cluster membership.
  
  Another way to do it is
  
  long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
  ).getHighId();
  
  which can return the same value with the first, if some conditions are
  met. It is shorter and cast-free but i'd still use the first way.
  
  getHighId() is a constant time operation for both ways described - it
  is just a field access, with an additional long comparison for the
  first case.
  
  I am thinking about how efficient this will be. As I understand it, the 
  sampling misses come from deleted nodes that once was there. But if I 
  remember correctly, Neo4j tries to reuse these unused node indices when 
  new nodes are added. But is an unused node index _guaranteed_ to be used 
  given that there is one, or could inserting another node result in 
  increasing 'highId' even though some indices below it are not used?
  
  During the lifetime of a Neo4j instance there is no id reuse for Nodes
  and Relationships - deleted ids are saved however and will be reused
  the next time Neo4j starts. This means that if during run A you
  deleted nodes 3 and 5, the first two nodes returned by createNode() on
  the next run will have ids 3 and 5 - so highId will not change.
  Additionally, during run A, after deleting nodes 3 and 5, no new nodes
  would have the id 3 or 5. A crash (or improper shutdown) of the
  database will break this however, since the ids-to-recycle will
  probably not make it to disk.
  
  So, in short, it is guaranteed that ids *won't* be reused in the same
  run but not guaranteed to be reused between runs.
  
  My conclusion is that the sampling misses will increase with index usage 
  sparseness and that we will have a high rate of sampling misses when we 
  had many deletes and few insertions recently. Would you agree?
  
  Yes, that is true, especially given the cost of the wasted I/O and
  of handling the exception. However, this cost can go down
  significantly if you keep a hash set for the ids of nodes you have
  deleted and check that before asking for the node by id, instead of
  catching an exception. Persisting that between runs would move you
  away from encapsulated Neo4j constructs and would also be more
  efficient.
  
  Thanks again.
  Regards,Anders
  
  Date: Wed, 9 Nov 2011 19:30:36 +0200
  From: chris.gio...@neotechnology.com
  To: user@lists.neo4j.org
  Subject: Re: [Neo4j] Sampling a Neo4j instance?
  
  Hi,
  
  Backing Jim's algorithm with some code:
  
  public static void main( String[] args )
  {
  long SAMPLE_SIZE = 1;
  EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
  path/to/db/ );
  // Determine the highest possible id for the node store
  long highId = ( (NeoStoreXaDataSource)
  db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
  Config.DEFAULT_DATA_SOURCE_NAME )
  ).getNeoStore().getNodeStore().getHighId

Re: [Neo4j] Sampling a Neo4j instance?

2011-11-16 Thread Anders Lindström

Chris, thanks again for your replies.
I realize now that I don't have the 'getConfig' method -- I'm writing a server 
plugin and I only get the GraphDatabaseService interface passed to my method, 
not a EmbeddedGraphDatabase. Is there an equivalent way of getting the highest 
node index through the interface?
Thanks.

 Date: Thu, 10 Nov 2011 12:01:31 +0200
 From: chris.gio...@neotechnology.com
 To: user@lists.neo4j.org
 Subject: Re: [Neo4j] Sampling a Neo4j instance?
 
 Answers inline.
 
 2011/11/9 Anders Lindström andli...@hotmail.com:
 
  Thanks to the both of you. I am very grateful that you took your time to 
  put this into code -- how's that for community!
  I presume this way of getting 'highId' is constant in time? It looks rather 
  messy though -- is it really the most straightforward way to do it?
 
 This is the safest way to do it, that takes into consideration crashes
 and HA cluster membership.
 
 Another way to do it is
 
 long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
 ).getHighId();
 
 which can return the same value with the first, if some conditions are
 met. It is shorter and cast-free but i'd still use the first way.
 
 getHighId() is a constant time operation for both ways described - it
 is just a field access, with an additional long comparison for the
 first case.
 
  I am thinking about how efficient this will be. As I understand it, the 
  sampling misses come from deleted nodes that once was there. But if I 
  remember correctly, Neo4j tries to reuse these unused node indices when new 
  nodes are added. But is an unused node index _guaranteed_ to be used given 
  that there is one, or could inserting another node result in increasing 
  'highId' even though some indices below it are not used?
 
 During the lifetime of a Neo4j instance there is no id reuse for Nodes
 and Relationships - deleted ids are saved however and will be reused
 the next time Neo4j starts. This means that if during run A you
 deleted nodes 3 and 5, the first two nodes returned by createNode() on
 the next run will have ids 3 and 5 - so highId will not change.
 Additionally, during run A, after deleting nodes 3 and 5, no new nodes
 would have the id 3 or 5. A crash (or improper shutdown) of the
 database will break this however, since the ids-to-recycle will
 probably not make it to disk.
 
 So, in short, it is guaranteed that ids *won't* be reused in the same
 run but not guaranteed to be reused between runs.
 
  My conclusion is that the sampling misses will increase with index usage 
  sparseness and that we will have a high rate of sampling misses when we 
  had many deletes and few insertions recently. Would you agree?
 
 Yes, that is true, especially given the cost of the wasted I/O and
 of handling the exception. However, this cost can go down
 significantly if you keep a hash set for the ids of nodes you have
 deleted and check that before asking for the node by id, instead of
 catching an exception. Persisting that between runs would move you
 away from encapsulated Neo4j constructs and would also be more
 efficient.
 
  Thanks again.
  Regards,Anders
 
  Date: Wed, 9 Nov 2011 19:30:36 +0200
  From: chris.gio...@neotechnology.com
  To: user@lists.neo4j.org
  Subject: Re: [Neo4j] Sampling a Neo4j instance?
 
  Hi,
 
  Backing Jim's algorithm with some code:
 
  public static void main( String[] args )
  {
  long SAMPLE_SIZE = 1;
  EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
  path/to/db/ );
  // Determine the highest possible id for the node store
  long highId = ( (NeoStoreXaDataSource)
  db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
  Config.DEFAULT_DATA_SOURCE_NAME )
  ).getNeoStore().getNodeStore().getHighId();
  System.out.println( highId +  is the highest id );
  long i = 0;
  long nextId;
 
  // Do the sampling
  Random random = new Random();
  while ( i  SAMPLE_SIZE )
  {
  nextId = Math.abs( random.nextLong() ) % highId;
  try
  {
  db.getNodeById( nextId );
  i++;
  System.out.println( id  + nextId +  is there );
  }
  catch ( NotFoundException e )
  {
  // NotFoundException is thrown when the node asked is not 
  in use
  System.out.println( id  + nextId +  not in use );
  }
  }
  db.shutdown();
  }
 
  Like already mentioned, this will be slow. Random jumps around the
  graph are not something caches can keep up with - unless your whole db
  fits in memory. But accessing random pieces of an on-disk file cannot
  be done much faster.
 
  cheers,
  CG
 
  On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote:
   Hi Anders,
  
   When you do getAllNodes, you're getting back an iterable so as you point

Re: [Neo4j] Sampling a Neo4j instance?

2011-11-16 Thread Chris Gioran
No, GraphDatabaseService wisely hides those things away. I would
suggest using instanceof and casting to EmbeddedGraphDatabase.

cheers,
CG

2011/11/16 Anders Lindström andli...@hotmail.com:

 Chris, thanks again for your replies.
 I realize now that I don't have the 'getConfig' method -- I'm writing a 
 server plugin and I only get the GraphDatabaseService interface passed to my 
 method, not a EmbeddedGraphDatabase. Is there an equivalent way of getting 
 the highest node index through the interface?
 Thanks.

 Date: Thu, 10 Nov 2011 12:01:31 +0200
 From: chris.gio...@neotechnology.com
 To: user@lists.neo4j.org
 Subject: Re: [Neo4j] Sampling a Neo4j instance?

 Answers inline.

 2011/11/9 Anders Lindström andli...@hotmail.com:
 
  Thanks to the both of you. I am very grateful that you took your time to 
  put this into code -- how's that for community!
  I presume this way of getting 'highId' is constant in time? It looks 
  rather messy though -- is it really the most straightforward way to do it?

 This is the safest way to do it, that takes into consideration crashes
 and HA cluster membership.

 Another way to do it is

 long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
 ).getHighId();

 which can return the same value with the first, if some conditions are
 met. It is shorter and cast-free but i'd still use the first way.

 getHighId() is a constant time operation for both ways described - it
 is just a field access, with an additional long comparison for the
 first case.

  I am thinking about how efficient this will be. As I understand it, the 
  sampling misses come from deleted nodes that once was there. But if I 
  remember correctly, Neo4j tries to reuse these unused node indices when 
  new nodes are added. But is an unused node index _guaranteed_ to be used 
  given that there is one, or could inserting another node result in 
  increasing 'highId' even though some indices below it are not used?

 During the lifetime of a Neo4j instance there is no id reuse for Nodes
 and Relationships - deleted ids are saved however and will be reused
 the next time Neo4j starts. This means that if during run A you
 deleted nodes 3 and 5, the first two nodes returned by createNode() on
 the next run will have ids 3 and 5 - so highId will not change.
 Additionally, during run A, after deleting nodes 3 and 5, no new nodes
 would have the id 3 or 5. A crash (or improper shutdown) of the
 database will break this however, since the ids-to-recycle will
 probably not make it to disk.

 So, in short, it is guaranteed that ids *won't* be reused in the same
 run but not guaranteed to be reused between runs.

  My conclusion is that the sampling misses will increase with index usage 
  sparseness and that we will have a high rate of sampling misses when we 
  had many deletes and few insertions recently. Would you agree?

 Yes, that is true, especially given the cost of the wasted I/O and
 of handling the exception. However, this cost can go down
 significantly if you keep a hash set for the ids of nodes you have
 deleted and check that before asking for the node by id, instead of
 catching an exception. Persisting that between runs would move you
 away from encapsulated Neo4j constructs and would also be more
 efficient.

  Thanks again.
  Regards,Anders
 
  Date: Wed, 9 Nov 2011 19:30:36 +0200
  From: chris.gio...@neotechnology.com
  To: user@lists.neo4j.org
  Subject: Re: [Neo4j] Sampling a Neo4j instance?
 
  Hi,
 
  Backing Jim's algorithm with some code:
 
      public static void main( String[] args )
      {
          long SAMPLE_SIZE = 1;
          EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
                  path/to/db/ );
          // Determine the highest possible id for the node store
          long highId = ( (NeoStoreXaDataSource)
  db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
                  Config.DEFAULT_DATA_SOURCE_NAME )
  ).getNeoStore().getNodeStore().getHighId();
          System.out.println( highId +  is the highest id );
          long i = 0;
          long nextId;
 
          // Do the sampling
          Random random = new Random();
          while ( i  SAMPLE_SIZE )
          {
              nextId = Math.abs( random.nextLong() ) % highId;
              try
              {
                  db.getNodeById( nextId );
                  i++;
                  System.out.println( id  + nextId +  is there );
              }
              catch ( NotFoundException e )
              {
                  // NotFoundException is thrown when the node asked is not 
  in use
                  System.out.println( id  + nextId +  not in use );
              }
          }
          db.shutdown();
      }
 
  Like already mentioned, this will be slow. Random jumps around the
  graph are not something caches can keep up with - unless your whole db
  fits in memory. But accessing random pieces of an on-disk file cannot
  be done much faster

Re: [Neo4j] Sampling a Neo4j instance?

2011-11-10 Thread Chris Gioran
Answers inline.

2011/11/9 Anders Lindström andli...@hotmail.com:

 Thanks to the both of you. I am very grateful that you took your time to put 
 this into code -- how's that for community!
 I presume this way of getting 'highId' is constant in time? It looks rather 
 messy though -- is it really the most straightforward way to do it?

This is the safest way to do it, that takes into consideration crashes
and HA cluster membership.

Another way to do it is

long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
).getHighId();

which can return the same value with the first, if some conditions are
met. It is shorter and cast-free but i'd still use the first way.

getHighId() is a constant time operation for both ways described - it
is just a field access, with an additional long comparison for the
first case.

 I am thinking about how efficient this will be. As I understand it, the 
 sampling misses come from deleted nodes that once was there. But if I 
 remember correctly, Neo4j tries to reuse these unused node indices when new 
 nodes are added. But is an unused node index _guaranteed_ to be used given 
 that there is one, or could inserting another node result in increasing 
 'highId' even though some indices below it are not used?

During the lifetime of a Neo4j instance there is no id reuse for Nodes
and Relationships - deleted ids are saved however and will be reused
the next time Neo4j starts. This means that if during run A you
deleted nodes 3 and 5, the first two nodes returned by createNode() on
the next run will have ids 3 and 5 - so highId will not change.
Additionally, during run A, after deleting nodes 3 and 5, no new nodes
would have the id 3 or 5. A crash (or improper shutdown) of the
database will break this however, since the ids-to-recycle will
probably not make it to disk.

So, in short, it is guaranteed that ids *won't* be reused in the same
run but not guaranteed to be reused between runs.

 My conclusion is that the sampling misses will increase with index usage 
 sparseness and that we will have a high rate of sampling misses when we had 
 many deletes and few insertions recently. Would you agree?

Yes, that is true, especially given the cost of the wasted I/O and
of handling the exception. However, this cost can go down
significantly if you keep a hash set for the ids of nodes you have
deleted and check that before asking for the node by id, instead of
catching an exception. Persisting that between runs would move you
away from encapsulated Neo4j constructs and would also be more
efficient.

 Thanks again.
 Regards,Anders

 Date: Wed, 9 Nov 2011 19:30:36 +0200
 From: chris.gio...@neotechnology.com
 To: user@lists.neo4j.org
 Subject: Re: [Neo4j] Sampling a Neo4j instance?

 Hi,

 Backing Jim's algorithm with some code:

     public static void main( String[] args )
     {
         long SAMPLE_SIZE = 1;
         EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
                 path/to/db/ );
         // Determine the highest possible id for the node store
         long highId = ( (NeoStoreXaDataSource)
 db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
                 Config.DEFAULT_DATA_SOURCE_NAME )
 ).getNeoStore().getNodeStore().getHighId();
         System.out.println( highId +  is the highest id );
         long i = 0;
         long nextId;

         // Do the sampling
         Random random = new Random();
         while ( i  SAMPLE_SIZE )
         {
             nextId = Math.abs( random.nextLong() ) % highId;
             try
             {
                 db.getNodeById( nextId );
                 i++;
                 System.out.println( id  + nextId +  is there );
             }
             catch ( NotFoundException e )
             {
                 // NotFoundException is thrown when the node asked is not in 
 use
                 System.out.println( id  + nextId +  not in use );
             }
         }
         db.shutdown();
     }

 Like already mentioned, this will be slow. Random jumps around the
 graph are not something caches can keep up with - unless your whole db
 fits in memory. But accessing random pieces of an on-disk file cannot
 be done much faster.

 cheers,
 CG

 On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote:
  Hi Anders,
 
  When you do getAllNodes, you're getting back an iterable so as you point 
  out the sample isn't random (unless it was written randomly to disk). If 
  you're prepared to take a scattergun approach and tolerate being 
  disk-bound, then you can ask for getNodeById using a made-up ID and deal 
  with the times when your ID's don't resolve.
 
  It'll be slow (since the chances of having the nodes in cache are low) but 
  as random as your random ID generator.
 
  Jim
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Sampling a Neo4j instance?

2011-11-10 Thread Michael Hunger
Probably using an index for your nodes (could be an auto-index).

And then using an random shuffling of the results? You can pass in a lucene 
query object or query string to index.query(queryOrQueryObject).

Sth like this 
http://stackoverflow.com/questions/7201638/lucene-2-9-2-how-to-show-results-in-random-order

perhaps there is also some string based lucene query/sort syntax for it.

Michael

Am 10.11.2011 um 11:01 schrieb Chris Gioran:

 Answers inline.
 
 2011/11/9 Anders Lindström andli...@hotmail.com:
 
 Thanks to the both of you. I am very grateful that you took your time to put 
 this into code -- how's that for community!
 I presume this way of getting 'highId' is constant in time? It looks rather 
 messy though -- is it really the most straightforward way to do it?
 
 This is the safest way to do it, that takes into consideration crashes
 and HA cluster membership.
 
 Another way to do it is
 
 long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
 ).getHighId();
 
 which can return the same value with the first, if some conditions are
 met. It is shorter and cast-free but i'd still use the first way.
 
 getHighId() is a constant time operation for both ways described - it
 is just a field access, with an additional long comparison for the
 first case.
 
 I am thinking about how efficient this will be. As I understand it, the 
 sampling misses come from deleted nodes that once was there. But if I 
 remember correctly, Neo4j tries to reuse these unused node indices when new 
 nodes are added. But is an unused node index _guaranteed_ to be used given 
 that there is one, or could inserting another node result in increasing 
 'highId' even though some indices below it are not used?
 
 During the lifetime of a Neo4j instance there is no id reuse for Nodes
 and Relationships - deleted ids are saved however and will be reused
 the next time Neo4j starts. This means that if during run A you
 deleted nodes 3 and 5, the first two nodes returned by createNode() on
 the next run will have ids 3 and 5 - so highId will not change.
 Additionally, during run A, after deleting nodes 3 and 5, no new nodes
 would have the id 3 or 5. A crash (or improper shutdown) of the
 database will break this however, since the ids-to-recycle will
 probably not make it to disk.
 
 So, in short, it is guaranteed that ids *won't* be reused in the same
 run but not guaranteed to be reused between runs.
 
 My conclusion is that the sampling misses will increase with index usage 
 sparseness and that we will have a high rate of sampling misses when we 
 had many deletes and few insertions recently. Would you agree?
 
 Yes, that is true, especially given the cost of the wasted I/O and
 of handling the exception. However, this cost can go down
 significantly if you keep a hash set for the ids of nodes you have
 deleted and check that before asking for the node by id, instead of
 catching an exception. Persisting that between runs would move you
 away from encapsulated Neo4j constructs and would also be more
 efficient.
 
 Thanks again.
 Regards,Anders
 
 Date: Wed, 9 Nov 2011 19:30:36 +0200
 From: chris.gio...@neotechnology.com
 To: user@lists.neo4j.org
 Subject: Re: [Neo4j] Sampling a Neo4j instance?
 
 Hi,
 
 Backing Jim's algorithm with some code:
 
 public static void main( String[] args )
 {
 long SAMPLE_SIZE = 1;
 EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
 path/to/db/ );
 // Determine the highest possible id for the node store
 long highId = ( (NeoStoreXaDataSource)
 db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
 Config.DEFAULT_DATA_SOURCE_NAME )
 ).getNeoStore().getNodeStore().getHighId();
 System.out.println( highId +  is the highest id );
 long i = 0;
 long nextId;
 
 // Do the sampling
 Random random = new Random();
 while ( i  SAMPLE_SIZE )
 {
 nextId = Math.abs( random.nextLong() ) % highId;
 try
 {
 db.getNodeById( nextId );
 i++;
 System.out.println( id  + nextId +  is there );
 }
 catch ( NotFoundException e )
 {
 // NotFoundException is thrown when the node asked is not 
 in use
 System.out.println( id  + nextId +  not in use );
 }
 }
 db.shutdown();
 }
 
 Like already mentioned, this will be slow. Random jumps around the
 graph are not something caches can keep up with - unless your whole db
 fits in memory. But accessing random pieces of an on-disk file cannot
 be done much faster.
 
 cheers,
 CG
 
 On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote:
 Hi Anders,
 
 When you do getAllNodes, you're getting back an iterable so as you point 
 out the sample isn't random (unless it was written randomly to disk). If 
 you're prepared

[Neo4j] Sampling a Neo4j instance?

2011-11-09 Thread Anders Lindström

Hi,
I have looked through the archives and tried to find more information about how 
to sample the nodes of a Neo4j instance.
As it seems, one way to go is to iterate using 'getAllNodes' and keep on 
sampling until you are happy with the sample size. However, there is a 
restriction with this approach in that it is not random -- you just get the 
first N nodes of the 'getAllNodes' iterator. Is there an efficient way to do a 
random sampling of N nodes? (I believe one way is to iterate through _all_ 
results from 'getAllNodes' and pick among these randomly -- but this is not 
efficient and scales pretty bad.)
If relevant, the sample will be used as input to a sort of clustering algorithm 
which will then try to cluster similar semantic node types into different 
clusters (e.g., in the IMDb case, it can distinguish which nodes are movies and 
which are actors).
I intend to write my own server plugin to do this and then get the results from 
another application over the REST API. I feel that this can be kind of slow 
though. Are there any alternatives to send data faster?
Thanks!
Regards,Anders Lindström
  
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Sampling a Neo4j instance?

2011-11-09 Thread Jim Webber
Hi Anders,

When you do getAllNodes, you're getting back an iterable so as you point out 
the sample isn't random (unless it was written randomly to disk). If you're 
prepared to take a scattergun approach and tolerate being disk-bound, then you 
can ask for getNodeById using a made-up ID and deal with the times when your 
ID's don't resolve.

It'll be slow (since the chances of having the nodes in cache are low) but as 
random as your random ID generator.

Jim
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Sampling a Neo4j instance?

2011-11-09 Thread Chris Gioran
Hi,

Backing Jim's algorithm with some code:

public static void main( String[] args )
{
long SAMPLE_SIZE = 1;
EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
path/to/db/ );
// Determine the highest possible id for the node store
long highId = ( (NeoStoreXaDataSource)
db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
Config.DEFAULT_DATA_SOURCE_NAME )
).getNeoStore().getNodeStore().getHighId();
System.out.println( highId +  is the highest id );
long i = 0;
long nextId;

// Do the sampling
Random random = new Random();
while ( i  SAMPLE_SIZE )
{
nextId = Math.abs( random.nextLong() ) % highId;
try
{
db.getNodeById( nextId );
i++;
System.out.println( id  + nextId +  is there );
}
catch ( NotFoundException e )
{
// NotFoundException is thrown when the node asked is not in use
System.out.println( id  + nextId +  not in use );
}
}
db.shutdown();
}

Like already mentioned, this will be slow. Random jumps around the
graph are not something caches can keep up with - unless your whole db
fits in memory. But accessing random pieces of an on-disk file cannot
be done much faster.

cheers,
CG

On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote:
 Hi Anders,

 When you do getAllNodes, you're getting back an iterable so as you point out 
 the sample isn't random (unless it was written randomly to disk). If you're 
 prepared to take a scattergun approach and tolerate being disk-bound, then 
 you can ask for getNodeById using a made-up ID and deal with the times when 
 your ID's don't resolve.

 It'll be slow (since the chances of having the nodes in cache are low) but as 
 random as your random ID generator.

 Jim
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Sampling a Neo4j instance?

2011-11-09 Thread Anders Lindström

Thanks to the both of you. I am very grateful that you took your time to put 
this into code -- how's that for community!
I presume this way of getting 'highId' is constant in time? It looks rather 
messy though -- is it really the most straightforward way to do it?
I am thinking about how efficient this will be. As I understand it, the 
sampling misses come from deleted nodes that once was there. But if I 
remember correctly, Neo4j tries to reuse these unused node indices when new 
nodes are added. But is an unused node index _guaranteed_ to be used given that 
there is one, or could inserting another node result in increasing 'highId' 
even though some indices below it are not used?
My conclusion is that the sampling misses will increase with index usage 
sparseness and that we will have a high rate of sampling misses when we had 
many deletes and few insertions recently. Would you agree?
Thanks again.
Regards,Anders

 Date: Wed, 9 Nov 2011 19:30:36 +0200
 From: chris.gio...@neotechnology.com
 To: user@lists.neo4j.org
 Subject: Re: [Neo4j] Sampling a Neo4j instance?
 
 Hi,
 
 Backing Jim's algorithm with some code:
 
 public static void main( String[] args )
 {
 long SAMPLE_SIZE = 1;
 EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
 path/to/db/ );
 // Determine the highest possible id for the node store
 long highId = ( (NeoStoreXaDataSource)
 db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
 Config.DEFAULT_DATA_SOURCE_NAME )
 ).getNeoStore().getNodeStore().getHighId();
 System.out.println( highId +  is the highest id );
 long i = 0;
 long nextId;
 
 // Do the sampling
 Random random = new Random();
 while ( i  SAMPLE_SIZE )
 {
 nextId = Math.abs( random.nextLong() ) % highId;
 try
 {
 db.getNodeById( nextId );
 i++;
 System.out.println( id  + nextId +  is there );
 }
 catch ( NotFoundException e )
 {
 // NotFoundException is thrown when the node asked is not in 
 use
 System.out.println( id  + nextId +  not in use );
 }
 }
 db.shutdown();
 }
 
 Like already mentioned, this will be slow. Random jumps around the
 graph are not something caches can keep up with - unless your whole db
 fits in memory. But accessing random pieces of an on-disk file cannot
 be done much faster.
 
 cheers,
 CG
 
 On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber j...@neotechnology.com wrote:
  Hi Anders,
 
  When you do getAllNodes, you're getting back an iterable so as you point 
  out the sample isn't random (unless it was written randomly to disk). If 
  you're prepared to take a scattergun approach and tolerate being 
  disk-bound, then you can ask for getNodeById using a made-up ID and deal 
  with the times when your ID's don't resolve.
 
  It'll be slow (since the chances of having the nodes in cache are low) but 
  as random as your random ID generator.
 
  Jim
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
  
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user