Re: [Neo4j] Sampling a Neo4j instance?

Mattias Persson Thu, 17 Nov 2011 05:55:13 -0800

I don't think lucene supports that. You could instead make sure that
everything you index is also indexed in some kind of "all" field, which
would duplicate your index, but make this possible.


Den 17 november 2011 11:00 skrev Anders Lindström <[email protected]>:

>
> Thanks Michael for this creative idea.
>
> But is it possible to query for _all_ objects in a Lucene index? As I
> understand it, I need at least the name of an index key field, e.g.
> 'title', right? What I would like to do is basically query for * (without
> knowing _anything_ but the index name, i.e. not even names of index keys)
> and then have the results randomly sorted.
>
> Also, when and on what collection is the actual sorting performed? It
> seems to me an approach like this would sort all entries in the IndexHits
> first, and then we can start going through them. For a large index, this
> doesn't scale as sorting is O(nlog n). On the StackOverflow link it says
> "This doesn't consume any I/O when shuffling the results.", but I cannot
> understand how this is. What if the resulting IndexHits does not fit into
> memory, then we need to go to disk for shuffling too?
>
>
> Lastly, thanks CG. I've implemented your suggestion and it seems to be
> working fine!
>
> > From: [email protected]
> > Date: Thu, 10 Nov 2011 11:14:32 +0100
> > To: [email protected]
> > Subject: Re: [Neo4j] Sampling a Neo4j instance?
> >
> > Probably using an index for your nodes (could be an auto-index).
> >
> > And then using an random shuffling of the results? You can pass in a
> lucene query object or query string to index.query(queryOrQueryObject).
> >
> > Sth like this
> http://stackoverflow.com/questions/7201638/lucene-2-9-2-how-to-show-results-in-random-order
> >
> > perhaps there is also some string based lucene query/sort syntax for it.
> >
> > Michael
> >
> > Am 10.11.2011 um 11:01 schrieb Chris Gioran:
> >
> > > Answers inline.
> > >
> > > 2011/11/9 Anders Lindström <[email protected]>:
> > >>
> > >> Thanks to the both of you. I am very grateful that you took your time
> to put this into code -- how's that for community!
> > >> I presume this way of getting 'highId' is constant in time? It looks
> rather messy though -- is it really the most straightforward way to do it?
> > >
> > > This is the safest way to do it, that takes into consideration crashes
> > > and HA cluster membership.
> > >
> > > Another way to do it is
> > >
> > > long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
> > > ).getHighId();
> > >
> > > which can return the same value with the first, if some conditions are
> > > met. It is shorter and cast-free but i'd still use the first way.
> > >
> > > getHighId() is a constant time operation for both ways described - it
> > > is just a field access, with an additional long comparison for the
> > > first case.
> > >
> > >> I am thinking about how efficient this will be. As I understand it,
> the "sampling misses" come from deleted nodes that once was there. But if I
> remember correctly, Neo4j tries to reuse these unused node indices when new
> nodes are added. But is an unused node index _guaranteed_ to be used given
> that there is one, or could inserting another node result in increasing
> 'highId' even though some indices below it are not used?
> > >
> > > During the lifetime of a Neo4j instance there is no id reuse for Nodes
> > > and Relationships - deleted ids are saved however and will be reused
> > > the next time Neo4j starts. This means that if during run A you
> > > deleted nodes 3 and 5, the first two nodes returned by createNode() on
> > > the next run will have ids 3 and 5 - so highId will not change.
> > > Additionally, during run A, after deleting nodes 3 and 5, no new nodes
> > > would have the id 3 or 5. A crash (or improper shutdown) of the
> > > database will break this however, since the ids-to-recycle will
> > > probably not make it to disk.
> > >
> > > So, in short, it is guaranteed that ids *won't* be reused in the same
> > > run but not guaranteed to be reused between runs.
> > >
> > >> My conclusion is that the "sampling misses" will increase with index
> usage sparseness and that we will have a high rate of "sampling misses"
> when we had many deletes and few insertions recently. Would you agree?
> > >
> > > Yes, that is true, especially given the cost of the "wasted" I/O and
> > > of handling the exception. However, this cost can go down
> > > significantly if you keep a hash set for the ids of nodes you have
> > > deleted and check that before asking for the node by id, instead of
> > > catching an exception. Persisting that between runs would move you
> > > away from encapsulated Neo4j constructs and would also be more
> > > efficient.
> > >
> > >> Thanks again.
> > >> Regards,Anders
> > >>
> > >>> Date: Wed, 9 Nov 2011 19:30:36 +0200
> > >>> From: [email protected]
> > >>> To: [email protected]
> > >>> Subject: Re: [Neo4j] Sampling a Neo4j instance?
> > >>>
> > >>> Hi,
> > >>>
> > >>> Backing Jim's algorithm with some code:
> > >>>
> > >>>     public static void main( String[] args )
> > >>>     {
> > >>>         long SAMPLE_SIZE = 10000;
> > >>>         EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
> > >>>                 "path/to/db/" );
> > >>>         // Determine the highest possible id for the node store
> > >>>         long highId = ( (NeoStoreXaDataSource)
> > >>>
> db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
> > >>>                 Config.DEFAULT_DATA_SOURCE_NAME )
> > >>> ).getNeoStore().getNodeStore().getHighId();
> > >>>         System.out.println( highId + " is the highest id" );
> > >>>         long i = 0;
> > >>>         long nextId;
> > >>>
> > >>>         // Do the sampling
> > >>>         Random random = new Random();
> > >>>         while ( i < SAMPLE_SIZE )
> > >>>         {
> > >>>             nextId = Math.abs( random.nextLong() ) % highId;
> > >>>             try
> > >>>             {
> > >>>                 db.getNodeById( nextId );
> > >>>                 i++;
> > >>>                 System.out.println( "id " + nextId + " is there" );
> > >>>             }
> > >>>             catch ( NotFoundException e )
> > >>>             {
> > >>>                 // NotFoundException is thrown when the node asked
> is not in use
> > >>>                 System.out.println( "id " + nextId + " not in use" );
> > >>>             }
> > >>>         }
> > >>>         db.shutdown();
> > >>>     }
> > >>>
> > >>> Like already mentioned, this will be slow. Random jumps around the
> > >>> graph are not something caches can keep up with - unless your whole
> db
> > >>> fits in memory. But accessing random pieces of an on-disk file cannot
> > >>> be done much faster.
> > >>>
> > >>> cheers,
> > >>> CG
> > >>>
> > >>> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <[email protected]>
> wrote:
> > >>>> Hi Anders,
> > >>>>
> > >>>> When you do getAllNodes, you're getting back an iterable so as you
> point out the sample isn't random (unless it was written randomly to disk).
> If you're prepared to take a scattergun approach and tolerate being
> disk-bound, then you can ask for getNodeById using a made-up ID and deal
> with the times when your ID's don't resolve.
> > >>>>
> > >>>> It'll be slow (since the chances of having the nodes in cache are
> low) but as random as your random ID generator.
> > >>>>
> > >>>> Jim
> > >>>> _______________________________________________
> > >>>> Neo4j mailing list
> > >>>> [email protected]
> > >>>> https://lists.neo4j.org/mailman/listinfo/user
> > >>>>
> > >>> _______________________________________________
> > >>> Neo4j mailing list
> > >>> [email protected]
> > >>> https://lists.neo4j.org/mailman/listinfo/user
> > >>
> > >> _______________________________________________
> > >> Neo4j mailing list
> > >> [email protected]
> > >> https://lists.neo4j.org/mailman/listinfo/user
> > >>
> > > _______________________________________________
> > > Neo4j mailing list
> > > [email protected]
> > > https://lists.neo4j.org/mailman/listinfo/user
> >
> > _______________________________________________
> > Neo4j mailing list
> > [email protected]
> > https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Mattias Persson, [[email protected]]
Hacker, Neo Technology
www.neotechnology.com
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Sampling a Neo4j instance?

Reply via email to