OK I have some code and benchmarks for this solution up on a Google Code 
project here: http://code.google.com/p/graphdb-load-tester/

The project exists to address the performance challenges I have encountered 
when dealing with large graphs. It  uses all of the Wikipedia links as a test 
dataset and a choice of graph databases (most of which use Lucene BTW).
The test data is essentially 130 million edges representing links between pages 
e.g.  Communism->Russia.
To load the data all of the graph databases have to translate user-defined keys 
like "Russia" into an internally-generated node ID using a service that looks 
like this: 
        interface KeyService
        { 
                //Returns existing nodeid or -1 if is not already in store
                public long getGraphNodeId(String udk);

                //Adds a new record - assumption is client has checked user 
defined key (udk) is not stored already using getGraphNodeId
                public void put(String udk, long graphNodeId);
        }

This is a challenge on a dataset of this size. I tried using a Lucene-based 
implementation for this service with the following optimisations:
1) a Bloomfilter to quickly "know what we don't know"
2) an LRUCache to hold on to commonly referenced vertices e.g the Wikipdedia 
article for "United States"
3) a hashmap representing the unflushed state of Lucene's IndexWriter to avoid 
the need for excessive flushing with NRT reader etc

The search/write performance showed the familiar saw-toothing as the Lucene 
index grew in size and merge operations kicked in.

The KVStore implementation I wrote attempts to tackle this problem using a 
fundamentally different form of index. The results from the KVStore runs show 
it was twice as fast as this  Lucene solution and maintains constant 
performance without the saw toothing effect.

Benchmark figures are here: http://goo.gl/VQ027
The KVStore source code is here: http://goo.gl/ovkop and the Lucene 
implementation I compare against is also in the project.

Cheers
Mark





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to