Hi
TDB has a couple of command line utilities to initially load (i.e. build TDB 
indexes) some RDF data into an empty TDB store. These commands are known as 
tdbloader [1] and tdbloader2 [2].

tdbloader is a Java program and it uses the classes in 
c.h.h.j.tdb.store.bulkloader package, it builds the node table and the primary 
indexes first. After that it builds the secondary indexes
sequentially. Even if there are different policies (see: 
BuilderSecondaryIndexes implementations). tdbloader can be also used 
incrementally (with little care in relation to your stats.opt file if you
use one).

tdbloader2 [3,4] is a bash script with some Java code (where the magic 
happens). tdbloader2 builds just the node table and text files for the input 
data using Node ids. It then uses UNIX sort to sort
the text files and produce test files ready to be streamed into 
BPlusTreeRewriter to generate B+Tree indexes [5]. tdbloader2 cannot be used to 
incrementally load new data into an existing TDB database.

The more RAM you have the better. RAM is becoming cheaper and cheaper (unless 
you use VMs in the cloud :-)). tdbloader serves us well for datasets of a few 
millions triples|quads up to a 100M
triples|quads, depending on how much RAM you have. tdbloader2 serves us well 
for datasets of a few 100M triples|quads, but depending on how much RAM you 
have, its performances can drop as your dataset
approaches 1B or more triples|quads.

What could we do to scale TDB loading further?

We (@ Talis) use MapReduce in our ingestion pipeline to validate|clean the data 
or to build the indexes we need. However, we are still using 
tdbloader|tdbloader2 and that is becoming the bottleneck of
our ingestion pipeline (and an increasing cost) for datasets on the order of 
billion of triples|quads.

Is it possible to use MapReduce to generate 'standard' TDB indexes?

The answer to this question is still open. We have shared a prototype (a.k.a. 
tdbloader3 [6,7]) which seems to provide a positive answer to that question. 
However, there is still a big problem with
the current implementation: the first MapReduce job mush use a single reducer, 
which is a very bad practice and quickly become your new bottleneck. In other 
words, a MapReduce antipattern against
scalability. So, in practice, we still don't have a way to build our TDB 
indexes using MapReduce.

However, it might be possible to split the first MapReduce job into two 
MapReduce jobs (with multiple reducers each). This would remove the single 
reducer. The fact that the last MapReduce jobs uses 9
reducers (one for each TDB index) would probably become the next scalability 
bottleneck. However, we hope this will, in practice, not be a problem for us, 
at least for a while. Another alternative
would be to (optionally) have a different node table implementation where node 
ids are not offset into a file where the RDF node values are stored. In this 
scenario, we would use hash codes as node
ids, so the node to node id is direct and it does not involve any index. But we 
would need indexes to map a node id into an hash and the hash into a file 
offset where to retrieve the node value. The
first alternative seems more promising to me.

TODO:

 - Add MiniMRCluster so that it is easy for developers to run tests with 
multiple reducers on a laptop.
 - Split the first MapReduce job into two: one to produce offset values for 
each partition, the other to generate data files with correct ids for 
subsequent jobs.
 - Build the node table concatenating output files from the MapReduce jobs 
above.
 - Test on a cluster with a large (> 1B dataset).

Help welcome. :-)

Paolo

 [1] http://openjena.org/wiki/TDB/Commands#tdbloader
 [2] http://openjena.org/wiki/TDB/Commands#tdbloader2
 [3] http://seaborne.blogspot.com/2010/12/tdb-bulk-loader-version-2.html
 [4] 
http://seaborne.blogspot.com/2010/12/performance-benchmarks-for-tdb-loader.html
 [5] http://seaborne.blogspot.com/2010/12/repacking-btrees.html
 [6] https://github.com/castagna/tdbloader3
 [7] http://markmail.org/thread/573frdoq2rqgm2gg


Reply via email to