On 20/01/12 16:45, Benson Margulies wrote:
We're trying to write rather an embarrassment of riches of RDF to TDB
by just making ordinary model API calls to the model from the default
graph. It bogs down. We're about to retry with a really big -Xmx, but
I wonder if we'd be better advised to fill up a memory model and tip
that into TDF instead?
Benson,
Is this loading an empty DB or one with existing data?
There are two modes of loading: bulking from empty and incremental.
Even if you point the bulk loader at a store with existing data, it
will; use the incremental approach. The bulk loader trickery assumes
total rewrite of all the tables. It would be possible to do a bit
better when loading so much that recalculating indexes fro scratch is
better than incremental adds.
Incremental means add each triple, with all the associated indexing, one
by one. It means that the storage access is jumping all over the place
and all the indexes are active. B+Trees end up doing less than idea
copy-spliting.
The bulk loader works on one index at a time: that means more of the RAM
is devoted to caching is (better cache intensity). It also rebuild
B+trees to reduce tree trashing, an unfortunate feature of B+Trees.
Large -Xmx is bad on 64 bits machines. The main caching is outside the
heap so more heap means less for the OS to cache memory mapped files.
If TDB used hash-based ids, (it uses incremental ids), better parallel
processing and better merge DB effects would be possible.
Paolo has been looking at this - both hash and incremental ids. Paolo -
is there any thing in your mapreduce suite to do bulk incremental loading?
Andy