Re: Writing a whole lot of RDF to TDB versus Jena

Andy Seaborne Sat, 21 Jan 2012 03:03:50 -0800

On 20/01/12 16:45, Benson Margulies wrote:

We're trying to write rather an embarrassment of riches of RDF to TDB
by just making ordinary model API calls to the model from the default
graph. It bogs down. We're about to retry with a really big -Xmx, but
I wonder if we'd be better advised to fill up a memory model and tip
that into TDF instead?


Benson,

Is this loading an empty DB or one with existing data?

There are two modes of loading: bulking from empty and incremental.Even if you point the bulk loader at a store with existing data, itwill; use the incremental approach. The bulk loader trickery assumestotal rewrite of all the tables. It would be possible to do a bitbetter when loading so much that recalculating indexes fro scratch isbetter than incremental adds.

Incremental means add each triple, with all the associated indexing, oneby one. It means that the storage access is jumping all over the placeand all the indexes are active. B+Trees end up doing less than ideacopy-spliting.

The bulk loader works on one index at a time: that means more of the RAMis devoted to caching is (better cache intensity). It also rebuildB+trees to reduce tree trashing, an unfortunate feature of B+Trees.

Large -Xmx is bad on 64 bits machines. The main caching is outside theheap so more heap means less for the OS to cache memory mapped files.

If TDB used hash-based ids, (it uses incremental ids), better parallelprocessing and better merge DB effects would be possible.

Paolo has been looking at this - both hash and incremental ids. Paolo -is there any thing in your mapreduce suite to do bulk incremental loading?


        Andy

Re: Writing a whole lot of RDF to TDB versus Jena

Reply via email to