Re: Jena TDB: Limitations of orgnizing large collections via named graphs

Andy Seaborne Tue, 27 Oct 2020 04:31:02 -0700

Hi Fidan,

I guess either that you are loading all the files inside a transaction?


How much heap size are you using? (Don't allocate the whole of free RAM).

TDB1 uses heap space for uncommitteed transactions and it also buffers afew committed transactions because, usually, it is better to do thefinal work on them a few at a time.


There is a control for the buffering:
TransactionActionManager.QueueBatchSize

TDB2 does not use heap space in this way and does not have limitationson the size of transactions. A heap of 2-4G is fine - the main work atscale happens in the indexes which are not in the heap.


>>   dataset.getNamedModel(namedGraph).add(model);

So you seem to have the data in memory in "model" as well so both theTDB(1) space and model are taking up heap.

You can stream the data in by having a transaction and calling Model.add(or DatasetGraph.add(Triple) if yo end up working in triples notmodels+statements. Your choice - it isn't a factor here.).


A different approach might be:

Convert your resources to RDF and write these to disk, possibly withadding the named graph (so TriG or N-Quads format) then using a bulkingloader (TDB1: tdbloader (TDB1 tdblaoder32 is only useful for very largedatasets) or tdb2.tdbloader.

They are faster than loading into a "live" dataset - they work bymanipulating the internal structures directly.


For TDB1, they have to start with an empty database.

For TDB2, it (there is one bulkloader, with options) works on partiallyloaded databases.

As to which options for the "--loader" argument to tdb2.tdbloader, itdepends. The default is good; if you have several 100's of millions andup, try --loader=parallel if it s a big server.


    Andy




On 27/10/2020 08:26, Fidan Limani wrote:

Recently, I am dealing with a large collection of resources that need to be 
converted to RDF. The original collection contains a set of files, each containing 
> 4 M resources on average. In order to keep the provenance, I thought having 
named graphs with the same name to organize the RDF collection would be nice.

However, after half of the collection is stored, even on a powerful server, the 
memory does not seem to be enough for the store operation in the TDB. Consider 
the following statement:


In it, we retrieve the current RDF Model of triples and add another collection of triples 
to it. After a while, once the storage reaches a certain point, the operation 
"hangs" due to heap space exception.

(Finally) The question, then, is: is there a way (a more streaming-like) to 
store larger collections via named graphs? My current workaround consists in 
splitting the original collection into smaller, more manageable collections 
that the server can handle and store in named graphs.

Re: Jena TDB: Limitations of orgnizing large collections via named graphs

Reply via email to