Hi Fidan,
I guess either that you are loading all the files inside a transaction?
How much heap size are you using? (Don't allocate the whole of free RAM).
TDB1 uses heap space for uncommitteed transactions and it also buffers a
few committed transactions because, usually, it is better to do the
final work on them a few at a time.
There is a control for the buffering:
TransactionActionManager.QueueBatchSize
TDB2 does not use heap space in this way and does not have limitations
on the size of transactions. A heap of 2-4G is fine - the main work at
scale happens in the indexes which are not in the heap.
>> dataset.getNamedModel(namedGraph).add(model);
So you seem to have the data in memory in "model" as well so both the
TDB(1) space and model are taking up heap.
You can stream the data in by having a transaction and calling Model.add
(or DatasetGraph.add(Triple) if yo end up working in triples not
models+statements. Your choice - it isn't a factor here.).
A different approach might be:
Convert your resources to RDF and write these to disk, possibly with
adding the named graph (so TriG or N-Quads format) then using a bulking
loader (TDB1: tdbloader (TDB1 tdblaoder32 is only useful for very large
datasets) or tdb2.tdbloader.
They are faster than loading into a "live" dataset - they work by
manipulating the internal structures directly.
For TDB1, they have to start with an empty database.
For TDB2, it (there is one bulkloader, with options) works on partially
loaded databases.
As to which options for the "--loader" argument to tdb2.tdbloader, it
depends. The default is good; if you have several 100's of millions and
up, try --loader=parallel if it s a big server.
Andy
On 27/10/2020 08:26, Fidan Limani wrote:
Recently, I am dealing with a large collection of resources that need to be
converted to RDF. The original collection contains a set of files, each containing
> 4 M resources on average. In order to keep the provenance, I thought having
named graphs with the same name to organize the RDF collection would be nice.
However, after half of the collection is stored, even on a powerful server, the
memory does not seem to be enough for the store operation in the TDB. Consider
the following statement:
In it, we retrieve the current RDF Model of triples and add another collection of triples
to it. After a while, once the storage reaches a certain point, the operation
"hangs" due to heap space exception.
(Finally) The question, then, is: is there a way (a more streaming-like) to
store larger collections via named graphs? My current workaround consists in
splitting the original collection into smaller, more manageable collections
that the server can handle and store in named graphs.