Re: Jena TDB: Limitations of orgnizing large collections via named graphs

Fidan Limani Tue, 27 Oct 2020 08:02:02 -0700

One more note:

Ideally, given an input collection, the implementation should convert it to RDF 
and generate data dumps, which the implementing parties then could use for 
their use cases.


On 2020/10/27 12:27:35, Fidan Limani <[email protected]> wrote: 
> Thanks for the prompt reply, Andy.
> 
> I am doing batch-type storage: after a certain number of resources has been 
> converted and stored, I issue a transation. Based on your comment, the heap 
> size is quite enough - 80 GB, but I guess the issue remains with 
> (programmatically) using the TDB 2.
> 
> The relevant packages for storage are org.apache.jena.system.Txn; 
> org.apache.jena.tdb2.DatabaseMgr; and org.apache.jena.tdb2.TDB2Factory, but 
> yet some TDB 1 behavior seems to show, or?
> 
> Just as additional information, the following method is invoked to store 
> Model instances to TDB2:
> 
> public void storeLinkInstance(String namedGraph, Model model) {
>         Txn.executeWrite(dataset, ()->{
>             // Add to existing named graph
>             if (dataset.containsNamedModel(namedGraph)){
>                 /* Model tempModel = dataset.getNamedModel(namedGraph); // 
> .add(model)
>                 dataset.addNamedModel(namedGraph, tempModel); */
>                 dataset.getNamedModel(namedGraph).add(model);
>             } else {
>                 // Add the named graph for the first time
>                 dataset.addNamedModel(namedGraph, model);
>             }
>         });
>     }
> 
> 
> Finally, when I use TDB2 loader (from the command line) to load all these 
> smaller parts, that works just fine, and I am also able to use Jena Fuseki on 
> top of the resulting TDB, but I face the issue when programmatically 
> converting and storing the resources.
> 
> Thanks
> 
> On 2020/10/27 11:30:22, Andy Seaborne <[email protected]> wrote: 
> > Hi Fidan,
> > 
> > I guess either that you are loading all the files inside a transaction?
> > 
> > How much heap size are you using? (Don't allocate the whole of free RAM).
> > 
> > TDB1 uses heap space for uncommitteed transactions and it also buffers a 
> > few committed transactions because, usually, it is better to do the 
> > final work on them a few at a time.
> > 
> > There is a control for the buffering:
> > TransactionActionManager.QueueBatchSize
> > 
> > TDB2 does not use heap space in this way and does not have limitations 
> > on the size of transactions. A heap of 2-4G is fine - the main work at 
> > scale happens in the indexes which are not in the heap.
> > 
> >  >>   dataset.getNamedModel(namedGraph).add(model);
> > 
> > So you seem to have the data in memory in "model" as well so both the 
> > TDB(1) space and model are taking up heap.
> > 
> > You can stream the data in by having a transaction and calling Model.add 
> > (or DatasetGraph.add(Triple) if yo end up working in triples not 
> > models+statements. Your choice - it isn't a factor here.).
> > 
> > A different approach might be:
> > 
> > Convert your resources to RDF and write these to disk, possibly with 
> > adding the named graph (so TriG or N-Quads format) then using a bulking 
> > loader (TDB1: tdbloader (TDB1 tdblaoder32 is only useful for very large 
> > datasets) or tdb2.tdbloader.
> > 
> > They are faster than loading into a "live" dataset - they work by 
> > manipulating the internal structures directly.
> > 
> > For TDB1, they have to start with an empty database.
> > 
> > For TDB2, it (there is one bulkloader, with options) works on partially 
> > loaded databases.
> > 
> > As to which options for the "--loader" argument to tdb2.tdbloader, it 
> > depends. The default is good; if you have several 100's of millions and 
> > up, try --loader=parallel if it s a big server.
> > 
> >      Andy
> > 
> > 
> > 
> > 
> > On 27/10/2020 08:26, Fidan Limani wrote:
> > > Recently, I am dealing with a large collection of resources that need to 
> > > be converted to RDF. The original collection contains a set of files, 
> > > each containing > 4 M resources on average. In order to keep the 
> > > provenance, I thought having named graphs with the same name to organize 
> > > the RDF collection would be nice.
> > > 
> > > However, after half of the collection is stored, even on a powerful 
> > > server, the memory does not seem to be enough for the store operation in 
> > > the TDB. Consider the following statement:
> > >       
> > 
> > 
> > 
> > > 
> > > In it, we retrieve the current RDF Model of triples and add another 
> > > collection of triples to it. After a while, once the storage reaches a 
> > > certain point, the operation "hangs" due to heap space exception.
> > > 
> > > (Finally) The question, then, is: is there a way (a more streaming-like) 
> > > to store larger collections via named graphs? My current workaround 
> > > consists in splitting the original collection into smaller, more 
> > > manageable collections that the server can handle and store in named 
> > > graphs.
> > > 
> > 
>

Re: Jena TDB: Limitations of orgnizing large collections via named graphs

Reply via email to