Thanks, Andy!

Few notes on your comments:

- With the “or” in the sentence I was wondering if there the TDB1 was (somehow) 
still being used, thus causing the exception. That’s why I mentioned the Jena 
packages used for storage.
The dataset descriptions contain common metadata you would typically find for 
publications (title, authors, keywords, etc.), except for dataset “description” 
string literals, which at time can be long.

- Yes, the Model instance is in-memory: I add 50 K dataset descriptions, 
converted to RDF, to it, and then I issue a TDB write transaction. After that, 
I empty the in-memory model, and continue with the rest of the collection.

- Reading directly to TDB: If I understand you correctly, what prevents me to 
do this is that I must first convert every record from the input collection to 
RDF, and then store the resulting model to TDB. So, the source collection is 
not in RDF. Once the collection is converted to RDF, loading it to TDB2 is 
quite smooth (and efficient).

What I changed, however, was the part of how I store the Model instance to 
TDB2. For some reason, I had this "map" approach, whereas I would first 
retrieve the contents of a named graph, add the latest  Model "batch", and then 
store it back to the named graph. After you pointed out the heap operations of 
containing this way, I just directly write to the named graph, without first 
retrieving its content, as follows:
    dataset.addNamedModel(namedGraph, model);

As per your comment (2 -4 GB of heap typically required for TDB 2 operations), 
I lowered the heap allocation on the server to 10 GB and now the application 
completes without any issues. It still takes quite some time for it to 
complete, but the collection is over 15 GB (compressed), so it take a while.

I will report back if there is anything else I find useful for the process.

Kind regards and thanks for the helpful feedback,
Fidan




On 2020/10/28 11:35:08, Andy Seaborne <[email protected]> wrote: 
> 
> 
> On 27/10/2020 12:27, Fidan Limani wrote:
> > Thanks for the prompt reply, Andy.
> > 
> > I am doing batch-type storage: after a certain number of resources has been 
> > converted and stored, I issue a transation. Based on your comment, the heap 
> > size is quite enough - 80 GB
> 
> Do leave space for the OS file system cache. Otherwise the indexes have 
> no cache space (that is not in the heap).
> 
> > , but I guess the issue remains with (programmatically) using the TDB 2.
> 
> > 
> > The relevant packages for storage are org.apache.jena.system.Txn; 
> > org.apache.jena.tdb2.DatabaseMgr; and org.apache.jena.tdb2.TDB2Factory, but 
> > yet some TDB 1 behavior seems to show, or?
> 
> "or?" - don't understand.
> 
> if you get OOME or heap CPU death, then maybe the issues isn't in TDB1 or 2.
> 
> The code seems to read everything into memory, then add it to TDB.
> 
> 
> Does your data contain, for example, many large literals?
> 
> > 
> > Just as additional information, the following method is invoked to store 
> > Model instances to TDB2:
> > 
> 
> I'm guessing here but is "model" in-memeory and you read your data into it?
> 
> Can you read data straight into TDB instead?
> 
> > public void storeLinkInstance(String namedGraph, Model model) {
> >          Txn.executeWrite(dataset, ()->{
> >              // Add to existing named graph
> >              if (dataset.containsNamedModel(namedGraph)){
> >                  /* Model tempModel = dataset.getNamedModel(namedGraph); // 
> > .add(model)
> 
>                     // This model "m" is a view of the database
>                     // it does not shore anything itself.
>                     Model m = dataset.getNamedModel(namedGraph)
>                     RDFDataMgr.read(m, "filename");
> 
> 
> >                  dataset.addNamedModel(namedGraph, tempModel); */
> >                  dataset.getNamedModel(namedGraph).add(model);
> >              } else {
> >                  // Add the named graph for the first time
> >                  dataset.addNamedModel(namedGraph, model);
> >              }
> >          });
> >      }
> > 
> > 
> > Finally, when I use TDB2 loader (from the command line) to load all these 
> > smaller parts, that works just fine, and I am also able to use Jena Fuseki 
> > on top of the resulting TDB, but I face the issue when programmatically 
> > converting and storing the resources.
> > 
> > Thanks
> 
> 
> > One more note:
> > 
> > Ideally, given an input collection, the implementation should convert it to 
> > RDF and generate data dumps, which the implementing parties then could use 
> > for their use cases.
> 
> Teh idea of convert to files and read those files into TDB2 fits well 
> with that requirement.
> 
> 
> 
> > 
> > On 2020/10/27 11:30:22, Andy Seaborne <[email protected]> wrote:
> >> Hi Fidan,
> >>
> >> I guess either that you are loading all the files inside a transaction?
> >>
> >> How much heap size are you using? (Don't allocate the whole of free RAM).
> >>
> >> TDB1 uses heap space for uncommitteed transactions and it also buffers a
> >> few committed transactions because, usually, it is better to do the
> >> final work on them a few at a time.
> >>
> >> There is a control for the buffering:
> >> TransactionActionManager.QueueBatchSize
> >>
> >> TDB2 does not use heap space in this way and does not have limitations
> >> on the size of transactions. A heap of 2-4G is fine - the main work at
> >> scale happens in the indexes which are not in the heap.
> >>
> >>   >>   dataset.getNamedModel(namedGraph).add(model);
> >>
> >> So you seem to have the data in memory in "model" as well so both the
> >> TDB(1) space and model are taking up heap.
> >>
> >> You can stream the data in by having a transaction and calling Model.add
> >> (or DatasetGraph.add(Triple) if yo end up working in triples not
> >> models+statements. Your choice - it isn't a factor here.).
> >>
> >> A different approach might be:
> >>
> >> Convert your resources to RDF and write these to disk, possibly with
> >> adding the named graph (so TriG or N-Quads format) then using a bulking
> >> loader (TDB1: tdbloader (TDB1 tdblaoder32 is only useful for very large
> >> datasets) or tdb2.tdbloader.
> >>
> >> They are faster than loading into a "live" dataset - they work by
> >> manipulating the internal structures directly.
> >>
> >> For TDB1, they have to start with an empty database.
> >>
> >> For TDB2, it (there is one bulkloader, with options) works on partially
> >> loaded databases.
> >>
> >> As to which options for the "--loader" argument to tdb2.tdbloader, it
> >> depends. The default is good; if you have several 100's of millions and
> >> up, try --loader=parallel if it s a big server.
> >>
> >>       Andy
> >>
> >>
> >>
> >>
> >> On 27/10/2020 08:26, Fidan Limani wrote:
> >>> Recently, I am dealing with a large collection of resources that need to 
> >>> be converted to RDF. The original collection contains a set of files, 
> >>> each containing > 4 M resources on average. In order to keep the 
> >>> provenance, I thought having named graphs with the same name to organize 
> >>> the RDF collection would be nice.
> >>>
> >>> However, after half of the collection is stored, even on a powerful 
> >>> server, the memory does not seem to be enough for the store operation in 
> >>> the TDB. Consider the following statement:
> >>>        
> >>
> >>
> >>
> >>>
> >>> In it, we retrieve the current RDF Model of triples and add another 
> >>> collection of triples to it. After a while, once the storage reaches a 
> >>> certain point, the operation "hangs" due to heap space exception.
> >>>
> >>> (Finally) The question, then, is: is there a way (a more streaming-like) 
> >>> to store larger collections via named graphs? My current workaround 
> >>> consists in splitting the original collection into smaller, more 
> >>> manageable collections that the server can handle and store in named 
> >>> graphs.
> >>>
> >>
> 

Reply via email to