One more note: Ideally, given an input collection, the implementation should convert it to RDF and generate data dumps, which the implementing parties then could use for their use cases.
On 2020/10/27 12:27:35, Fidan Limani <[email protected]> wrote: > Thanks for the prompt reply, Andy. > > I am doing batch-type storage: after a certain number of resources has been > converted and stored, I issue a transation. Based on your comment, the heap > size is quite enough - 80 GB, but I guess the issue remains with > (programmatically) using the TDB 2. > > The relevant packages for storage are org.apache.jena.system.Txn; > org.apache.jena.tdb2.DatabaseMgr; and org.apache.jena.tdb2.TDB2Factory, but > yet some TDB 1 behavior seems to show, or? > > Just as additional information, the following method is invoked to store > Model instances to TDB2: > > public void storeLinkInstance(String namedGraph, Model model) { > Txn.executeWrite(dataset, ()->{ > // Add to existing named graph > if (dataset.containsNamedModel(namedGraph)){ > /* Model tempModel = dataset.getNamedModel(namedGraph); // > .add(model) > dataset.addNamedModel(namedGraph, tempModel); */ > dataset.getNamedModel(namedGraph).add(model); > } else { > // Add the named graph for the first time > dataset.addNamedModel(namedGraph, model); > } > }); > } > > > Finally, when I use TDB2 loader (from the command line) to load all these > smaller parts, that works just fine, and I am also able to use Jena Fuseki on > top of the resulting TDB, but I face the issue when programmatically > converting and storing the resources. > > Thanks > > On 2020/10/27 11:30:22, Andy Seaborne <[email protected]> wrote: > > Hi Fidan, > > > > I guess either that you are loading all the files inside a transaction? > > > > How much heap size are you using? (Don't allocate the whole of free RAM). > > > > TDB1 uses heap space for uncommitteed transactions and it also buffers a > > few committed transactions because, usually, it is better to do the > > final work on them a few at a time. > > > > There is a control for the buffering: > > TransactionActionManager.QueueBatchSize > > > > TDB2 does not use heap space in this way and does not have limitations > > on the size of transactions. A heap of 2-4G is fine - the main work at > > scale happens in the indexes which are not in the heap. > > > > >> dataset.getNamedModel(namedGraph).add(model); > > > > So you seem to have the data in memory in "model" as well so both the > > TDB(1) space and model are taking up heap. > > > > You can stream the data in by having a transaction and calling Model.add > > (or DatasetGraph.add(Triple) if yo end up working in triples not > > models+statements. Your choice - it isn't a factor here.). > > > > A different approach might be: > > > > Convert your resources to RDF and write these to disk, possibly with > > adding the named graph (so TriG or N-Quads format) then using a bulking > > loader (TDB1: tdbloader (TDB1 tdblaoder32 is only useful for very large > > datasets) or tdb2.tdbloader. > > > > They are faster than loading into a "live" dataset - they work by > > manipulating the internal structures directly. > > > > For TDB1, they have to start with an empty database. > > > > For TDB2, it (there is one bulkloader, with options) works on partially > > loaded databases. > > > > As to which options for the "--loader" argument to tdb2.tdbloader, it > > depends. The default is good; if you have several 100's of millions and > > up, try --loader=parallel if it s a big server. > > > > Andy > > > > > > > > > > On 27/10/2020 08:26, Fidan Limani wrote: > > > Recently, I am dealing with a large collection of resources that need to > > > be converted to RDF. The original collection contains a set of files, > > > each containing > 4 M resources on average. In order to keep the > > > provenance, I thought having named graphs with the same name to organize > > > the RDF collection would be nice. > > > > > > However, after half of the collection is stored, even on a powerful > > > server, the memory does not seem to be enough for the store operation in > > > the TDB. Consider the following statement: > > > > > > > > > > > > > > > In it, we retrieve the current RDF Model of triples and add another > > > collection of triples to it. After a while, once the storage reaches a > > > certain point, the operation "hangs" due to heap space exception. > > > > > > (Finally) The question, then, is: is there a way (a more streaming-like) > > > to store larger collections via named graphs? My current workaround > > > consists in splitting the original collection into smaller, more > > > manageable collections that the server can handle and store in named > > > graphs. > > > > > >
