Re: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing incremental SPARQL update to endpoint.

Dave Reynolds Thu, 10 Feb 2022 09:18:40 -0800

While I can't help with the substance of this question ...

> Since, as far as I know, the latest fuseki (4.4.0) no longer supportsTDB1

I don't think that's correct. While there are new features of TDB2 inthe new release (the faster loader) I don't believe TDB 1 has beendeprecated let alone dropped.


Dave

On 10/02/2022 16:58, Cédric Viaccoz wrote:

Hello everyone,
I deploy a data treatment pipeline at the University of Geneva where alinked data platform, Fedora Commons Repository(https://duraspace.org/fedora/ <https://duraspace.org/fedora/>) databaseis loaded with researchers’ data, and then its RDF metadata issynchronized/uploaded to a fuseki triplestore. The synchronization toolI use is the fcrepo-indexing-triplestore messaging application from thefcrepo-camel-toolbox(https://github.com/fcrepo-exts/fcrepo-camel-toolbox<https://github.com/fcrepo-exts/fcrepo-camel-toolbox>), basically anApache Camel application designed to synchronize Fedora with an externaltriplestore.
Since, as far as I know, the latest fuseki (4.4.0) no longer supportsTDB1, I opted to migrate all the projects’ data to TDB2, meaningsynchronizing the whole of the data from Fedora to Fuseki, this timemaking the camel app pointing to TDB2 based endpoints.
However, I noticed that the data volume as it is stored in fuseki in the“<FUSEKI_BASE>/databases” folder increased drastically in TDB2 comparedto TDB1. For instance, a dataset which used to occupy 74Mb of data onTDB1 now weighs more than 11Gb! After some investigation I hypothesizedthat incremental insertion of triples in TDB2 endpoint create biggerdisk footprint than a single batch load (where as in TDB1 both loadingstrategy leads to the same disk footprint).
It is quite tiresome to replicate my precise use case, because itrequires deploying a Fedora repository and a camel application, soinstead I included to this mail a zip containing a small sample of ourdata as a turtle file and a python script that “emulates” the behaviorof the data synchronization between fedora and fuseki. If you create apersistent TDB2 dataset on your local fuseki listening on localhost port3030, and name this dataset “gypso”, then running the Python script“triplestore_incremental_update.py” will, for each single triple fromthe “gypso.ttl” file, send an INSERT DATA {} sparql query to the fusekigypso/update endpoint. Please note that the phython script uses thepackage rdflib, so installing it through “pip install rdflib” previouslymight be necessary. On my Debian server, the resulting size of thedatabase (can be checked by the linux command “du -h<FUSEKI_BASE>/databases/gypso/Data-001”) was 50Mb, whereas directlyuploading the “gypso.ttl” file to then endpoint results in a size ofonly 538Kb even though the data and query performance is identical aftereither loading strategy.
I know that as a workaround I could serialize all the data from ourinfrastructure into compact turtle files and then directly uploads themto TDB2 endpoints, but the data on Fedora side gets updated regularly,so having the camel application taking care of doing automaticsynchronization is necessary, besides this was not an issue at all onTDB1. Would anyone have an idea what might be the culprit behind thisbehavior ?
If you need additional details, by looking at the individual file sizeunder “Data-001” I noticed that only the following files grow betweenthe two different loading strategies : “SPO.idn”, “nodes.idn”,“nodes.dat”, “OSP.dat”, “POS.idn”, “OSP.idn”, “POS.dat” and “SPO.dat”. Ialso have included to this mail a screenshot displaying a side-by-sidecomparison of the size of the databases files between gypso.ttl loadedincrementally on the left, and as a single file upload and the right.Hope this can maybe give a more low-level vision on the issue.
Best regards,

Cédric Viaccoz
*Concepteur-Développeur au sein du domaine fonctionnel “Recherche etInformation Scientifique (RISe)”*
Division du système et des technologies de l'information et de lacommunication/ IT Services (DISTIC)
Université de Genève | 24 rue Général-Dufour | Bureau 338

Tél : +41 22 379 71 10

Re: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing incremental SPARQL update to endpoint.

Reply via email to