Hi Elton,

Do you have lots of may large literals in your data?

Also, did you try a compaction on the database? If not, can you try it and post the new file sizes afterwards? Note, they will be located in a new ./Data-XXXX directory, e.g. before Data-0001 and afterwards Data-0002

By the way, we're now at Jena 4.7.0 - you might have a look at release notes of the last 3 versions, maybe things you have recognized while running you current Fuseki. If not, just keep it running if you're happy with it of course.


Cheers,
Lorenz

On 28.01.23 03:10, Elton Soares wrote:
Dear Jena Community,

I'm running Jena Fuseki Version 4.4.0 as a container on an OpenShift Cluster.
OS Version Info (cat /etc/os-release):
NAME="Red Hat Enterprise Linux"
VERSION="8.5 (Ootpa)"
ID="rhel"
ID_LIKE="fedora" ="8.5"
...

Hardware Info (from Jena Fuseki initialization log):
[2023-01-27 20:08:59] Server     INFO    Memory: 32.0 GiB
[2023-01-27 20:08:59] Server     INFO    Java:   11.0.14.1
[2023-01-27 20:08:59] Server     INFO    OS:     Linux 
3.10.0-1160.76.1.el7.x86_64 amd64
[2023-01-27 20:08:59] Server     INFO    PID:    1


Disk Info (df -h):
Filesystem                                                          Size  Used 
Avail Use% Mounted on
overlay                                                              99G   76G  
 18G  82% /
tmpfs                                                                64M     0  
 64M   0% /dev
tmpfs                                                                63G     0  
 63G   0% /sys/fs/cgroup
shm                                                                  64M     0  
 64M   0% /dev/shm
/dev/mapper/docker_data                                              99G   76G  
 18G  82% /config
/data                                                                1.0T  677G 
 348G  67% /usr/app/run
tmpfs                                                                40G   24K  
 40G   1%


My dataset is built using TDB2, and currently has the following RDF Stats:
·         Triples: 65KK (Approximately 65 million)
·         Subjects: ~20KK (Aproximately 20 million)
·         Objects: ~8KK (Aproximately 8 million)
·         Graphs: ~213K (Aproximately 213 thousand)
·         Predicates: 153


The files corresponding to this dataset alone on disk sum up to approximately 
671GB (measured with du -h). From these, the largest files are:
·         /usr/app/run/databases/my-dataset/Data-0001/OSPG.dat: 243GB
·         /usr/app/run/databases/my-dataset/Data-0001/nodes.dat: 76GB
·         /usr/app/run/databases/my-dataset/Data-0001/POSG.dat: 35GB
·         /usr/app/run/databases/my-dataset/Data-0001/nodes.idn: 33GB
·         /usr/app/run/databases/my-dataset/Data-0001/POSG.idn: 29GB
·         /usr/app/run/databases/my-dataset/Data-0001/OSPG.idn: 27GB


I've looked into several documentation pages, source code, forums, ... nowhere 
I was able to find some explanation to why OSPG.dat is so much larger than all 
other files.
I've been using Jena for quite some time now and I'm well aware that its 
indexes grow significantly during usage, specially when triples are being added 
across multiple requests (transactional workloads).
Even though, the size of this particular file (OSPG.dat) surprised me, as in my 
prior experience the indexes would never get larger than the nodes.dat file.
Is there a reasonable explanation for this based on the content of the dataset 
or the way it was generated? Could this be an indexing bug within TDB2?
Thank you for your support!
For completeness, here is the assembler configuration for my dataset:
@prefix :       http://base/# .
@prefix fuseki: http://jena.apache.org/fuseki# .
@prefix ja:     http://jena.hpl.hp.com/2005/11/Assembler# .
@prefix rdf:    http://www.w3.org/1999/02/22-rdf-syntax-ns# .
@prefix rdfs:   http://www.w3.org/2000/01/rdf-schema# .
@prefix root:   http://dev-test-jena-fuseki/$/datasets .
@prefix tdb2:   http://jena.apache.org/2016/tdb# .

tdb2:GraphTDB  rdfs:subClassOf  ja:Model .

ja:ModelRDFS  rdfs:subClassOf  ja:Model .

ja:RDFDatasetSink  rdfs:subClassOf  ja:RDFDataset .

http://jena.hpl.hp.com/2008/tdb#DatasetTDB
rdfs:subClassOf  ja:RDFDataset .

tdb2:GraphTDB2  rdfs:subClassOf  ja:Model .

http://jena.apache.org/text#TextDataset
rdfs:subClassOf  ja:RDFDataset .

ja:RDFDatasetZero  rdfs:subClassOf  ja:RDFDataset .

:service_tdb_my-dataset
rdf:type                      fuseki:Service ;
rdfs:label                    "TDB my-dataset" ;
fuseki:dataset                :ds_my-dataset ;
fuseki:name                   "my-dataset" ;
fuseki:serviceQuery           "sparql" , "query" ;
fuseki:serviceReadGraphStore  "get" ;
fuseki:serviceReadWriteGraphStore
"data" ;
fuseki:serviceUpdate          "update" ;
fuseki:serviceUpload          "upload" .

ja:ViewGraph  rdfs:subClassOf  ja:Model .

ja:GraphRDFS  rdfs:subClassOf  ja:Model .

tdb2:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .

http://jena.hpl.hp.com/2008/tdb#GraphTDB
rdfs:subClassOf  ja:Model .

ja:DatasetTxnMem  rdfs:subClassOf  ja:RDFDataset .

tdb2:DatasetTDB2  rdfs:subClassOf  ja:RDFDataset .

ja:RDFDatasetOne  rdfs:subClassOf  ja:RDFDataset .

ja:MemoryDataset  rdfs:subClassOf  ja:RDFDataset .

ja:DatasetRDFS  rdfs:subClassOf  ja:RDFDataset .

:ds_my-dataset  rdf:type     tdb2:DatasetTDB2 ;
tdb2:location           "run/databases/my-dataset" ;
tdb2:unionDefaultGraph  true ;
ja:context              \[ ja:cxtName   "arq:optFilterPlacement" ;
ja:cxtValue  "false"
\] .

This issue has been also published at 
https://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files


Reply via email to