Re: Why does the OSPG.dat file grows so much more than all other files?

Andy Seaborne Sat, 28 Jan 2023 06:01:42 -0800

I don't how OSPG can be a considerably different size. Small variationshappen but this does not look small.

Lorenz's advice to run a compaction and see what the indexes sizes areis a good idea. A backup would also be a good idea because something isunexpected (backup uses GSPO).

There has been some fixes in compaction since 4.4.0 related tocompacting while also active in Fuseki.

This index does not store the literals strings representations - theyare referenced via the 8 byte entries. In OSPG, the index entries are 4slots of 8 bytes.


    Andy

(Unrelated comment below)

On 28/01/2023 07:47, Lorenz Buehmann wrote:

Hi Elton,

Do you have lots of may large literals in your data?
Also, did you try a compaction on the database? If not, can you try itand post the new file sizes afterwards? Note, they will be located in anew ./Data-XXXX directory, e.g. before Data-0001 and afterwards Data-0002
By the way, we're now at Jena 4.7.0 - you might have a look at releasenotes of the last 3 versions, maybe things you have recognized whilerunning you current Fuseki. If not, just keep it running if you're happywith it of course.


Theer

Cheers,
Lorenz

On 28.01.23 03:10, Elton Soares wrote:
Dear Jena Community,
I'm running Jena Fuseki Version 4.4.0 as a container on an OpenShiftCluster.
OS Version Info (cat /etc/os-release):
NAME="Red Hat Enterprise Linux"
VERSION="8.5 (Ootpa)"
ID="rhel"
ID_LIKE="fedora" ="8.5"
...

Hardware Info (from Jena Fuseki initialization log):
[2023-01-27 20:08:59] Server     INFO    Memory: 32.0 GiB
[2023-01-27 20:08:59] Server     INFO    Java:   11.0.14.1
[2023-01-27 20:08:59] Server INFO OS: Linux3.10.0-1160.76.1.el7.x86_64 amd64
[2023-01-27 20:08:59] Server     INFO    PID:    1


Disk Info (df -h):
FilesystemSize Used Avail Use% Mounted onoverlay99G 76G 18G 82% /tmpfs64M 0 64M 0% /devtmpfs63G 0 63G 0% /sys/fs/cgroupshm64M 0 64M 0% /dev/shm/dev/mapper/docker_data99G 76G 18G 82% /config/data1.0T 677G 348G 67% /usr/app/runtmpfs40G 24K 40G 1%
My dataset is built using TDB2, and currently has the following RDFStats:
·         Triples: 65KK (Approximately 65 million)
·         Subjects: ~20KK (Aproximately 20 million)
·         Objects: ~8KK (Aproximately 8 million)
·         Graphs: ~213K (Aproximately 213 thousand)
·         Predicates: 153
The files corresponding to this dataset alone on disk sum up toapproximately 671GB (measured with du -h). From these, the largestfiles are:
·         /usr/app/run/databases/my-dataset/Data-0001/OSPG.dat: 243GB
·         /usr/app/run/databases/my-dataset/Data-0001/nodes.dat: 76GB
·         /usr/app/run/databases/my-dataset/Data-0001/POSG.dat: 35GB
·         /usr/app/run/databases/my-dataset/Data-0001/nodes.idn: 33GB
·         /usr/app/run/databases/my-dataset/Data-0001/POSG.idn: 29GB
·         /usr/app/run/databases/my-dataset/Data-0001/OSPG.idn: 27GB
I've looked into several documentation pages, source code, forums, ...nowhere I was able to find some explanation to why OSPG.dat is so muchlarger than all other files.I've been using Jena for quite some time now and I'm well aware thatits indexes grow significantly during usage, specially when triplesare being added across multiple requests (transactional workloads).Even though, the size of this particular file (OSPG.dat) surprised me,as in my prior experience the indexes would never get larger than thenodes.dat file.Is there a reasonable explanation for this based on the content of thedataset or the way it was generated? Could this be an indexing bugwithin TDB2?
Thank you for your support!
For completeness, here is the assembler configuration for my dataset:
@prefix :       http://base/# .
@prefix fuseki: http://jena.apache.org/fuseki# .
@prefix ja:     http://jena.hpl.hp.com/2005/11/Assembler# .
@prefix rdf:    http://www.w3.org/1999/02/22-rdf-syntax-ns# .
@prefix rdfs:   http://www.w3.org/2000/01/rdf-schema# .
@prefix root:   http://dev-test-jena-fuseki/$/datasets .
@prefix tdb2:   http://jena.apache.org/2016/tdb# .


It only needs:

:service_tdb_my-dataset
rdf:type                      fuseki:Service ;
rdfs:label                    "TDB my-dataset" ;
fuseki:dataset                :ds_my-dataset ;
fuseki:name                   "my-dataset" ;
fuseki:serviceQuery           "sparql" , "query" ;
fuseki:serviceReadGraphStore  "get" ;
fuseki:serviceReadWriteGraphStore
"data" ;
fuseki:serviceUpdate          "update" ;
fuseki:serviceUpload          "upload" .

:ds_my-dataset  rdf:type     tdb2:DatasetTDB2 ;
tdb2:location           "run/databases/my-dataset" ;
tdb2:unionDefaultGraph  true ;
ja:context              \[ ja:cxtName   "arq:optFilterPlacement" ;
ja:cxtValue  "false"
\] .


The rest can go.

This issue has been also published athttps://stackoverflow.com/questions/75264889/why-does-the-ospg-dat-file-grows-so-much-more-than-all-other-files

Re: Why does the OSPG.dat file grows so much more than all other files?

Reply via email to