Re: Estimating TDB2 size

Andrew U. Frank Sun, 26 Nov 2017 08:17:50 -0800

thank you for the explanations:

to laura: i guess HDT would reduce the size of my files considerably.where could i find information how to use fuseki with HDT? i might beworth trying and see how response time changes.

to andy: am i correct to understand that a triple (uri p literal) istranslated in two triples (uri p uriX) and a second one (uriX s literal)for some properties p and s? is there any reuse of existing literals?that would give for each literal triple approx. 60 bytes?

i still do not undestand how a triple needs about 300 bytes of storage?(or how an nt.gzip file of 219 M igives a TDB database of 13 GB)

size of the database is of concern to me and I think it influencesperformance through the use of IO time.


thank you all very much for the clarifications!

andrew



On 11/26/2017 07:30 AM, Andy Seaborne wrote:

Every RDFTerm gets a NodeId in TDB.  A triple is 3 NodeIds.

There is a big cache, NodeId->RDFTerm.
In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an intand long (96 bits) - the current implementation is using 64 bits.
It is very common as a design to dictionary (intern) terms becausejoins can be done by comparing a integers, not testing whether twostrings are the same, which is much more expensive.
In addition TDBx inlines numbers integers and date/times and some others.

https://jena.apache.org/documentation/tdb/architecture.html
TDBx could, but doesn't, store compressed data on disk. There are prosand cons of this.
    Andy

On 26/11/17 08:30, Laura Morales wrote:
Perhaps a bit tangential but this is somehow related to how HDTstores its data (I've run some tests with Fuseki + HDT store insteadof TDB). Basically, they assign each subject, predicate, and objectan integer value. It keeps an index to map integers with thecorresponding string (of the original value), and then they storeevery triple using integers instead of strings (something like "1 29. 8 2 1 ." and so forth. The drawback I think is that they have totranslate indices/strings back and forth at each query, nonethelessthe response time is still impressive (milliseconds), and itcompresses the original file *a lot*. By a lot I mean that forWikidata (not the full file though, but one with about 2.3 billiontriples) the HDT is more or less 40GB, and gz-compressed about 10GB.The problem is that their rdf2hdt tool is so inefficient that it doeseverything in RAM, so to convert something like wikidata you'd needat least a machine with 512GB of ram (or swap if you have a fastenough swap :D). Also the tool looks like it can't handle files withmore than 2^32 triples, although HDT (the format) does handle them.So as long as you can handle the conversion, if you want to savespace you could benefit from using a HDT store rather than using TDB.
Sent: Sunday, November 26, 2017 at 5:30 AM
From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
To: users@jena.apache.org
Subject: Re: Estimating TDB2 size
i have   specific questiosn in relation to what ajs6f said:

i have a TDB store with 1/3 triples with very small literals (3-5 char),
where the same sequence is often repeated. would i get smaller store and
better performance if these were URI of the character sequence (stored
once for each repeated case)? any guess how much I could improve?

does the size of the URI play a role in the amount of storage used. i
observe that i have for 33 M triples a TDB size (files) of 13 GB, which
means about 300 byte per triple. the literals are all short (very seldom
more than 10 char, mostly 5 - words from english text). is is a named
graph, if this makes a difference.

thank you!

andrew


--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                 +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil

Re: Estimating TDB2 size

Reply via email to