Hello
i would like to ask how TDB2 and Fuseki manages big amounts of string data
(especially repeating data) and what it the best practices. Does it
optimize it somehow? Or is it on us to do some improvements.

For example, we have a TDB2 storage which we access via Fuseki and example
named graph like this:
[http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Region, "New
York"]
[http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Other, "long
long string"]
[http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#NAME, "JOHN
SMITH"]

So, we have JohnSmith person with 2 properties - "Region" and "Other". One
of them is short string of New York, other is long string.
Assume we have 100 000 more people and many of them have same "Region" and
"other" properties. So, what would be the best approach to storing such
data?

I created 10 000 more named graphs of people with different names but same
other properties and tested the performance.
First I checked 10 000 cases of reading the graphs like this and the
average time was around 4.4 ms (no matter how long are the strings).

Other option I considered is making "New York" a resource and storing it in
"cities" named graph and doing the same thing with "long long string". So,
the idea is to store the actual string only once.I tested reading the
graphs again on 10 000 cases and didn't notice any change in performance.
The average load time was still 4.4 ms when instead of "New York" and "Long
long String" we had resources URIs.
However, to get the full data, we need to add the actual resources to our
original JohnSmith graph, which adds overhead since we have to get 2 more
named graphs. So, it causes quite expectable drop of performance.

So, according to my tests the first case (the one described in the graph
example) performed the best, but it feels like we are storing too much
extra information. So, I still wanted to ask on your opinions to such
approach and learn if TDB store makes some inner optimization to the data.

Reply via email to