Hello i would like to ask how TDB2 and Fuseki manages big amounts of string data (especially repeating data) and what it the best practices. Does it optimize it somehow? Or is it on us to do some improvements.
For example, we have a TDB2 storage which we access via Fuseki and example named graph like this: [http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Region, "New York"] [http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#Other, "long long string"] [http://people/JohnSmith, http://www.w3.org/2001/vcard-rdf/3.0#NAME, "JOHN SMITH"] So, we have JohnSmith person with 2 properties - "Region" and "Other". One of them is short string of New York, other is long string. Assume we have 100 000 more people and many of them have same "Region" and "other" properties. So, what would be the best approach to storing such data? I created 10 000 more named graphs of people with different names but same other properties and tested the performance. First I checked 10 000 cases of reading the graphs like this and the average time was around 4.4 ms (no matter how long are the strings). Other option I considered is making "New York" a resource and storing it in "cities" named graph and doing the same thing with "long long string". So, the idea is to store the actual string only once.I tested reading the graphs again on 10 000 cases and didn't notice any change in performance. The average load time was still 4.4 ms when instead of "New York" and "Long long String" we had resources URIs. However, to get the full data, we need to add the actual resources to our original JohnSmith graph, which adds overhead since we have to get 2 more named graphs. So, it causes quite expectable drop of performance. So, according to my tests the first case (the one described in the graph example) performed the best, but it feels like we are storing too much extra information. So, I still wanted to ask on your opinions to such approach and learn if TDB store makes some inner optimization to the data.