On 22/02/12 22:01, Tim Harsch wrote:
I am wondering how TDB deals with UTF strings in general. How are
strings stored internally and processed during joins? What I'm most
interested in is how the case of UTF normalization is handled? So I
think in theory you must store the UTF normalized version of a string
so that later, when a join is performed, normalized strings are
compared against normalized strings... otherwise TDB must perform
normalization on each string at join time which seems would be very
expensive. But, if you store normalized strings then you are unable
to return the original un-normalized string that was loaded,
correct?
Thanks, Tim
The secret ... TDB does not join on strings.
When RDF terms are read in, they are assigned an id in the node table.
An id is 64 bits currently. Some values are stored inline (integers,
decimals, dates, dateTimes).
There is only ever one copy of a string. Triples are 3 ids, joins are
done on ids - fixed length byte sequences. It assumes the full length
MD5 hash is unique for each literal in the DB.
Incoming RDF data to TDB is expected to be correct; no alternations are
made except on inlined values.
Data should be checked first for all sorts of things, because a bad
triple half though a load is hard to deal with neatly.
In fact, the RIOT parsers don't normalize as it's expensive (although
RDF/XML might - don't know). But you could pre-process the data to
normalize if you believe it's going to be a problem.
RDF tends towards NFC ("recommends" = SHOULD) for literals. Theer are 4
kinds of normalization.
I have as a background project-ette for canonicalization and more
rigorously validations as a parser pipeline stage - that could do
Unicode NFC.
Andy