Re: How is UTF-8 handled in TDB

Andy Seaborne Wed, 22 Feb 2012 14:28:21 -0800

On 22/02/12 22:01, Tim Harsch wrote:

I am wondering how TDB deals with UTF strings in general.  How are
strings stored internally and processed during joins?  What I'm most
interested in is how the case of UTF normalization is handled?  So I
think in theory you must store the UTF normalized version of a string
so that later, when a join is performed, normalized strings are
compared against normalized strings...  otherwise TDB must perform
normalization on each string at join time which seems would be very
expensive.  But, if you store normalized strings then you are unable
to return the original un-normalized string that was loaded,
correct?


Thanks, Tim


The secret ... TDB does not join on strings.

When RDF terms are read in, they are assigned an id in the node table.An id is 64 bits currently. Some values are stored inline (integers,decimals, dates, dateTimes).

There is only ever one copy of a string. Triples are 3 ids, joins aredone on ids - fixed length byte sequences. It assumes the full lengthMD5 hash is unique for each literal in the DB.

Incoming RDF data to TDB is expected to be correct; no alternations aremade except on inlined values.

Data should be checked first for all sorts of things, because a badtriple half though a load is hard to deal with neatly.

In fact, the RIOT parsers don't normalize as it's expensive (althoughRDF/XML might - don't know). But you could pre-process the data tonormalize if you believe it's going to be a problem.

RDF tends towards NFC ("recommends" = SHOULD) for literals. Theer are 4kinds of normalization.

I have as a background project-ette for canonicalization and morerigorously validations as a parser pipeline stage - that could doUnicode NFC.


        Andy

Re: How is UTF-8 handled in TDB

Reply via email to