Re: How is UTF-8 handled in TDB

Tim Harsch Thu, 23 Feb 2012 09:12:56 -0800

So I knew that TDB used an id in place of a string, except in the case of 
inlined values.  Are you saying that non-inlined values use an MD5 digest?  I 
did not know that.


So, if no normalization is done on literals how does Fuseki/TDB pass the 
normalization tests of SPARQL DAWG?  My understanding of this is still limited 
but I'm assuming that normalization tests won't pass for two non-normalized 
literals (that are non-equal without normalization; but would be after) unless 
both literals in a comparison were first normalized (either as pre-step or at 
string table load time or at query time).

Thanks,
Tim




>________________________________
> From: Andy Seaborne <[email protected]>
>To: [email protected] 
>Sent: Wednesday, February 22, 2012 2:23 PM
>Subject: Re: How is UTF-8 handled in TDB
> 
>On 22/02/12 22:01, Tim Harsch wrote:
>> I am wondering how TDB deals with UTF strings in general.  How are
>> strings stored internally and processed during joins?  What I'm most
>> interested in is how the case of UTF normalization is handled?  So I
>> think in theory you must store the UTF normalized version of a string
>> so that later, when a join is performed, normalized strings are
>> compared against normalized strings...  otherwise TDB must perform
>> normalization on each string at join time which seems would be very
>> expensive.  But, if you store normalized strings then you are unable
>> to return the original un-normalized string that was loaded,
>> correct?
>> 
>> Thanks, Tim
>> 
>
>The secret ... TDB does not join on strings.
>
>When RDF terms are read in, they are assigned an id in the node table. An id 
>is 64 bits currently. Some values are stored inline (integers, decimals, 
>dates, dateTimes).
>
>There is only ever one copy of a string.  Triples are 3 ids, joins are done on 
>ids - fixed length byte sequences.  It assumes the full length MD5 hash is 
>unique for each literal in the DB.
>
>Incoming RDF data to TDB is expected to be correct; no alternations are made 
>except on inlined values.
>
>Data should be checked first for all sorts of things, because a bad triple 
>half though a load is hard to deal with neatly.
>
>In fact, the RIOT parsers don't normalize as it's expensive (although RDF/XML 
>might - don't know).  But you could pre-process the data to normalize if you 
>believe it's going to be a problem.
>
>RDF tends towards NFC ("recommends" = SHOULD) for literals.  Theer are 4 kinds 
>of normalization.
>
>I have as a background project-ette for canonicalization and more rigorously 
>validations as a parser pipeline stage - that could do Unicode NFC.
>
>    Andy
>
>
>

Re: How is UTF-8 handled in TDB

Reply via email to