So I knew that TDB used an id in place of a string, except in the case of inlined values. Are you saying that non-inlined values use an MD5 digest? I did not know that.
So, if no normalization is done on literals how does Fuseki/TDB pass the normalization tests of SPARQL DAWG? My understanding of this is still limited but I'm assuming that normalization tests won't pass for two non-normalized literals (that are non-equal without normalization; but would be after) unless both literals in a comparison were first normalized (either as pre-step or at string table load time or at query time). Thanks, Tim >________________________________ > From: Andy Seaborne <[email protected]> >To: [email protected] >Sent: Wednesday, February 22, 2012 2:23 PM >Subject: Re: How is UTF-8 handled in TDB > >On 22/02/12 22:01, Tim Harsch wrote: >> I am wondering how TDB deals with UTF strings in general. How are >> strings stored internally and processed during joins? What I'm most >> interested in is how the case of UTF normalization is handled? So I >> think in theory you must store the UTF normalized version of a string >> so that later, when a join is performed, normalized strings are >> compared against normalized strings... otherwise TDB must perform >> normalization on each string at join time which seems would be very >> expensive. But, if you store normalized strings then you are unable >> to return the original un-normalized string that was loaded, >> correct? >> >> Thanks, Tim >> > >The secret ... TDB does not join on strings. > >When RDF terms are read in, they are assigned an id in the node table. An id >is 64 bits currently. Some values are stored inline (integers, decimals, >dates, dateTimes). > >There is only ever one copy of a string. Triples are 3 ids, joins are done on >ids - fixed length byte sequences. It assumes the full length MD5 hash is >unique for each literal in the DB. > >Incoming RDF data to TDB is expected to be correct; no alternations are made >except on inlined values. > >Data should be checked first for all sorts of things, because a bad triple >half though a load is hard to deal with neatly. > >In fact, the RIOT parsers don't normalize as it's expensive (although RDF/XML >might - don't know). But you could pre-process the data to normalize if you >believe it's going to be a problem. > >RDF tends towards NFC ("recommends" = SHOULD) for literals. Theer are 4 kinds >of normalization. > >I have as a background project-ette for canonicalization and more rigorously >validations as a parser pipeline stage - that could do Unicode NFC. > > Andy > > >
