Re: TDB Literal Canonicalization

Andy Seaborne Fri, 12 Aug 2011 03:14:03 -0700

The reply to Ian is the current state.

It could be changed - take a more value-oriented appraoch through out.


(longer term thinking out loud, not plans, nor likely next steps).

1/ RIOT parsers could canonicalize data.

This is a possible approach to simple literals/xsd:strings for RDF 1.1anyway.

We could canonicalize to xsd:decimal, or canonicalize integer valueddecimals to integer.


org.openjena.riot.pipeline.normalize

XSD 1.0 -> XSD 1.1 changes the canonical lexical form of integer-valueddecimals from 78.0 to 78.


Potential parsing costs [*]

2/ ARQ/TDB query execution could specially handle XSD values to look forboth.


So

{ ?x :p 123 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }
{ ?x :p 123.0 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }

It's rather easier for constants.

{ ?x :p1 ?v ; :p2 ?v . } and doing value equality is doable, quiteeasily with an index join, but I'd need to think more about merge joins(not currently used anyway).

Any and all random thoughts and comments welcome - I guess the realissue if to decide a policy for Jena.

How much to work in terms of "value" andhow much to work preserving therepresentational differences. e.g. This can change COUNT() results.


        Andy

[*] On N-triples loading:

When loading at scale, this is a possible appreciable cost. TheN-triples load path is already fairly stream-lined and a extra step ofcheck-copy may be a visible cost. N-triples parsing is not strongly I/O- it reads large chunks of the streaming fashion and files tend to begenerated all at once, causing the disk blocks to laid out nicely.

Costs may be offset by some concurrent processing - I did do one simpleexperiment and found that concurrent was faster, so concurrency costswere not bigger than gains by using more threads.




On 12/08/11 10:03, Andy Seaborne wrote:



On 11/08/11 22:41, Ian Emmons wrote:

TDB experts,

At [1], the TDB documentation indicates that TDB will regard
"47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match
them in a query. However, when I store the former and query for the
latter, TDB does not return the expected result.


TDB stores the values of integer and decimal, but it does stil keep
those two types part. The rules of XSD arithmetic try not to over
promote datatypes e.g. integer + integer is integer.

I guess "by query" you are putting the decimal directly in a graph
pattern. They are the same value in FILTERs.


I've attached a small sample program and the .ttl file that it reads
so that you can reproduce the problem. My question is, what am I
doing wrong, here?


The attachments are empty - and indeed the [1] link is in the second
attachment. I can send you the raw source of the message I received if
that helps.

Andy


Thanks,

Ian











[1] http://jenawiki.hpl.hp.com/wiki/TDB/ValueCanonicalization

Re: TDB Literal Canonicalization

Reply via email to