The reply to Ian is the current state.
It could be changed - take a more value-oriented appraoch through out.
(longer term thinking out loud, not plans, nor likely next steps).
1/ RIOT parsers could canonicalize data.
This is a possible approach to simple literals/xsd:strings for RDF 1.1
anyway.
We could canonicalize to xsd:decimal, or canonicalize integer valued
decimals to integer.
org.openjena.riot.pipeline.normalize
XSD 1.0 -> XSD 1.1 changes the canonical lexical form of integer-valued
decimals from 78.0 to 78.
Potential parsing costs [*]
2/ ARQ/TDB query execution could specially handle XSD values to look for
both.
So
{ ?x :p 123 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }
{ ?x :p 123.0 . } => { ?x :p 123 . } union { { ?x :p 123.0 . }
It's rather easier for constants.
{ ?x :p1 ?v ; :p2 ?v . } and doing value equality is doable, quite
easily with an index join, but I'd need to think more about merge joins
(not currently used anyway).
Any and all random thoughts and comments welcome - I guess the real
issue if to decide a policy for Jena.
How much to work in terms of "value" andhow much to work preserving the
representational differences. e.g. This can change COUNT() results.
Andy
[*] On N-triples loading:
When loading at scale, this is a possible appreciable cost. The
N-triples load path is already fairly stream-lined and a extra step of
check-copy may be a visible cost. N-triples parsing is not strongly I/O
- it reads large chunks of the streaming fashion and files tend to be
generated all at once, causing the disk blocks to laid out nicely.
Costs may be offset by some concurrent processing - I did do one simple
experiment and found that concurrent was faster, so concurrency costs
were not bigger than gains by using more threads.
On 12/08/11 10:03, Andy Seaborne wrote:
On 11/08/11 22:41, Ian Emmons wrote:
TDB experts,
At [1], the TDB documentation indicates that TDB will regard
"47"^^xsd:integer and "47.0"^^xsd:decimal as the same value and match
them in a query. However, when I store the former and query for the
latter, TDB does not return the expected result.
TDB stores the values of integer and decimal, but it does stil keep
those two types part. The rules of XSD arithmetic try not to over
promote datatypes e.g. integer + integer is integer.
I guess "by query" you are putting the decimal directly in a graph
pattern. They are the same value in FILTERs.
I've attached a small sample program and the .ttl file that it reads
so that you can reproduce the problem. My question is, what am I
doing wrong, here?
The attachments are empty - and indeed the [1] link is in the second
attachment. I can send you the raw source of the message I received if
that helps.
Andy
Thanks,
Ian
[1] http://jenawiki.hpl.hp.com/wiki/TDB/ValueCanonicalization