== Summary
This email covers the changes in RDF 1.1 around plain literals. In RDF
1.1, all literals have a datatype.
* simple literals have datatype xsd:string.
* literals with a language tag have a datatype rdf:langString.
This change may have some impact on databases.
== RDF 1.1
The current situation for RDF (know as RDF-2004) is that "plain
literals" are literals which have no datatype. They are either "simple
literals" (no datatype, no language tag) or have a language tag. A
literal does not have both a language tag and a datatype in RDF-2004.
In RDF 1.1, all literals have a datatype always.
* simple literals have datatype xsd:string.
simple literals and xsd:strings are the same RDF term.
* literals with a language tag have datatype rdf:langString.
This is a change but the working group believes it is a small one. Mixed
data, with both plain literals and xsd:string is assumed to be rare.
The first one, simple literal/xsd:string, is the more significant change.
== Example
Previously:
:s :p "foo" .
:s :p "foo"^^xsd:string .
was 2 triples. In RDF 1.1 there is a graph of one triple there because
a graph is a set of triples; "foo" and "foo"^^xsd:string are different
ways of writing the same thing much like this shows two ways to write
the same triple:
---------
@prefix : <http://example/> .
:x :p 123 .
<http://example/x> :p 123 .
---------
== Syntax
This change happens because of the treatment of syntax, input and output:
On input, simple literal and xsd:string create the same RDF term, with
datatype xsd:string. Langtags cause a literal with type rdf:langString,
and a language tag, to be created.
On output, the plain literal forms are used. xsd:string and
xsd:langString do not appear in the output.
(Aside: rdf:plainLiteral should never appear in RDF data but we could do
the same transforms to the canonical value form)
== Effects
(due to xsd:string)
Systems using xsd:string, and sensitive to an explicit type, are
affected. At a guess, OWL systems, maybe Protégé (but I have no
evidence one way of the other. They see to have xsd:strings in the data
and until converted may see data without explicit xsd:string and get
confused.)
The numbers of triples changes IF the same subject/predicate is used
with simple literals and with xsd:strings.
Generally, I see data that either uses xsd:string or uses simple
literals. Mixing seems quite rare.
== Jena
(xsd:string)
Jena in-memory already equates simple literals and xsd:strings for
searching (i.e. Graph.find) so while the number of results can change,
it should not a case of not finding data.
The worse case is producing data for other systems that are not RDF 1.1
and do expect an explicit xsd:string datatype on literals.
== RDF API users
(rdf:langString)
The key is "test language before datatype" - if tested that way round
the appearance of rdf:langString will not matter. If the test is
"datatype first, null meaning plain literal", it will matter.
I doubt much code outside Jena does this sort of thing - it's something
writers do so that needs completely checking but it's just a case of
finding all the calls of getLiteralLanguage().
This is the most significant rdf:langString related change as far as I
can see.
== SPARQL
(xsd:string)
SPARQL already has some adaptation:
datatype("x") = xsd:string (SPARQL 1.0)
datatype("x"@en) = rdf:langString (SPARQL 1.1)
Due to the xsd:string change, matching basic graph patterns may produce
a result it didn't before:
{ ?x :p "foo"^^xsd:string } will match data :x :p "foo"
{ ?x :p "foo" } will match data :x :p "foo"^^xsd:string
It makes it easier to optimize FILTER(?x = "foo")
== Databases
(xsd:string)
Anything that relies on a hash of literal in a system that uses
xsd:string will need to reload. Currently, if keeping simple literals
and xsd:strings apart includes hashing them differently, then this
change is significant.
This does affect TDB and SDB.
= Compatibility
We could provide some compatibility
1/ The ability to write data with explicit xsd:string
2/ Hide rdf:langString from Node.getLiteralDatatype()
What does not work is recording whether an RDF term was originally
written as xsd:string or as a simple literal. That could end up with
two different terms (Nodes) that represent the same term, or
non-determinism depending on which term is seen first.
Andy