== Summary

This email covers the changes in RDF 1.1 around plain literals. In RDF 1.1, all literals have a datatype.

* simple literals have datatype xsd:string.
* literals with a language tag have a datatype rdf:langString.

This change may have some impact on databases.

== RDF 1.1

The current situation for RDF (know as RDF-2004) is that "plain literals" are literals which have no datatype. They are either "simple literals" (no datatype, no language tag) or have a language tag. A literal does not have both a language tag and a datatype in RDF-2004.

In RDF 1.1, all literals have a datatype always.

* simple literals have datatype xsd:string.
  simple literals and xsd:strings are the same RDF term.

* literals with a language tag have datatype rdf:langString.

This is a change but the working group believes it is a small one. Mixed data, with both plain literals and xsd:string is assumed to be rare.

The first one, simple literal/xsd:string, is the more significant change.

== Example

Previously:

:s :p "foo" .
:s :p "foo"^^xsd:string .

was 2 triples. In RDF 1.1 there is a graph of one triple there because a graph is a set of triples; "foo" and "foo"^^xsd:string are different ways of writing the same thing much like this shows two ways to write the same triple:

---------
@prefix : <http://example/> .

:x :p 123 .
<http://example/x> :p 123 .
---------

== Syntax

This change happens because of the treatment of syntax, input and output:

On input, simple literal and xsd:string create the same RDF term, with datatype xsd:string. Langtags cause a literal with type rdf:langString, and a language tag, to be created.

On output, the plain literal forms are used. xsd:string and xsd:langString do not appear in the output.

(Aside: rdf:plainLiteral should never appear in RDF data but we could do the same transforms to the canonical value form)

== Effects
(due to xsd:string)

Systems using xsd:string, and sensitive to an explicit type, are affected. At a guess, OWL systems, maybe Protégé (but I have no evidence one way of the other. They see to have xsd:strings in the data and until converted may see data without explicit xsd:string and get confused.)

The numbers of triples changes IF the same subject/predicate is used with simple literals and with xsd:strings.

Generally, I see data that either uses xsd:string or uses simple literals. Mixing seems quite rare.

== Jena
(xsd:string)

Jena in-memory already equates simple literals and xsd:strings for searching (i.e. Graph.find) so while the number of results can change, it should not a case of not finding data.

The worse case is producing data for other systems that are not RDF 1.1 and do expect an explicit xsd:string datatype on literals.

== RDF API users
(rdf:langString)

The key is "test language before datatype" - if tested that way round the appearance of rdf:langString will not matter. If the test is "datatype first, null meaning plain literal", it will matter.

I doubt much code outside Jena does this sort of thing - it's something writers do so that needs completely checking but it's just a case of finding all the calls of getLiteralLanguage().

This is the most significant rdf:langString related change as far as I can see.

== SPARQL
(xsd:string)

SPARQL already has some adaptation:
   datatype("x") = xsd:string           (SPARQL 1.0)
   datatype("x"@en) = rdf:langString    (SPARQL 1.1)

Due to the xsd:string change, matching basic graph patterns may produce a result it didn't before:

{ ?x :p "foo"^^xsd:string }  will match data  :x :p "foo"
{ ?x :p "foo" }              will match data  :x :p "foo"^^xsd:string

It makes it easier to optimize FILTER(?x = "foo")

== Databases
(xsd:string)

Anything that relies on a hash of literal in a system that uses xsd:string will need to reload. Currently, if keeping simple literals and xsd:strings apart includes hashing them differently, then this change is significant.

This does affect TDB and SDB.

= Compatibility

We could provide some compatibility

1/ The ability to write data with explicit xsd:string
2/ Hide rdf:langString from Node.getLiteralDatatype()

What does not work is recording whether an RDF term was originally written as xsd:string or as a simple literal. That could end up with two different terms (Nodes) that represent the same term, or non-determinism depending on which term is seen first.

        Andy

Reply via email to