Re: Question about basic vs extended language ranges

Andy Seaborne Sat, 12 Sep 2020 11:52:10 -0700



On 12/09/2020 17:58, Håvard Ottestad wrote:

Hi Andy,

Thanks for answering.

Do I understand correctly that a range like “en-*” is not a basic range and 
that Jena is supports it is not fully in line with what the SPARQL spec 
requires?


It is not basic range.
(Your choices are exactly "*", or a prefix of subtags)

The SPARQL spec does not require specific behaviour for language tagpatterns outside basic.

"requires" is tricky thing in SPARQL because there is an extensibilitymechanism. e.g. xsd:date support in any function call.


    Andy


Cheers,
Håvard

On 12 Sep 2020, at 18:31, Andy Seaborne <[email protected]> wrote:

This is from a discussion this last week:

    https://github.com/TopQuadrant/shacl/issues/100

On 12/09/2020 11:55, Håvard Ottestad wrote:

Hi,
I’ve been trying to get basic language ranges working for the SHACL engine in 
RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena 
implement basic language ranges.
The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt 
<https://www.ietf.org/rfc/rfc4647.txt>
Specifically sections
  -  2.1.  Basic Language Range
  - 3.3.1.  Basic Filtering
Looking at the ABNF in 2.1.
    language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
    alphanum         = ALPHA / DIGIT
It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even 
“a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”.
It seems like the range “en” would match a tag “en-gb” and a tag “en”.
I had a deep dive into the langMatch code in Jena and it seems to support “*” 
at any position in the range.
Is Jena supporting part of the extended range specification,


Jena LangMatches supports basic matching as required by SPARQL and SHACL, and does match 
some cases of "-*" but not properly by full RFC 4647. More by accident than 
design, I suspect.

Calling it "part of extended" is generous. It fails to match "-*" to multiples 
subtag ranges.

Basic is not completely compatible with extended.

Pattern "de-DE" matches "de-Latn-DE" by extended, but not basic.

Extended is sensitive to the fact the second subtag, 'script' is 4ALPHA, and 'region' is 2ALPHA or 3DIGIT so 
"de-DE" matches like "de-*-DE" on language and region, skipping region. Each part of a 
language has a slightly different syntax and extended filtering seem to depend on this to do its jump ahead 
for "-*".

I haven't got my head around the full impact of extended matching. It assumes 
valid language tags and invalid (by RFC 5646) language exist. In the real 
world, bad tags are common.

But SPARQL and Turtle have a catch all parse syntax based on the earlier RFC 
3066 and HTTP at the time.  And in the real world, bad tags are common.

"a-ab-abc-12345678-a" is not a legal language tag by 5646 or 4646 in several 
ways; it is legal by 3066.

To add to the language tag fun, RDF and RFC 4646 disagree on the canonical form 
of language tags.

or am I missing something? (I have been missing a lot of things lately > :P so 
I wouldn’t be surprised).


This? :-)
https://github.com/TopQuadrant/shacl/issues/100#issuecomment-690100566

"""
The NodeFunctions.langMatches code does look like it gets basic matching right (as SPARQL 
requires), test cases to the contrary welcome, but the handling of extended matching 
looks wrong for "-*" with multiple occurences of subtags.

Extended matching is complicated and relies on (1) valid language tag input (2) 
the different parts of a language tag having different syntax.

"de-DE" does not match "de-Latn-DE" by basic but does by extended.
"""

    Andy

Cheers,
Håvard
PS: From 2.2.  Extended Language Range
    extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))

Re: Question about basic vs extended language ranges

Reply via email to