Re: Question about basic vs extended language ranges

Håvard Ottestad Sat, 12 Sep 2020 09:59:10 -0700

Hi Andy,

Thanks for answering.


Do I understand correctly that a range like “en-*” is not a basic range and 
that Jena is supports it is not fully in line with what the SPARQL spec 
requires?

Cheers,
Håvard



> On 12 Sep 2020, at 18:31, Andy Seaborne <[email protected]> wrote:
> 
> This is from a discussion this last week:
> 
>    https://github.com/TopQuadrant/shacl/issues/100
> 
> On 12/09/2020 11:55, Håvard Ottestad wrote:
>> Hi,
>> I’ve been trying to get basic language ranges working for the SHACL engine 
>> in RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena 
>> implement basic language ranges.
>> The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt 
>> <https://www.ietf.org/rfc/rfc4647.txt>
>> Specifically sections
>>  -  2.1.  Basic Language Range
>>  - 3.3.1.  Basic Filtering
>> Looking at the ABNF in 2.1.
>>    language-range   = (1*8ALPHA *("-" 1*8alphanum)) / "*"
>>    alphanum         = ALPHA / DIGIT
>> It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even 
>> “a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”.
>> It seems like the range “en” would match a tag “en-gb” and a tag “en”.
>> I had a deep dive into the langMatch code in Jena and it seems to support 
>> “*” at any position in the range.
>> Is Jena supporting part of the extended range specification, 
> 
> Jena LangMatches supports basic matching as required by SPARQL and SHACL, and 
> does match some cases of "-*" but not properly by full RFC 4647. More by 
> accident than design, I suspect.
> 
> Calling it "part of extended" is generous. It fails to match "-*" to 
> multiples subtag ranges.
> 
> Basic is not completely compatible with extended.
> 
> Pattern "de-DE" matches "de-Latn-DE" by extended, but not basic.
> 
> Extended is sensitive to the fact the second subtag, 'script' is 4ALPHA, and 
> 'region' is 2ALPHA or 3DIGIT so "de-DE" matches like "de-*-DE" on language 
> and region, skipping region. Each part of a language has a slightly different 
> syntax and extended filtering seem to depend on this to do its jump ahead for 
> "-*".
> 
> I haven't got my head around the full impact of extended matching. It assumes 
> valid language tags and invalid (by RFC 5646) language exist. In the real 
> world, bad tags are common.
> 
> But SPARQL and Turtle have a catch all parse syntax based on the earlier RFC 
> 3066 and HTTP at the time.  And in the real world, bad tags are common.
> 
> "a-ab-abc-12345678-a" is not a legal language tag by 5646 or 4646 in several 
> ways; it is legal by 3066.
> 
> To add to the language tag fun, RDF and RFC 4646 disagree on the canonical 
> form of language tags.
> 
> > or am I missing something? (I have been missing a lot of things lately > :P 
> > so I wouldn’t be surprised).
> 
> This? :-)
> https://github.com/TopQuadrant/shacl/issues/100#issuecomment-690100566
> 
> """
> The NodeFunctions.langMatches code does look like it gets basic matching 
> right (as SPARQL requires), test cases to the contrary welcome, but the 
> handling of extended matching looks wrong for "-*" with multiple occurences 
> of subtags.
> 
> Extended matching is complicated and relies on (1) valid language tag input 
> (2) the different parts of a language tag having different syntax.
> 
> "de-DE" does not match "de-Latn-DE" by basic but does by extended.
> """
> 
>    Andy
> 
>> Cheers,
>> Håvard
>> PS: From 2.2.  Extended Language Range
>>    extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))

Re: Question about basic vs extended language ranges

Reply via email to