Hi Andy, Thanks for answering.
Do I understand correctly that a range like “en-*” is not a basic range and that Jena is supports it is not fully in line with what the SPARQL spec requires? Cheers, Håvard > On 12 Sep 2020, at 18:31, Andy Seaborne <[email protected]> wrote: > > This is from a discussion this last week: > > https://github.com/TopQuadrant/shacl/issues/100 > > On 12/09/2020 11:55, Håvard Ottestad wrote: >> Hi, >> I’ve been trying to get basic language ranges working for the SHACL engine >> in RDF4J and I’ve stumbled upon some differences between how RDF4J and Jena >> implement basic language ranges. >> The SPARQL spec points to: https://www.ietf.org/rfc/rfc4647.txt >> <https://www.ietf.org/rfc/rfc4647.txt> >> Specifically sections >> - 2.1. Basic Language Range >> - 3.3.1. Basic Filtering >> Looking at the ABNF in 2.1. >> language-range = (1*8ALPHA *("-" 1*8alphanum)) / "*" >> alphanum = ALPHA / DIGIT >> It looks like “*” is legal, “en” is legal and “en-gb” is legal (and even >> “a-ab-abc-12345678-a”). But “*-gb” is not legal and neither is “en-*”. >> It seems like the range “en” would match a tag “en-gb” and a tag “en”. >> I had a deep dive into the langMatch code in Jena and it seems to support >> “*” at any position in the range. >> Is Jena supporting part of the extended range specification, > > Jena LangMatches supports basic matching as required by SPARQL and SHACL, and > does match some cases of "-*" but not properly by full RFC 4647. More by > accident than design, I suspect. > > Calling it "part of extended" is generous. It fails to match "-*" to > multiples subtag ranges. > > Basic is not completely compatible with extended. > > Pattern "de-DE" matches "de-Latn-DE" by extended, but not basic. > > Extended is sensitive to the fact the second subtag, 'script' is 4ALPHA, and > 'region' is 2ALPHA or 3DIGIT so "de-DE" matches like "de-*-DE" on language > and region, skipping region. Each part of a language has a slightly different > syntax and extended filtering seem to depend on this to do its jump ahead for > "-*". > > I haven't got my head around the full impact of extended matching. It assumes > valid language tags and invalid (by RFC 5646) language exist. In the real > world, bad tags are common. > > But SPARQL and Turtle have a catch all parse syntax based on the earlier RFC > 3066 and HTTP at the time. And in the real world, bad tags are common. > > "a-ab-abc-12345678-a" is not a legal language tag by 5646 or 4646 in several > ways; it is legal by 3066. > > To add to the language tag fun, RDF and RFC 4646 disagree on the canonical > form of language tags. > > > or am I missing something? (I have been missing a lot of things lately > :P > > so I wouldn’t be surprised). > > This? :-) > https://github.com/TopQuadrant/shacl/issues/100#issuecomment-690100566 > > """ > The NodeFunctions.langMatches code does look like it gets basic matching > right (as SPARQL requires), test cases to the contrary welcome, but the > handling of extended matching looks wrong for "-*" with multiple occurences > of subtags. > > Extended matching is complicated and relies on (1) valid language tag input > (2) the different parts of a language tag having different syntax. > > "de-DE" does not match "de-Latn-DE" by basic but does by extended. > """ > > Andy > >> Cheers, >> Håvard >> PS: From 2.2. Extended Language Range >> extended-language-range = (1*8ALPHA / "*”) *("-" (1*8alphanum / "*"))
