Osma Suominen created JENA-1313:
-----------------------------------

             Summary: Language-specific collation in ARQ
                 Key: JENA-1313
                 URL: https://issues.apache.org/jira/browse/JENA-1313
             Project: Apache Jena
          Issue Type: Improvement
          Components: ARQ
    Affects Versions: Jena 3.2.0
            Reporter: Osma Suominen


As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
mailing list in October 2016, I would like to change ARQ collation of literal 
values to be language-aware and respect language-specific collation rules.

This would probably involve changing at least the 
[NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
 method.

It currently sorts by lexical value first, then by language tag. Since the 
collation order needs to be stable across all possible literal values, I think 
the safest way would be to sort by language tag first, then by lexical value 
according to the collation rules for that language.

But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
collation rules than the main language? It would be a bit strange if all 
{{@en-US}} literals sorted after {{@en}} literals...

It would be good to check how Dydra does this and possibly take the same 
approach. See the message linked above for further backgound.

I've been talking with [~kinow] about this and he may be interested in 
implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to