[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Andy Seaborne (JIRA) Mon, 03 Apr 2017 04:25:57 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953314#comment-15953314
 ]


Andy Seaborne commented on JENA-1313:
-------------------------------------

I know :) but the details of a collation function are complex as well.

It needs to map strings to numbers so that he numbers define the sort order.  
One value per string, so very large numbers to reflect the fact there are very 
many strings.

Its return value is not the result of a comparator. Each item in the sort must 
have a unique value and those values define the sort order. And 
{{collate:collate(?label, 'fi')} only has one argument. 

Two variable {{collate:compare}} to pass through a comparator would be a bigger 
change - that's not how the sorting works at the moment.  Doing VALUE_LANG 
looks easier unless we want a fully general (not just locale collation) 
collation.

And I'm comfortable with a syntax extension (in lang:ARQ) if needed/helpful.


> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Reply via email to