[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Osma Suominen (JIRA) Fri, 31 Mar 2017 04:34:12 -0700

    [ 
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950717#comment-15950717
 ]


Osma Suominen commented on JENA-1313:
-------------------------------------

> return values in es, then pt, then fi
Shouldn't this be es, fi, pt?

Anyway, people seem to be agreeing on making an extension function rather than 
changing the default sorting order of literals.

I re-read the [description of the Dydra implementation of 
collation|http://blog.dydra.com/2015/05/06/collation], but they don't say what 
the ordering is between literals with differing language tags - it remains 
explicitly undefined, as in SPARQL. I don't know what ORDER BY will in practice 
do in Dydra when the language tags differ.

Dydra also has a [test 
suite|https://github.com/dydra/http-api-tests/tree/master/extensions/sparql-protocol/collation]
 for collation that is published using the Unlicense i.e. placed in the public 
domain. There I found [an 
example|https://github.com/dydra/http-api-tests/blob/master/extensions/sparql-protocol/collation/ordered-locations-da.txt]
 of a proper Danish language collation sequence that could perhaps also be used 
as a test case. There are other test cases in that directory that may also be 
relevant.

For the extension function, I suggest defining it just as a single function 
e.g. {{collate:collate}} that takes up to two parameters: the literal value and 
the language/locale (which may be omitted, and then the language is extracted 
from the language tag). So you can say e.g. {{ORDER BY collate:collate(?label, 
'fi')}} to get the labels sorted according to Finnish collation rules, 
regardless of their language tags. I think this is better than the 
{{collate:fi}} example given by [~kinow] above, because it requires just a 
single extension function and makes the locale/language a parameter that may 
come from a SPARQL variable instead of being hardwired into the query. A 
special parameter value such as {{unicode}} or the empty string could be used 
to force Unicode collation rules. (even a one-parameter version of 
{{collate:collate}} could be used in this way by passing {{STRLANG(?label, 
'fi')}} as the parameter, but that seems unnecessarily complicated to me)

> Language-specific collation in ARQ
> ----------------------------------
>
>                 Key: JENA-1313
>                 URL: https://issues.apache.org/jira/browse/JENA-1313
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ
>    Affects Versions: Jena 3.2.0
>            Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users 
> mailing list in October 2016, I would like to change ARQ collation of literal 
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the 
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
>  method.
> It currently sorts by lexical value first, then by language tag. Since the 
> collation order needs to be stable across all possible literal values, I 
> think the safest way would be to sort by language tag first, then by lexical 
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different 
> collation rules than the main language? It would be a bit strange if all 
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same 
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in 
> implementing it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (JENA-1313) Language-specific collation in ARQ

Reply via email to