[ https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953527#comment-15953527 ]
Osma Suominen edited comment on JENA-1313 at 4/3/17 2:04 PM: ------------------------------------------------------------- So...what about code like this, that sorts first by locale, then by the byte value (encoded as hex) for the collation key: {noformat} public String collate(String lang, String value) { Collator collator = new Collator(lang); // FIXME probably need to use some factory here instead... byte[] key = collator.getCollationKey().toByteArray(); return lang + "|" + org.apache.commons.codec.binary.Hex.encodeHexString(key); } {noformat} was (Author: osma): So...what about code like this, that sorts first by locale, then by the byte value (encoded as hex) for the collation key: {noformat} public collate(String lang, String value) { Collator collator = new Collator(lang); // FIXME probably need to use some factory here instead... byte[] key = collator.getCollationKey().toByteArray(); return lang + "|" + org.apache.commons.codec.binary.Hex.encodeHexString(key); } {noformat} > Language-specific collation in ARQ > ---------------------------------- > > Key: JENA-1313 > URL: https://issues.apache.org/jira/browse/JENA-1313 > Project: Apache Jena > Issue Type: Improvement > Components: ARQ > Affects Versions: Jena 3.2.0 > Reporter: Osma Suominen > > As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users > mailing list in October 2016, I would like to change ARQ collation of literal > values to be language-aware and respect language-specific collation rules. > This would probably involve changing at least the > [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199] > method. > It currently sorts by lexical value first, then by language tag. Since the > collation order needs to be stable across all possible literal values, I > think the safest way would be to sort by language tag first, then by lexical > value according to the collation rules for that language. > But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different > collation rules than the main language? It would be a bit strange if all > {{@en-US}} literals sorted after {{@en}} literals... > It would be good to check how Dydra does this and possibly take the same > approach. See the message linked above for further backgound. > I've been talking with [~kinow] about this and he may be interested in > implementing it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)