[ https://issues.apache.org/jira/browse/JENA-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262293#comment-16262293 ]
ASF GitHub Bot commented on JENA-1384: -------------------------------------- Github user afs commented on a diff in the pull request: https://github.com/apache/jena/pull/308#discussion_r152531097 --- Diff: jena-arq/src/main/java/org/apache/jena/riot/process/normalize/CanonicalizeLiteral.java --- @@ -73,6 +76,36 @@ public Node apply(Node node) { return n2 ; } + /** Convert the lexical form to a canonical form if one of the known datatypes, + * otherwise return the node argument. (same object :: {@code ==}) + */ + public static Node canonicalValue(Node node) { + if ( ! node.isLiteral() ) + return node ; + // Fast-track + if ( NodeUtils.isLangString(node) ) + return node; + if ( NodeUtils.isSimpleString(node) ) + return node; + + if ( ! node.getLiteralDatatype().isValid(node.getLiteralLexicalForm()) ) + // Invalid lexical form for the datatype - do nothing. + return node; + + RDFDatatype dt = node.getLiteralDatatype() ; + // Datatype, not rdf:langString (RDF 1.1). + DatatypeHandler handler = dispatch.get(dt) ; + if ( handler == null ) + return node ; + Node n2 = handler.handle(node, node.getLiteralLexicalForm(), dt) ; + if ( n2 == null ) + return node ; + return n2 ; + } + + /** Convert the language tag of a lexical form to a canonical form if one of the known datatypes, + * otherwise return the node argument. (same object; compare by {@code ==}) + */ private static Node canonicalLangtag(String lexicalForm, String langTag) { String langTag2 = LangTag.canonical(langTag); if ( langTag2.equals(langTag) ) --- End diff -- Here, node isn't passed in so it can't be returned. Style thing. Node is already known to have a language tag so I don't like passing in a Node which can be wrong e.g.through mis-call from somewhere else.. Passing lex+lang forces it to be the information for a language tagged literal. It's tested at line 74 ``` if ( n2 == null ) return node ; ``` and elsewhere conversion also sometimes returns `null` for "no conversion" which means no new node is needed which is more efficient (meaureably). > Make canonical literals lowercase language tags. > ------------------------------------------------ > > Key: JENA-1384 > URL: https://issues.apache.org/jira/browse/JENA-1384 > Project: Apache Jena > Issue Type: Improvement > Affects Versions: Jena 3.4.0 > Reporter: Elie Roux > Assignee: Andy Seaborne > Priority: Minor > Fix For: Jena 3.6.0 > > > Please make an option so that canonicalLiterals follows the RDF 1.1 > definition of a canonical literal instead of the BCP-47 one. Right now for my > dataset I have: > - lower-cased value for JSON-LD output (as mandated by the JSON-LD spec > following a RDF 1.1 option) > - BCP-47 canonical value for TTL output if I make Jena canonicalize literals > (which I want to, I want them to be uniform) > - lower-cased value for TTL output if I choose not to canonicalize them > So please allow for users just to use lower-case uniformly, so that there can > be a homogeneous canonicalization among different outputs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)