[ 
https://issues.apache.org/jira/browse/JENA-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659742#comment-13659742
 ] 

Andy Seaborne commented on JENA-457:
------------------------------------

Jena 2.10.1 has a completely rewritten N-triples writer - it follows RDF 1.1 
(the note referenced).  It writes UTF-8.  Fuseki uses application/n-triples.

http://jena.apache.org/documentation/io/rdf-output.html

The N-Triples writer should write what it's given as faithly and accurately as 
possible.  If there is going to be conversion it ought to be in the data, not 
at the point of writing, otherwise what you read back in is not the data 
written.

The core issue is encoding vs escaping.  Encoding records the characters but 
does not put the original character into the IRI or string; escaping is a way 
to write data which really does include the original character.

x5A is character Z.

"%5A" is a string of 3 characters %-5-A; this is not a 'Z' in the string.  When 
seeing the characters %-5-A the machine can't tell if the original data was a 
'Z', and the application encoded it as %-5-A or whether the original data had 
%-5-A in it.

"\u005A" is a string of 1 character; there really is a Z in the string.  This 
is escaping.

RFC 3986 section 2.4 discusses when to reverse the encoding.  RDF syntax is not 
producing URIs nor is it dereferening them, it's merely carrying them from one 
place to another.  DBpedia is producing the URIs.  The data URIs really do have 
%-x-x in them - it was put there when the URI was created.

In the real world, N-triples with characters outside ASCII is already 
reasonably common without people realising N-triples is stricly ASCII.

By the way - the encoding rules for the host name part would be different 
(punycode).

It is possible to add writers to the RIOT architecture (there is a link to a 
complete code example of this in the documentation).  These could do 
specialised transformations but the main N-triples writer really should be 
simply writing out data unchanged.
                
> ntriples: Object-URIs should be %-encoded
> -----------------------------------------
>
>                 Key: JENA-457
>                 URL: https://issues.apache.org/jira/browse/JENA-457
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: ARQ, Jena, RDF API
>    Affects Versions: ARQ 2.9.3
>         Environment: everywhere
>            Reporter: Pascal Christoph
>            Priority: Minor
>              Labels: patch
>
> Ntriple serialization is in pure ASCII for now[1] , so IRIs are not possible 
> as UTF8 is not allowed (see rfc3987). Serializing a Model to ntriples escapes 
> non-ASCII characters with '\u' escaping. These URIs don't resolve in most 
> cases per se, e.g. in dbpedia. These are the three different notations 
> possible:
> 1. http://de.dbpedia.org/resource/T\u00FCr
> 2. http://de.dbpedia.org/resource/T%fcr
> 3. http://de.dbpedia.org/resource/Tür
> While the 1. doesn't resolve and the 3. is not ASCII, the 2. (the 
> percent-octet encoding) fulfills both requirements. So I would like to see 
> the use of the 2. to encode object URIs in ASCII ntriple serialization. See 
> also 
> https://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples
>  .
> One could use jena to serialize as turtle and transform this turtle file to 
> ntriples with rapper. But rapper encodes all literals having 
> unicode-escape-sequences to utf8 ignoring the transformation of URIs (wisely, 
> since they are identifier). So this does not help.
> Some concrete code which is responsible for this serialization:
>  RDFWriter fasterWriter = model.getWriter("N-TRIPLE");
> Should be save to apply a patch like this in NTripleWriter.java:
> private static void writeURIString(String s, PrintWriter writer) {
>     writer.print(org.apache.commons.httpclient.util.URIUtil.encodeQuery(s) ) ;
> }
> (not tested)
> What do you think?
> -o
> [1]see a month old note from W3C where it is proposed to use utf-8 instead of 
> ASCII : http://www.w3.org/TR/2013/NOTE-n-triples-20130409/#n-triple-changes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to