Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Daniel Bünzli Sat, 09 May 2015 05:20:08 -0700

Le samedi, 9 mai 2015 à 06:24, Philippe Verdy a écrit :
> You are not stuck! You can still regenerate a valid JSON output encoded in 
> UTF-8: it will once again use escape sequences (which are also needed if your 
> text contains quotation marks used to delimit the JSON strings in its syntax.


That's a possible resolution, but a very bad one: I can then no longer in my 
program distinguish between the JSON strings "\uDEAD" and "\\uDEAD". This 
exactly leads to the interoperability problems mentioned in section 8.2 of RFC 
7159.

You say passing escapes to the programmer is needed if your text contains 
quotation marks, this is nonsense. A good and sane JSON codec will never let 
the programmer deal with escapes directly, it is its responsability to allow 
the programmer to only deal with the JSON *data* not the details of the 
encoding of the data. As such it will automatically unescape on decoding to 
give you the data represented by the encoding and automatically escape (if 
needed) the data you give it on encoding.

> Unlike UTF-8, JSON has never been designed to restrict its strings to have 
> its represented values to be only plain-text, it is a only a serialization of 
> "strings" to valid plain-text using a custom syntax.
You say a lot of things about what JSON is supposed to be/has been designed 
for. It would be nice to substantiate your claims by pointing at relevant 
standards. If JSON as in RFC 4627 really wanted to transmit sequences of bytes 
I think it would have been *much more* explicit.  

The introduction of both RFC 4627 (remember, written by the *inventor* of JSON) 
and RFC 7159 (that obsoletes 4627) say "A string is a sequence of zero or more 
Unicode characters" as we already mentioned an both agree on this is very 
imprecise. There are two interpretations:

* This is a sequence of Unicode scalar values, i.e. text (mine)
* This is a sequence of Unicode code points, i.e. a JavaScript string (yours)

Now given this imprecision the fact is that you cannot ignore that some stupid 
people that are very wrong like me will take the first interpretation. Since 
this interpretation is less liberal you will have to cope with it and 
acknowledge the fact that lone escaped surrogates may not be interpreted 
correctly in the wild.  

This leads to the clarification and the interoperability warnings of section 
8.2 in RFC 7159. If you read carefully these two paragraphs you may infer that 
their "Unicode character" is more likely to be "Unicode scalar value". These 
paragraphs were not present in RFC 4267 so the latter was really ambiguous, I 
would however say RFC 7159 is not, if you don't agree with that we are still 
left with the above two possible interpretations and if you care about 
interoperability you should know which interpretation to take.

Best,

Daniel

Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Reply via email to