Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Daniel Bünzli Thu, 07 May 2015 12:39:04 -0700

Le jeudi, 7 mai 2015 à 14:46, Costello, Roger L. a écrit :
> The JSON specification says that a character may be escaped using this 
> notation: \uXXXX (XXXX are four hex digits)
>  
> However, not every four hex digits corresponds to a Unicode character.


If we refer to the wording of RFC 7159, they are using imprecise terminology. 
They are meaning "any code point in U+0000 to U+FFFF" (since you need escaped 
surrogate pairs to be able to escape scalar values not in the BMP). You can 
understand their definition of a "character that may be escaped" by this 
sentence of section 7 [1]:  

  "Any character may be escaped. If the character is in the Basic Multilingual 
Plane (U+0000 through U+FFFF) then it may  be represented as a six-character 
sequence: a reverse solidus, followed by the lowercase letter u, followed by 
four hexadecimal digits that encode the character's code point."

However if you are concerned about wrong surrogate sequences or lone surrogate 
characters (of which the standard has sadly nothing to say about [2]), I have 
written a best-effort json parser [3] that reports them and allows you to 
continue by replacing the offending escape sequences by U+FFFD. There's a test 
command line tool named jsontrip in the distribution that allows you among 
other things to report these errors. For example:  

> echo '["\uDEAD"]' | jsontrip
-:1.2-1.8: illegal escape, U+DEAD lone low surrogate



Best,

Daniel

[1] https://tools.ietf.org/html/rfc7159#section-7
[2] https://tools.ietf.org/html/rfc7159#section-8.2
[3] http://erratique.ch/software/jsonm

Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

Reply via email to