If is not necessary, in fact that same section is also repeating that any "code point" from U+0000 to U+FFFF is representable with the escape sequence, without restriction ! This just confirms that JSON does not really encode Unicode strings but just streams of arbitrary 16-bit code-units (and then possibly reencoded into an internal encoding scheme used by JSON parsers, that internal encoding being bound to the programming environment and its internal binary API or exposed variables or properties).
The fact that it is also bond to the plain-text encoding is just because the plain-text characters used in its syntax that not encoded with those escape sequences, and that are not assigned a special role for delimiting string literals, will be decoded from the input syntax and then reencoded into their equivalent in the internal encoding (in the parser, or exposed by the parser in its returned variables or properties): - if the transport format is UTF-8, the syntaxic file will be read using an UTF-8 scanner returning code points or small strings containing the full sequence representing a single code point (over MIME-compatible transports this uses the charset settings of this transport). These codepoints are then converted to one or two 16-bit code units. Then the JSON syntax is recognized by its parser, which will recognize string delimiters, and then also the escape sequences which will be parsed and also converted to 16-bit code units. Then this internal stream of 16-bit code units will be exposed to the output using the encoding expected by the JSON client or programming environement. In summary, the refernece to Unicode in the RFCs for JSON is not really necesssary, all it needs to say is that the JSON parsers must be able to accept a file containing any plain-text valid in its transport encoding scheme, and that it will be able to decode from it the stream of 16bit code units and generate a valid output in the encoding expected by the client (when the client is Javascript or Java, the internal encoding will be the same as the exposed encoding ; this won't be true in Lua, or PHP or many C/C++ programs that often prefer using 8-bit strings; Some languages are hybrids and support two kinds of strings: 8-bit strings and 16-bit strings, rarely 32-bit strings) 2015-05-09 8:26 GMT+02:00 Norbert Lindenberg <[email protected] >: > RFC 7158 section 7 [1] provides not only the \uXXXX notation for Unicode > code points in the Basic Multilingual Plane, but also a 12-character > sequence encoding the UTF-16 surrogate pair (i.e. \uYYYY\uZZZZ with 0xD800 > ≤ YYYY < 0xDC00 ≤ ZZZZ ≤ 0xDFFF) for supplementary Unicode code points. A > tool checking for escape sequences that don’t correspond to any Unicode > character must be aware of this, because neither \uYYYY nor \uZZZZ by > itself would correspond to any Unicode character, but their combination may > well do so. > > Norbert > > [1] https://tools.ietf.org/html/rfc7158#section-7 > > > > On May 7, 2015, at 5:46 , Costello, Roger L. <[email protected]> wrote: > > > > Hi Folks, > > > > The JSON specification says that a character may be escaped using this > notation: \uXXXX (XXXX are four hex digits) > > > > However, not every four hex digits corresponds to a Unicode character. > > > > Are there tools to scan a JSON document to detect the presence of > \uXXXX, where XXXX does not correspond to any Unicode character? > > > > /Roger > > > > >

