Re: [HACKERS] JSON and unicode surrogate pairs

Andrew Dunstan Sun, 09 Jun 2013 23:17:40 -0700


On 06/09/2013 07:47 PM, Tom Lane wrote:

Andrew Dunstan <[email protected]> writes:

I did that, but it's evident from the buildfarm that there's more work
to do. The problem is that we do the de-escaping as we lex the json to
construct the look ahead token, and at that stage we don't know whether
or not it's really going to be needed. That means we can cause errors to
be raised in far too many places. It's failing on this line:
     converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);
even though the operator in use ("->") doesn't even use the de-escaped
value.
The real solution is going to be to delay the de-escaping of the string
until it is known to be wanted. That's unfortunately going to be a bit
invasive, but I can't see a better solution. I'll work on it ASAP.

Not sure that this idea isn't a dead end.  IIUC, you're proposing to
jump through hoops in order to avoid complaining about illegal JSON
data, essentially just for backwards compatibility with 9.2's failure to
complain about it.  If we switch over to a pre-parsed (binary) storage
format for JSON values, won't we be forced to throw these errors anyway?
If so, maybe we should just take the compatibility hit now while there's
still a relatively small amount of stored JSON data in the wild.

No, I probably haven't explained it very well. Here is the regressiondiff from jacana:


      ERROR:  cannot call json_populate_recordset on a nested object
      -- handling of unicode surrogate pairs
      select json '{ "a":  "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct;
   !           correct
   ! ----------------------------
   !  "\ud83d\ude04\ud83d\udc36"
   ! (1 row)
   !
      select json '{ "a":  "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a 
row
      ERROR:  invalid input syntax for type json
      DETAIL:  high order surrogate must not follow a high order surrogate.
   --- 922,928 ----
      ERROR:  cannot call json_populate_recordset on a nested object
      -- handling of unicode surrogate pairs
      select json '{ "a":  "\ud83d\ude04\ud83d\udc36" }' -> 'a' as correct;
   ! ERROR:  character with byte sequence 0xf0 0x9f 0x98 0x84 in encoding "UTF8" has no 
equivalent in encoding "WIN1252"
      select json '{ "a":  "\ud83d\ud83d" }' -> 'a'; -- 2 high surrogates in a 
row
      ERROR:  invalid input syntax for type json
      DETAIL:  high order surrogate must not follow a high order surrogate.


The sequence in question is two perfectly valid surrogate pairs.

...

After thinking about this some more I have come to the conclusion thatwe should only do any de-escaping of \uxxxx sequences, whether or notthey are for BMP characters, when the server encoding is utf8. For anyother encoding, which is already a violation of the JSON standardanyway, and should be avoided if you're dealing with JSON, we shouldjust pass them through even in text output. This will be a simple andvery localized fix.

We'll still have to deal with this issue when we get to binary storageof JSON, but that's not something we need to confront today.


cheers

andrew



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] JSON and unicode surrogate pairs

Reply via email to