[HACKERS] jsonb, unicode escapes and escaped backslashes

Andrew Dunstan Wed, 21 Jan 2015 15:52:27 -0800

The following case has just been brought to my attention (look at thediffering number of backslashes):


   andrew=# select jsonb '"\\u0000"';
      jsonb
   ----------
     "\u0000"
   (1 row)


   andrew=# select jsonb '"\u0000"';
      jsonb
   ----------
     "\u0000"
   (1 row)

   andrew=# select json '"\u0000"';
       json
   ----------
     "\u0000"
   (1 row)

   andrew=# select json '"\\u0000"';
       json
   -----------
     "\\u0000"
   (1 row)

The problem is that jsonb uses the parsed, unescaped value of thestring, while json does not. when the string parser sees the input withthe 2 backslashes, it outputs a single backslash, and then it encountersthe remaining chareacters and emits them as is, resulting in a token of'\u0000'. When it encounters the input with one backslash, it recognizesa unicode escape, and because it's for u+0000 emits '\u0000'. All otherunicode escapes are resolved, so the only abiguity on input concernsthis case.

Things get worse, though. On output, '\uabcd' for any four hex digits isrecognized as a unicode escape, and thus the backslash is not escaped,so that we get:


   andrew=# select jsonb '"\\uabcd"';
      jsonb
   ----------
     "\uabcd"
   (1 row)

We could probably fix this fairly easily for non- U+0000 cases by havingjsonb_to_cstring use a different escape_json routine.

But it's a mess, sadly, and I'm not sure what a good fix for the U+0000case would look like. Maybe we should detect such input and emit awarning of ambiguity? It's likely to be rare enough, but clearly not asrare as we'd like, since this is a report from the field.


cheers

andrew


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] jsonb, unicode escapes and escaped backslashes

Reply via email to