The following case has just been brought to my attention (look at the differing number of backslashes):

   andrew=# select jsonb '"\\u0000"';
      jsonb
   ----------
     "\u0000"
   (1 row)

   andrew=# select jsonb '"\u0000"';
      jsonb
   ----------
     "\u0000"
   (1 row)

   andrew=# select json '"\u0000"';
       json
   ----------
     "\u0000"
   (1 row)

   andrew=# select json '"\\u0000"';
       json
   -----------
     "\\u0000"
   (1 row)

The problem is that jsonb uses the parsed, unescaped value of the string, while json does not. when the string parser sees the input with the 2 backslashes, it outputs a single backslash, and then it encounters the remaining chareacters and emits them as is, resulting in a token of '\u0000'. When it encounters the input with one backslash, it recognizes a unicode escape, and because it's for u+0000 emits '\u0000'. All other unicode escapes are resolved, so the only abiguity on input concerns this case.

Things get worse, though. On output, '\uabcd' for any four hex digits is recognized as a unicode escape, and thus the backslash is not escaped, so that we get:

   andrew=# select jsonb '"\\uabcd"';
      jsonb
   ----------
     "\uabcd"
   (1 row)


We could probably fix this fairly easily for non- U+0000 cases by having jsonb_to_cstring use a different escape_json routine.

But it's a mess, sadly, and I'm not sure what a good fix for the U+0000 case would look like. Maybe we should detect such input and emit a warning of ambiguity? It's likely to be rare enough, but clearly not as rare as we'd like, since this is a report from the field.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to