Re: [HACKERS] JSON and unicode surrogate pairs

Stefan Drees Tue, 11 Jun 2013 06:10:35 -0700

On 2013-06-11 12:53 CEST, Hannu Krosing wrote:

On 06/11/2013 10:47 AM, Andres Freund wrote:

On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:

It's legal, is it not, to just write the equivalent Unicode character in
the JSON string and not use the escapes?  If so I would think that that
would be the most common usage.  If someone's writing an escape, they
probably had a reason for doing it that way, and might not appreciate
our overriding their decision.

We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead of
JSON, and in those cases, when the value returned is a JSON string, we do
the following to it:


I have just realized that the problem is actually quite a lot bigger than
that. We also use this value for field name comparison. So, let us suppose
that we have a LATIN1 database and a piece of JSON with a field name
containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
Making that processable so it doesn't blow up would be mighty tricky and
error prone. The non-orthogonality I suggested as a solution upthread is, by
contrast, very small and easy to manage, and not terribly hard to explain -
see attached.

I think this all shows pretty clearly that it was a mistake allowing
json data in the database that we cannot entirely display with the
database's encoding. All the proposed ugly workarounds are only
necessary because we don't throw an error when originally validating the
json.
Even in an utf-8 database you can get errors due to \u unescaping (at
attribute access time, *NOT* at json_in() time) due to invalidate
surrogate pairs.

I think this goes countrary to normal postgres approach of validating
data as strict as necessary. And I think we are going to regret not
fixing this while there are still relatively few users out there.

Exactly -

  * allow in only valid JSON.
  * Validate all utf8 strings for valid unicode.
  * have one canonic way of outputting unicode - utf8 for utf8 databases,
escaped for all other encodings
  * If you need to store anything else, use text.

Requiring preserving "original text" in json data field is Not Good!

I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
as '{"a":b"}'

ahem, do you mean instead to give (none -> null and missing '"' insertedin "answer"):

=# SELECT '{"a":1, "a":null, "a":true, "a":"b"}'::json;
                 json
--------------------------------------
 {"a":"b"}

or only when "stored" in database and subsequently retrieved? The"original text" in this case was perfectly valid JSON text.

(I know that currently this is noty true and will happen only once I
read in the json value in client)

Isn't this a good situation and doesn't this also depend on the storagerepresentation in the client?


What about this:
=# SELECT '{"measure":"seconds", "measure":42}'::json;
                 json
--------------------------------------
 {"measure":42}

I presume people being used to store metadata in "preceding" json objectmembers with duplicate names, would want to decide in the clientrequesting the data what to do with the metadata information and at whatpoint to "drop", wouldn't they :-?)

For anything else - don't use json, use any text type

If you really need a simple text-validated-as-valid-json-input datatype
then add this.

Call it jsontext or somesuch :)


All the best,
Stefan.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] JSON and unicode surrogate pairs

Reply via email to