Re: [HACKERS] JSON and unicode surrogate pairs

Hannu Krosing Tue, 11 Jun 2013 06:58:59 -0700

On 06/11/2013 03:42 PM, Andrew Dunstan wrote:
>
> On 06/11/2013 09:16 AM, Hannu Krosing wrote:
>
>
>>>
>>> It's a pity that we don't have a non-error producing conversion
>>> function
>>> (or if we do that I haven't found it). Then we might adopt a rule for
>>> processing
>>> unicode escapes that said "convert unicode escapes to the database
>>> encoding
>> only when extracting JSON keys or values to text makes it sense to
>> unescape
>> to database encoding.
>
> That's exactly the scenario we are talking about. When emitting JSON
> the functions have always emitted unicode escapes as they are in the
> text, and will continue to do so.
>
>>
>> strings inside JSON itself are by definition utf8
>
>
> We have deliberately extended that to allow JSON strings to be in any
> database server encoding. 
Ugh!


Does that imply that we just not "allow" it, but rather "require" it ?

Why are we arguing the "unicode surrogate pairs" as a "JSON thing" then ?
 
Should it not be "client to server encoding conversion thing" instead ?

> That was argued back in the 9.2 timeframe and I am not interested in
> re-litigating it.
>
> The only issue at hand is how to handle unicode escapes (which in
> their string form are pure ASCII) when emitting text strings.
Unicode escapes in non-unicode strings seem something that is
ill-defined by nature ;)

That is, you can't come up with a good general answer for this.
>>> if possible, and if not then emit them unchanged." which might be a
>>> reasonable
>>> compromise.
>> I'd opt for "... and if not then emit them quoted". The default should
>> be not loosing
>> any data.
>>
>>
>>
>
>
> I don't know what this means at all. Quoted how? Let's say I have a
> Latin1 database and have the following JSON string: "\u20AC2.00". In a
> UTF8 database the text representation of this is €2.00 - what are you
> saying it should be in the Latin1 database?

utf8-quote the '€' - "\u20AC2.00"

That is, convert unicode-->Latin1 what has a correspondence, utf8-quote
anything that does not.

If we allow unicode escapes in non-unicode strings anyway, then this
seems the most logical thing to do.


>
> cheers
>
> andrew
>
>


-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] JSON and unicode surrogate pairs

Reply via email to