On 01/14/2012 06:11 PM, Joey Adams wrote:
On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan<and...@dunslane.net>  wrote:
Second, what should be do when the database encoding isn't UTF8? I'm
inclined to emit a \unnnn escape for any non-ASCII character (assuming it
has a unicode code point - are there any code points in the non-unicode
encodings that don't have unicode equivalents?). The alternative would be to
fail on non-ASCII characters, which might be ugly. Of course, anyone wanting
to deal with JSON should be using UTF8 anyway, but we still have to deal
with these things. What about SQL_ASCII? If there's a non-ASCII sequence
there we really have no way of telling what it should be. There at least I
think we should probably error out.
I don't think there is a satisfying solution to this problem.  Things
working against us:

  * Some server encodings support characters that don't map to Unicode
characters (e.g. unused slots in Windows-1252).  Thus, converting to
UTF-8 and back is lossy in general.

  * We want a normalized representation for comparison.  This will
involve a mixture of server and Unicode characters, unless the
encoding is UTF-8.

  * We can't efficiently convert individual characters to and from
Unicode with the current API.

  * What do we do about \u0000 ?  TEXT datums cannot contain NUL characters.

I'd say just ban Unicode escapes and non-ASCII characters unless the
server encoding is UTF-8, and ban all \u0000 escapes.  It's easy, and
whatever we support later will be a superset of this.

Strategies for handling this situation have been discussed in prior
emails.  This is where things got stuck last time.



Well, from where I'm coming from, nuls are not a problem. But escape_json() is currently totally encoding-unaware. It produces \unnnn escapes for low ascii characters, and just passes through characters with the high bit set. That's possibly OK for EXPLAIN output - we really don't want don't want EXPLAIN failing. But maybe we should ban JSON output for EXPLAIN if the encoding isn't UTF8.

Another question in my mind is what to do when the client encoding isn't UTF8.

None of these is an insurmountable problem, ISTM - we just need to make some decisions.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to