On Sat, Jan 14, 2012 at 3:06 PM, Andrew Dunstan <and...@dunslane.net> wrote: > Second, what should be do when the database encoding isn't UTF8? I'm > inclined to emit a \unnnn escape for any non-ASCII character (assuming it > has a unicode code point - are there any code points in the non-unicode > encodings that don't have unicode equivalents?). The alternative would be to > fail on non-ASCII characters, which might be ugly. Of course, anyone wanting > to deal with JSON should be using UTF8 anyway, but we still have to deal > with these things. What about SQL_ASCII? If there's a non-ASCII sequence > there we really have no way of telling what it should be. There at least I > think we should probably error out.
I don't think there is a satisfying solution to this problem. Things working against us: * Some server encodings support characters that don't map to Unicode characters (e.g. unused slots in Windows-1252). Thus, converting to UTF-8 and back is lossy in general. * We want a normalized representation for comparison. This will involve a mixture of server and Unicode characters, unless the encoding is UTF-8. * We can't efficiently convert individual characters to and from Unicode with the current API. * What do we do about \u0000 ? TEXT datums cannot contain NUL characters. I'd say just ban Unicode escapes and non-ASCII characters unless the server encoding is UTF-8, and ban all \u0000 escapes. It's easy, and whatever we support later will be a superset of this. Strategies for handling this situation have been discussed in prior emails. This is where things got stuck last time. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers