On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > BTW, could the \uNNNN problem be finessed by leaving such escapes in > source form?
Yes, it could. However, it doesn't solve the problem of comparison (needed for member lookup), which requires canonicalizing the strings to be compared. Here's a Unicode handling policy I came up with. It is guaranteed to be lossless as long as the client and database encodings are the same. --- On input (json_in), if the text is valid JSON, it is condensed: * Whitespace characters surrounding tokens are removed. * Long escapes (like \u0022) are converted to short escapes (like \") where possible. * Unnecessary escapes of ASCII characters (e.g. \u0061 and even \u007F) are converted to their respective characters. * Escapes of non-ASCII characters (e.g. \u0080, \u266B, \uD834\uDD1E) are converted to their respective characters, but only if the database encoding is UTF-8. On output (json_out), non-ASCII characters are converted to \uXXXX escapes, unless one or more of these very likely circumstances hold: * The client encoding and database encoding are the same. No conversion is performed, so escaping characters will not prevent any conversion errors. * The client encoding is UTF-8. Escaping is not necessary because the client can encode all Unicode codepoints. * The client encoding and/or database encoding is SQL_ASCII. SQL_ASCII tells PostgreSQL to shirk transcoding in favor of speed. When a JSON-encoded string is unwrapped using from_json (e.g. from_json($$ "\u00A1Hola!" $$)), escapes will be converted to the characters they represent. If any escapes cannot be represented in the database encoding, an error will be raised. Note that: * If the database encoding is UTF-8, conversion will never fail. * If the database encoding is SQL_ASCII, conversion will fail if any escapes of non-ASCII characters are present. --- However, I'm having a really hard time figuring out how comparison would work in this framework. Here are a few options: 1. Convert the strings to UTF-8, convert the escapes to characters, and compare the strings. 2. Convert the escapes to the database encoding, then compare the strings. 3. If either string contains escapes of non-ASCII characters, do 1. Otherwise, do 2. Number 1 seems the most sane to me, but could lead to rare errors. Number 3 could produce confusing results. If character set X has three different representations of one Unicode codepoint, then we could have scenarios like this (simplified): "abc♫" != "aaa♫" but: "abc\u266b" == "aaa♫" I suppose a simple solution would be to convert all escapes and outright ban escapes of characters not in the database encoding. This would have the nice property that all strings can be unescaped server-side. Problem is, what if a browser or other program produces, say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database where the encoding lacks this character? On the other hand, converting all JSON to UTF-8 would be simpler to implement. It would probably be more intuitive, too, given that the JSON RFC says, "JSON text SHALL be encoded in Unicode." In any case, the documentation should say "Use UTF-8 for best results", as there seems to be no entirely satisfactory way to handle JSON in other database encodings. - Joey -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers