On Thu, Apr 16, 2009 at 02:47:20PM +0300, Marko Kreen wrote: > On 4/16/09, Sam Mason <s...@samason.me.uk> wrote: > > Microsoft have also gone this way in C#, named code points are not > > supported however. > > And it handles also non-BMP codepoints with \u escape similarly: > > http://en.csharp-online.net/ECMA-334:_9.4.1_Unicode_escape_sequences > > This makes it even more standard.
I fail to see what you're pointing out here; as far as I understand it, \u is for BMP code points and \U extends the range out to 32bit code points. I can't see anything about non-BMP and \u in the above link, you appear free to write your own surrogate pairs but that seems like an independent issue. I'd not realised before that C# is specified to use UTF-16 as its internal encoding. > > This would be following the BitC[2] project, especially if it was more > > like: > > > > \{U+xxxx} > > We already got yet-another-unique-way-of-escaping-unicode with U&. > > Now let's try to support some actual standard also. That comes across *very* negatively; I hope it's just a language issue. I read your parent post as soliciting opinions on possible ways to encode Unicode characters in PG's literals. The U&'lit' was criticised, you posted some suggestions, I followed up with what I hoped to be a useful addition. It seems useful here to separate "de jure" from "de facto" standards; implementing U&'lit' would be following the de jure standard, anything else would be de facto. A survey of existing SQL implementations would seem to be more appropriate as well: Oracle: UNISTR(string-literal) and \xxxx It looks as though Oracle originally used UCS-2 internally (i.e. BMP only) but more recently Unicode support has been improved to allow other planes. MS-SQL Server: can't find anything remotely useful; best seems to be to use NCHAR(integer-expression) which looks somewhat unmaintainable. DB2: U&string-literal and \xxxxxx i.e. it follows the SQL-2003 spec FireBird: can't find much either; support looks somewhat low on the ground MySQL: same again, seems to assume query is encoded in UTF-8 Summary seems to be that either I'm bad at searching or support for Unicode doesn't seem very complete in the database world and people work around it somehow. > You did not read my mail carefully enough - the Java and also Python/C# > already support non-BMP chars with '\u' and exactly the same (utf16) way. Again, I think this may be a language issue; if not then more verbose explanations help, maybe something like "sorry, I obviously didn't explain that very well". You will of course felt you explained it perfectly well, but everybody enters a discussion with different intuitions and biases, email has a nasty habit of accentuating these differences and compounding them with language problems. I'd never heard of UTF-16 surrogate pairs before this discussion and hence didn't realise that it's valid to have a surrogate pair in place of a single code point. The docs say that <D800 DF02> corresponds to U+10302, Python would appear to follow my intuitions in that: ord(u'\uD800\uDF02') results in an error instead of giving back 66306, as I'd expect. Is this a bug in Python, my understanding, or something else? -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers