Re: [HACKERS] Unicode string literals versus the world

Andrew Dunstan Thu, 16 Apr 2009 08:34:54 -0700


Tom Lane wrote:

Sam Mason <s...@samason.me.uk> writes:

I'd never heard of UTF-16 surrogate pairs before this discussion and
hence didn't realise that it's valid to have a surrogate pair in place
of a single code point.  The docs say that <D800 DF02> corresponds to
U+10302, Python would appear to follow my intuitions in that:

  ord(u'\uD800\uDF02')

results in an error instead of giving back 66306, as I'd expect.  Is
this a bug in Python, my understanding, or something else?


I might be wrong, but I think surrogate pairs are expressly forbidden in
all representations other than UTF16/UCS2.  We definitely forbid them
when validating UTF-8 strings --- that's per an RFC recommendation.
It sounds like Python is doing the same.

You mustn't encode the surrogate, but it's up to us how we allow peopleto designate a given code point.

Frankly, I think we shouldn't provide for using surrogates at all. Iwould prefer something like \uXXXX for BMP items and \UXXXXXXXX as thestraight 32bit designation of a higher codepoint.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unicode string literals versus the world

Reply via email to