I misspoke. I meant to ask: "How do you normalize away surrogate pairs in UTF-16?" It was a rhetorical question. The point was just that decomposed characters can be handled by implicit or explicit normalization. Surrogate pairs can only be similarly normalized away if your model allows you to represent their normalized forms. A UTF-16 characters model would not.

On 9/26/06, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
Paul Prescod schrieb:
>  There is at least one big difference between surrogate pairs and
> decomposed characters. The user can typically normalize away
> decompositions. How do you normalize away decompositions in a language
> that only supports 16-bit representations?

I don't see the problem: You use UTF-16; all normal forms (NFC, NFD,
NFKC, NFKD) can be represented in UTF-16 just fine.

It is somewhat tricky to implement a normalization algorithm in
UTF-16, since you must combine surrogate pairs first in order to
find out what the canonical decomposition of the code point is;
but it's just more code, and no problem in principle.

Regards,
Martin

_______________________________________________
Python-3000 mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-3000
Unsubscribe: 
http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com

Reply via email to