On Sun, 30 Aug 2009 06:54:21 +0200, Dieter Maurer wrote: >> What you propose would break the property "unichr(i) always returns >> a string of length one, if it returns anything at all". > > But getting a "ValueError" in some builds (and not in others) > is rather worse than getting unicode strings of different length....
Not necessarily. If the code assumes that unichr() always returns a single-character string, it will silently produce bogus results when unichr() returns a pair of surrogates. An exception is usually preferable to silently producing bad data. If unichr() returns a surrogate pair, what is e.g. unichr(i).isalpha() supposed to do? Using surrogates is fine in an external representation (UTF-16), but it doesn't make sense as an internal representation. Think: why do people use wchar_t[] rather than a char[] encoded in UTF-8? Because a wchar_t[] allows you to index *characters*, which you can't do with a multi-byte encoding. You can't do it with a multi-*word* encoding either. UCS-2 and UTF-16 are superficially so similar that people forget that they're completely different beasts. UCS-2 is fixed-length, UTF-16 is variable-length. This makes UTF-16 semantically much closer to UTF-8 than to UCS-2 or UCS-4. If your wchar_t is 16 bits, the only sane solution is to forego support for characters outside of the BMP. The alternative is to process wide strings in exactly the same way that you process narrow (mbcs) strings; e.g. extracting character N requires iterating over the string from the beginning until you have counted N-1 characters. This provides no benefit over using narrow strings except for a slight performance gain from halving the number of iterations. You still end up with indexing being O(n) rather than O(1). -- http://mail.python.org/mailman/listinfo/python-list