On 2008-07-03 15:21, Jeroen Ruigrok van der Werven wrote:
-On [20080703 15:00], M.-A. Lemburg ([EMAIL PROTECTED]) wrote:
Unicode if full of combining code points - if you break such a sequence,
the output will be just as wrong; regardless of UCS2 vs. UCS4.

In my opinion you are confusing two related, but very separated things here.
Combining characters have nothing to do with breaking up the encoding of a
single codepoint. Sure enough, if you arbitrary slice up codepoints that
consist of combining characters then your result is indeed odd looking.

I never said that nor is that the point I am making.

Please remember that lone surrogate pair code points are perfectly
valid Unicode code points, nevertheless. Just as a lone combining
code point is valid on its own.

Guido points out that Python supports surrogate pairs and says that if
Python is dealing wrongly with this in the core than it needs to be fixed.
I am pointing out that given the fact we allow surrogate pairs we deal
rather simplistic with it in the core. In fact, we do not consider them at
all. In essence: though we may accept full 21-bit codepoints in the form of
\U00000000 escape sequences and store them internally as UTF-16 (which I
still need to verify) we subsequently deal with them programmatically as
UCS-2, which is plain silly.

Python applies conversion from non-BMP code points to surroagtes
for UCS builds in a few places and I agree that we should probably
do that at a few more places.

However, these are mainly conversion issues of encoded Unicode
representations vs. the internal Unicode storage where you want
to avoid exceptions in favor of finding a solution that preserves
data.

To make it clear: UCS2 builds of Python do not support non-BMP
code points out of the box.

A programmer will always have to use a codec to map the internal storage
on these builds to the full Unicode code point range. The following
codecs support surrogates on UCS2 builds:

 * UTF-8
 * UTF-16
 * UTF-32
 * unicode-escape
 * raw-unicode-escape

You either commit yourself fully to UTF-16 and surrogate pairs or not. Not
some form in-between, because that will ultimately lead to more confusion
due to the difference in results when dealing with Unicode.

Programmers will have to be aware of the fact that on UCS2
builds of Python non-BMP code points will have to be treated
differently than on UCS4 builds.

I don't see that as a problem. It is in a way similar to
32-bit vs. 64-bit builds of Python or the fact that floating point
numbers work differently depending on the Python platform or
compiler being used.

BTW: Have you ever run into any problems with UCS2 vs. UCS4
in practice that were not easy to solve ?

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2008-07-07: EuroPython 2008, Vilnius, Lithuania             3 days to go

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to