I think the discussion is going in the wrong direction:

The choice between UCS2 and UCS4 builds is really only meant
to enhance the possibility to interface to native OS or
application APIs, e.g. Windows LIBC and Java use UTF-16, glibc
on Unix uses UCS4.

The problem of slicing Unicode objects is far more complicated
than just breaking a surrogate pair. Unicode if full of combining
code points - if you break such a sequence, the output will be
just as wrong; regardless of UCS2 vs. UCS4.

A long time ago we had a discussion about these problems. I had
suggested a new module (unicodeindex IIRC) which takes care of indexing
Unicode strings based on code points (which support for surrogates),
glyphs (taking combining code points into account) and words (with
support for various breaking/non-breaking separation code points).

Trying to solve such issues at the storage level is the wrong
approach, since the problem is application specific and thus requires
a higher-level set of possible solutions.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 03 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2008-07-07: EuroPython 2008, Vilnius, Lithuania             3 days to go

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to