On Thu, Aug 25, 2011 at 1:11 PM, Guido van Rossum <gu...@python.org> wrote: >> With narrow builds, code units can currently come into play >> internally, but with PEP 393 everything internal will be working >> directly with code points. Normalisation, combining characters and >> bidi issues may still affect the correctness of unicode comparison and >> slicing (and other text manipulation), but there are limits to how >> much of the underlying complexity we can effectively hide without >> being misleading. > > Let's just define a Unicode string to be a sequence of code points and > let libraries deal with the rest. Ok, methods like lower() should > consider characters, but indexing/slicing should refer to code points. > Same for '=='; we can have a library that compares by applying (or > assuming?) certain normalizations. Tom C tells me that case-less > comparison cannot use a.lower() == b.lower(); fine, we can add that > operation to the library too. But this exceeds the scope of PEP 393, > right?
Yep, I was agreeing with you on this point - I think you're right that if we provide a solid code point based core Unicode type (perhaps with some character based methods), then library support can fill the gap between handling code points and handling characters. In particular, a unicode character based string type would be significantly easier to write in Python than it would be in C (after skimming Tom's bug report at http://bugs.python.org/issue12729, I better understand the motivation and desire for that kind of interface and it sounds like Terry's prototype is along those lines). Once those mappings are thrashed out outside the core, then there may be something to incorporate directly around the 3.4 timeframe (or potentially even in 3.3, since it should already be possible to develop such a wrapper based on UCS4 builds of 3.2) However, there may an important distinction to be made on the Python-the-language vs CPython-the-implementation front: is another implementation (e.g. PyPy) *allowed* to implement character based indexing instead of code point based for 2.x unicode/3.x str type? Or is the code point indexing part of the language spec, and any character based indexing needs to be provided via a separate type or module? Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com