Re: [Python-Dev] PEP 393: Flexible String Representation

Stefan Behnel Fri, 28 Jan 2011 02:32:42 -0800

Florian Weimer, 28.01.2011 10:35:

* Stefan Behnel:

"Martin v. Löwis", 24.01.2011 21:17:

The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can double
as Py_UNICODE representation.


It's too bad this isn't initialised by default, though. Py_UNICODE is
the only representation that can be used efficiently from C code


Is this really true?  I don't think I've seen any C API which actually
uses wchar_t, beyond that what is provided by libc.  UTF-8 and even
UTF-16 are much, much more common.

They are also much harder to use, unless you are really only interested in7-bit ASCII data - which is the case for most C libraries, so I believethat's what you meant here. However, this is the CPython runtime withbuilt-in Unicode support, not the C runtime where it comes as an add-on atbest, and where Unicode processing without being Unicode aware is common.

The nice thing about Py_UNICODE is that is basically gives you nativeUnicode code points directly, without needing to decode UTF-8 byte runs andthe like. In Cython, it allows you to do things like this:


    def test_for_those_characters(unicode s):
        for c in s:
            # warning: randomly chosen Unicode escapes ahead
            if c in u"\u0356\u1012\u3359\u4567":
                return True
        else:
            return False

The loop runs in plain C, using the somewhat obvious implementation with aloop over Py_UNICODE characters and a switch statement for the comparison.This would look a *lot* more ugly with UTF-8 encoded byte strings.

Regarding Cython specifically, the above will still be *possible* under theproposal, given that the memory layout of the strings will still representthe Unicode code points. It will just be trickier to implement in Cython'stype system as there is no longer a (user visible) C type representationfor those code units. It can be any of uchar, ushort16 or uint32, neitherof which is necessarily a 'native' representation of a Unicode character inCPython. While I'm somewhat confident that I'll find a way to fix this inCython, my point is just that this adds a certain level of complexity to Ccode using the new memory layout that simply wasn't there before.


Stefan

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393: Flexible String Representation

Reply via email to