Florian Weimer, 28.01.2011 10:35:
* Stefan Behnel:
"Martin v. Löwis", 24.01.2011 21:17:
The Py_UNICODE type is still supported but deprecated. It is always
defined as a typedef for wchar_t, so the wstr representation can double
as Py_UNICODE representation.

It's too bad this isn't initialised by default, though. Py_UNICODE is
the only representation that can be used efficiently from C code

Is this really true?  I don't think I've seen any C API which actually
uses wchar_t, beyond that what is provided by libc.  UTF-8 and even
UTF-16 are much, much more common.

They are also much harder to use, unless you are really only interested in 7-bit ASCII data - which is the case for most C libraries, so I believe that's what you meant here. However, this is the CPython runtime with built-in Unicode support, not the C runtime where it comes as an add-on at best, and where Unicode processing without being Unicode aware is common.

The nice thing about Py_UNICODE is that is basically gives you native Unicode code points directly, without needing to decode UTF-8 byte runs and the like. In Cython, it allows you to do things like this:

    def test_for_those_characters(unicode s):
        for c in s:
            # warning: randomly chosen Unicode escapes ahead
            if c in u"\u0356\u1012\u3359\u4567":
                return True
        else:
            return False

The loop runs in plain C, using the somewhat obvious implementation with a loop over Py_UNICODE characters and a switch statement for the comparison. This would look a *lot* more ugly with UTF-8 encoded byte strings.

Regarding Cython specifically, the above will still be *possible* under the proposal, given that the memory layout of the strings will still represent the Unicode code points. It will just be trickier to implement in Cython's type system as there is no longer a (user visible) C type representation for those code units. It can be any of uchar, ushort16 or uint32, neither of which is necessarily a 'native' representation of a Unicode character in CPython. While I'm somewhat confident that I'll find a way to fix this in Cython, my point is just that this adds a certain level of complexity to C code using the new memory layout that simply wasn't there before.

Stefan

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to