Re: [Cython] Another string encoding idea

Stefan Behnel Fri, 27 Nov 2009 22:52:38 -0800

Hi Robert,

Robert Bradshaw, 27.11.2009 22:34:
> I had an epiphany when I realized that I find this burdensome not  
> because the user needs to specify an encoding, but that they have to  
> manually handle it every time they deal with a char*. So, my proposal  
> is this: let the user specify via a compiler directive an encoding to  
> use for all conversions.


Sounds better than defaulting to the Python system encoding (as Py2 does),
which is unrelated to the encoding used by any C libraries etc. It's also
explicit.

On the downside, while being explicit, it can still lead to all sorts of
unexpected behaviour for users because strings would pop up in non-obvious
types in their code. Now the conversion from char* to bytes would have to
be explicit, although it's certainly not uncommon when dealing with C code,
and totally normal in Py2.


> Cython could then transparently and  
> efficiently handle all char* <-> str (a.k.a. unicode) encodings in  
> Py3, and unicode -> char* in Py2.

As Greg pointed out, going directly from unicode to char* isn't trivial to
implement and the implications are certainly not obvious for most users and
not controllable by user code, so you can't just free memory by setting a
variable to None. I think that's straight out for not being explicit.

Currently, coercion from char*/bytes to unicode is an explicit step that is
easy to do via

    cdef char* s = ...
    u = s[:length].decode('UTF-8')

in 0.12. See

http://trac.cython.org/cython_trac/ticket/436

Your proposal would make that

    # cython: bytes-encoding=UTF-8

    cdef char* s = ...
    cdef unicode u = s[:length]

(well, I /hope/ you'd require the target to be typed, right?) or

    # cython: bytes-encoding=UTF-8

    cdef char* s = ...
    cdef str py_s = s[:length]

so you'd not really gain much in terms of typing and (IMO) loose readability.

Note that many encodings (e.g. the Asian 2-byte encodings) naturally
contain 0 bytes, so automatic conversion of char* can't even work in those
cases, as only the user code would know the correct length of the string.

I'm +0.3 on the opposite way in Py2 for the 'str' type, though, as I
already mentioned. I think that would a) fit the intention of users, b)
match the main use case of accepting both str and unicode as function
arguments in Py2 (and only in Py2!), and c) be free of memory handling
issues as the target would still be a Python object.

So I think it makes sense to support this only in Py2, and only for Python
objects, not for char*.

BTW, you keep talking about supporting all sorts of encodings here, whereas
the use cases you present seem to deal only with plain ASCII non-textual
data. Maybe it would be enough to make ASCII the default encoding for
unicode->str coercion of function arguments in Py2 then? Or (as my original
proposal went) to use the platform encoding for this, as CPython does, and
which is normally ASCII in Py2 anyway.

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to