Re: [Cython] Another string encoding idea

Robert Bradshaw Wed, 02 Dec 2009 17:01:57 -0800

On Dec 1, 2009, at 12:56 AM, Stefan Behnel wrote:

> Robert Bradshaw, 01.12.2009 04:23:
>> On Nov 30, 2009, at 10:14 AM, Christopher Barker wrote:
>>> I think the key from a user's
>>> perspective is that one is either working with "text": human  
>>> readable
>>> stuff, or data. If text, then the natural python3 data type is a
>>> unicode string. If data, then bytes -- we should really follow  
>>> that as
>>> best we can.
>>
>> unicode = char* + length + encoding
>> bytes = char* + length
>>
>> So what is the Python equivalent of char*? Neither, and what you want
>> depends on the application and context.
>
> Ok, so we agree that there are various different use cases that  
> require
> different setups.
>
> As I indicated before, CPython's argument unpacking functions support
> various ways of dealing with unicode/bytes conversion to char* through
> their "s#", "u#" and "es#" formats. These are actually helpful, but  
> not
> currently supported by Cython.
>
> Maybe a buffer emulation might help here, where Cython would set up a
> Py_buffer struct for a function argument and fill in the values from  
> the
> Python string that was passed. That might be a way to handle all use  
> cases
> in a uniform way, and we could easily extend this to an additional  
> buffer
> option 'encoding', which would override the platform specific default
> encoding used to handle char* buffers.
>
> There's also still Dag's trac ticket about ctypedef support for buffer
> parameters:
>
> http://trac.cython.org/cython_trac/ticket/194
>
> This would allow users to define their own encoded char*+length type.
>
> The usage would be something like
>
>    ctypedef str[encoding='ASCII'] ascii_string
>
>    def func(ascii_string s):
>        print s[:s.len].decode('ASCII')


I expect magic on the C <-> Python boundary, because conversion is  
necessary. Implicit coercion from one Python type to another is a bit  
less obvious. (If anything, I would want to introduce a new type  
rather than overload str to have this meaning...)

> and would accept and encode Unicode arguments as well as arguments  
> that
> support the 1D buffer protocol.

So it would try to decode a numpy char* array? I think I'd rather get  
a type error than something implicit here.

> Given that there's the "es" and "et" formattings in CPython (not  
> sure if
> they continue to work for bytes in Py3, BTW, as it seems that their
> documentation wasn't overhauled), we could also distinguish how bytes
> arguments are handled: should they be checked for having the correct
> encoding, or should they be passed through? Both use cases are  
> legitimate
> and could be distinguished by another buffer option.
>
>
>> That is another idea. A new type would handle conversion to char*,  
>> but
>> not from char*. Bytes objects would still be returned by default
>> unless one did something extra there (which is fine for some uses,  
>> but
>> for other str is more natural).
>
> We could have a "cython.str()" function that converts char*+length  
> or a
> char* buffer to bytes or unicode depending on the platform and using  
> either
> the platform encoding or a different one passed as argument. So you'd
> return "cython.str(c_string, length)" (or "cython.str(s)" for the  
> example
> above) and be happy.

That's a good idea, and should probably go in regardless of whatever  
else happens.

> For function return types, we could also accept the Py3 syntax:
>
>    def func(str[encoding='ASCII'] s) -> cython.str:
>        ...
>
> that would handle the conversion on the fly, as would an equivalent
> declaration for cdef/cpdef functions.

The motivation for a directive was to avoid having to be explicit  
every time a char* is used (or at least once for every function),  
which none of the above ideas address.

- Robert


_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to