Re: [Cython] Another string encoding idea

Stefan Behnel Tue, 01 Dec 2009 00:57:36 -0800

Robert Bradshaw, 01.12.2009 04:23:
> On Nov 30, 2009, at 10:14 AM, Christopher Barker wrote:
>> I think the key from a user's
>> perspective is that one is either working with "text": human readable
>> stuff, or data. If text, then the natural python3 data type is a  
>> unicode string. If data, then bytes -- we should really follow that as
>> best we can.
> 
> unicode = char* + length + encoding
> bytes = char* + length
> 
> So what is the Python equivalent of char*? Neither, and what you want  
> depends on the application and context.


Ok, so we agree that there are various different use cases that require
different setups.

As I indicated before, CPython's argument unpacking functions support
various ways of dealing with unicode/bytes conversion to char* through
their "s#", "u#" and "es#" formats. These are actually helpful, but not
currently supported by Cython.

Maybe a buffer emulation might help here, where Cython would set up a
Py_buffer struct for a function argument and fill in the values from the
Python string that was passed. That might be a way to handle all use cases
in a uniform way, and we could easily extend this to an additional buffer
option 'encoding', which would override the platform specific default
encoding used to handle char* buffers.

There's also still Dag's trac ticket about ctypedef support for buffer
parameters:

http://trac.cython.org/cython_trac/ticket/194

This would allow users to define their own encoded char*+length type.

The usage would be something like

    ctypedef str[encoding='ASCII'] ascii_string

    def func(ascii_string s):
        print s[:s.len].decode('ASCII')

and would accept and encode Unicode arguments as well as arguments that
support the 1D buffer protocol.

Given that there's the "es" and "et" formattings in CPython (not sure if
they continue to work for bytes in Py3, BTW, as it seems that their
documentation wasn't overhauled), we could also distinguish how bytes
arguments are handled: should they be checked for having the correct
encoding, or should they be passed through? Both use cases are legitimate
and could be distinguished by another buffer option.


> That is another idea. A new type would handle conversion to char*, but  
> not from char*. Bytes objects would still be returned by default  
> unless one did something extra there (which is fine for some uses, but  
> for other str is more natural).

We could have a "cython.str()" function that converts char*+length or a
char* buffer to bytes or unicode depending on the platform and using either
the platform encoding or a different one passed as argument. So you'd
return "cython.str(c_string, length)" (or "cython.str(s)" for the example
above) and be happy.

For function return types, we could also accept the Py3 syntax:

    def func(str[encoding='ASCII'] s) -> cython.str:
        ...

that would handle the conversion on the fly, as would an equivalent
declaration for cdef/cpdef functions.

Stefan


_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to