Robert Bradshaw, 01.12.2009 04:23: > On Nov 30, 2009, at 10:14 AM, Christopher Barker wrote: >> I think the key from a user's >> perspective is that one is either working with "text": human readable >> stuff, or data. If text, then the natural python3 data type is a >> unicode string. If data, then bytes -- we should really follow that as >> best we can. > > unicode = char* + length + encoding > bytes = char* + length > > So what is the Python equivalent of char*? Neither, and what you want > depends on the application and context.
Ok, so we agree that there are various different use cases that require different setups. As I indicated before, CPython's argument unpacking functions support various ways of dealing with unicode/bytes conversion to char* through their "s#", "u#" and "es#" formats. These are actually helpful, but not currently supported by Cython. Maybe a buffer emulation might help here, where Cython would set up a Py_buffer struct for a function argument and fill in the values from the Python string that was passed. That might be a way to handle all use cases in a uniform way, and we could easily extend this to an additional buffer option 'encoding', which would override the platform specific default encoding used to handle char* buffers. There's also still Dag's trac ticket about ctypedef support for buffer parameters: http://trac.cython.org/cython_trac/ticket/194 This would allow users to define their own encoded char*+length type. The usage would be something like ctypedef str[encoding='ASCII'] ascii_string def func(ascii_string s): print s[:s.len].decode('ASCII') and would accept and encode Unicode arguments as well as arguments that support the 1D buffer protocol. Given that there's the "es" and "et" formattings in CPython (not sure if they continue to work for bytes in Py3, BTW, as it seems that their documentation wasn't overhauled), we could also distinguish how bytes arguments are handled: should they be checked for having the correct encoding, or should they be passed through? Both use cases are legitimate and could be distinguished by another buffer option. > That is another idea. A new type would handle conversion to char*, but > not from char*. Bytes objects would still be returned by default > unless one did something extra there (which is fine for some uses, but > for other str is more natural). We could have a "cython.str()" function that converts char*+length or a char* buffer to bytes or unicode depending on the platform and using either the platform encoding or a different one passed as argument. So you'd return "cython.str(c_string, length)" (or "cython.str(s)" for the example above) and be happy. For function return types, we could also accept the Py3 syntax: def func(str[encoding='ASCII'] s) -> cython.str: ... that would handle the conversion on the fly, as would an equivalent declaration for cdef/cpdef functions. Stefan _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
