On Dec 1, 2009, at 12:56 AM, Stefan Behnel wrote: > Robert Bradshaw, 01.12.2009 04:23: >> On Nov 30, 2009, at 10:14 AM, Christopher Barker wrote: >>> I think the key from a user's >>> perspective is that one is either working with "text": human >>> readable >>> stuff, or data. If text, then the natural python3 data type is a >>> unicode string. If data, then bytes -- we should really follow >>> that as >>> best we can. >> >> unicode = char* + length + encoding >> bytes = char* + length >> >> So what is the Python equivalent of char*? Neither, and what you want >> depends on the application and context. > > Ok, so we agree that there are various different use cases that > require > different setups. > > As I indicated before, CPython's argument unpacking functions support > various ways of dealing with unicode/bytes conversion to char* through > their "s#", "u#" and "es#" formats. These are actually helpful, but > not > currently supported by Cython. > > Maybe a buffer emulation might help here, where Cython would set up a > Py_buffer struct for a function argument and fill in the values from > the > Python string that was passed. That might be a way to handle all use > cases > in a uniform way, and we could easily extend this to an additional > buffer > option 'encoding', which would override the platform specific default > encoding used to handle char* buffers. > > There's also still Dag's trac ticket about ctypedef support for buffer > parameters: > > http://trac.cython.org/cython_trac/ticket/194 > > This would allow users to define their own encoded char*+length type. > > The usage would be something like > > ctypedef str[encoding='ASCII'] ascii_string > > def func(ascii_string s): > print s[:s.len].decode('ASCII')
I expect magic on the C <-> Python boundary, because conversion is necessary. Implicit coercion from one Python type to another is a bit less obvious. (If anything, I would want to introduce a new type rather than overload str to have this meaning...) > and would accept and encode Unicode arguments as well as arguments > that > support the 1D buffer protocol. So it would try to decode a numpy char* array? I think I'd rather get a type error than something implicit here. > Given that there's the "es" and "et" formattings in CPython (not > sure if > they continue to work for bytes in Py3, BTW, as it seems that their > documentation wasn't overhauled), we could also distinguish how bytes > arguments are handled: should they be checked for having the correct > encoding, or should they be passed through? Both use cases are > legitimate > and could be distinguished by another buffer option. > > >> That is another idea. A new type would handle conversion to char*, >> but >> not from char*. Bytes objects would still be returned by default >> unless one did something extra there (which is fine for some uses, >> but >> for other str is more natural). > > We could have a "cython.str()" function that converts char*+length > or a > char* buffer to bytes or unicode depending on the platform and using > either > the platform encoding or a different one passed as argument. So you'd > return "cython.str(c_string, length)" (or "cython.str(s)" for the > example > above) and be happy. That's a good idea, and should probably go in regardless of whatever else happens. > For function return types, we could also accept the Py3 syntax: > > def func(str[encoding='ASCII'] s) -> cython.str: > ... > > that would handle the conversion on the fly, as would an equivalent > declaration for cdef/cpdef functions. The motivation for a directive was to avoid having to be explicit every time a char* is used (or at least once for every function), which none of the above ideas address. - Robert _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
