On Nov 27, 2009, at 2:33 PM, Lisandro Dalcin wrote: > On Fri, Nov 27, 2009 at 7:23 PM, Dag Sverre Seljebotn > <[email protected]> wrote: >> Robert Bradshaw wrote: >>> Though I usually try to avoid the topic, I've been thinking a lot >>> about string handling in Cython lately. I think we've taken a great >>> step forward in terms of usability with CEP 108, especially for >>> those >>> who never deal with external libraries, but all this explicit >>> encoding >>> and decoding still seems too heavy (though I understand why it's >>> necessary to deal with anything but pure ASCII). For an application >>> like lxml that is all about string processing, the verbosity and >>> explicitness isn't burdensome and the issue naturally comes up, but >>> this is not true of many applications. (For example the last time I >>> had to use strings, my character set was limited to [0-9Ee+-.].) On >>> the other hand, it's clear letting users just ignore the encoding >>> issue is unacceptable and undesirable. >>> >>> I had an epiphany when I realized that I find this burdensome not >>> because the user needs to specify an encoding, but that they have to >>> manually handle it every time they deal with a char*. So, my >>> proposal >>> is this: let the user specify via a compiler directive an encoding >>> to >>> use for all conversions. Cython could then transparently and >>> efficiently handle all char* <-> str (a.k.a. unicode) encodings in >>> Py3, and unicode -> char* in Py2. If no encoding is specified char* >>> would still turn into bytes in Py3, and the conversions mentioned >>> above would be disallowed. >>> >>> This might be a good compromise between explicitness, safety, and >>> ease >>> of use. Thoughts? >> >> I'm somewhat sceptical/undecided about char* being coerced to unicode >> this way, i.e. char*->unicode. I don't have a problem with the idea >> for >> unicode->char* (as long as bytes->char* is still OK as well ). >> > > I have the same feeling. However, I would accept to have two > directives: one for unicode->char*, and another for char*->unicode.
That might be a good idea. > And of course, we will need a mechanism to override the default > encoding by using explicit encode()/decode() method call. For example, > if you have to deal with both text and filenames in a char*, you may > need to special-handle filenames (hello, ext* filesystems). For sure. I'm imagining the mechanisms one uses now would still work, as would stuff like cdef char* s = ... print <bytes>s - Robert _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
