Robert Bradshaw wrote: > Though I usually try to avoid the topic, I've been thinking a lot > about string handling in Cython lately. I think we've taken a great > step forward in terms of usability with CEP 108, especially for those > who never deal with external libraries, but all this explicit encoding > and decoding still seems too heavy (though I understand why it's > necessary to deal with anything but pure ASCII). For an application > like lxml that is all about string processing, the verbosity and > explicitness isn't burdensome and the issue naturally comes up, but > this is not true of many applications. (For example the last time I > had to use strings, my character set was limited to [0-9Ee+-.].) On > the other hand, it's clear letting users just ignore the encoding > issue is unacceptable and undesirable. > > I had an epiphany when I realized that I find this burdensome not > because the user needs to specify an encoding, but that they have to > manually handle it every time they deal with a char*. So, my proposal > is this: let the user specify via a compiler directive an encoding to > use for all conversions. Cython could then transparently and > efficiently handle all char* <-> str (a.k.a. unicode) encodings in > Py3, and unicode -> char* in Py2. If no encoding is specified char* > would still turn into bytes in Py3, and the conversions mentioned > above would be disallowed. > > This might be a good compromise between explicitness, safety, and ease > of use. Thoughts?
I'm somewhat sceptical/undecided about char* being coerced to unicode this way, i.e. char*->unicode. I don't have a problem with the idea for unicode->char* (as long as bytes->char* is still OK as well ). -- Dag Sverre _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
