Though I usually try to avoid the topic, I've been thinking a lot about string handling in Cython lately. I think we've taken a great step forward in terms of usability with CEP 108, especially for those who never deal with external libraries, but all this explicit encoding and decoding still seems too heavy (though I understand why it's necessary to deal with anything but pure ASCII). For an application like lxml that is all about string processing, the verbosity and explicitness isn't burdensome and the issue naturally comes up, but this is not true of many applications. (For example the last time I had to use strings, my character set was limited to [0-9Ee+-.].) On the other hand, it's clear letting users just ignore the encoding issue is unacceptable and undesirable.
I had an epiphany when I realized that I find this burdensome not because the user needs to specify an encoding, but that they have to manually handle it every time they deal with a char*. So, my proposal is this: let the user specify via a compiler directive an encoding to use for all conversions. Cython could then transparently and efficiently handle all char* <-> str (a.k.a. unicode) encodings in Py3, and unicode -> char* in Py2. If no encoding is specified char* would still turn into bytes in Py3, and the conversions mentioned above would be disallowed. This might be a good compromise between explicitness, safety, and ease of use. Thoughts? - Robert _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
