Though I usually try to avoid the topic, I've been thinking a lot  
about string handling in Cython lately. I think we've taken a great  
step forward in terms of usability with CEP 108, especially for those  
who never deal with external libraries, but all this explicit encoding  
and decoding still seems too heavy (though I understand why it's  
necessary to deal with anything but pure ASCII). For an application  
like lxml that is all about string processing, the verbosity and  
explicitness isn't burdensome and the issue naturally comes up, but  
this is not true of many applications. (For example the last time I  
had to use strings, my character set was limited to [0-9Ee+-.].) On  
the other hand, it's clear letting users just ignore the encoding  
issue is unacceptable and undesirable.

I had an epiphany when I realized that I find this burdensome not  
because the user needs to specify an encoding, but that they have to  
manually handle it every time they deal with a char*. So, my proposal  
is this: let the user specify via a compiler directive an encoding to  
use for all conversions. Cython could then transparently and  
efficiently handle all char* <-> str (a.k.a. unicode) encodings in  
Py3, and unicode -> char* in Py2. If no encoding is specified char*  
would still turn into bytes in Py3, and the conversions mentioned  
above would be disallowed.

This might be a good compromise between explicitness, safety, and ease  
of use. Thoughts?

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to