On Fri, Nov 27, 2009 at 7:23 PM, Dag Sverre Seljebotn <[email protected]> wrote: > Robert Bradshaw wrote: >> Though I usually try to avoid the topic, I've been thinking a lot >> about string handling in Cython lately. I think we've taken a great >> step forward in terms of usability with CEP 108, especially for those >> who never deal with external libraries, but all this explicit encoding >> and decoding still seems too heavy (though I understand why it's >> necessary to deal with anything but pure ASCII). For an application >> like lxml that is all about string processing, the verbosity and >> explicitness isn't burdensome and the issue naturally comes up, but >> this is not true of many applications. (For example the last time I >> had to use strings, my character set was limited to [0-9Ee+-.].) On >> the other hand, it's clear letting users just ignore the encoding >> issue is unacceptable and undesirable. >> >> I had an epiphany when I realized that I find this burdensome not >> because the user needs to specify an encoding, but that they have to >> manually handle it every time they deal with a char*. So, my proposal >> is this: let the user specify via a compiler directive an encoding to >> use for all conversions. Cython could then transparently and >> efficiently handle all char* <-> str (a.k.a. unicode) encodings in >> Py3, and unicode -> char* in Py2. If no encoding is specified char* >> would still turn into bytes in Py3, and the conversions mentioned >> above would be disallowed. >> >> This might be a good compromise between explicitness, safety, and ease >> of use. Thoughts? > > I'm somewhat sceptical/undecided about char* being coerced to unicode > this way, i.e. char*->unicode. I don't have a problem with the idea for > unicode->char* (as long as bytes->char* is still OK as well ). >
I have the same feeling. However, I would accept to have two directives: one for unicode->char*, and another for char*->unicode. And of course, we will need a mechanism to override the default encoding by using explicit encode()/decode() method call. For example, if you have to deal with both text and filenames in a char*, you may need to special-handle filenames (hello, ext* filesystems). -- Lisandro Dalcín --------------- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
