Re: [Cython] Another string encoding idea

Robert Bradshaw Fri, 27 Nov 2009 14:58:36 -0800

On Nov 27, 2009, at 2:33 PM, Lisandro Dalcin wrote:

> On Fri, Nov 27, 2009 at 7:23 PM, Dag Sverre Seljebotn
> <[email protected]> wrote:
>> Robert Bradshaw wrote:
>>> Though I usually try to avoid the topic, I've been thinking a lot
>>> about string handling in Cython lately. I think we've taken a great
>>> step forward in terms of usability with CEP 108, especially for  
>>> those
>>> who never deal with external libraries, but all this explicit  
>>> encoding
>>> and decoding still seems too heavy (though I understand why it's
>>> necessary to deal with anything but pure ASCII). For an application
>>> like lxml that is all about string processing, the verbosity and
>>> explicitness isn't burdensome and the issue naturally comes up, but
>>> this is not true of many applications. (For example the last time I
>>> had to use strings, my character set was limited to [0-9Ee+-.].) On
>>> the other hand, it's clear letting users just ignore the encoding
>>> issue is unacceptable and undesirable.
>>>
>>> I had an epiphany when I realized that I find this burdensome not
>>> because the user needs to specify an encoding, but that they have to
>>> manually handle it every time they deal with a char*. So, my  
>>> proposal
>>> is this: let the user specify via a compiler directive an encoding  
>>> to
>>> use for all conversions. Cython could then transparently and
>>> efficiently handle all char* <-> str (a.k.a. unicode) encodings in
>>> Py3, and unicode -> char* in Py2. If no encoding is specified char*
>>> would still turn into bytes in Py3, and the conversions mentioned
>>> above would be disallowed.
>>>
>>> This might be a good compromise between explicitness, safety, and  
>>> ease
>>> of use. Thoughts?
>>
>> I'm somewhat sceptical/undecided about char* being coerced to unicode
>> this way, i.e. char*->unicode. I don't have a problem with the idea  
>> for
>> unicode->char* (as long as bytes->char* is still OK as well ).
>>
>
> I have the same feeling. However, I would accept to have two
> directives: one for unicode->char*, and another for char*->unicode.


That might be a good idea.

> And of course, we will need a mechanism to override the default
> encoding by using explicit encode()/decode() method call. For example,
> if you have to deal with both text and filenames in a char*, you may
> need to special-handle filenames (hello, ext* filesystems).

For sure. I'm imagining the mechanisms one uses now would still work,  
as would stuff like

cdef char* s = ...
print <bytes>s

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to