On Apr 16, 2008, at 10:14 PM, Stefan Behnel wrote:
> Hi,
>
> Robert Bradshaw wrote:
>> On Apr 16, 2008, at 4:25 AM, Stefan Behnel wrote:
>> I think our C files should always be pure ascii.
>
> You mean with C string escapes?

Yes, that's what I mean.

>>> That's the main reason why Py3 has a well defined "bytes" type and a
>>> Unicode "str" type instead of a Unicode "unicode" type and an
>>> underdefined
>>> "str" type in Py2.
>>>
>>>> What hasn't been resolved is conversions
>>>>
>>>>      cdef object o = s # s is a char*
>>> Sure, the semantics are clear: char* is a byte sequence in C, so the
>>> result is the equivalent of a byte sequence in Python: a byte
>>> string, i.e.
>>> a str object in Python2 and a bytes object in Py3.
>>
>> I understand this distinction. Technically a char* is a byte string.
>> The problem is that people are going to want to implicitly handle
>> unicode <-> char* much more often.
>
> But they shouldn't do that. Python3 is very strict here. There is  
> no automatic
> conversion between bytes and str. You must be explicit about the  
> way you want
> to convert it. And believe me, they didn't break it doing that,  
> they fixed it.

I fully agree that Python is moving in the right direction here.

There are two kinds of mistakes that languages can make with strings.  
The first is to assume 1 byte == 1 character, which is obviously bad  
and what Python is moving away from. The second, however, is to  
require explicit mention of the conversion for every trivial task. I  
don't want Cython to be like this.

> Doing magic in Cython would actually be unexpected in that light.

I don't think it's a question of magic, it's a question of the  
relationship between bytes, unicode, and char*. I'm not saying there  
should be conversion between Python bytes and unicode, I'm saying  
that the C type char* corresponds better to the Python unicode type  
than the Python bytes type.

>>>> (That would break a lot of
>>>> code...including really nice code like declaring a function  
>>>> argument
>>>> to be char*)
>>> That would still accept any kind of byte string or a bytes object
>>> in Py3, which is just fine IMHO.
>>
>> I think this significantly impacts usability. For example, if I have
>> a function
>>
>>      def foo(char* x):
>>          ...
>>
>> then users of my module won't be able to write foo("eggs") anymore,
>> they will have to write foo(b"eggs") or even foo(x.encode('UTF-8'))
>> if x is given to them from elsewhere.
>
> That's fine, because your function expects a byte string. It cannot  
> handle a
> unicode string, and it even says so in its signature.
>
>> Likewise, if I have
>>
>>      def foo():
>>          cdef char* s
>>          ...
>>          return s
>>
>> Then the user won't be able to write
>>
>>      print "The answer is %s" % foo()
>>
>> or
>>
>>      foo() + "eggs"
>
> But again, that's because of Python semantics, not of Cython  
> semantics.

You say "that's fine" but my issue was one of usability, which hasn't  
been addressed.

Technically, a char* is a pointer to a char. It doesn't even have a  
length, which is one thing that distinguishes it from a Python bytes  
object. But what char* means (in the conventional sense) is a c  
string. A c string should get converted into a Python string (which,  
in Python 3000 is a unicode object). Put another way, the type of  
"foo" in C should get converted to the type of "foo" in Python.

We get to decide what the relationship is between a char* and a  
PyObject*. I am advocating that whenever an implicit conversion  
between the two, char* is treated as a null-terminated utf-8 string.  
This will allow maximum backwards compatibility and ease of use.

- Robert


_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to