On Apr 16, 2008, at 10:14 PM, Stefan Behnel wrote:
> Hi,
>
> Robert Bradshaw wrote:
>> On Apr 16, 2008, at 4:25 AM, Stefan Behnel wrote:
>> I think our C files should always be pure ascii.
>
> You mean with C string escapes?
Yes, that's what I mean.
>>> That's the main reason why Py3 has a well defined "bytes" type and a
>>> Unicode "str" type instead of a Unicode "unicode" type and an
>>> underdefined
>>> "str" type in Py2.
>>>
>>>> What hasn't been resolved is conversions
>>>>
>>>> cdef object o = s # s is a char*
>>> Sure, the semantics are clear: char* is a byte sequence in C, so the
>>> result is the equivalent of a byte sequence in Python: a byte
>>> string, i.e.
>>> a str object in Python2 and a bytes object in Py3.
>>
>> I understand this distinction. Technically a char* is a byte string.
>> The problem is that people are going to want to implicitly handle
>> unicode <-> char* much more often.
>
> But they shouldn't do that. Python3 is very strict here. There is
> no automatic
> conversion between bytes and str. You must be explicit about the
> way you want
> to convert it. And believe me, they didn't break it doing that,
> they fixed it.
I fully agree that Python is moving in the right direction here.
There are two kinds of mistakes that languages can make with strings.
The first is to assume 1 byte == 1 character, which is obviously bad
and what Python is moving away from. The second, however, is to
require explicit mention of the conversion for every trivial task. I
don't want Cython to be like this.
> Doing magic in Cython would actually be unexpected in that light.
I don't think it's a question of magic, it's a question of the
relationship between bytes, unicode, and char*. I'm not saying there
should be conversion between Python bytes and unicode, I'm saying
that the C type char* corresponds better to the Python unicode type
than the Python bytes type.
>>>> (That would break a lot of
>>>> code...including really nice code like declaring a function
>>>> argument
>>>> to be char*)
>>> That would still accept any kind of byte string or a bytes object
>>> in Py3, which is just fine IMHO.
>>
>> I think this significantly impacts usability. For example, if I have
>> a function
>>
>> def foo(char* x):
>> ...
>>
>> then users of my module won't be able to write foo("eggs") anymore,
>> they will have to write foo(b"eggs") or even foo(x.encode('UTF-8'))
>> if x is given to them from elsewhere.
>
> That's fine, because your function expects a byte string. It cannot
> handle a
> unicode string, and it even says so in its signature.
>
>> Likewise, if I have
>>
>> def foo():
>> cdef char* s
>> ...
>> return s
>>
>> Then the user won't be able to write
>>
>> print "The answer is %s" % foo()
>>
>> or
>>
>> foo() + "eggs"
>
> But again, that's because of Python semantics, not of Cython
> semantics.
You say "that's fine" but my issue was one of usability, which hasn't
been addressed.
Technically, a char* is a pointer to a char. It doesn't even have a
length, which is one thing that distinguishes it from a Python bytes
object. But what char* means (in the conventional sense) is a c
string. A c string should get converted into a Python string (which,
in Python 3000 is a unicode object). Put another way, the type of
"foo" in C should get converted to the type of "foo" in Python.
We get to decide what the relationship is between a char* and a
PyObject*. I am advocating that whenever an implicit conversion
between the two, char* is treated as a null-terminated utf-8 string.
This will allow maximum backwards compatibility and ease of use.
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev