Stefan Behnel wrote:
> Dag Sverre Seljebotn, 06.09.2010 20:30:
>
>> Stefan Behnel wrote:
>>
>>> Robert Bradshaw, 06.09.2010 19:01:
>>>
>>>
>>>> On Mon, Sep 6, 2010 at 9:36 AM, Dag Sverre Seljebotn
>>>>
>>>>
>>>>> I don't understand this suggestion. What happens in each of these cases,
>>>>> for different settings of "from __future__ import unicode_literals"?
>>>>>
>>>>> cdef char* x1 = 'abc\u0001'
>>>>>
>>>>>
>>> As I said in my other mail, I don't think anyone would use the above in
>>> real code. The alternative below is just too obvious and simple.
>>>
>>>
>>>
>>>
>>>>> cdef char* x2 = 'abc\x01'
>>>>>
>>>>>
>>>> from __future__ import unicode_literals (or -3)
>>>>
>>>> len(x1) == 4
>>>> len(x2) == 4
>>>>
>>>> Otherwise
>>>>
>>>> len(x1) == 9
>>>> len(x2) == 4
>>>>
>>>>
>>> Hmm, now *that* looks unexpected to me. The way I see it, a C string is the
>>> C equivalent of a Python byte string and should always and predictably
>>> behave like a Python byte string, regardless of the way Python object
>>> literals are handled.
>>>
>>>
>> While the "cdef char*" case isn't that horrible,
>>
>> f('abc\x01')
>>
>> is. Imagine throwing in a type in the signature of f and then get
>> different data in.
>>
>
> This case is unambiguous. But the following would change.
>
> # using default source code encoding UTF-8
>
> cdef char* cstring = 'abcüöä'
>
> charfunc('abcüöä')
>
> pyfunc('abcüöä')
>
> Here, 'cstring' is assigned a 9 byte long C string which is also passed
> into charfunc(). When unicode_literals are enabled, pyfunc() would receive
> u'abcüöä', otherwise otherwise it would receive the same 9 bytes long byte
> string.
>
> # encoding: ISO-8859-1
>
> cdef char* cstring = 'abcüöä'
>
> charfunc('abcüöä')
>
> pyfunc('abcüöä')
>
> assigns a 6 byte long C string, same for the charfunc() call. With
> unicode_literals, pyfunc() would receive u'abcüöä', otherwise, it would
> receive a 6 byte long byte string b'abcüöä'.
>
> With the ASCII-only proposal, both examples above would raise an error for
> the C string usage and behave as described for the Python strings.
>
>
> The same string as an escaped literal:
>
> cdef char* cstring = 'abc\xfc\xf6\xe4'
>
> cfunc('abc\xfc\xf6\xe4')
>
> pyfunc('abc\xfc\xf6\xe4')
>
> would assign/pass a 6 byte string, whereas it would be equally disallowed
> with the ASCII-only proposal. The Python case would pass a 6 character
> unicode or 6 bytes byte string, depending on unicode_literals.
>
> My point is that I don't see a reason for a compiler error. I find the
> above behaviour predictable and reasonable.
>
>
>
>> I really, really don't like having the value of a literal depend on type
>> of the variable it gets assigned to (I know, I know about ints and so
>> on, but let's try to keep the number of instances down).
>>
>> My vote is for identifying a set of completely safe strings (no \x or
>> \u, ASCII-only) that is the same regardless of any setting, and allow
>> that. Anything else, demand a b'' prefix to assign to a char*. Putting
>> in a b'' isn't THAT hard.
>>
>
> Well, then why not keep it the way it was before and *always* require a 'b'
> prefix in front of char* literals when unicode_literals is enabled? After
> all, it's an explicit option, so users who want to enable it can be
> required to adapt their code accordingly.
>
If this can get any momentum, I'm all for it (I was dismissing it when
thinking about it because I thought it would meet opposition
everywhere). It doesn't really make sense to assign unicode literals to
char* in the first place to me, and with -3 or unicode_literals you're
pretty much asking for having to do such a change.
Dag Sverre
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev