Re: [Cython] C string literals

Dag Sverre Seljebotn Mon, 06 Sep 2010 12:36:41 -0700

Stefan Behnel wrote:
> Dag Sverre Seljebotn, 06.09.2010 20:30:
>   
>> Stefan Behnel wrote:
>>     
>>> Robert Bradshaw, 06.09.2010 19:01:
>>>
>>>       
>>>> On Mon, Sep 6, 2010 at 9:36 AM, Dag Sverre Seljebotn
>>>>
>>>>         
>>>>> I don't understand this suggestion. What happens in each of these cases,
>>>>> for different settings of "from __future__ import unicode_literals"?
>>>>>
>>>>> cdef char* x1 = 'abc\u0001'
>>>>>
>>>>>           
>>> As I said in my other mail, I don't think anyone would use the above in
>>> real code. The alternative below is just too obvious and simple.
>>>
>>>
>>>
>>>       
>>>>> cdef char* x2 = 'abc\x01'
>>>>>
>>>>>           
>>>> from __future__ import unicode_literals (or -3)
>>>>
>>>>       len(x1) == 4
>>>>       len(x2) == 4
>>>>
>>>> Otherwise
>>>>
>>>>       len(x1) == 9
>>>>       len(x2) == 4
>>>>
>>>>         
>>> Hmm, now *that* looks unexpected to me. The way I see it, a C string is the
>>> C equivalent of a Python byte string and should always and predictably
>>> behave like a Python byte string, regardless of the way Python object
>>> literals are handled.
>>>
>>>       
>> While the "cdef char*" case isn't that horrible,
>>
>> f('abc\x01')
>>
>> is. Imagine throwing in a type in the signature of f and then get
>> different data in.
>>     
>
> This case is unambiguous. But the following would change.
>
>      # using default source code encoding UTF-8
>
>      cdef char* cstring = 'abcüöä'
>
>      charfunc('abcüöä')
>
>      pyfunc('abcüöä')
>
> Here, 'cstring' is assigned a 9 byte long C string which is also passed 
> into charfunc(). When unicode_literals are enabled, pyfunc() would receive 
> u'abcüöä', otherwise otherwise it would receive the same 9 bytes long byte 
> string.
>
>      # encoding: ISO-8859-1
>
>      cdef char* cstring = 'abcüöä'
>
>      charfunc('abcüöä')
>
>      pyfunc('abcüöä')
>
> assigns a 6 byte long C string, same for the charfunc() call. With 
> unicode_literals, pyfunc() would receive u'abcüöä', otherwise, it would 
> receive a 6 byte long byte string b'abcüöä'.
>
> With the ASCII-only proposal, both examples above would raise an error for 
> the C string usage and behave as described for the Python strings.
>
>
> The same string as an escaped literal:
>
>      cdef char* cstring = 'abc\xfc\xf6\xe4'
>
>      cfunc('abc\xfc\xf6\xe4')
>
>      pyfunc('abc\xfc\xf6\xe4')
>
> would assign/pass a 6 byte string, whereas it would be equally disallowed 
> with the ASCII-only proposal. The Python case would pass a 6 character 
> unicode or 6 bytes byte string, depending on unicode_literals.
>
> My point is that I don't see a reason for a compiler error. I find the 
> above behaviour predictable and reasonable.
>
>
>   
>> I really, really don't like having the value of a literal depend on type
>> of the variable it gets assigned to (I know, I know about ints and so
>> on, but let's try to keep the number of instances down).
>>
>> My vote is for identifying a set of completely safe strings (no \x or
>> \u, ASCII-only) that is the same regardless of any setting, and allow
>> that. Anything else, demand a b'' prefix to assign to a char*. Putting
>> in a b'' isn't THAT hard.
>>     
>
> Well, then why not keep it the way it was before and *always* require a 'b' 
> prefix in front of char* literals when unicode_literals is enabled? After 
> all, it's an explicit option, so users who want to enable it can be 
> required to adapt their code accordingly.
>   
If this can get any momentum, I'm all for it (I was dismissing it when 
thinking about it because I thought it would meet opposition 
everywhere). It doesn't really make sense to assign unicode literals to 
char* in the first place to me, and with -3 or unicode_literals you're 
pretty much asking for having to do such a change.


Dag Sverre
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] C string literals

Reply via email to