Re: [Cython] C string literals

Robert Bradshaw Tue, 07 Sep 2010 01:28:24 -0700

On Mon, Sep 6, 2010 at 11:56 AM, Stefan Behnel <[email protected]> wrote:



>> While the "cdef char*" case isn't that horrible,
>>
>> f('abc\x01')
>>
>> is. Imagine throwing in a type in the signature of f and then get
>> different data in.
>
> This case is unambiguous. But the following would change.
>
>     # using default source code encoding UTF-8
>
>     cdef char* cstring = 'abcüöä'
>
>     charfunc('abcüöä')
>
>     pyfunc('abcüöä')
>
> Here, 'cstring' is assigned a 9 byte long C string which is also passed
> into charfunc(). When unicode_literals are enabled, pyfunc() would receive
> u'abcüöä', otherwise otherwise it would receive the same 9 bytes long byte
> string.
>
>     # encoding: ISO-8859-1
>
>     cdef char* cstring = 'abcüöä'
>
>     charfunc('abcüöä')
>
>     pyfunc('abcüöä')
>
> assigns a 6 byte long C string, same for the charfunc() call. With
> unicode_literals, pyfunc() would receive u'abcüöä', otherwise, it would
> receive a 6 byte long byte string b'abcüöä'.
>
> With the ASCII-only proposal, both examples above would raise an error for
> the C string usage and behave as described for the Python strings.
>
>
> The same string as an escaped literal:
>
>     cdef char* cstring = 'abc\xfc\xf6\xe4'
>
>     cfunc('abc\xfc\xf6\xe4')
>
>     pyfunc('abc\xfc\xf6\xe4')
>
> would assign/pass a 6 byte string, whereas it would be equally disallowed
> with the ASCII-only proposal. The Python case would pass a 6 character
> unicode or 6 bytes byte string, depending on unicode_literals.
>
> My point is that I don't see a reason for a compiler error. I find the
> above behaviour predictable and reasonable.

The reason I don't see this as predictable is because the value of the
literal depend on knowing the signature of cfunc and pyfunc (which
probably will not be named as informatively...)

Actually, b'abcüöä' is a syntax error "bytes can only contain ASCII
literal characters," which bolsters the argument of requiring Cython
to follow suite. (In fact, if byte literals are interpreted as the
literal bytes in the file, that means python files can't be naively
re-encoded with a different encoding (and fixing the header) without
possibly changing the actual meaning of the program.

>> I really, really don't like having the value of a literal depend on type
>> of the variable it gets assigned to (I know, I know about ints and so
>> on, but let's try to keep the number of instances down).
>>
>> My vote is for identifying a set of completely safe strings (no \x or
>> \u, ASCII-only) that is the same regardless of any setting, and allow
>> that. Anything else, demand a b'' prefix to assign to a char*. Putting
>> in a b'' isn't THAT hard.
>
> Well, then why not keep it the way it was before and *always* require a 'b'
> prefix in front of char* literals when unicode_literals is enabled? After
> all, it's an explicit option, so users who want to enable it can be
> required to adapt their code accordingly.

Why require it if there's absolutely no ambiguity?

- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] C string literals

Reply via email to