Robert Bradshaw, 07.09.2010 01:54:
> On Mon, Sep 6, 2010 at 11:30 AM, Dag Sverre Seljebotn
>> While the "cdef char*" case isn't that horrible,
>>
>> f('abc\x01')
>>
>> is. Imagine throwing in a type in the signature of f and then get
>> different data in.
>>
>> I really, really don't like having the value of a literal depend on type
>> of the variable it gets assigned to (I know, I know about ints and so
>> on, but let's try to keep the number of instances down).
>
> +1. This is the main reason I'm arguing my point. Literals should not
> be re-interpreted based on context.
Well, they are, though. There's the context of the source code encoding,
the context of unicode_literals, and the special case of 1-character and
1-byte literals in integer contexts. There's also the runtime specific
interpretation of 'str', but that only affects literals indirectly,
independent of their content.
In addition to that, the "ASCII-only" proposal adds a similar context on
top as the "char* == bytes" proposal. "ASCII-only" encodes Unicode strings
to ASCII and rejects everything that doesn't fit, including explicit byte
escapes. So you can't write cfunc("ao\xFF"), for example, although the code
itself only uses plain ASCII characters. This creates an artificial
difference between cfunc("ac\x7F") and cfunc("ac\x80") in the sense that
one is allowed and the other is rejected and requires code modifications.
"char* == bytes" encodes char* literals back to the byte sequence defined
by the source code encoding, while properly handling all byte escapes in
addition. So cfunc("ao\xFF") behaves exactly as written and cfunc("aoäö")
will be interpreted in the context of the source code encoding.
>> My vote is for identifying a set of completely safe strings (no \x or
>> \u, ASCII-only) that is the same regardless of any setting, and allow
>> that. Anything else, demand a b'' prefix to assign to a char*. Putting
>> in a b'' isn't THAT hard.
>
> Sure. Many (most) libraries take char* for string values. I want to
> avoid requiting special incantations (not that the 'b' is hard, but
> one needs to know it) to write, e.g,
>
> printf("Hello World\n")
But that would only apply when you enable unicode_literals. I think it's
reasonable to either a) require a 'b' prefix in that case or b) enforce
bytes semantics for char* automatically. To make life easy for users,
either b) can be applied or for a), we can let Cython generate a patch (or
script) that prepends 'b' prefixes to all places where unprefixed string
literals are used in a char* context. That way, the source code becomes
safe regardless of the unicode_literals setting.
Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev