Robert Bradshaw, 07.09.2010 01:54:
> On Mon, Sep 6, 2010 at 11:30 AM, Dag Sverre Seljebotn
>> While the "cdef char*" case isn't that horrible,
>>
>> f('abc\x01')
>>
>> is. Imagine throwing in a type in the signature of f and then get
>> different data in.
>>
>> I really, really don't like having the value of a literal depend on type
>> of the variable it gets assigned to (I know, I know about ints and so
>> on, but let's try to keep the number of instances down).
>
> +1. This is the main reason I'm arguing my point. Literals should not
> be re-interpreted based on context.

Well, they are, though. There's the context of the source code encoding, 
the context of unicode_literals, and the special case of 1-character and 
1-byte literals in integer contexts. There's also the runtime specific 
interpretation of 'str', but that only affects literals indirectly, 
independent of their content.

In addition to that, the "ASCII-only" proposal adds a similar context on 
top as the "char* == bytes" proposal. "ASCII-only" encodes Unicode strings 
to ASCII and rejects everything that doesn't fit, including explicit byte 
escapes. So you can't write cfunc("ao\xFF"), for example, although the code 
itself only uses plain ASCII characters. This creates an artificial 
difference between cfunc("ac\x7F") and cfunc("ac\x80") in the sense that 
one is allowed and the other is rejected and requires code modifications.

"char* == bytes" encodes char* literals back to the byte sequence defined 
by the source code encoding, while properly handling all byte escapes in 
addition. So cfunc("ao\xFF") behaves exactly as written and cfunc("aoäö") 
will be interpreted in the context of the source code encoding.


>> My vote is for identifying a set of completely safe strings (no \x or
>> \u, ASCII-only) that is the same regardless of any setting, and allow
>> that. Anything else, demand a b'' prefix to assign to a char*. Putting
>> in a b'' isn't THAT hard.
>
> Sure. Many (most) libraries take char* for string values. I want to
> avoid requiting special incantations (not that the 'b' is hard, but
> one needs to know it) to write, e.g,
>
> printf("Hello World\n")

But that would only apply when you enable unicode_literals. I think it's 
reasonable to either a) require a 'b' prefix in that case or b) enforce 
bytes semantics for char* automatically. To make life easy for users, 
either b) can be applied or for a), we can let Cython generate a patch (or 
script) that prepends 'b' prefixes to all places where unprefixed string 
literals are used in a char* context. That way, the source code becomes 
safe regardless of the unicode_literals setting.

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to