On Tue, Sep 7, 2010 at 1:14 AM, Stefan Behnel <[email protected]> wrote:
> Robert Bradshaw, 07.09.2010 01:54:
>> On Mon, Sep 6, 2010 at 11:30 AM, Dag Sverre Seljebotn
>>> While the "cdef char*" case isn't that horrible,
>>>
>>> f('abc\x01')
>>>
>>> is. Imagine throwing in a type in the signature of f and then get
>>> different data in.
>>>
>>> I really, really don't like having the value of a literal depend on type
>>> of the variable it gets assigned to (I know, I know about ints and so
>>> on, but let's try to keep the number of instances down).
>>
>> +1. This is the main reason I'm arguing my point. Literals should not
>> be re-interpreted based on context.
>
> Well, they are, though. There's the context of the source code encoding,
> the context of unicode_literals, and the special case of 1-character and
> 1-byte literals in integer contexts. There's also the runtime specific
> interpretation of 'str', but that only affects literals indirectly,
> independent of their content.
OK, there's some context going on, but I don't find any of these as
egregious and depending on a definition that may be in another file
and especially because it could change, for the sake of optimization,
in an surprisingly incompatible way.
> In addition to that, the "ASCII-only" proposal adds a similar context on
> top as the "char* == bytes" proposal. "ASCII-only" encodes Unicode strings
> to ASCII and rejects everything that doesn't fit, including explicit byte
> escapes. So you can't write cfunc("ao\xFF"), for example, although the code
> itself only uses plain ASCII characters. This creates an artificial
> difference between cfunc("ac\x7F") and cfunc("ac\x80") in the sense that
> one is allowed and the other is rejected and requires code modifications.
I'm fine with this distinction, or (as someone proposed) marking a
string as unsafe if it has any escapes at all. Python 3 draw one line
at "ASCII only" for bytes types, though they do allow all escapes.
> "char* == bytes" encodes char* literals back to the byte sequence defined
> by the source code encoding, while properly handling all byte escapes in
> addition. So cfunc("ao\xFF") behaves exactly as written and cfunc("aoäö")
> will be interpreted in the context of the source code encoding.
>
>
>>> My vote is for identifying a set of completely safe strings (no \x or
>>> \u, ASCII-only) that is the same regardless of any setting, and allow
>>> that. Anything else, demand a b'' prefix to assign to a char*. Putting
>>> in a b'' isn't THAT hard.
>>
>> Sure. Many (most) libraries take char* for string values. I want to
>> avoid requiting special incantations (not that the 'b' is hard, but
>> one needs to know it) to write, e.g,
>>
>> printf("Hello World\n")
>
> But that would only apply when you enable unicode_literals.
Yep. But if we plan to move to -3 being the default eventually, I'd
like to make things easier not harder.
> I think it's
> reasonable to either a) require a 'b' prefix in that case or b) enforce
> bytes semantics for char* automatically. To make life easy for users,
> either b) can be applied or for a), we can let Cython generate a patch (or
> script) that prepends 'b' prefixes to all places where unprefixed string
> literals are used in a char* context. That way, the source code becomes
> safe regardless of the unicode_literals setting.
Neither are obvious to do for a newcomer, whereas a compiler error can
give a hint.
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev