Robert Bradshaw, 06.09.2010 18:24:
> On Sat, Sep 4, 2010 at 10:59 PM, Stefan Behnel wrote:
>> Robert Bradshaw, 05.09.2010 07:06:
>>> On Sat, Sep 4, 2010 at 9:24 PM, Stefan Behnel wrote:
>>>> Robert Bradshaw, 04.09.2010 22:04:
>>>>> How about we parse the literals as unicode strings, and if used in a
>>>>> bytes context we raise a compile time error if any characters are
>>>>> larger than a char?
>>>>
>>>> Can't work because you cannot recover the original byte sequence from a
>>>> decoded Unicode string. It may have used escapes or not, and it may or may
>>>> not be encodable using the source code encoding.
>>>
>>> I'm saying we shouldn't care about using escapes, and should raise a
>>> compile time error if it's not encodable using the source encoding.
>>
>> In that case, you'd break most code that actually uses escapes. If the byte
>> values were correctly representable using the source encoding the escapes
>> wouldn't be necessary in the first place.
>
> The most common escape is probably \n, followed by \0, \r, \t... As
> for \uXXXX, that is just a superset of \xXX that only works for
> unicode literals.

Sure, and '\u...' is the only escape sequence that really makes a 
difference here.


>>> In other words, I'm not a fan of
>>>
>>>       foo("abc \u0001")
>>>
>>> behaving (in my opinion) very differently depending on whether foo
>>> takes a char* or object argument.
>>
>> It's Python compatible, though:
>
> No, it's not. Python doesn't have the concept of "used in a C context."

I meant the context of byte strings. Cython has always allowed C char* 
strings to be used without prefix, and I would expect that most people have 
used that in their code. I also don't see a problem with that.


> When I see b"abc \u0001" or u"abc \u0001" I know exactly what it
> means.  When I see "abc \u0001" I have to know whether unicode literals
> are enabled to know what it means, but now you've changed it so that's
> not enough anymore

C char* strings have always behaved like plain byte strings, and that's the 
right way to handle them. The only problem is that importing 
Future.unicode_literals breaks those literals. My change fixed that.

Besides, I really don't think that people will use Unicode escapes when 
writing char* literals when the normal byte escapes are so much shorter and 
more readable.


> I'm with Lisandro and Carl WItty--how about just letting the parser
> parse them as unicode literals and then only accepting conversion back
> to char* for plain ASCII rather than introducing more complicated
> logic and semantics?

As I said, that breaks non-ASCII strings. I don't see why we should make an 
exception only for ASCII when we can make it work in general. If you want, 
we can disallow (or warn about) Unicode escapes in those strings. I could 
live with that and it's easy to implement. You can still write their byte 
sequence down by escaping the leading '\u' as '\\u' or by prepending a 'b' 
to the string, so nothing is lost, and users are prevented from falling 
into the trap of believing that their explicitly escaped Unicode string 
will be passed as such into a char* accepting function (however unlikely it 
is that someone might get that idea...).

Plus, when you use non-ASCII characters from the source code charset in 
such a string, you will get exactly the byte sequence of the source code. I 
think that's expected, too.

I really cannot see anything being wrong with my fix.

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to