Re: comparing Unicode and string
On 2006-11-10, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Marc 'BlackJack' Rintsch wrote: Why? Python strings are *byte strings* and bytes have values in the range 0..255. Why would you restrict them to ASCII only? Because getting an exception when comparing a string with a unicode string is irritating. But I don't insist on my PEP. The example just shows just another pitfall with Unicode and why I'll advise to any beginner: Never write text constants that contain non-ascii chars as simple strings, always make them Unicode strings by prepending the u. That doesn't do any good if you aren't writing them in unicode code points, though. -- Neil Cerutti To succeed in the world it is not enough to be stupid, you must also be well-mannered. --Voltaire -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
On 2006-11-10, John Machin [EMAIL PROTECTED] wrote: Neil Cerutti wrote: On 2006-10-16, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello, here is something that surprises me. #coding: iso-8859-1 I think that's supposed to be: # -*- coding: iso-8859-1 -*- Not quite. As PEP 263 says: More precisely, the first or second line must match the regular expression coding[:=]\s*([-\w.]+). Yep. I was erroneously going by the example in the Unicode Howto. Thanks for the correction. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
Neil Cerutti wrote: On 2006-11-10, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Marc 'BlackJack' Rintsch wrote: Why? Python strings are *byte strings* and bytes have values in the range 0..255. Why would you restrict them to ASCII only? Because getting an exception when comparing a string with a unicode string is irritating. But I don't insist on my PEP. The example just shows just another pitfall with Unicode and why I'll advise to any beginner: Never write text constants that contain non-ascii chars as simple strings, always make them Unicode strings by prepending the u. That doesn't do any good if you aren't writing them in unicode code points, though. You tell the interpreter what encoding your source code is in. It then knows precisely how to decode your string literals into Unicode. How do you write things in Unicode code points? regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://holdenweb.blogspot.com Recent Ramblings http://del.icio.us/steve.holden -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
On 2006-11-10, Steve Holden [EMAIL PROTECTED] wrote: But I don't insist on my PEP. The example just shows just another pitfall with Unicode and why I'll advise to any beginner: Never write text constants that contain non-ascii chars as simple strings, always make them Unicode strings by prepending the u. That doesn't do any good if you aren't writing them in unicode code points, though. You tell the interpreter what encoding your source code is in. It then knows precisely how to decode your string literals into Unicode. How do you write things in Unicode code points? for = uf\xfcr -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
Neil Cerutti wrote: On 2006-11-10, Steve Holden [EMAIL PROTECTED] wrote: But I don't insist on my PEP. The example just shows just another pitfall with Unicode and why I'll advise to any beginner: Never write text constants that contain non-ascii chars as simple strings, always make them Unicode strings by prepending the u. That doesn't do any good if you aren't writing them in unicode code points, though. You tell the interpreter what encoding your source code is in. It then knows precisely how to decode your string literals into Unicode. How do you write things in Unicode code points? for = uf\xfcr Unless you're using unicode unfriendly editor or console, uf\xfcr is the same as ufür: uf\xfcr is ufür True So there is no need to write unicode strings in hexadecimal representation of code points. -- Leo -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
Marc 'BlackJack' Rintsch wrote: Why? Python strings are *byte strings* and bytes have values in the range 0..255. Why would you restrict them to ASCII only? Because getting an exception when comparing a string with a unicode string is irritating. But I don't insist on my PEP. The example just shows just another pitfall with Unicode and why I'll advise to any beginner: Never write text constants that contain non-ascii chars as simple strings, always make them Unicode strings by prepending the u. Luc -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
Neil Cerutti wrote: On 2006-10-16, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello, here is something that surprises me. #coding: iso-8859-1 I think that's supposed to be: # -*- coding: iso-8859-1 -*- Not quite. As PEP 263 says: More precisely, the first or second line must match the regular expression coding[:=]\s*([-\w.]+). -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
I didn't mean that the *assignment* should raise exception. I mean that any string constant that cannot be decoded using sys.getdefaultencoding() should be considered a kind of syntax error. I agree of course with the argument of backward compatibility, which means that my suggestion is for Python 3.0, not earlier. And I admit that my suggestion lacks a solution for Neil Cerutti's use of non-decodable simple strings. And I admit that there are certainly more competent people than me to think about this question. I just wanted to throw my penny into the pond :-) Luc Fredrik Lundh wrote: [EMAIL PROTECTED] wrote: Suggestion: shouldn't an error raise already when I try to assign s2? variables are not typed in Python. plain assignment will never raise an exception. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
In [EMAIL PROTECTED], [EMAIL PROTECTED] wrote: I didn't mean that the *assignment* should raise exception. I mean that any string constant that cannot be decoded using sys.getdefaultencoding() should be considered a kind of syntax error. Why? Python strings are *byte strings* and bytes have values in the range 0..255. Why would you restrict them to ASCII only? Ciao, Marc 'BlackJack' Rintsch -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
[EMAIL PROTECTED] wrote: Thanks, John and Neil, for your explanations. Still I find it rather difficult to explain to a Python beginner why this error occurs. Suggestion: shouldn't an error raise already when I try to assign s2? A normal string should never be allowed to contain characters that are not codable using the system encoding. This test could be made at compile time and would render Python more didadic. This is impossible because of backward compatibility, your suggestion will break a lot of existing programs. The change is planned to happen in python 3.0 where it's ok to break backward compatibility if needed. -- Leo. -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
[EMAIL PROTECTED] wrote: Suggestion: shouldn't an error raise already when I try to assign s2? variables are not typed in Python. plain assignment will never raise an exception. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
Thanks, John and Neil, for your explanations. Still I find it rather difficult to explain to a Python beginner why this error occurs. Suggestion: shouldn't an error raise already when I try to assign s2? A normal string should never be allowed to contain characters that are not codable using the system encoding. This test could be made at compile time and would render Python more didadic. Luc [EMAIL PROTECTED] schrieb: Hello, here is something that surprises me. #coding: iso-8859-1 s1=uFrau Müller machte große Augen s2=Frau Müller machte große Augen if s1 == s2: pass Running this code produces a UnicodeDecodeError: Traceback (most recent call last): File tmp.py, line 4, in ? if s1 == s2: UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6: ordinal not in range(128) I would have expected that s1 == s2 gives True... or maybe False... but raising an error here is unnecessary. I guess that the comparison operator decides to convert s2 to a Unicode but forgets that I said #coding: iso-8859-1 at the beginning of the file. TIA for any comments. Luc Saffre -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
On 2006-10-19, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Suggestion: shouldn't an error raise already when I try to assign s2? There's been discussion on pydev about changing this, but for now I believe a str is a sequence of bytes in Python, rather than a string of characters. My current project (an implementation of the Glk API in Python) would be more troublesome to write if I had to store all my latin-1 character strings as lists or arrays of bytes. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
[EMAIL PROTECTED] wrote: Hello, here is something that surprises me. #coding: iso-8859-1 s1=uFrau Müller machte große Augen s2=Frau Müller machte große Augen if s1 == s2: pass Running this code produces a UnicodeDecodeError: Traceback (most recent call last): File tmp.py, line 4, in ? if s1 == s2: UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6: ordinal not in range(128) I would have expected that s1 == s2 gives True... or maybe False... but raising an error here is unnecessary. I guess that the comparison operator decides to convert s2 to a Unicode but forgets that I said #coding: iso-8859-1 at the beginning of the file. The #coding declaration is not effective at runtime. It's there strictly to guide the compiler in how to compile byte strings. The default encoding at run time is ascii unless it's been set to something else, which is why the error message specifies ascii. John Roth TIA for any comments. Luc Saffre -- http://mail.python.org/mailman/listinfo/python-list
Re: comparing Unicode and string
On 2006-10-16, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello, here is something that surprises me. #coding: iso-8859-1 I think that's supposed to be: # -*- coding: iso-8859-1 -*- The special comment changes only the encoding of unicode literals. In particular, it doesn't change the default encoding of str literals. s1=uFrau Müller machte große Augen s2=Frau Müller machte große Augen if s1 == s2: pass On my machine, the ü and ß in s2 are being stored in the code points of my terminal's encoding, cp437. Unforunately cp437 code points from 127-255 are not the same as those in iso-8859-1. To fix this, I have to do the following: s1 == s2.decode('cp437') True Running this code produces a UnicodeDecodeError: Traceback (most recent call last): File tmp.py, line 4, in ? if s1 == s2: UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6: ordinal not in range(128) I would have expected that s1 == s2 gives True... or maybe False... but raising an error here is unnecessary. I guess that the comparison operator decides to convert s2 to a Unicode but forgets that I said #coding: iso-8859-1 at the beginning of the file. It's trying to interpret s2 as ascii, and failing, since 129 and 225 code points are out of range. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list