Re: comparing Unicode and string

2006-11-10 Thread Neil Cerutti
On 2006-11-10, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Marc 'BlackJack' Rintsch wrote:
 Why?  Python strings are *byte strings* and bytes have values in the range
 0..255.  Why would you restrict them to ASCII only?

 Because getting an exception when comparing a string with a unicode
 string is irritating.

 But I don't insist on my PEP. The example just shows just
 another pitfall with Unicode and why I'll advise to any
 beginner: Never write text constants that contain non-ascii
 chars as simple strings, always make them Unicode strings by
 prepending the u.

That doesn't do any good if you aren't writing them in unicode
code points, though.

-- 
Neil Cerutti
To succeed in the world it is not enough to be stupid, you must also
be well-mannered. --Voltaire
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-11-10 Thread Neil Cerutti
On 2006-11-10, John Machin [EMAIL PROTECTED] wrote:

 Neil Cerutti wrote:
 On 2006-10-16, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  Hello,
 
  here is something that surprises me.
 
#coding: iso-8859-1

 I think that's supposed to be:

 # -*- coding: iso-8859-1 -*-


 Not quite. As PEP 263 says:

 
 More precisely, the first or second line must match the regular
 expression coding[:=]\s*([-\w.]+). 
 

Yep. I was erroneously going by the example in the Unicode Howto.
Thanks for the correction.

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-11-10 Thread Steve Holden
Neil Cerutti wrote:
 On 2006-11-10, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Marc 'BlackJack' Rintsch wrote:
 Why?  Python strings are *byte strings* and bytes have values in the range
 0..255.  Why would you restrict them to ASCII only?
 Because getting an exception when comparing a string with a unicode
 string is irritating.

 But I don't insist on my PEP. The example just shows just
 another pitfall with Unicode and why I'll advise to any
 beginner: Never write text constants that contain non-ascii
 chars as simple strings, always make them Unicode strings by
 prepending the u.
 
 That doesn't do any good if you aren't writing them in unicode
 code points, though.
 
You tell the interpreter what encoding your source code is in. It then 
knows precisely how to decode your string literals into Unicode. How do 
you write things in Unicode code points?

regards
  Steve
-- 
Steve Holden   +44 150 684 7255  +1 800 494 3119
Holden Web LLC/Ltd  http://www.holdenweb.com
Skype: holdenweb   http://holdenweb.blogspot.com
Recent Ramblings http://del.icio.us/steve.holden

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-11-10 Thread Neil Cerutti
On 2006-11-10, Steve Holden [EMAIL PROTECTED] wrote:
 But I don't insist on my PEP. The example just shows just
 another pitfall with Unicode and why I'll advise to any
 beginner: Never write text constants that contain non-ascii
 chars as simple strings, always make them Unicode strings by
 prepending the u.
 
 That doesn't do any good if you aren't writing them in unicode
 code points, though.

 You tell the interpreter what encoding your source code is in.
 It then knows precisely how to decode your string literals into
 Unicode. How do you write things in Unicode code points?

for = uf\xfcr

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-11-10 Thread Leo Kislov
Neil Cerutti wrote:
 On 2006-11-10, Steve Holden [EMAIL PROTECTED] wrote:
  But I don't insist on my PEP. The example just shows just
  another pitfall with Unicode and why I'll advise to any
  beginner: Never write text constants that contain non-ascii
  chars as simple strings, always make them Unicode strings by
  prepending the u.
 
  That doesn't do any good if you aren't writing them in unicode
  code points, though.
 
  You tell the interpreter what encoding your source code is in.
  It then knows precisely how to decode your string literals into
  Unicode. How do you write things in Unicode code points?

 for = uf\xfcr

Unless you're using unicode unfriendly editor or console, uf\xfcr is
the same as ufür:

 uf\xfcr is ufür
True

So there is no need to write unicode strings in hexadecimal
representation of code points.

  -- Leo

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: comparing Unicode and string

2006-11-09 Thread [EMAIL PROTECTED]
Marc 'BlackJack' Rintsch wrote:
 Why?  Python strings are *byte strings* and bytes have values in the range
 0..255.  Why would you restrict them to ASCII only?

Because getting an exception when comparing a string with a unicode
string is irritating.

But I don't insist on my PEP. The example just shows just another
pitfall with Unicode and why I'll advise to any beginner: Never write
text constants that contain non-ascii chars as simple strings, always
make them Unicode strings by prepending the u.

Luc

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-11-09 Thread John Machin

Neil Cerutti wrote:
 On 2006-10-16, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
  Hello,
 
  here is something that surprises me.
 
#coding: iso-8859-1

 I think that's supposed to be:

 # -*- coding: iso-8859-1 -*-


Not quite. As PEP 263 says:


More precisely, the first or second line must match the regular
expression coding[:=]\s*([-\w.]+). 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-10-23 Thread [EMAIL PROTECTED]
I didn't mean that the *assignment* should raise exception. I mean that
any string constant that cannot be decoded using
sys.getdefaultencoding() should be considered a kind of syntax error.

I agree of course with the argument of backward compatibility, which
means that my suggestion is for Python 3.0, not earlier.

And I admit that my suggestion lacks a solution for Neil Cerutti's use
of non-decodable simple strings. And I admit that there are certainly
more competent people than me to think about this question. I just
wanted to throw my penny into the pond :-)

Luc

Fredrik Lundh wrote:
 [EMAIL PROTECTED] wrote:

  Suggestion: shouldn't an error raise already when I try to assign s2?

 variables are not typed in Python.  plain assignment will never raise an
 exception.
 
 /F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-10-23 Thread Marc 'BlackJack' Rintsch
In [EMAIL PROTECTED],
[EMAIL PROTECTED] wrote:

 I didn't mean that the *assignment* should raise exception. I mean that
 any string constant that cannot be decoded using
 sys.getdefaultencoding() should be considered a kind of syntax error.

Why?  Python strings are *byte strings* and bytes have values in the range
0..255.  Why would you restrict them to ASCII only?

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-10-20 Thread Leo Kislov

[EMAIL PROTECTED] wrote:
 Thanks, John and Neil, for your explanations.

 Still I find it rather difficult to explain to a Python beginner why
 this error occurs.

 Suggestion: shouldn't an error raise already when I try to assign s2? A
 normal string should never be allowed to contain characters that are
 not codable using the system encoding. This test could be made at
 compile time and would render Python more didadic.

This is impossible because of backward compatibility, your suggestion
will break a lot of existing programs. The change is planned to happen
in python 3.0 where it's ok to break backward compatibility if needed.

  -- Leo.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-10-20 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote:

 Suggestion: shouldn't an error raise already when I try to assign s2?

variables are not typed in Python.  plain assignment will never raise an
exception.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-10-19 Thread [EMAIL PROTECTED]
Thanks, John and Neil, for your explanations.

Still I find it rather difficult to explain to a Python beginner why
this error occurs.

Suggestion: shouldn't an error raise already when I try to assign s2? A
normal string should never be allowed to contain characters that are
not codable using the system encoding. This test could be made at
compile time and would render Python more didadic.

Luc

[EMAIL PROTECTED] schrieb:

 Hello,

 here is something that surprises me.

   #coding: iso-8859-1
   s1=uFrau Müller machte große Augen
   s2=Frau Müller machte große Augen
   if s1 == s2:
   pass

 Running this code produces a UnicodeDecodeError:

 Traceback (most recent call last):
   File tmp.py, line 4, in ?
 if s1 == s2:
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
 ordinal not in range(128)

 I would have expected that s1 == s2 gives True... or maybe False...
 but raising an error here is unnecessary. I guess that the comparison
 operator decides to convert s2 to a Unicode but forgets that I said
 #coding: iso-8859-1 at the beginning of the file.
 
 TIA for any comments.
 
 Luc Saffre

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-10-19 Thread Neil Cerutti
On 2006-10-19, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Suggestion: shouldn't an error raise already when I try to
 assign s2?

There's been discussion on pydev about changing this, but for now
I believe a str is a sequence of bytes in Python, rather than a
string of characters. My current project (an implementation of
the Glk API in Python) would be more troublesome to write if I
had to store all my latin-1 character strings as lists or arrays
of bytes.

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-10-16 Thread John Roth

[EMAIL PROTECTED] wrote:
 Hello,

 here is something that surprises me.

   #coding: iso-8859-1
   s1=uFrau Müller machte große Augen
   s2=Frau Müller machte große Augen
   if s1 == s2:
   pass

 Running this code produces a UnicodeDecodeError:

 Traceback (most recent call last):
   File tmp.py, line 4, in ?
 if s1 == s2:
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
 ordinal not in range(128)

 I would have expected that s1 == s2 gives True... or maybe False...
 but raising an error here is unnecessary. I guess that the comparison
 operator decides to convert s2 to a Unicode but forgets that I said
 #coding: iso-8859-1 at the beginning of the file.

The #coding declaration is not effective at runtime. It's
there strictly to guide the compiler in how to compile
byte strings.

The default encoding at run time is ascii unless
it's been set to something else, which is why the
error message specifies ascii.

John Roth

 
 TIA for any comments.
 
 Luc Saffre

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: comparing Unicode and string

2006-10-16 Thread Neil Cerutti
On 2006-10-16, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 Hello,

 here is something that surprises me.

   #coding: iso-8859-1

I think that's supposed to be:

# -*- coding: iso-8859-1 -*-

The special comment changes only the encoding of unicode
literals. In particular, it doesn't change the default encoding
of str literals.

   s1=uFrau Müller machte große Augen
   s2=Frau Müller machte große Augen
   if s1 == s2:
   pass

On my machine, the ü and ß in s2 are being stored in the code
points of my terminal's encoding, cp437. Unforunately cp437 code
points from 127-255 are not the same as those in iso-8859-1.

To fix this, I have to do the following:

 s1 == s2.decode('cp437')
True

 Running this code produces a UnicodeDecodeError:

 Traceback (most recent call last):
   File tmp.py, line 4, in ?
 if s1 == s2:
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 6:
 ordinal not in range(128)

 I would have expected that s1 == s2 gives True... or maybe
 False... but raising an error here is unnecessary. I guess that
 the comparison operator decides to convert s2 to a Unicode but
 forgets that I said #coding: iso-8859-1 at the beginning of the
 file.

It's trying to interpret s2 as ascii, and failing, since 129 and
225 code points are out of range.

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list