Re: Problem with sets and Unicode strings

2006-06-29 Thread Dennis Benzinger
Diez B. Roggisch wrote:
 But I'd say that it's not intuitive that for sets x in y can be false
 (without raising an exception!) while the doing the same with a tuple
 raises an exception. Where is this difference documented?
 
 2.3.7 Set Types -- set, frozenset
 
 ...
 
 Set elements are like dictionary keys; they need to define both __hash__ and
 __eq__ methods.
 ...
 
 And it has to hold that
 
 a == b = hash(a) == hash(b)
 
 but NOT
 
 hash(a) == hash(b) = a == b
 
 Thus if the hashes vary, the set doesn't bother to actually compare the
 values.
 [...]

Ok, I understand.
But isn't it a (minor) problem that using a set like this:

# -*- coding: UTF-8 -*-

FIELDS_SET = set((Fächer, ))


print uFächer in FIELDS_SET
print uFächer == Fächer


shadows the error of not setting sys.defaultencoding()?


Dennis
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Problem with sets and Unicode strings

2006-06-29 Thread Robert Kern
Dennis Benzinger wrote:
 Ok, I understand.
 But isn't it a (minor) problem that using a set like this:
 
 # -*- coding: UTF-8 -*-
 
 FIELDS_SET = set((Fächer, ))
 
 print uFächer in FIELDS_SET
 print uFächer == Fächer
 
 shadows the error of not setting sys.defaultencoding()?

You can't set the default encoding. If you could, then scripts that run on your 
machine wouldn't run on mine.

If there's an error, it's the fact that you use a regular string at the 
beginning (Fächer) and a unicode string later (uFächer). But set objects 
can't know that that's the problem or even if it *is* a problem.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth.
   -- Umberto Eco

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem with sets and Unicode strings

2006-06-29 Thread Dennis Benzinger
Robert Kern wrote:
 Dennis Benzinger wrote:
 Ok, I understand.
 But isn't it a (minor) problem that using a set like this:

 # -*- coding: UTF-8 -*-

 FIELDS_SET = set((Fächer, ))

 print uFächer in FIELDS_SET
 print uFächer == Fächer

 shadows the error of not setting sys.defaultencoding()?
 
 You can't set the default encoding. If you could, then scripts that run 
 on your machine wouldn't run on mine.
 [...]

As Serge Orlov wrote in one of his posts you _can_ set the default 
encoding (at least in site.py). See 
http://docs.python.org/lib/module-sys.html


Bye,
Dennis
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem with sets and Unicode strings

2006-06-29 Thread Robert Kern
Dennis Benzinger wrote:
 Robert Kern wrote:
 Dennis Benzinger wrote:
 Ok, I understand.
 But isn't it a (minor) problem that using a set like this:

 # -*- coding: UTF-8 -*-

 FIELDS_SET = set((Fächer, ))

 print uFächer in FIELDS_SET
 print uFächer == Fächer

 shadows the error of not setting sys.defaultencoding()?
 You can't set the default encoding. If you could, then scripts that run 
 on your machine wouldn't run on mine.
 [...]
 
 As Serge Orlov wrote in one of his posts you _can_ set the default 
 encoding (at least in site.py). See 
 http://docs.python.org/lib/module-sys.html

Okay, *don't* set the default encoding to anything other than 'ascii'. Doing so 
would be an error, not the other way around.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth.
   -- Umberto Eco

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem with sets and Unicode strings

2006-06-29 Thread Jean-Paul Calderone
On Thu, 29 Jun 2006 21:19:30 +0200, Dennis Benzinger [EMAIL PROTECTED] wrote:
Robert Kern wrote:
 Dennis Benzinger wrote:
 Ok, I understand.
 But isn't it a (minor) problem that using a set like this:

 # -*- coding: UTF-8 -*-

 FIELDS_SET = set((Fächer, ))

 print uFächer in FIELDS_SET
 print uFächer == Fächer

 shadows the error of not setting sys.defaultencoding()?

 You can't set the default encoding. If you could, then scripts that run
 on your machine wouldn't run on mine.
 [...]

As Serge Orlov wrote in one of his posts you _can_ set the default
encoding (at least in site.py). See
http://docs.python.org/lib/module-sys.html

But doing so is not useful so one should generally never do it.  You
cannot set the default encoding on any computers you don't directly
control, so any software you write which depends on this will not be
easily distributable.  Additionally, if you decide to use two packages
which use this feature and go to the trouble of modifying your own
site.py for them, you won't be able to, since there can only be one
default system encoding.  Only one will be able to work at a time.

The default encoding is ascii and should always be ascii.  If you want
another encoding, specify it in a call to .encode() or .decode().

Jean-Paul
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem with sets and Unicode strings

2006-06-29 Thread Fredrik Lundh
Dennis Benzinger wrote:

 shadows the error of not setting sys.defaultencoding()?
 
 You can't set the default encoding. If you could, then scripts that run 
 on your machine wouldn't run on mine.
 [...]
 
 As Serge Orlov wrote in one of his posts you _can_ set the default 
 encoding (at least in site.py). See 
 http://docs.python.org/lib/module-sys.html

yes, but you're not supposed to do that, for several reasons, including 
the reasons Robert provided: if you mess with the interpreter defaults, 
code you write isn't portable, and code written by others may not work 
on your machine.

the interpreter isn't fully encoding agnostic either; things are not 
guaranteed to work properly if you're not using the default.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Problem with sets and Unicode strings

2006-06-28 Thread Laurent Pointal
Dennis Benzinger a écrit :
 No, byte strings contain characters which are at least 8-bit wide
 http://docs.python.org/ref/types.html. But I don't understand what
 Python is trying to decode and why the exception says something about
 the ASCII codec, because my file is encoded with UTF-8.

[addendum to others replies]

The file encoding directive is used by Python to convert uxxx strings
into unicode objects using right conversion rules when compiling the code.
When a string is written simply with xxx, its a 8 bits string with NO
encoding data associated. When these strings must be converted they are
considered to be using sys.getdefaultencoding() [generally ascii -
forced ascii in python 2.5]

So a short reply: the utf8 directive has no effect on 8 bits strings,
use unicode strings to manage correctly non-ascii texts.

A+

Laurent.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Problem with sets and Unicode strings

2006-06-28 Thread Dennis Benzinger
Serge Orlov wrote:
 On 6/27/06, Dennis Benzinger [EMAIL PROTECTED] wrote:
 Serge Orlov wrote:
  On 6/27/06, Dennis Benzinger [EMAIL PROTECTED] wrote:
  Hi!
 
  The following program in an UTF-8 encoded file:
 
 
  # -*- coding: UTF-8 -*-
 
  FIELDS = (Fächer, )
  FROZEN_FIELDS = frozenset(FIELDS)
  FIELDS_SET = set(FIELDS)
 
  print uFächer in FROZEN_FIELDS
  print uFächer in FIELDS_SET
  print uFächer in FIELDS
 
 
  gives this output
 
 
  False
  False
  Traceback (most recent call last):
 File test.py, line 9, in ?
   print uFÀcher in FIELDS
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in 
 position 1:
  ordinal not in range(128)
 
 
  Why do the first two print statements succeed and the third one fails
  with an exception?
 
  Actually all three statements fail to produce correct result.

 So this is a bug in Python?
 
 No.
 
  frozenset remove the exception?
 
  Because sets use hash algorithm to find matches, whereas the last
  statement directly compares a unicode string with a byte string. Byte
  strings can only contain ascii characters, that's why python raises an
  exception. The problem is very easy to fix: use unicode strings for
  all non-ascii strings.

 No, byte strings contain characters which are at least 8-bit wide
 http://docs.python.org/ref/types.html.
 
 Yes, but later it's written that non-ascii characters do not have
 universal meaning assigned to them. In other words if you put byte
 0xE4 into a bytes string all python knows about it is that it's *some*
 character. If you put character U+00E4 into a unicode string python
 knows it's a latin small letter a with diaeresis. Trying to compare
 *some* character with a specific character is obviously undefined.
  [...]

But http://docs.python.org/ref/comparisons.html says:

Strings are compared lexicographically using the numeric equivalents 
(the result of the built-in function ord()) of their characters. Unicode 
and 8-bit strings are fully interoperable in this behavior.

Doesn't this mean that Unicode and 8-bit strings can be compared and 
this comparison is well defined? (even if it's is not meaningful)



Thanks for your anwsers,
Dennis
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Problem with sets and Unicode strings

2006-06-28 Thread Dennis Benzinger
Robert Kern wrote:
 Dennis Benzinger wrote:
 Serge Orlov wrote:
 On 6/27/06, Dennis Benzinger [EMAIL PROTECTED] wrote:
 Hi!

 The following program in an UTF-8 encoded file:


 # -*- coding: UTF-8 -*-

 FIELDS = (Fächer, )
 FROZEN_FIELDS = frozenset(FIELDS)
 FIELDS_SET = set(FIELDS)

 print uFächer in FROZEN_FIELDS
 print uFächer in FIELDS_SET
 print uFächer in FIELDS


 gives this output


 False
 False
 Traceback (most recent call last):
File test.py, line 9, in ?
  print uFÀcher in FIELDS
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
 ordinal not in range(128)


 Why do the first two print statements succeed and the third one fails
 with an exception?
 Actually all three statements fail to produce correct result.

 So this is a bug in Python?
 
 No.
 [...]

But I'd say that it's not intuitive that for sets x in y can be false 
(without raising an exception!) while the doing the same with a tuple 
raises an exception. Where is this difference documented?


Thanks,
Dennis
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem with sets and Unicode strings

2006-06-28 Thread Diez B. Roggisch
 But http://docs.python.org/ref/comparisons.html says:
 
 Strings are compared lexicographically using the numeric equivalents
 (the result of the built-in function ord()) of their characters. Unicode
 and 8-bit strings are fully interoperable in this behavior.
 
 Doesn't this mean that Unicode and 8-bit strings can be compared and
 this comparison is well defined? (even if it's is not meaningful)

Obviously not - otherwise you wouldn't have the problems you'd observed,
wouldn't you?

What happens of course is that in case of string to unicode-comparison, the
string gets coerced to an unicode value - using the default encoding!


# -*- coding: latin1 -*-

print ö.decode(latin1) == uö
print ö == uö



So - they are fully interoperable and the comparison is well defined - when
the coercion is successful.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem with sets and Unicode strings

2006-06-28 Thread Diez B. Roggisch
 But I'd say that it's not intuitive that for sets x in y can be false
 (without raising an exception!) while the doing the same with a tuple
 raises an exception. Where is this difference documented?

2.3.7 Set Types -- set, frozenset

...

Set elements are like dictionary keys; they need to define both __hash__ and
__eq__ methods.
...

And it has to hold that

a == b = hash(a) == hash(b)

but NOT

hash(a) == hash(b) = a == b

Thus if the hashes vary, the set doesn't bother to actually compare the
values.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Problem with sets and Unicode strings

2006-06-27 Thread Serge Orlov
On 6/27/06, Dennis Benzinger [EMAIL PROTECTED] wrote:
 Hi!

 The following program in an UTF-8 encoded file:


 # -*- coding: UTF-8 -*-

 FIELDS = (Fächer, )
 FROZEN_FIELDS = frozenset(FIELDS)
 FIELDS_SET = set(FIELDS)

 print uFächer in FROZEN_FIELDS
 print uFächer in FIELDS_SET
 print uFächer in FIELDS


 gives this output


 False
 False
 Traceback (most recent call last):
File test.py, line 9, in ?
  print uFÀcher in FIELDS
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
 ordinal not in range(128)


 Why do the first two print statements succeed and the third one fails
 with an exception?

Actually all three statements fail to produce correct result.

 Why does the use of set/frozenset remove the exception?

Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that's why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Problem with sets and Unicode strings

2006-06-27 Thread Dennis Benzinger
Serge Orlov wrote:
 On 6/27/06, Dennis Benzinger [EMAIL PROTECTED] wrote:
 Hi!

 The following program in an UTF-8 encoded file:


 # -*- coding: UTF-8 -*-

 FIELDS = (Fächer, )
 FROZEN_FIELDS = frozenset(FIELDS)
 FIELDS_SET = set(FIELDS)

 print uFächer in FROZEN_FIELDS
 print uFächer in FIELDS_SET
 print uFächer in FIELDS


 gives this output


 False
 False
 Traceback (most recent call last):
File test.py, line 9, in ?
  print uFÀcher in FIELDS
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
 ordinal not in range(128)


 Why do the first two print statements succeed and the third one fails
 with an exception?
 
 Actually all three statements fail to produce correct result.

So this is a bug in Python?

 frozenset remove the exception?
 
 Because sets use hash algorithm to find matches, whereas the last
 statement directly compares a unicode string with a byte string. Byte
 strings can only contain ascii characters, that's why python raises an
 exception. The problem is very easy to fix: use unicode strings for
 all non-ascii strings.

No, byte strings contain characters which are at least 8-bit wide 
http://docs.python.org/ref/types.html. But I don't understand what 
Python is trying to decode and why the exception says something about 
the ASCII codec, because my file is encoded with UTF-8.


Dennis
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Problem with sets and Unicode strings

2006-06-27 Thread Serge Orlov
On 6/27/06, Dennis Benzinger [EMAIL PROTECTED] wrote:
 Serge Orlov wrote:
  On 6/27/06, Dennis Benzinger [EMAIL PROTECTED] wrote:
  Hi!
 
  The following program in an UTF-8 encoded file:
 
 
  # -*- coding: UTF-8 -*-
 
  FIELDS = (Fächer, )
  FROZEN_FIELDS = frozenset(FIELDS)
  FIELDS_SET = set(FIELDS)
 
  print uFächer in FROZEN_FIELDS
  print uFächer in FIELDS_SET
  print uFächer in FIELDS
 
 
  gives this output
 
 
  False
  False
  Traceback (most recent call last):
 File test.py, line 9, in ?
   print uFÀcher in FIELDS
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
  ordinal not in range(128)
 
 
  Why do the first two print statements succeed and the third one fails
  with an exception?
 
  Actually all three statements fail to produce correct result.

 So this is a bug in Python?

No.

  frozenset remove the exception?
 
  Because sets use hash algorithm to find matches, whereas the last
  statement directly compares a unicode string with a byte string. Byte
  strings can only contain ascii characters, that's why python raises an
  exception. The problem is very easy to fix: use unicode strings for
  all non-ascii strings.

 No, byte strings contain characters which are at least 8-bit wide
 http://docs.python.org/ref/types.html.

Yes, but later it's written that non-ascii characters do not have
universal meaning assigned to them. In other words if you put byte
0xE4 into a bytes string all python knows about it is that it's *some*
character. If you put character U+00E4 into a unicode string python
knows it's a latin small letter a with diaeresis. Trying to compare
*some* character with a specific character is obviously undefined.

 But I don't understand what
 Python is trying to decode and why the exception says something about
 the ASCII codec, because my file is encoded with UTF-8.

Because byte strings can come from different sources (network, files,
etc) not only from the sources of your program python cannot assume
all of them are utf-8. It assumes they are ascii, because most of
wide-spread text encodings are ascii bases. Actually it's a guess,
since there are utf-16, utf-32 and other non-ascii encodings. If you
want to experience the life without guesses put
sys.setdefaultencoding(undefined) into site.py
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Problem with sets and Unicode strings

2006-06-27 Thread Robert Kern
Dennis Benzinger wrote:
 Serge Orlov wrote:
 On 6/27/06, Dennis Benzinger [EMAIL PROTECTED] wrote:
 Hi!

 The following program in an UTF-8 encoded file:


 # -*- coding: UTF-8 -*-

 FIELDS = (Fächer, )
 FROZEN_FIELDS = frozenset(FIELDS)
 FIELDS_SET = set(FIELDS)

 print uFächer in FROZEN_FIELDS
 print uFächer in FIELDS_SET
 print uFächer in FIELDS


 gives this output


 False
 False
 Traceback (most recent call last):
File test.py, line 9, in ?
  print uFÀcher in FIELDS
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
 ordinal not in range(128)


 Why do the first two print statements succeed and the third one fails
 with an exception?
 Actually all three statements fail to produce correct result.
 
 So this is a bug in Python?

No.

 frozenset remove the exception?

 Because sets use hash algorithm to find matches, whereas the last
 statement directly compares a unicode string with a byte string. Byte
 strings can only contain ascii characters, that's why python raises an
 exception. The problem is very easy to fix: use unicode strings for
 all non-ascii strings.
 
 No, byte strings contain characters which are at least 8-bit wide 
 http://docs.python.org/ref/types.html. But I don't understand what 
 Python is trying to decode and why the exception says something about 
 the ASCII codec, because my file is encoded with UTF-8.

Please read

   http://www.amk.ca/python/howto/unicode

The string in all of the containers (FIELDS, FROZEN_FIELDS, FIELDS_SET) is a 
regular byte string, not a Unicode string. The encoding declaration only 
controls how the file is parsed. The string literal that you use for FIELDS is 
a 
regular string literal, not a Unicode string literal, so the object it creates 
is an 8-bit byte string. The tuple containment test is attempting to compare 
your Unicode string object to the regular string object for equality. Python 
does these comparisons by attempting to decode the regular string into a 
Unicode 
string. Since there is no encoding information present on regular strings at 
this point (since the encoding declaration in your file only controls parsing, 
nothing else), Python assumes ASCII and throws an exception otherwise.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth.
   -- Umberto Eco

-- 
http://mail.python.org/mailman/listinfo/python-list