subject:"Re\: \[Python\-Dev\] unicode hell\/mixing str and unicode as dictionary keys"

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-08 Thread David Hopwood

Martin v. Löwis wrote:
> David Hopwood schrieb:
>>Michael Foord wrote:
>>>David Hopwood wrote:[snip..]
>>>
>>we should, of course, continue to use the one we always used (for
>>"ascii", there is no difference between the two).
>
>+1
>
>This seems the most (only ?) logical solution.

No; always considering Unicode and non-ASCII byte strings to be distinct
is just as logical.
>>
>>I think you must have misread my comment:
> 
> Indeed. The misunderstanding originates from your sentence starting with
> "no", when, in fact, you seem to be supporting the proposal I made.

I had misunderstood what the existing Python behaviour is. I now think
the current behaviour (which uses "B.decode(system_encoding) == U") is
definitely a bad idea, especially in cases where the system encoding is
not US-ASCII, but I agree that it can't be changed for 2.5.

-- 
David Hopwood <[EMAIL PROTECTED]>



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-08 Thread M.-A. Lemburg

Armin Rigo wrote:
> Hi,
> 
> On Thu, Aug 03, 2006 at 07:53:11PM +0200, M.-A. Lemburg wrote:
>>> I though I'd heard (from Guido here or on the py3k list) that it was only 
>>> 1 < u'abc' that would raise an exception, and that 1 == u'abc' would still 
>>> evaluate to False.  Did I misunderstand?
>> Could be that I'm wrong.
> 
> I also seem to remember that TypeErrors should only signal ordering
> non-sense, not equality.  In this case, I'm on the opinion that unicode
> objects and completely-unrelated strings of random bytes should
> successfully compare as unequal, but I'm not enough of a unicode user to
> be sure.

Agreed - for Py3k where strings no longer exist and Unicode is
the only text type.

In Python 2.x the situation is a little different, since strings
are still very often used as container for text data.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 08 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread Martin v. Löwis

Armin Rigo schrieb:
> I also seem to remember that TypeErrors should only signal ordering
> non-sense, not equality.  In this case, I'm on the opinion that unicode
> objects and completely-unrelated strings of random bytes should
> successfully compare as unequal, but I'm not enough of a unicode user to
> be sure.

I believe this was the original intent for raising TypeErrors here in
the first place: string-vs-unicode comparison predates rich comparisons,
and there is no way to implement __cmp__ meaningfully if the strings
don't convert successfully under the system encoding (if they are
inequal, you wouldn't be able to tell which one is smaller).

With rich comparisons available, I see no reason to keep raising that
exception.

As for unicode users: As others have said, they should avoid mixing
unicode and ascii strings. We provide a fallback for a limited case
(ascii); beyond that, Python assumes that non-ascii strings represent
uninterpreted bytes, not characters.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread Martin v. Löwis

David Hopwood schrieb:
> Michael Foord wrote:
>> David Hopwood wrote:[snip..]
>>
> we should, of course, continue to use the one we always used (for
> "ascii", there is no difference between the two).
 +1

 This seems the most (only ?) logical solution.
>>> No; always considering Unicode and non-ASCII byte strings to be distinct
>>> is just as logical.
> 
> I think you must have misread my comment:

Indeed. The misunderstanding originates from your sentence starting with
"no", when, in fact, you seem to be supporting the proposal I made.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread Ron Adam

Michael Foord wrote:
> David Hopwood wrote:[snip..]
>>   
 we should, of course, continue to use the one we always used (for
 "ascii", there is no difference between the two).

>>> +1
>>>
>>> This seems the most (only ?) logical solution.
>>> 
>> No; always considering Unicode and non-ASCII byte strings to be distinct
>> is just as logical.

Yes, that's true.  (But can't be done prior to P3k of course.) Consider 
the comparison of ...

[3] == (3,)   ->  False

These are not the same thing even though it may be trivial to treat them 
as being equivalent.  So how smart should a equivalence comparison be? 
I think testing for interchangeability and/or taking into account 
context is going down a very difficult road.  Which is what the string 
to Unicode comparison does by making an assumption that the string type 
is in the default encoding, which it may not be.

Purity in this would insist that comparing floats and integers always 
return False, but there is little ambiguity when it comes to whether 
numerical values are equivalent or not.  The rules for their comparisons 
are fairly well established.  So numerical equivalence can be the 
exception when comparing values of differing types and its the expected 
behavior as well as the established practice in programming.

> Except there has been an implicit promise in Python for years now that 
> ascii byte-strings will compare equally to the unicode equivalent: lots 
> of code assumes this. Breaking this is fine in principle - but for Py3K 
> not Py 2.x.

Also True.  And I hope that a bytes to Unicode comparison in Py3k will 
always returns False just like [3] == (3,) always returns False.

> That means Martin's solution is the best for the current problem. (IMHO 
> of course...)

I think (IMHO) in this particular case, maintaining "backwards 
compatibility" should take precedence (until Py3k) and be the stated 
reason for the continued behavior in the documents as well.  And so 
Unicode to String comparisons should be the second exception to not 
doing data form conversions when comparing two objects.  At least for 
pre-Py3k.

Are there other cases where different types of objects compare equal? 
(Not including those where the user writes or overrides a method to get 
that functionality of course.)

Cheers,
Ron

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread David Hopwood

Michael Foord wrote:
> David Hopwood wrote:[snip..]
> 
 we should, of course, continue to use the one we always used (for
 "ascii", there is no difference between the two).
>>>
>>> +1
>>>
>>> This seems the most (only ?) logical solution.
>>
>> No; always considering Unicode and non-ASCII byte strings to be distinct
>> is just as logical.
> 
> Except there has been an implicit promise in Python for years now that
> ascii byte-strings will compare equally to the unicode equivalent: lots
> of code assumes this.

I think you must have misread my comment:

  No; always considering Unicode and *non-ASCII* byte strings to be distinct
  is just as logical.

This says nothing about comparing Unicode and ASCII byte strings.

-- 
David Hopwood <[EMAIL PROTECTED]>


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread Michael Foord

David Hopwood wrote:[snip..]
>
>   
>>> we should, of course, continue to use the one we always used (for
>>> "ascii", there is no difference between the two).
>>>   
>> +1
>>
>> This seems the most (only ?) logical solution.
>> 
>
> No; always considering Unicode and non-ASCII byte strings to be distinct
> is just as logical.
>   
Except there has been an implicit promise in Python for years now that 
ascii byte-strings will compare equally to the unicode equivalent: lots 
of code assumes this. Breaking this is fine in principle - but for Py3K 
not Py 2.x.

That means Martin's solution is the best for the current problem. (IMHO 
of course...)

Michael
http://www.voidspace.org.uk/python/index.shtml


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread Armin Rigo

Hi,

On Thu, Aug 03, 2006 at 07:53:11PM +0200, M.-A. Lemburg wrote:
> > I though I'd heard (from Guido here or on the py3k list) that it was only 
> > 1 < u'abc' that would raise an exception, and that 1 == u'abc' would still 
> > evaluate to False.  Did I misunderstand?
> 
> Could be that I'm wrong.

I also seem to remember that TypeErrors should only signal ordering
non-sense, not equality.  In this case, I'm on the opinion that unicode
objects and completely-unrelated strings of random bytes should
successfully compare as unequal, but I'm not enough of a unicode user to
be sure.

A bientot,

Armin.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread Martin v. Löwis

David Hopwood schrieb:
> I disagree. Unicode strings should always be considered distinct from
> non-ASCII byte strings. Implicitly encoding or decoding in order to
> perform a comparison is a bad idea; it is expensive and will often do
> the wrong thing.

That's a pretty irrelevant position at this point; Python has had
the notion of a system encoding since Unicode was introduced,
and we are not going to remove that just before a release candidate
of Python 2.5.

The question at hand is not whether certain object should compare
unequal, but whether comparing them should raise an exception.

>>> Which of the two conversions is selected is arbitrary; [...]
>
> It would not be arbitrary. In the common case where the byte encoding
> uses "precomposed" characters, using "U.encode(system_encoding) == B"
> will tend to succeed in more cases than "B.decode(system_encoding) == U",
> because alternative representations of the same abstract character in
> Unicode will be mapped to the same precomposed character.

No, they won't (although they should, perhaps):

py> u'o\u0308'.encode("latin-1")
Traceback (most recent call last):
  File "", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0308' in
position 1: ordinal not in range(256)

In addition, it's also possible to find encodings (e.g. iso-2022) where
different byte sequences decode to the same Unicode string.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread David Hopwood

Michael Foord wrote:
> Martin v. Löwis wrote:
> 
>>[snip..]
>>Expanding this view to Unicode should mean that a unicode
>>string U equals a byte string B if
>>U.encode(system_encode) == B or B.decode(system_encoding) == U,
>>and that they don't equal otherwise (e.g. if the conversion
>>fails with a "not convertible" exception).

I disagree. Unicode strings should always be considered distinct from
non-ASCII byte strings. Implicitly encoding or decoding in order to
perform a comparison is a bad idea; it is expensive and will often do
the wrong thing.

The programmer should explicitly encode the Unicode string or decode
the byte string before comparison (which one of these is correct is
application-dependent).

>>Which of the two conversions is selected is arbitrary; [...]

It would not be arbitrary. In the common case where the byte encoding
uses "precomposed" characters, using "U.encode(system_encoding) == B"
will tend to succeed in more cases than "B.decode(system_encoding) == U",
because alternative representations of the same abstract character in
Unicode will be mapped to the same precomposed character.

(Whether these are cases in which the comparison *should* succeed is,
as I said above, application-dependent.)

The special case of considering US-ASCII strings to compare equal to
the corresponding Unicode string, is more reasonable than this would be
for a general byte encoding, because:

 - it can be done with no (or only a trivial) conversion,
 - US-ASCII has no precomposed characters or combining marks, so it
   does not have multiple encodings for the same abstract character,
 - Unicode has a US-ASCII subset that uses exactly the same encoding
   model as US-ASCII (whereas in general, a byte encoding might use
   an arbitrarily different encoding model to Unicode, as for example
   is the case for ISCII).

>>we should, of course, continue to use the one we always used (for
>>"ascii", there is no difference between the two).
> 
> +1
> 
> This seems the most (only ?) logical solution.

No; always considering Unicode and non-ASCII byte strings to be distinct
is just as logical.

-- 
David Hopwood <[EMAIL PROTECTED]>

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread Michael Foord

Martin v. Löwis wrote:
> [snip..]
> Expanding this view to Unicode should mean that a unicode
> string U equals a byte string B if
> U.encode(system_encode) == B or B.decode(system_encoding) == U,
> and that they don't equal otherwise (e.g. if the conversion
> fails with a "not convertible" exception). Which of the
> two conversions is selected is arbitrary; we should, of
> course, continue to use the one we always used (for
> "ascii", there is no difference between the two).
>
>   
+1

This seems the most (only ?) logical solution.

Michael Foord

> Regards,
> Martin
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
>
>   

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-07 Thread Martin v. Löwis

M.-A. Lemburg schrieb:
>> There's no disputing that an exception should be raised
>> if the string *must* be interpretable as characters in
>> order to continue. But that's not true here if you allow
>> for the interpretation that they're simply objects of
>> different (duck) type and therefore unequal.
> 
> Hmm, given that interpretation, 1 == 1.0 would have to be
> False.

No, but 1 == 1.5 would have to be False (and actually is).
In that analogy, int relates to float as ascii-bytes to
Unicode: some values are shared between int and float (e.g.
1 and 1.0), other values are not shared (e.g. 1.5 has no
equivalent in int). An int equals a float only  if both
values originate from the shared subset.

Now, int is a (nearly) true subset of float, so there are
no ints with no float equivalent (actually, there are, but
Python ignores that).

> Note that you do have to interpret the string as characters
> if you compare it to Unicode and there's nothing wrong with
> that.

Consider this:
py> int(3+4j)
Traceback (most recent call last):
  File "", line 1, in ?
TypeError: can't convert complex to int; use int(abs(z))
py> 3 == 3+4j
False

So even though the conversion raises an exception, the
values are determined to be not equal. Again, because int
is a nearly true subset of complex, the conversion goes
the other way, but *if* it would use the complex->int
conversion, then the TypeError should be taken as
a guarantee that the objects don't compare equal.

Expanding this view to Unicode should mean that a unicode
string U equals a byte string B if
U.encode(system_encode) == B or B.decode(system_encoding) == U,
and that they don't equal otherwise (e.g. if the conversion
fails with a "not convertible" exception). Which of the
two conversions is selected is arbitrary; we should, of
course, continue to use the one we always used (for
"ascii", there is no difference between the two).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread Ralf Schmitt

Christopher Armstrong wrote:
> On 8/4/06, *Ralf Schmitt* <[EMAIL PROTECTED] > 
> wrote:
> 
> 
> Maybe this is all just a matter of choosing the right
> defaultencoding ? :)
> 
> 
> 
> Doing this is amazingly stupid. I can't believe how often I hear this 
> suggestion. Apparently the fact that you have to "reload(sys)" to do it 
> isn't warning enough?

Ahh, the twisted people, friendly as I know them. Did you actually 
notice the smiley?

- Ralf


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread Christopher Armstrong

On 8/4/06, Ralf Schmitt <[EMAIL PROTECTED]> wrote:
Jean-Paul Calderone wrote:>> I like the exception that 2.5 raises.  I only wish it raised by default> when using 'ascii' and u'ascii' as keys in the same dictionary. ;)  Oh,> and that str and unicode did not hash like they do.  ;)
No problem: >>> import sys >>> reload(sys) >>> sys.setdefaultencoding("base64") >>> "a"==u"a"
Traceback (most recent call last):...binascii.Error: Incorrect paddingMaybe this is all just a matter of choosing the right defaultencoding ? :)Doing this is amazingly stupid. I can't believe how often I hear this suggestion. Apparently the fact that you have to "reload(sys)" to do it isn't warning enough?
-- Christopher ArmstrongInternational Man of Twisteryhttp://radix.twistedmatrix.com/http://twistedmatrix.com/
http://canonical.com/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread Bob Ippolito


On Aug 3, 2006, at 9:34 PM, Josiah Carlson wrote:

>
> Bob Ippolito <[EMAIL PROTECTED]> wrote:
>> On Aug 3, 2006, at 6:51 PM, Greg Ewing wrote:
>>
>>> M.-A. Lemburg wrote:
>>>
 Perhaps we ought to add an exception to the dict lookup mechanism
 and continue to silence UnicodeErrors ?!
>>>
>>> Seems to be that comparison of unicode and non-unicode
>>> strings for equality shouldn't raise exceptions in the
>>> first place.
>>
>> Seems like a slightly better idea than having dictionaries suppress
>> exceptions. Still not ideal though because sticking non-ASCII strings
>> that are supposed to be text and unicode in the same data structures
>> is *probably* still an error.
>
> If/when 'python -U -c "import test.testall"' runs without unexpected
> error (I doubt it will happen prior to the "all strings are unicode"
> conversion), then I think that we can say that there aren't any
> use-cases for strings and unicode being in the same dictionary.
>
> As an alternate idea, rather than attempting to .decode('ascii') when
> strings and unicode compare, why not .decode('latin-1')?  We lose the
> unicode decoding error, but "the right thing" happens (in my opinion)
> when u'\xa1' and '\xa1' compare.

Well, in this case it would cause different behavior if u'\xa1' and  
'\xa1' compared equal. It'd just be an even more subtle error.

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread M.-A. Lemburg

Ralf Schmitt wrote:
> M.-A. Lemburg wrote:
>> Ralf Schmitt wrote:
>>> Does python 2.4 catch any exception when comparing keys (which are not 
>>> basestrings) in dictionaries?
>> Yes. It does so for all equality compares that need to be done
>> as part of the hash collision algorithm (not only w/r to strings
>> and Unicode, but in general).
>>
>> This was changed in 2.5, which now reports the exception.
>>
> 
> So, this thread isn't about "unicode hell" at all.

For some reason people always think of hell when dealing with Unicode.
Instead, the should think of hell when dealing with strings.

> I guess this change will break lots of code (or will reveal lots of 
> broken code...as it did in my case actually).

I don't think it will break a lot of code that wasn't already
broken.

Let's see how many reports we get for 2.5b3 and then decide.

If things turn out bad, we might silence the UnicodeError
and instead issue a warning everytime this situation occurs.
In 2.6 we'd then revert to raising an exception.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 04 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread M.-A. Lemburg

Greg Ewing wrote:
> M.-A. Lemburg wrote:
> 
>> If a string
>> is not ASCII and thus causes the exception, there's not a lot you
>> can say, since you don't know the encoding of the string.
> 
> That's one way of looking at it.
> 
> Another is that any string containing chars > 127 is not
> text at all, but binary data, in which case it's not equal
> to *any* unicode string -- just like bytes objects will
> not be equal to strings in Py3k.
> 
>> All you
>> know is that it's not ASCII. Instead of guessing, Python then raises
>> an exception to let the programmer decide.
> 
> There's no disputing that an exception should be raised
> if the string *must* be interpretable as characters in
> order to continue. But that's not true here if you allow
> for the interpretation that they're simply objects of
> different (duck) type and therefore unequal.

Hmm, given that interpretation, 1 == 1.0 would have to be
False.

Note that you do have to interpret the string as characters
if you compare it to Unicode and there's nothing wrong with
that.

What's making this particular case interesting is that
the comparison is hidden in the dictionary implementation
and only triggers if you get a hash collision, which makes
the whole issue appear to be happening randomly.

This whole thread aside: it's never recommended to mix strings
and Unicode, unless you really have to.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 04 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread Michael Hudson

"M.-A. Lemburg" <[EMAIL PROTECTED]> writes:

> The point here is that a typical user won't expect any comparisons
> to be made when dealing with dictionaries, simply because the fact
> that you do need to make comparisons is an implementation detail.

Of course looking things up in a dictionary involves comparisons!  How
could it not?

> So in this particular case silencing the exception might be the
> more user friendly way of dealing with the problem.

Please, no.

> That said, the problem still lingers in that dictionary, so it may
> bite you in some other context, e.g. when iterating over the list
> of keys.

For this reason, and others.

Cheers,
mwh

-- 
   web in my head get it out get it out
-- from Twisted.Quotes
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread Ralf Schmitt

M.-A. Lemburg wrote:
> Ralf Schmitt wrote:
>> Does python 2.4 catch any exception when comparing keys (which are not 
>> basestrings) in dictionaries?
> 
> Yes. It does so for all equality compares that need to be done
> as part of the hash collision algorithm (not only w/r to strings
> and Unicode, but in general).
> 
> This was changed in 2.5, which now reports the exception.
> 

So, this thread isn't about "unicode hell" at all.
I guess this change will break lots of code (or will reveal lots of 
broken code...as it did in my case actually).

- Ralf

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread Greg Ewing

M.-A. Lemburg wrote:

> If a string
> is not ASCII and thus causes the exception, there's not a lot you
> can say, since you don't know the encoding of the string.

That's one way of looking at it.

Another is that any string containing chars > 127 is not
text at all, but binary data, in which case it's not equal
to *any* unicode string -- just like bytes objects will
not be equal to strings in Py3k.

> All you
> know is that it's not ASCII. Instead of guessing, Python then raises
> an exception to let the programmer decide.

There's no disputing that an exception should be raised
if the string *must* be interpretable as characters in
order to continue. But that's not true here if you allow
for the interpretation that they're simply objects of
different (duck) type and therefore unequal.

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread M.-A. Lemburg

Ralf Schmitt wrote:
> Does python 2.4 catch any exception when comparing keys (which are not 
> basestrings) in dictionaries?

Yes. It does so for all equality compares that need to be done
as part of the hash collision algorithm (not only w/r to strings
and Unicode, but in general).

This was changed in 2.5, which now reports the exception.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 04 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread Ralf Schmitt

Jean-Paul Calderone wrote:
> 
> I like the exception that 2.5 raises.  I only wish it raised by default
> when using 'ascii' and u'ascii' as keys in the same dictionary. ;)  Oh,
> and that str and unicode did not hash like they do.  ;)

No problem:

 >>> import sys
 >>> reload(sys)

 >>> sys.setdefaultencoding("base64")
 >>> "a"==u"a"
Traceback (most recent call last):
   File "", line 1, in 
   File "/exp/lib/python2.5/encodings/base64_codec.py", line 42, in 
base64_decode
 output = base64.decodestring(input)
   File "/exp/lib/python2.5/base64.py", line 321, in decodestring
 return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
 >>> "a"=="a"
True
 >>> d={u"a":1, "a":1}
Traceback (most recent call last):
   File "", line 1, in 
   File "/exp/lib/python2.5/encodings/base64_codec.py", line 42, in 
base64_decode
 output = base64.decodestring(input)
   File "/exp/lib/python2.5/base64.py", line 321, in decodestring
 return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

Maybe this is all just a matter of choosing the right defaultencoding ? :)


BTW, python 2.4 also suppresses this exception (when instantiating the 
dictionary)

Does python 2.4 catch any exception when comparing keys (which are not 
basestrings) in dictionaries?


- Ralf

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread Giovanni Bajo

M.-A. Lemburg <[EMAIL PROTECTED]> wrote:

>> At the very least, in the context of a dictionary
>> lookup.
>
> The point here is that a typical user won't expect any comparisons
> to be made when dealing with dictionaries, simply because the fact
> that you do need to make comparisons is an implementation detail.

I'd also weight in the fact that this is a change of behaviour since the
introduction of unicode in 2.0, and it's break real-world applications (as
demonstrated by this thread). I think Python ought to revert to the previous
behaviour and, in case there's interest, the discussion moved to the py3k list.

Giovanni Bajo

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-04 Thread M.-A. Lemburg

Delaney, Timothy (Tim) wrote:
> M.-A. Lemburg wrote:
> 
>> Perhaps we ought to add an exception to the dict lookup mechanism
>> and continue to silence UnicodeErrors ?!
> 
> I'd definitely consider a UnicodeError to be an indication that two
> objects are not equal. 

Not really: Python expects all strings to be ASCII whenever they
meet Unicode strings and have to be converted to Unicode. If a string
is not ASCII and thus causes the exception, there's not a lot you
can say, since you don't know the encoding of the string. All you
know is that it's not ASCII. Instead of guessing, Python then raises
an exception to let the programmer decide.

> At the very least, in the context of a dictionary
> lookup.

The point here is that a typical user won't expect any comparisons
to be made when dealing with dictionaries, simply because the fact
that you do need to make comparisons is an implementation detail.

So in this particular case silencing the exception might be the
more user friendly way of dealing with the problem.

That said, the problem still lingers in that dictionary, so it may
bite you in some other context, e.g. when iterating over the list
of keys.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 04 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Michael Urman

On 8/3/06, Josiah Carlson <[EMAIL PROTECTED]> wrote:
> As an alternate idea, rather than attempting to .decode('ascii') when
> strings and unicode compare, why not .decode('latin-1')?  We lose the
> unicode decoding error, but "the right thing" happens (in my opinion)
> when u'\xa1' and '\xa1' compare.

Since I use utf-8 way more than I use latin-1, -1. Since others do
not, -1 on any not obviously correct encoding other than ascii, which
gets grandfathered.

This raises an exception for a good reason. Yes it's annoying at
times. We should fix those times, not the (unbroken) exception.

Michael
-- 
Michael Urman  http://www.tortall.net/mu/blog
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Jean-Paul Calderone

On Thu, 03 Aug 2006 21:34:04 -0700, Josiah Carlson <[EMAIL PROTECTED]> wrote:
>
>Bob Ippolito <[EMAIL PROTECTED]> wrote:
>> On Aug 3, 2006, at 6:51 PM, Greg Ewing wrote:
>>
>> > M.-A. Lemburg wrote:
>> >
>> >> Perhaps we ought to add an exception to the dict lookup mechanism
>> >> and continue to silence UnicodeErrors ?!
>> >
>> > Seems to be that comparison of unicode and non-unicode
>> > strings for equality shouldn't raise exceptions in the
>> > first place.
>>
>> Seems like a slightly better idea than having dictionaries suppress
>> exceptions. Still not ideal though because sticking non-ASCII strings
>> that are supposed to be text and unicode in the same data structures
>> is *probably* still an error.
>
>If/when 'python -U -c "import test.testall"' runs without unexpected
>error (I doubt it will happen prior to the "all strings are unicode"
>conversion), then I think that we can say that there aren't any
>use-cases for strings and unicode being in the same dictionary.
>
>As an alternate idea, rather than attempting to .decode('ascii') when
>strings and unicode compare, why not .decode('latin-1')?  We lose the
>unicode decoding error, but "the right thing" happens (in my opinion)
>when u'\xa1' and '\xa1' compare.

It might be right for Latin-1 strings.

However, it would be even *more* surprising for the person who has to
figure out why his program works when his program gets a string containing
'\xc0' from one user but fails when it gets '\xe3\x81\x82' from another
user.

I like the exception that 2.5 raises.  I only wish it raised by default
when using 'ascii' and u'ascii' as keys in the same dictionary. ;)  Oh,
and that str and unicode did not hash like they do.  ;)

Jean-Paul
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread James Y Knight


On Aug 4, 2006, at 12:34 AM, Josiah Carlson wrote:
> As an alternate idea, rather than attempting to .decode('ascii') when
> strings and unicode compare, why not .decode('latin-1')?  We lose the
> unicode decoding error, but "the right thing" happens (in my opinion)
> when u'\xa1' and '\xa1' compare.

Maybe you want those to compare equal, but _I_ want u'\xa1' and '\xc2 
\xa1' to compare equal, so it should obviously use .decode('utf-8')!

(okay, no, I don't really want that.)

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Josiah Carlson

Bob Ippolito <[EMAIL PROTECTED]> wrote:
> On Aug 3, 2006, at 6:51 PM, Greg Ewing wrote:
> 
> > M.-A. Lemburg wrote:
> >
> >> Perhaps we ought to add an exception to the dict lookup mechanism
> >> and continue to silence UnicodeErrors ?!
> >
> > Seems to be that comparison of unicode and non-unicode
> > strings for equality shouldn't raise exceptions in the
> > first place.
> 
> Seems like a slightly better idea than having dictionaries suppress  
> exceptions. Still not ideal though because sticking non-ASCII strings  
> that are supposed to be text and unicode in the same data structures  
> is *probably* still an error.

If/when 'python -U -c "import test.testall"' runs without unexpected
error (I doubt it will happen prior to the "all strings are unicode"
conversion), then I think that we can say that there aren't any
use-cases for strings and unicode being in the same dictionary.

As an alternate idea, rather than attempting to .decode('ascii') when
strings and unicode compare, why not .decode('latin-1')?  We lose the
unicode decoding error, but "the right thing" happens (in my opinion)
when u'\xa1' and '\xa1' compare.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Bob Ippolito

On Aug 3, 2006, at 6:51 PM, Greg Ewing wrote:

> M.-A. Lemburg wrote:
>
>> Perhaps we ought to add an exception to the dict lookup mechanism
>> and continue to silence UnicodeErrors ?!
>
> Seems to be that comparison of unicode and non-unicode
> strings for equality shouldn't raise exceptions in the
> first place.

Seems like a slightly better idea than having dictionaries suppress  
exceptions. Still not ideal though because sticking non-ASCII strings  
that are supposed to be text and unicode in the same data structures  
is *probably* still an error.

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread James Y Knight

On Aug 3, 2006, at 5:47 PM, M.-A. Lemburg wrote:
>> The only way this error could be the right thing is if you were  
>> trying
>> to suggest that he shouldn't mix unicode and bytestrings at all.
>
> Good question. I wonder whether that's a reasonable approach for
> Python 2.x (I'd say it is for Py3k).

It's my understanding that in py3k, there will be no implicit  
conversion, bytestrings and unicodes will never be equal (no matter  
what the contents), and so this wouldn't be an issue. (as u"1" == "1"  
would be the same sort of situation as 1 == "1" is now)

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Greg Ewing

M.-A. Lemburg wrote:

> Perhaps we ought to add an exception to the dict lookup mechanism
> and continue to silence UnicodeErrors ?!

Seems to be that comparison of unicode and non-unicode
strings for equality shouldn't raise exceptions in the
first place.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Michael Urman

On 8/3/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote:
> > ...but in the case of dictionaries this behaviour has changed and in
> > prior versions of python dictionaries did work as I expected them to.
> > Now they don't.
>
> Let's put it this way: Python 2.5 uncovered a bug in your
> application that has always been there. It's better to
> fix your application than arguing to cover up the bug again.

I would understand this assertion if Ralf were expecting dictionaries
to consider
{ u'm\xe1s': 1, 'm\xe1s': 1 } == { u'm\xe1s': 1 } == { 'm\xe1s': 1 }
This is clearly a mess waiting to explode.

But that's not what he said. He expects, as is the case in python2.4,
len({ u'm\xe1s': 1, 'm\xe1s': 1 }) == 2
because u'm\xe1s' clearly does not equal 'm\xe1s'. Because it raises
an exception, the dictionary shouldn't consider it equal, so there
should be the two keys which happen to be somewhat equivalent.

While this is in fact in the NEWS (Patch #1497053 & bug #1275608), I
think this should be raised for further discussion. Raising the
exception is good for debugging mistakes, but bad for dictionaries
holding holding inequal objects that happen to hash to the same value,
and correclty raise exceptions on comparison.

When we thought it was just a debugging tool, it made sense to put it
straight in to 2.5. Since it actually can adversely affect behavior in
only slightly edgy cases, perhaps it should go through a warning phase
(which ideally could show the exception that was thrown, thus yielding
most or all of the intended debugging advantage).

Michael
-- 
Michael Urman  http://www.tortall.net/mu/blog
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Delaney, Timothy (Tim)

M.-A. Lemburg wrote:

> Perhaps we ought to add an exception to the dict lookup mechanism
> and continue to silence UnicodeErrors ?!

I'd definitely consider a UnicodeError to be an indication that two
objects are not equal. At the very least, in the context of a dictionary
lookup.

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread M.-A. Lemburg

Jim Jewett wrote:
> http://mail.python.org/pipermail/python-dev/2006-August/067934.html
> M.-A. Lemburg mal at egenix.com
> 
>> Ralf Schmitt wrote:
>>> Still trying to port our software. here's another thing I noticed:
> 
>>> d = {}
>>> d[u'm\xe1s'] = 1
>>> d['m\xe1s'] = 1
>>> print d
> 
> (a 2-element dictionary, because they are not equal)
> 
>>> With python 2.5 I get: [ a traceback ending in ]
> 
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1:
>>> ordinal not in range(128)
> 
>> Let's put it this way: Python 2.5 uncovered a bug in your
>> application that has always been there.
> 
> No; he application would only have a bug if he expected those two
> objects to compare equal.  Trying to stick something hashable into a
> dictionary should not raise an Exception just because there is already
> a similar key, (regardless of whether or not the other key is equal or
> identical).

Hmm, you have a point there...

>>> d = {}

# Two different objects
>>> x = 'a'
>>> y = hash(x)
>>> x
'a'
>>> y
12416037344

# ... with the same hash value
>>> hash(x)
12416037344
>>> hash(y)
12416037344

# Put them in the dictionary, causing a hash collision ...
>>> d[x] = 1
>>> d[y] = 2

# ... which is resolved by comparing the two for equality
# and assigning them to two different slots:
>>> d
{'a': 1, 12416037344: 2}

Since Python 2.5 propagates the compare exception, you get the
exception. Python 2.4 silenced the exception.

> The only way this error could be the right thing is if you were trying
> to suggest that he shouldn't mix unicode and bytestrings at all.

Good question. I wonder whether that's a reasonable approach for
Python 2.x (I'd say it is for Py3k).

Currently you can't safely mix non-ASCII string with Unicode
keys in the same dictionary.

Perhaps we ought to add an exception to the dict lookup mechanism
and continue to silence UnicodeErrors ?!

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 03 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread M.-A. Lemburg

John J Lee wrote:
> On Thu, 3 Aug 2006, M.-A. Lemburg wrote:
> [...]
>> It's actually a good preparation for Py3k where 1 == u'abc' will
>> (likely) also raise an exception.
> 
> I though I'd heard (from Guido here or on the py3k list) that it was only 
> 1 < u'abc' that would raise an exception, and that 1 == u'abc' would still 
> evaluate to False.  Did I misunderstand?

Could be that I'm wrong.

All such mixed-type compares are wrong depending on the scale of
brokenness you apply ;-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 03 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread John J Lee

On Thu, 3 Aug 2006, M.-A. Lemburg wrote:
[...]
> It's actually a good preparation for Py3k where 1 == u'abc' will
> (likely) also raise an exception.

I though I'd heard (from Guido here or on the py3k list) that it was only 
1 < u'abc' that would raise an exception, and that 1 == u'abc' would still 
evaluate to False.  Did I misunderstand?

John
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread M.-A. Lemburg

Ralf Schmitt wrote:
 Still trying to port our software. here's another thing I noticed:

 d = {}
 d[u'm\xe1s'] = 1
 d['m\xe1s'] = 1
 print d

 With python 2.5 I get:

 $ python2.5 t2.py
 Traceback (most recent call last):
File "t2.py", line 3, in 
  d['m\xe1s'] = 1
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: 
 ordinal not in range(128)

>> This is because Unicode and 8-bit string keys only work
>> in the same way if and only if they are plain ASCII.
> 
> This is okay. But in the case where one is not ASCII I would prefer to 
> be able to compare them (not equal) instead of getting a UnicodeError.
> I know it's too late to change this, ...

It is too late to change this, since it was always like this ;-)

Seriously, Unicode is doing the right thing here: you should
really always get an exception if you compare apples and
oranges, rather than reverting to comparing the ids of apples
and oranges as fall-back solution.

I believe that Py3k will implement this.

>> The reason lies in the hash function used by Unicode: it is
>> crafted to make hash(u) == hash(s) for all ASCII s, such
>> that s == u.
>>
>> For non-ASCII strings, there are no guarantees as to the
>> hash value of the strings or whether they match or not.
>>
>> This has been like that since Unicode was introduced, so it's
>> not new in Python 2.5.
>>
> 
> ...but in the case of dictionaries this behaviour has changed and in 
> prior versions of python dictionaries did work as I expected them to.
> Now they don't.

Let's put it this way: Python 2.5 uncovered a bug in your
application that has always been there. It's better to
fix your application than arguing to cover up the bug again.

> When working with unicode strings and (accidently) mixing with str 
> strings, things might seem to work until the first non-ascii string
> is given to some code and one gets that UnicodeDecodeError (e.g. when 
> comparing them).
> 
> If one mixes unicode strings and str strings as keys in a dictionary 
> things might seem to work far longer until he tries to put in some non 
> ASCII string with the "wrong" hash value and suddenly things go boom.
> I'd rather keep the pre 2.5 behaviour.

It's actually a good preparation for Py3k where 1 == u'abc' will
(likely) also raise an exception.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 03 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Ralf Schmitt

M.-A. Lemburg wrote:
> Ralf Schmitt wrote:
>> Ralf Schmitt wrote:
>>> Still trying to port our software. here's another thing I noticed:
>>>
>>> d = {}
>>> d[u'm\xe1s'] = 1
>>> d['m\xe1s'] = 1
>>> print d
>>>
>>> With python 2.4 I can add those two keys to the dictionary and get:
>>> $ python2.4 t2.py
>>> {u'm\xe1s': 1, 'm\xe1s': 1}
>>>
>>> With python 2.5 I get:
>>>
>>> $ python2.5 t2.py
>>> Traceback (most recent call last):
>>>File "t2.py", line 3, in 
>>>  d['m\xe1s'] = 1
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: 
>>> ordinal not in range(128)
>>>
>>> Is this intended behaviour? I guess this might break lots of programs 
>>> and the way python 2.4 works looks right to me.
>>> I think it should be possible to mix str/unicode keys in dicts and let 
>>> non-ascii strings compare not-equal to any unicode string.
>> Also this behaviour makes your programs break randomly, that is, it will 
>> break when the string you add hashes to the same value that the unicode 
>> string has (at least that's what I guess..)
> 
> This is because Unicode and 8-bit string keys only work
> in the same way if and only if they are plain ASCII.

This is okay. But in the case where one is not ASCII I would prefer to 
be able to compare them (not equal) instead of getting a UnicodeError.
I know it's too late to change this, ...

> 
> The reason lies in the hash function used by Unicode: it is
> crafted to make hash(u) == hash(s) for all ASCII s, such
> that s == u.
> 
> For non-ASCII strings, there are no guarantees as to the
> hash value of the strings or whether they match or not.
> 
> This has been like that since Unicode was introduced, so it's
> not new in Python 2.5.
> 

...but in the case of dictionaries this behaviour has changed and in 
prior versions of python dictionaries did work as I expected them to.
Now they don't.

When working with unicode strings and (accidently) mixing with str 
strings, things might seem to work until the first non-ascii string
is given to some code and one gets that UnicodeDecodeError (e.g. when 
comparing them).

If one mixes unicode strings and str strings as keys in a dictionary 
things might seem to work far longer until he tries to put in some non 
ASCII string with the "wrong" hash value and suddenly things go boom.
I'd rather keep the pre 2.5 behaviour.

- Ralf

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Bob Ippolito


On Aug 3, 2006, at 9:51 AM, M.-A. Lemburg wrote:

> Ralf Schmitt wrote:
>> Ralf Schmitt wrote:
>>> Still trying to port our software. here's another thing I noticed:
>>>
>>> d = {}
>>> d[u'm\xe1s'] = 1
>>> d['m\xe1s'] = 1
>>> print d
>>>
>>> With python 2.4 I can add those two keys to the dictionary and get:
>>> $ python2.4 t2.py
>>> {u'm\xe1s': 1, 'm\xe1s': 1}
>>>
>>> With python 2.5 I get:
>>>
>>> $ python2.5 t2.py
>>> Traceback (most recent call last):
>>>File "t2.py", line 3, in 
>>>  d['m\xe1s'] = 1
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in  
>>> position 1:
>>> ordinal not in range(128)
>>>
>>> Is this intended behaviour? I guess this might break lots of  
>>> programs
>>> and the way python 2.4 works looks right to me.
>>> I think it should be possible to mix str/unicode keys in dicts  
>>> and let
>>> non-ascii strings compare not-equal to any unicode string.
>>
>> Also this behaviour makes your programs break randomly, that is,  
>> it will
>> break when the string you add hashes to the same value that the  
>> unicode
>> string has (at least that's what I guess..)
>
> This is because Unicode and 8-bit string keys only work
> in the same way if and only if they are plain ASCII.
>
> The reason lies in the hash function used by Unicode: it is
> crafted to make hash(u) == hash(s) for all ASCII s, such
> that s == u.
>
> For non-ASCII strings, there are no guarantees as to the
> hash value of the strings or whether they match or not.
>
> This has been like that since Unicode was introduced, so it's
> not new in Python 2.5.

What is new is that the exception raised on "u == s" after hash  
collision is no longer silently swallowed.

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread M.-A. Lemburg

Ralf Schmitt wrote:
> Ralf Schmitt wrote:
>> Still trying to port our software. here's another thing I noticed:
>>
>> d = {}
>> d[u'm\xe1s'] = 1
>> d['m\xe1s'] = 1
>> print d
>>
>> With python 2.4 I can add those two keys to the dictionary and get:
>> $ python2.4 t2.py
>> {u'm\xe1s': 1, 'm\xe1s': 1}
>>
>> With python 2.5 I get:
>>
>> $ python2.5 t2.py
>> Traceback (most recent call last):
>>File "t2.py", line 3, in 
>>  d['m\xe1s'] = 1
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: 
>> ordinal not in range(128)
>>
>> Is this intended behaviour? I guess this might break lots of programs 
>> and the way python 2.4 works looks right to me.
>> I think it should be possible to mix str/unicode keys in dicts and let 
>> non-ascii strings compare not-equal to any unicode string.
> 
> Also this behaviour makes your programs break randomly, that is, it will 
> break when the string you add hashes to the same value that the unicode 
> string has (at least that's what I guess..)

This is because Unicode and 8-bit string keys only work
in the same way if and only if they are plain ASCII.

The reason lies in the hash function used by Unicode: it is
crafted to make hash(u) == hash(s) for all ASCII s, such
that s == u.

For non-ASCII strings, there are no guarantees as to the
hash value of the strings or whether they match or not.

This has been like that since Unicode was introduced, so it's
not new in Python 2.5.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 03 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] unicode hell/mixing str and unicode as dictionary keys

2006-08-03 Thread Ralf Schmitt

Ralf Schmitt wrote:
> Still trying to port our software. here's another thing I noticed:
> 
> d = {}
> d[u'm\xe1s'] = 1
> d['m\xe1s'] = 1
> print d
> 
> With python 2.4 I can add those two keys to the dictionary and get:
> $ python2.4 t2.py
> {u'm\xe1s': 1, 'm\xe1s': 1}
> 
> With python 2.5 I get:
> 
> $ python2.5 t2.py
> Traceback (most recent call last):
>File "t2.py", line 3, in 
>  d['m\xe1s'] = 1
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 1: 
> ordinal not in range(128)
> 
> Is this intended behaviour? I guess this might break lots of programs 
> and the way python 2.4 works looks right to me.
> I think it should be possible to mix str/unicode keys in dicts and let 
> non-ascii strings compare not-equal to any unicode string.

Also this behaviour makes your programs break randomly, that is, it will 
break when the string you add hashes to the same value that the unicode 
string has (at least that's what I guess..)

- Ralf


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

41 matches

Mail list logo