[issue33317] `repr()` of string in NFC and NFD forms does not differ

2018-04-24 Thread Benjamin Peterson

Benjamin Peterson  added the comment:

On Tue, Apr 24, 2018, at 04:33, Pekka Klärck wrote:
> 
> Pekka Klärck  added the comment:
> 
> I didn't submit this as a bug report but as an enhancement request. From 
> usability point of view, saying that results differ but you just cannot 
> see the difference is not very helpful.
> 
> The exact reason I didn't submit this as an enhancement request for 
> unittest, pytest, and all other modules/tools being affected is that 
> "I'm not sure if there's a good way to detect whether two unicode 
> strings are going to display confusingly similarly". Enhancing `repr()` 
> would be a logical solution to this problem.

I should have said "there's no way to unambiguously represent a particular  
unicode string except as a sequence of integers, which isn't normally want 
anyone wants to see". This decomposition problem is only one of many. Even in 
ASCII land, fonts often have very similar glyphs for "l", "I", and "1".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33317] `repr()` of string in NFC and NFD forms does not differ

2018-04-24 Thread Pekka Klärck

Pekka Klärck  added the comment:

I didn't submit this as a bug report but as an enhancement request. From 
usability point of view, saying that results differ but you just cannot see the 
difference is not very helpful.

The exact reason I didn't submit this as an enhancement request for unittest, 
pytest, and all other modules/tools being affected is that "I'm not sure if 
there's a good way to detect whether two unicode strings are going to display 
confusingly similarly". Enhancing `repr()` would be a logical solution to this 
problem.

Finally, would any harm be done if `repr('hyva\u0308')` would be changed to 
`'hyva\\u0308'`? I don't see it being any different than `repr('foo\x00')` 
being `'foo\\x00'`; in both cases you can `eval()` the result to get the 
original value back like `repr()` is supposed to do when possible. Most 
importantly, the result would show that the value actually contains like you 
generally expect `repr()` to do.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33317] `repr()` of string in NFC and NFD forms does not differ

2018-04-23 Thread Benjamin Peterson

Benjamin Peterson  added the comment:

As stated, the bug report is invalid: the repr _does_ differ, it's just not 
presented that way by however you're viewing the two reprs. Distinct codepoint 
sequences that look identical under certain circumstances can happen many 
different ways with Unicode. repr's humble mission is to produce a Python 
literal equivalent to its argument not to produce unambiguous representations 
of codepoint sequences after font rendering.

Possibly, this could be converted to a unittest RFE, but I'm not sure if 
there's a good way to detect whether two unicode strings are going to display 
confusingly similarly.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33317] `repr()` of string in NFC and NFD forms does not differ

2018-04-20 Thread Pekka Klärck

Pekka Klärck  added the comment:

Thanks for pointing out `ascii()`. Seems to do exactly what I want.

`repr()` showing combining characters would, in my opinion, still be useful to 
avoid problems like I demonstrated with unittest and pytest. I doubt it's a 
good idea with them to use `ascii()` instead of `repr()` by default because on 
Python 3 the latter generally works much better with non-ASCII text.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33317] `repr()` of string in NFC and NFD forms does not differ

2018-04-20 Thread Serhiy Storchaka

Serhiy Storchaka  added the comment:

Use ascii() in Python 3 if you want the behavior of repr() in Python 2. It 
escapes all non-ascii characters.

But escaping only combining characters in addition to non-printable characters 
in repr() looks an interesting idea.

--
components: +Interpreter Core, Unicode
nosy: +benjamin.peterson, ezio.melotti, lemburg, serhiy.storchaka, vstinner
type:  -> enhancement
versions: +Python 3.8 -Python 3.4, Python 3.5, Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33317] `repr()` of string in NFC and NFD forms does not differ

2018-04-20 Thread Pekka Klärck

Pekka Klärck  added the comment:

Forgot to mention that this doesn't affect Python 2:

>>> a = u'hyv\xe4'
>>> b = u'hyva\u0308'
>>> print(repr(a))
u'hyv\xe4'
>>> print(repr(b))
u'hyva\u0308'


In addition to hoping `repr()` would be enhanced in future Python 3 versions, 
I'm also looking for a way how to show differences between strings that look 
the same but are different. Currently the best I've found is this:

>>> print('hyva\u0308'.encode('unicode_escape').decode('ASCII'))
hyva\u0308

--
versions: +Python 3.4, Python 3.5, Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue33317] `repr()` of string in NFC and NFD forms does not differ

2018-04-20 Thread Pekka Klärck

New submission from Pekka Klärck :

If I have two strings that look the same but have different Unicode form, it's 
very hard to see where the problem actually is:

>>> a = 'hyv\xe4'
>>> b = 'hyva\u0308'
>>> print(a)
hyvä
>>> print(b)
hyvä
>>> a == b
False
>>> print(repr(a))
'hyvä'
>>> print(repr(b))
'hyvä'

This affects, for example, test automation frameworks using `repr()` in error 
reporting. For example, both unittest and pytest report 
`self.assertEqual('hyv\xe4', 'hyva\u0308')` like this:

AssertionError: 'hyvä' != 'hyvä'
- hyvä
+ hyvä

Because the NFC form is used by strings by default, I would propose that 
`repr()` would show the decomposed form if the string is in NFD. In practice 
I'd like `repr('hyva\0308')` to yield `'hyva\0308'`.

--
messages: 315504
nosy: pekka.klarck
priority: normal
severity: normal
status: open
title: `repr()` of string in NFC and NFD forms does not differ

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com