[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-22 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Thank you Ma Lin. Closed as a duplicate of issue17252. -- nosy: +serhiy.storchaka resolution: -> duplicate stage: -> resolved status: open -> closed superseder: -> Latin Capital Letter I with Dot Above

[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-21 Thread Ma Lin
Ma Lin added the comment: There was a discussion about "Latin Capital Letter I with Dot Above" https://bugs.python.org/issue17252 -- nosy: +Ma Lin ___ Python tracker

[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-21 Thread INADA Naoki
INADA Naoki added the comment: Maybe, we should update UnicodeData? -- ___ Python tracker ___

[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-20 Thread Steven D'Aprano
Steven D'Aprano added the comment: It has never been the case that upper() or lower() are guaranteed to preserve string length in Unicode. For example, some characters decompose into a base plus combining characters. Ligatures are another example. See here for more

[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-20 Thread Kiril Dimitrov
Kiril Dimitrov added the comment: This is roughly my use case: zip( "ßx", [0.5, 0.3]) is [('ß', 0.5), ('x', 0.3)] zip("ßx".upper(), [0.5, 0.3]) will be [('S', 0.5), ('S', 0.3)] in later case you never get to see the value for 'x'. At least my expectation was that

[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-20 Thread INADA Naoki
INADA Naoki added the comment: Another example: >>> s = "ß" >>> len(s) 1 >>> len(s.upper()) 2 >>> s.upper() 'SS' >>> ord(s) 223 > This breaks unicode text matching. What do you talking about? re module? -- nosy: +inada.naoki

[issue33108] Unicode char 304 in lowercase has len = 2

2018-03-20 Thread Kiril Dimitrov
Change by Kiril Dimitrov : -- title: Unicode char 304 in lowercase has len 2 -> Unicode char 304 in lowercase has len = 2 ___ Python tracker

[issue33108] Unicode char 304 in lowercase has len 2

2018-03-20 Thread Kiril Dimitrov
New submission from Kiril Dimitrov : >>> chr(304) 'İ' >>> chr(304).lower() 'i̇' >>> len(chr(304).lower()) 2 This breaks unicode text matching. There is no other unicode character with the same behaviour (in 3.6.2 and 3.6.4). -- components: Unicode messages: