[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

Matthew Barnett Mon, 14 Aug 2017 10:57:54 -0700

Matthew Barnett added the comment:

The re module works with codepoints, it doesn't understand canonical 
equivalence.


For example, it doesn't recognise that "\N{LATIN CAPITAL LETTER E}\N{COMBINING 
ACUTE ACCENT}" is equivalent to "\N{LATIN CAPITAL LETTER E WITH ACUTE}".

This is true for Python in general, except for identifiers, which are 
normalised:

>>> "\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}"
'É'
>>> É = 0
>>> "\N{LATIN CAPITAL LETTER E WITH ACUTE}"
'É'
>>> É
0

This also means that, say '.' will match only 1 _codepoint_.

----------
nosy: +mrabarnett

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue31193>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue31193] re.IGNORECASE strips combining character from lower case of LATIN CAPITAL LETTER I WITH DOT ABOVE

Reply via email to