New submission from Chris Adams: I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text:
>>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', >>> flags=re.IGNORECASE) 'Архангельская губерния' >>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0) '' The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded: >>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', >>> flags=re.IGNORECASE|regex.UNICODE) u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f' In contrast, the regex module behaves as expected: >>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', >>> flags=regex.IGNORECASE|regex.UNICODE) u'' (Transcript maintained at https://gist.github.com/acdha/5111687) ---------- components: Regular Expressions messages: 183705 nosy: acdha, ezio.melotti, mrabarnett priority: normal severity: normal status: open title: IGNORECASE breaks unicode literal range matching type: behavior versions: Python 2.7, Python 3.3 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue17381> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com