New submission from Chris Adams:

I noticed an interesting failure while using re.match / re.sub to look for 
non-Cyrillic characters in allegedly Russian text:

>>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', 
>>> flags=re.IGNORECASE)
'Архангельская губерния'
>>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0)
''

The same is true in Python 2.7, although you need to use ur'' patterns for the 
literals to be expanded:

>>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', 
>>> flags=re.IGNORECASE|regex.UNICODE)
u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f'


In contrast, the regex module behaves as expected:

>>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', 
>>> flags=regex.IGNORECASE|regex.UNICODE)
u''

(Transcript maintained at https://gist.github.com/acdha/5111687)

----------
components: Regular Expressions
messages: 183705
nosy: acdha, ezio.melotti, mrabarnett
priority: normal
severity: normal
status: open
title: IGNORECASE breaks unicode literal range matching
type: behavior
versions: Python 2.7, Python 3.3

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue17381>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to