New submission from Serhiy Storchaka: Locale-specific case-insensitive regular expression matching works only when the pattern was compiled on the same locale as used for matching. Due to caching this can cause unexpected result.
Attached script demonstrates this (it requires two locales: ru_RU.koi8-r and ru_RU.cp1251). The output is: locale ru_RU.koi8-r b'1\xa3' ('1ё') matches b'1\xb3' ('1Ё') b'1\xa3' ('1ё') doesn't match b'1\xbc' ('1╪') locale ru_RU.cp1251 b'1\xa3' ('1Ј') doesn't match b'1\xb3' ('1і') b'1\xa3' ('1Ј') matches b'1\xbc' ('1ј') locale ru_RU.cp1251 b'2\xa3' ('2Ј') doesn't match b'2\xb3' ('2і') b'2\xa3' ('2Ј') matches b'2\xbc' ('2ј') locale ru_RU.koi8-r b'2\xa3' ('2ё') doesn't match b'2\xb3' ('2Ё') b'2\xa3' ('2ё') matches b'2\xbc' ('2╪') b'\xa3' matches b'\xb3' on KOI8-R locale if the pattern was compiled on KOI8-R locale and matches b'\xb3' if the pattern was compiled on CP1251 locale. I see three possible ways to solve this issue: 1. Avoid caching of locale-depending case-insensitive patterns. This definitely will decrease performance of the use of locale-depending case-insensitive regexps (if user don't use own caching) and may be slightly decrease performance of the use of other regexps. 2. Clear precompiled regexps cache on every locale change. This can look simpler, but is vulnerable to locale changes from extensions. 3. Do not lowercase characters at compile time (in locale-depending case-insensitive patterns). This needs to introduce new opcode for case-insensitivity matching or at least rewriting implementation of current opcodes (less efficient). On other way, this is more correct implementation than current one. The problem is that this is incompatible with those distributions which updates only Python library but not statically linked binary (e.g. Vim with Python support). May be there are some workarounds. ---------- components: Extension Modules, Library (Lib), Regular Expressions files: re_locale_caching.py messages: 226874 nosy: ezio.melotti, mrabarnett, pitrou, serhiy.storchaka priority: normal severity: normal status: open title: Locale dependent regexps on different locales type: behavior versions: Python 2.7, Python 3.4, Python 3.5 Added file: http://bugs.python.org/file36616/re_locale_caching.py _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue22410> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com