[issue22407] re.LOCALE is nonsensical for Unicode

Serhiy Storchaka Tue, 16 Sep 2014 09:11:27 -0700

Serhiy Storchaka added the comment:

Yes, one of solution is to deprecate re.LOCALE for unicode strings and then 
make it incompatible with unicode strings. But I think it would be good to 
implement locale-aware matching.


Example.

>>> for a in 'Ii\u0130\u0131':
...     for b in 'Ii\u0130\u0131':
...         if a != b and re.match(a, b, re.I): print(a, '~', b)
... 
I ~ i
I ~ İ
i ~ I
i ~ İ
İ ~ I
İ ~ i

This is incorrect result in Turkish. Capital dotless "I" matches capital "İ" 
with dot above, and small dotless "ı" doesn't match anything.

Regex produces more relevant output, which includes matches for Turkish and 
English:

I ~ i
I ~ ı
i ~ I
i ~ İ
İ ~ i
ı ~ I

With locale tr_TR.utf8 (with the patch):

>>> for a in 'Ii\u0130\u0131':
...     for b in 'Ii\u0130\u0131':
...         if a != b and re.match(a, b, re.I|re.L): print(a, '~', b)
... 
I ~ ı
i ~ İ
İ ~ i
ı ~ I

This is correct result in Turkish.

Therefore there is a use case for this feature.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue22407>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue22407] re.LOCALE is nonsensical for Unicode

Reply via email to