regular expressions and the LOCALE flag

Baz Walter Tue, 03 Aug 2010 11:00:07 -0700

the python docs say that re.LOCALE makes certain character classes"dependent on the current locale".


here's what i currently see on my system:


>>> import re, locale
>>> locale.getdefaultlocale()
('en_GB', 'UTF8')
>>> locale.getlocale()
(None, None)
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.ISO 8859-1')
'en_GB.ISO 8859-1'
>>> re.findall(r'\w', u'\xe5 \xe6 \xe7 a b c', re.L)
[u'\xe5', u'\xe6', u'\xe7', u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.UTF-8')
'en_GB.UTF-8'
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']

it seems wrong to me that re.LOCALE fails to give the "right" resultwhen the local encoding is utf8 - i think it should give the same resultas re.UNICODE.


is this a bug, or does the documentation just need to be made clearer?
--
http://mail.python.org/mailman/listinfo/python-list

regular expressions and the LOCALE flag

Reply via email to