the python docs say that re.LOCALE makes certain character classes "dependent on the current locale".

here's what i currently see on my system:

>>> import re, locale
>>> locale.getdefaultlocale()
('en_GB', 'UTF8')
>>> locale.getlocale()
(None, None)
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.ISO 8859-1')
'en_GB.ISO 8859-1'
>>> re.findall(r'\w', u'\xe5 \xe6 \xe7 a b c', re.L)
[u'\xe5', u'\xe6', u'\xe7', u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.UTF-8')
'en_GB.UTF-8'
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']

it seems wrong to me that re.LOCALE fails to give the "right" result when the local encoding is utf8 - i think it should give the same result as re.UNICODE.

is this a bug, or does the documentation just need to be made clearer?
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to