regular expressions and the LOCALE flag

2010-08-03 Thread Baz Walter
the python docs say that re.LOCALE makes certain character classes 
"dependent on the current locale".


here's what i currently see on my system:

>>> import re, locale
>>> locale.getdefaultlocale()
('en_GB', 'UTF8')
>>> locale.getlocale()
(None, None)
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.ISO 8859-1')
'en_GB.ISO 8859-1'
>>> re.findall(r'\w', u'\xe5 \xe6 \xe7 a b c', re.L)
[u'\xe5', u'\xe6', u'\xe7', u'a', u'b', u'c']
>>> locale.setlocale(locale.LC_ALL, 'en_GB.UTF-8')
'en_GB.UTF-8'
>>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']

it seems wrong to me that re.LOCALE fails to give the "right" result 
when the local encoding is utf8 - i think it should give the same result 
as re.UNICODE.


is this a bug, or does the documentation just need to be made clearer?
--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions and the LOCALE flag

2010-08-03 Thread MRAB

Baz Walter wrote:
the python docs say that re.LOCALE makes certain character classes 
"dependent on the current locale".


here's what i currently see on my system:

 >>> import re, locale
 >>> locale.getdefaultlocale()
('en_GB', 'UTF8')
 >>> locale.getlocale()
(None, None)
 >>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']
 >>> locale.setlocale(locale.LC_ALL, 'en_GB.ISO 8859-1')
'en_GB.ISO 8859-1'
 >>> re.findall(r'\w', u'\xe5 \xe6 \xe7 a b c', re.L)
[u'\xe5', u'\xe6', u'\xe7', u'a', u'b', u'c']
 >>> locale.setlocale(locale.LC_ALL, 'en_GB.UTF-8')
'en_GB.UTF-8'
 >>> re.findall(r'\w', u'a b c \xe5 \xe6 \xe7', re.L)
[u'a', u'b', u'c']

it seems wrong to me that re.LOCALE fails to give the "right" result 
when the local encoding is utf8 - i think it should give the same result 
as re.UNICODE.


is this a bug, or does the documentation just need to be made clearer?


re.LOCALE just passes the character to the underlying C library. It
really only works on bytestrings which have 1 byte per character. UTF-8
encodes codepoints outside the ASCII range to multiple bytes per
codepoint, and the re module will treat each of those bytes as a
separate character.

And, BTW, none of your examples pass a UTF-8 bytestring to re.findall:
all those string literals starting with the 'u' prefix are Unicode
strings!

Locale encodings are more trouble than they're worth. Unicode is better.
:-)
--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions and the LOCALE flag

2010-08-03 Thread Baz Walter

On 03/08/10 19:40, MRAB wrote:

Baz Walter wrote:

the python docs say that re.LOCALE makes certain character classes
"dependent on the current locale".


re.LOCALE just passes the character to the underlying C library. It
really only works on bytestrings which have 1 byte per character.


the re docs don't specify 8-bit encodings: they just refer to the 
'current locale'.



And, BTW, none of your examples pass a UTF-8 bytestring to re.findall:
all those string literals starting with the 'u' prefix are Unicode
strings!


not sure what you mean by this: if the string was encoded as utf8, '\w' 
still wouldn't match any of the non-ascii characters.



Locale encodings are more trouble than they're worth. Unicode is better.
:-)


yes, i'm really just trying to decide whether i should offer 'locale' as 
an option in my program. given the unintuitive way re.LOCALE works, i'm 
not sure that i should.


are you saying that it only really makes sense for *bytestrings* to be 
used with re.LOCALE?


if so, the re docs certainly don't make that clear.
--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions and the LOCALE flag

2010-08-03 Thread MRAB

Baz Walter wrote:

On 03/08/10 19:40, MRAB wrote:

Baz Walter wrote:

the python docs say that re.LOCALE makes certain character classes
"dependent on the current locale".


re.LOCALE just passes the character to the underlying C library. It
really only works on bytestrings which have 1 byte per character.


the re docs don't specify 8-bit encodings: they just refer to the 
'current locale'.



And, BTW, none of your examples pass a UTF-8 bytestring to re.findall:
all those string literals starting with the 'u' prefix are Unicode
strings!


not sure what you mean by this: if the string was encoded as utf8, '\w' 
still wouldn't match any of the non-ascii characters.



Strings with the 'u' prefix are Unicode strings, not bytestrings. They
don't have an encoding. A UTF-8 string is a bytestring in which the
bytes represent Unicode codepoints encoded as UTF-8.


Locale encodings are more trouble than they're worth. Unicode is better.
:-)


yes, i'm really just trying to decide whether i should offer 'locale' as 
an option in my program. given the unintuitive way re.LOCALE works, i'm 
not sure that i should.


are you saying that it only really makes sense for *bytestrings* to be 
used with re.LOCALE?


if so, the re docs certainly don't make that clear.


The re module can match against 3 types of string:

1. ASCII (default in Python 2): bytestring with characters in the ASCII
   range (1 byte per character). However, it doesn't complain if it sees
   bytes/characters outside the ASCII range.

2. LOCALE: bytestring with characters in the current locale (but only 1
   byte per character). Characters are categorised according to the
   underlying C library; for example, 'a' is a letter if isalpha('a')
   returns true.

3. UNICODE (default in Python 3): Unicode string.
--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions and the LOCALE flag

2010-08-03 Thread Baz Walter

On 03/08/10 21:24, MRAB wrote:

And, BTW, none of your examples pass a UTF-8 bytestring to re.findall:
all those string literals starting with the 'u' prefix are Unicode
strings!


not sure what you mean by this: if the string was encoded as utf8,
'\w' still wouldn't match any of the non-ascii characters.


Strings with the 'u' prefix are Unicode strings, not bytestrings. They
don't have an encoding.


well, they do if they are given one, as i suggested!

to be explicit, if the local encoding is 'utf8', none of the following 
will get a hit:


(1) re.findall(r'\w', '\xe5 \xe6 \xe7', re.L)
(2) re.findall(r'\w', u'\xe5 \xe6 \xe7'.encode('utf8'), re.L)
(3) re.findall(r'\w', u'\xe5 \xe6 \xe7', re.L)

so i still don't know what you meant about passing a 'UTF-8 bytestring' 
in your first comment :)


only (3) could feasibly get a hit - and then only if the re module was 
smart enough to fall back to re.UNICODE for utf8 (and any other 
encodings of unicode it might know about).



2. LOCALE: bytestring with characters in the current locale (but only 1
byte per character). Characters are categorised according to the
underlying C library; for example, 'a' is a letter if isalpha('a')
returns true.


this is actually what my question was about. i suspected something like 
this might be the case, but i can't actually see it stated anywhere in 
the docs. maybe it's just me, but 'current locale' doesn't naturally 
imply 'only 8-bit encodings'. i would have thought it implied 'whatever 
encoding is discovered on the local system' - and these days, that's 
very commonly utf8.


is there actually a use case for it working the way it currently does? 
it seems just broken to have it depending so heavily on implementation 
details.



3. UNICODE (default in Python 3): Unicode string.


i've just read the python3 re docs, and they do now make an explicit 
distinction between matching bytes (with the new re.ASCII flag) and 
matching textual characters (i.e. unicode, the default). the re.LOCALE 
flag is still there, and there are now warnings about it's unreliability 
- but it still doesn't state that it can only work properly if the local 
encoding is 8-bit.

--
http://mail.python.org/mailman/listinfo/python-list


Re: regular expressions and the LOCALE flag

2010-08-03 Thread MRAB

Baz Walter wrote:
> On 03/08/10 21:24, MRAB wrote:
 And, BTW, none of your examples pass a UTF-8 bytestring to
 re.findall: all those string literals starting with the 'u' prefix
 are Unicode strings!
>>>
>>> not sure what you mean by this: if the string was encoded as utf8,
>>> '\w' still wouldn't match any of the non-ascii characters.
>>>
>> Strings with the 'u' prefix are Unicode strings, not bytestrings.
>> They don't have an encoding.
>
> well, they do if they are given one, as i suggested!
>
> to be explicit, if the local encoding is 'utf8', none of the following
> will get a hit:
>
> (1) re.findall(r'\w', '\xe5 \xe6 \xe7', re.L)

This passes, for example, 0xE5 to the C library function isalnum() to
check whether it's alphanumeric. Apparently it's returning false when
the locale is set to UTF-8.

> (2) re.findall(r'\w', u'\xe5 \xe6 \xe7'.encode('utf8'), re.L)

u'\xe5' is encoded to '\xc3\xa5'. Both 0xC3 and 0xA5 are passed to the C
library function isalnum() to check whether they're alphanumeric.
Apparently it's returning false for both when the locale is set to
UTF-8.

> (3) re.findall(r'\w', u'\xe5 \xe6 \xe7', re.L)
>
Same as (1) above.

> so i still don't know what you meant about passing a 'UTF-8
> bytestring' in your first comment :)
>
> only (3) could feasibly get a hit - and then only if the re module was
> smart enough to fall back to re.UNICODE for utf8 (and any other
> encodings of unicode it might know about).
>
LOCALE was really intended for all those 1-byte-per-character character
sets like CP1252. Trying to implement regex when different characters
occupy different numbers of bytes is, well, challenging! :-)

>> 2. LOCALE: bytestring with characters in the current locale (but only
>> 1 byte per character). Characters are categorised according to the
>> underlying C library; for example, 'a' is a letter if isalpha('a')
>> returns true.
>
> this is actually what my question was about. i suspected something
> like this might be the case, but i can't actually see it stated
> anywhere in the docs. maybe it's just me, but 'current locale' doesn't
> naturally imply 'only 8-bit encodings'. i would have thought it
> implied 'whatever encoding is discovered on the local system' - and
> these days, that's very commonly utf8.
>
> is there actually a use case for it working the way it currently does?
> it seems just broken to have it depending so heavily on implementation
> details.
>
As I said, it's for old-style 1-byte-per-character character sets. If
you have UTF-8, then you can decode to Unicode.

Is it broken? Well, it works well enough for its intended use. Could the
re module work with bytes which represent characters in an arbitrary
encoding? Would you like to have a go at implementing it? I wouldn't...

It would be easier to just decode to Unicode and work with that.

>> 3. UNICODE (default in Python 3): Unicode string.
>
> i've just read the python3 re docs, and they do now make an explicit
> distinction between matching bytes (with the new re.ASCII flag) and
> matching textual characters (i.e. unicode, the default). the re.LOCALE
> flag is still there, and there are now warnings about it's
> unreliability - but it still doesn't state that it can only work
> properly if the local encoding is 8-bit.

The recommendation for text is to use UTF-8 externally (input, output
and storage in files) and Unicode internally when processing.
--
http://mail.python.org/mailman/listinfo/python-list