[issue45869] Unicode and acii regular expressions do not agree on ascii space characters

Steven D'Aprano Mon, 22 Nov 2021 23:23:01 -0800


Steven D'Aprano <[email protected]> added the comment:


Hi Joran,

I'm not sure why you think that /s should agree between ASCII and Unicode. That 
seems like an unjustified assumption to me.

You say: "The expectation would be that the re.A (or re.ASCII) flag should not 
impact the matching behavior of a regular expression on strings consisting only 
of ASCII characters."

But I'm not sure why you have that expectation. Is it documented somewhere? The 
docs clearly say that for character classes, "the characters they match depends 
on whether ASCII or LOCALE mode is in force." I am unable to find anything that 
says that the differences are limited only to non-ASCII code points.

I don't think there is any standard definition of "whitespace" in either the 
ASCII standard, or the very many different regex engines (Perl, dot-Net, Java, 
ECMA, etc).

Unicode does have an official whitespace character property, and as far as I 
can see '\x1c' through '\x1f' (File Separator, Group Separator, Record 
Separator and Unit Separator) are not considered whitespace:

https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace

But the str.isspace() method does consider them as whitespace, while 
bytes.isspace() does not.


>>> '\x1c'.isspace()
True
>>> b'\x1c'.isspace()
False

----------
nosy: +steven.daprano

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue45869>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45869] Unicode and acii regular expressions do not agree on ascii space characters

Reply via email to