Josh Rosenberg <shadowranger+pyt...@gmail.com> added the comment:

The definition of \w, historically, has corresponded to the set of characters 
that can occur in legal variable names in C (alphanumeric ASCII plus 
underscores, making it equivalent to [a-zA-Z0-9_] for ASCII regex). That's why, 
on top of the definitely wordy alphabetic characters, and the arguably wordy 
numerics, it includes the underscore, _.

That definition predates Unicode entirely, and Python is just building on it by 
expanding the definition of "alphanumeric" to encompass all alphanumeric 
characters in Unicode.

We definitely can't remove underscores from the definition without breaking 
existing code which assumes a common subset of PCRE support (every regex flavor 
I know of includes underscores in \w). Adding the zero width characters seems 
of limited benefit (especially in the non-joiner case; if you're trying to pull 
out words, presumably you don't want to group letters across a non-joining 
boundary?). Basically, you're parsing "Unicode word characters" as "Unicode's 
definition of word characters", when it's really meant to mean "All word 
characters, not just ASCII".

You omitted the clarifying remarks from the documentation though, the full 
description is:

> Matches Unicode word characters; this includes most characters that can be 
> part of a word in any language, as well as numbers and the underscore. If the 
> ASCII flag is used, only [a-zA-Z0-9_] is matched.

That's about as precise as I think we can make it (because technically, some of 
the things that count as "word characters" aren't actually part of an 
"alphabet" in the technical definition). If you think there is a clearer way of 
expressing it, please suggest a better phrasing, and this can be fixed as a 
documentation bug.

----------
nosy: +josh.r

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38566>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to