New submission from James Gerity <snoopj...@gmail.com>: The documentation for the `re` library¹ describes the behavior of the specifier '\w' as matching "Unicode word characters," which is very vague. The closest thing I can find that corresponds to this language is the guidance offered in Unicode Technical Standard #18², which defines the class `<word_character>` to include all alphabetic and decimal codepoints, as well as U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER. This does not appear to be a correct description of `re`, however, as these zero-width characters are not counted when matching '\w', e.g.:
``` >>> re.match('\w*', 'Auf\u200Clage') <re.Match object; span=(0, 3), match='Auf'> ``` It seems from examining the CPython source³ that SRE treats '\w' as meaning any alphanumeric character OR U+005F SPACING UNDERSCORE, which does not match any Unicode class definition I've been able to find. Can anyone provide clarification on what part of Unicode this documentation is referring to? If there is some other definition, the documentation should be more specific about referring to it (and including a link would be preferred). If instead the documentation is incorrect, this language should be changed to describe the true meaning of \w. ¹ https://docs.python.org/3/library/re.html#index-32 ² http://unicode.org/reports/tr18/ ³ https://github.com/python/cpython/blob/master/Modules/_sre.c#L125 ---------- assignee: docs@python components: Documentation messages: 355239 nosy: docs@python, snoopjedi priority: normal severity: normal status: open title: Description of '\w' behavior is vague in `re` documentation _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue38566> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com