New submission from James Gerity <snoopj...@gmail.com>:

The documentation for the `re` library¹ describes the behavior of the specifier 
'\w' as matching "Unicode word characters," which is very vague. The closest 
thing I can find that corresponds to this language is the guidance offered in 
Unicode Technical Standard #18², which defines the class `<word_character>` to 
include all alphabetic and decimal codepoints, as well as U+200C ZERO WIDTH 
NON-JOINER and U+200D ZERO WIDTH JOINER. This does not appear to be a correct 
description of `re`, however, as these zero-width characters are not counted 
when matching '\w', e.g.:

```
>>> re.match('\w*', 'Auf\u200Clage')
<re.Match object; span=(0, 3), match='Auf'>
```

It seems from examining the CPython source³ that SRE treats '\w' as meaning any 
alphanumeric character OR U+005F SPACING UNDERSCORE, which does not match any 
Unicode class definition I've been able to find.

Can anyone provide clarification on what part of Unicode this documentation is 
referring to? If there is some other definition, the documentation should be 
more specific about referring to it (and including a link would be preferred). 
If instead the documentation is incorrect, this language should be changed to 
describe the true meaning of \w.

¹ https://docs.python.org/3/library/re.html#index-32
² http://unicode.org/reports/tr18/
³ https://github.com/python/cpython/blob/master/Modules/_sre.c#L125

----------
assignee: docs@python
components: Documentation
messages: 355239
nosy: docs@python, snoopjedi
priority: normal
severity: normal
status: open
title: Description of '\w' behavior is vague in `re` documentation

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38566>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to