On 2017-11-16 10:23, Serhiy Storchaka wrote:
Currently the re module ignores only 6 ASCII whitespaces in the
re.VERBOSE mode:

       U+0009 CHARACTER TABULATION
       U+000A LINE FEED
       U+000B LINE TABULATION
       U+000C FORM FEED
       U+000D CARRIAGE RETURN
       U+0020 SPACE

Perl ignores characters that Unicode calls "Pattern White Space" in the
/x mode. It ignores additional 5 non-ASCII characters.

       U+0085 NEXT LINE
       U+200E LEFT-TO-RIGHT MARK
       U+200F RIGHT-TO-LEFT MARK
       U+2028 LINE SEPARATOR
       U+2029 PARAGRAPH SEPARATOR

The regex module just ignores characters for which str.isspace() returns
True. It ignores additional 20 non-ASCII whitespace characters,
including characters U+001C..001F whose classification as whitespaces is
questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.

       U+001C [FILE SEPARATOR]
       U+001D [GROUP SEPARATOR]
       U+001E [RECORD SEPARATOR]
       U+001F [UNIT SEPARATOR]
       U+00A0 NO-BREAK SPACE
       U+1680 OGHAM SPACE MARK
       U+2000 EN QUAD
       U+2001 EM QUAD
       U+2002 EN SPACE
       U+2003 EM SPACE
       U+2004 THREE-PER-EM SPACE
       U+2005 FOUR-PER-EM SPACE
       U+2006 SIX-PER-EM SPACE
       U+2007 FIGURE SPACE
       U+2008 PUNCTUATION SPACE
       U+2009 THIN SPACE
       U+200A HAIR SPACE
       U+202F NARROW NO-BREAK SPACE
       U+205F MEDIUM MATHEMATICAL SPACE
       U+3000 IDEOGRAPHIC SPACE

str.isspace appears to be Unicode "Whitespace" plus those 4 "questionable" codepoints.

Is it worth to extend the set of ignored whitespaces to "Pattern
Whitespaces"? Would it add any benefit? Or add confusion? Should this
depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in
verbose bytes patterns?

And there is a similar question about the Python parser. If Python uses
Unicode definition for identifier, shouldn't it accept non-ASCII
"Pattern Whitespaces" as whitespaces? There will be technical problems
with supporting this, but are there any benefits?


https://perldoc.perl.org/perlre.html
https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
https://unicode.org/L2/L2005/05012r-pattern.html

_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to