Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
21.11.17 04:20, Stephen J. Turnbull пише: Serhiy Storchaka writes: > I agree. But if there is a special part of the Unicode standard for > Pattern White Spaces which includes non-ASCII characters, perhaps there > is a need in them. I asked for the case if Python developers with very > different cultures have need in additional whitespaces in regular > expressions, but I don't know. Seems nobody has claimed their need. I doubt that Japanese would want it. I do use \N{IDEOGRAPHIC SPACE} a bit as a *target* of regular expressions, but I would never want it as non-syntactic in re.VERBOSE. (Of course, I'm not a native Japanese, but I have never heard a Japanese developer wish for use of that character in any programming language, outside of literal strings.) > In particularly I don't know how helpful would be supporting > right-to-left and left-to-right marks in verbose regular expressions That's a good question. Interpretation and display of R2L in programming constructs came up briefly in the discussions about BIDI on the emacs-devel list. I'll ask Eli Zaretskii, who implemented it for Emacs. Thank you Stephen. I would prefer to not change anything (because supporting additional whitespaces will complicate and slow down the code, and can add subtle bugs, add likely will add a confusion for users). But I want to know whether there is a real need in supporting additional whitespaces and rtl and ltr marks in regular expressions and Python syntax. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
Serhiy Storchaka writes: > I agree. But if there is a special part of the Unicode standard for > Pattern White Spaces which includes non-ASCII characters, perhaps there > is a need in them. I asked for the case if Python developers with very > different cultures have need in additional whitespaces in regular > expressions, but I don't know. Seems nobody has claimed their need. I doubt that Japanese would want it. I do use \N{IDEOGRAPHIC SPACE} a bit as a *target* of regular expressions, but I would never want it as non-syntactic in re.VERBOSE. (Of course, I'm not a native Japanese, but I have never heard a Japanese developer wish for use of that character in any programming language, outside of literal strings.) > In particularly I don't know how helpful would be supporting > right-to-left and left-to-right marks in verbose regular expressions That's a good question. Interpretation and display of R2L in programming constructs came up briefly in the discussions about BIDI on the emacs-devel list. I'll ask Eli Zaretskii, who implemented it for Emacs. Steve -- Associate Professor Division of Policy and Planning Science http://turnbull/sk.tsukuba.ac.jp/ Faculty of Systems and Information Email: turnb...@sk.tsukuba.ac.jp University of Tsukuba Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
For consistency, we should probably have "whitespace" for re equal to whatever "\s" matches, since this is what the engine itself considers as whitespace (and then also covers the special case where you use the re.ASCII flag). Still, the only practical case I could imagine, where extending the list would indeed make sense, is to have the character qualify as whitespace for re.VERBOSE, since this can sometimes be introduced via copy&paste from other sources (e.g. web pages showing a regular expression). Due to whitespace being what it is, it's hard to tell whether you've just copied a \u0020 or a \u00a0. The latter can easily render the regular expression non-working with the current interpretation of re.VERBOSE. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 20 2017) >>> Python Projects, Coaching and Consulting ... http://www.egenix.com/ >>> Python Database Interfaces ... http://products.egenix.com/ >>> Plone/Zope Database Interfaces ... http://zope.egenix.com/ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/ ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
20.11.17 10:13, Stephen J. Turnbull пише: Otherwise I'm with Paul, who writes: > My instinct is not to worry about it unless someone has actually hit > the issue in practice and raised a bug. After the tabs vs. spaces fiasco, I lean steeply to the right for code -- including embedded languages like regexes. *We* say what is allowed there, and *you* can find an editor that does it our way. I agree. But if there is a special part of the Unicode standard for Pattern White Spaces which includes non-ASCII characters, perhaps there is a need in them. I asked for the case if Python developers with very different cultures have need in additional whitespaces in regular expressions, but I don't know. Seems nobody has claimed their need. In particularly I don't know how helpful would be supporting right-to-left and left-to-right marks in verbose regular expressions (or even in Python code), or this will just add confusion? Unicode identifiers already can be misused for confusion due to homoglyphs. The problem is not that correctly looking program can be rejected by the compiler, but that the program can work differently from expected because it uses different names that look the same. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
All, If we *do* seriously consider adding those characters as ignorable in re.VERBOSE, that's because somebody is using them in text a lot, and it slips into their coding. Given frequent use, we should consider how a lot more whitespace characters can be conveniently searched individually and readably, because whitespace characters are the ultimate confusables. This may be a no-op, given the \N and \u notations, but \u is pretty opaque and \N leads to character-per-line regexes. ;-) Otherwise I'm with Paul, who writes: > My instinct is not to worry about it unless someone has actually hit > the issue in practice and raised a bug. After the tabs vs. spaces fiasco, I lean steeply to the right for code -- including embedded languages like regexes. *We* say what is allowed there, and *you* can find an editor that does it our way. The point of re.VERBOSE is to allow writing regexes the way we write Python code, formatting to emphasize structure and improve readability. I don't see why we would want to allow more than we already do, given that any fancy whitespace formatting for "literate programming" will be done by the code formatting engine of the document preparation system anyway. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
I put the actual space characters here so you can see them in a non-proportional font (which I assume most Python programmer use). https://gist.github.com/stephanh42/7c1c122154fd3f26d864233a40d8 The control characters aren't rendered at all (Vim renders them as ^\ ^] ^^ ^_, respectively). Most of the other spaces are rendered exactly like the normal space. The only ones which render differently are U+1680 | | OGHAM SPACE MARK U+3000 | | IDEOGRAPHIC SPACE I understand Ogham has recently (since 6th century CE) seen a decline in popularity. However, I think Python should totally adopt U+3000 as a new whitespace character and start promoting it as the One True Way to indent code, so as to finally end the age-old spaces vs tabs conflict. [That was supposed to be a joke.] Stephan 2017-11-17 16:38 GMT+01:00 Victor Stinner : > I don't think that we need more than space (U+0020) and Unix newline > (U+000A) ;-) > > Victor > > 2017-11-16 11:23 GMT+01:00 Serhiy Storchaka : > > Currently the re module ignores only 6 ASCII whitespaces in the > re.VERBOSE > > mode: > > > > U+0009 CHARACTER TABULATION > > U+000A LINE FEED > > U+000B LINE TABULATION > > U+000C FORM FEED > > U+000D CARRIAGE RETURN > > U+0020 SPACE > > > > Perl ignores characters that Unicode calls "Pattern White Space" in the > /x > > mode. It ignores additional 5 non-ASCII characters. > > > > U+0085 NEXT LINE > > U+200E LEFT-TO-RIGHT MARK > > U+200F RIGHT-TO-LEFT MARK > > U+2028 LINE SEPARATOR > > U+2029 PARAGRAPH SEPARATOR > > > > The regex module just ignores characters for which str.isspace() returns > > True. It ignores additional 20 non-ASCII whitespace characters, including > > characters U+001C..001F whose classification as whitespaces is > questionable, > > but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK. > > > > U+001C [FILE SEPARATOR] > > U+001D [GROUP SEPARATOR] > > U+001E [RECORD SEPARATOR] > > U+001F [UNIT SEPARATOR] > > U+00A0 NO-BREAK SPACE > > U+1680 OGHAM SPACE MARK > > U+2000 EN QUAD > > U+2001 EM QUAD > > U+2002 EN SPACE > > U+2003 EM SPACE > > U+2004 THREE-PER-EM SPACE > > U+2005 FOUR-PER-EM SPACE > > U+2006 SIX-PER-EM SPACE > > U+2007 FIGURE SPACE > > U+2008 PUNCTUATION SPACE > > U+2009 THIN SPACE > > U+200A HAIR SPACE > > U+202F NARROW NO-BREAK SPACE > > U+205F MEDIUM MATHEMATICAL SPACE > > U+3000 IDEOGRAPHIC SPACE > > > > Is it worth to extend the set of ignored whitespaces to "Pattern > > Whitespaces"? Would it add any benefit? Or add confusion? Should this > depend > > on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose > bytes > > patterns? > > > > And there is a similar question about the Python parser. If Python uses > > Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern > > Whitespaces" as whitespaces? There will be technical problems with > > supporting this, but are there any benefits? > > > > > > https://perldoc.perl.org/perlre.html > > https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax > > https://unicode.org/L2/L2005/05012r-pattern.html > > > > ___ > > Python-ideas mailing list > > Python-ideas@python.org > > https://mail.python.org/mailman/listinfo/python-ideas > > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
I don't think that we need more than space (U+0020) and Unix newline (U+000A) ;-) Victor 2017-11-16 11:23 GMT+01:00 Serhiy Storchaka : > Currently the re module ignores only 6 ASCII whitespaces in the re.VERBOSE > mode: > > U+0009 CHARACTER TABULATION > U+000A LINE FEED > U+000B LINE TABULATION > U+000C FORM FEED > U+000D CARRIAGE RETURN > U+0020 SPACE > > Perl ignores characters that Unicode calls "Pattern White Space" in the /x > mode. It ignores additional 5 non-ASCII characters. > > U+0085 NEXT LINE > U+200E LEFT-TO-RIGHT MARK > U+200F RIGHT-TO-LEFT MARK > U+2028 LINE SEPARATOR > U+2029 PARAGRAPH SEPARATOR > > The regex module just ignores characters for which str.isspace() returns > True. It ignores additional 20 non-ASCII whitespace characters, including > characters U+001C..001F whose classification as whitespaces is questionable, > but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK. > > U+001C [FILE SEPARATOR] > U+001D [GROUP SEPARATOR] > U+001E [RECORD SEPARATOR] > U+001F [UNIT SEPARATOR] > U+00A0 NO-BREAK SPACE > U+1680 OGHAM SPACE MARK > U+2000 EN QUAD > U+2001 EM QUAD > U+2002 EN SPACE > U+2003 EM SPACE > U+2004 THREE-PER-EM SPACE > U+2005 FOUR-PER-EM SPACE > U+2006 SIX-PER-EM SPACE > U+2007 FIGURE SPACE > U+2008 PUNCTUATION SPACE > U+2009 THIN SPACE > U+200A HAIR SPACE > U+202F NARROW NO-BREAK SPACE > U+205F MEDIUM MATHEMATICAL SPACE > U+3000 IDEOGRAPHIC SPACE > > Is it worth to extend the set of ignored whitespaces to "Pattern > Whitespaces"? Would it add any benefit? Or add confusion? Should this depend > on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose bytes > patterns? > > And there is a similar question about the Python parser. If Python uses > Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern > Whitespaces" as whitespaces? There will be technical problems with > supporting this, but are there any benefits? > > > https://perldoc.perl.org/perlre.html > https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax > https://unicode.org/L2/L2005/05012r-pattern.html > > ___ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
17.11.17 00:09, MRAB пише: On 2017-11-16 21:44, Serhiy Storchaka wrote: 16.11.17 19:38, Guido van Rossum пише: Who would benefit from changing this? Let's not change things just because we can, or because Perl 6 does it. I don't know. I know the disadvantages of making this change, and I ask what is the benefit. If there is a benefit, and it is important for Python, I could implement this feature in re and regex. You could see what some more languages, e.g. C#, do. If there isn't a consensus of some kind, it's best to leave it. I haven't found this in the documentation, but according to the sources it uses only 5 ASCII whitespaces (exluding \v). Java uses 6 ASCII whitespaces. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
On 2017-11-16 21:44, Serhiy Storchaka wrote: 16.11.17 19:38, Guido van Rossum пише: Who would benefit from changing this? Let's not change things just because we can, or because Perl 6 does it. I don't know. I know the disadvantages of making this change, and I ask what is the benefit. If there is a benefit, and it is important for Python, I could implement this feature in re and regex. You could see what some more languages, e.g. C#, do. If there isn't a consensus of some kind, it's best to leave it. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
16.11.17 19:38, Guido van Rossum пише: Who would benefit from changing this? Let's not change things just because we can, or because Perl 6 does it. I don't know. I know the disadvantages of making this change, and I ask what is the benefit. If there is a benefit, and it is important for Python, I could implement this feature in re and regex. ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
Who would benefit from changing this? Let's not change things just because we can, or because Perl 6 does it. On Thu, Nov 16, 2017 at 9:21 AM, MRAB wrote: > On 2017-11-16 10:23, Serhiy Storchaka wrote: > >> Currently the re module ignores only 6 ASCII whitespaces in the >> re.VERBOSE mode: >> >>U+0009 CHARACTER TABULATION >>U+000A LINE FEED >>U+000B LINE TABULATION >>U+000C FORM FEED >>U+000D CARRIAGE RETURN >>U+0020 SPACE >> >> Perl ignores characters that Unicode calls "Pattern White Space" in the >> /x mode. It ignores additional 5 non-ASCII characters. >> >>U+0085 NEXT LINE >>U+200E LEFT-TO-RIGHT MARK >>U+200F RIGHT-TO-LEFT MARK >>U+2028 LINE SEPARATOR >>U+2029 PARAGRAPH SEPARATOR >> >> The regex module just ignores characters for which str.isspace() returns >> True. It ignores additional 20 non-ASCII whitespace characters, >> including characters U+001C..001F whose classification as whitespaces is >> questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT >> MARK. >> >>U+001C [FILE SEPARATOR] >>U+001D [GROUP SEPARATOR] >>U+001E [RECORD SEPARATOR] >>U+001F [UNIT SEPARATOR] >>U+00A0 NO-BREAK SPACE >>U+1680 OGHAM SPACE MARK >>U+2000 EN QUAD >>U+2001 EM QUAD >>U+2002 EN SPACE >>U+2003 EM SPACE >>U+2004 THREE-PER-EM SPACE >>U+2005 FOUR-PER-EM SPACE >>U+2006 SIX-PER-EM SPACE >>U+2007 FIGURE SPACE >>U+2008 PUNCTUATION SPACE >>U+2009 THIN SPACE >>U+200A HAIR SPACE >>U+202F NARROW NO-BREAK SPACE >>U+205F MEDIUM MATHEMATICAL SPACE >>U+3000 IDEOGRAPHIC SPACE >> >> str.isspace appears to be Unicode "Whitespace" plus those 4 > "questionable" codepoints. > > > Is it worth to extend the set of ignored whitespaces to "Pattern >> Whitespaces"? Would it add any benefit? Or add confusion? Should this >> depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in >> verbose bytes patterns? >> >> And there is a similar question about the Python parser. If Python uses >> Unicode definition for identifier, shouldn't it accept non-ASCII >> "Pattern Whitespaces" as whitespaces? There will be technical problems >> with supporting this, but are there any benefits? >> >> >> https://perldoc.perl.org/perlre.html >> https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax >> https://unicode.org/L2/L2005/05012r-pattern.html >> >> ___ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ > -- --Guido van Rossum (python.org/~guido) ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
On 2017-11-16 10:23, Serhiy Storchaka wrote: Currently the re module ignores only 6 ASCII whitespaces in the re.VERBOSE mode: U+0009 CHARACTER TABULATION U+000A LINE FEED U+000B LINE TABULATION U+000C FORM FEED U+000D CARRIAGE RETURN U+0020 SPACE Perl ignores characters that Unicode calls "Pattern White Space" in the /x mode. It ignores additional 5 non-ASCII characters. U+0085 NEXT LINE U+200E LEFT-TO-RIGHT MARK U+200F RIGHT-TO-LEFT MARK U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR The regex module just ignores characters for which str.isspace() returns True. It ignores additional 20 non-ASCII whitespace characters, including characters U+001C..001F whose classification as whitespaces is questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK. U+001C [FILE SEPARATOR] U+001D [GROUP SEPARATOR] U+001E [RECORD SEPARATOR] U+001F [UNIT SEPARATOR] U+00A0 NO-BREAK SPACE U+1680 OGHAM SPACE MARK U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE U+2004 THREE-PER-EM SPACE U+2005 FOUR-PER-EM SPACE U+2006 SIX-PER-EM SPACE U+2007 FIGURE SPACE U+2008 PUNCTUATION SPACE U+2009 THIN SPACE U+200A HAIR SPACE U+202F NARROW NO-BREAK SPACE U+205F MEDIUM MATHEMATICAL SPACE U+3000 IDEOGRAPHIC SPACE str.isspace appears to be Unicode "Whitespace" plus those 4 "questionable" codepoints. Is it worth to extend the set of ignored whitespaces to "Pattern Whitespaces"? Would it add any benefit? Or add confusion? Should this depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose bytes patterns? And there is a similar question about the Python parser. If Python uses Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern Whitespaces" as whitespaces? There will be technical problems with supporting this, but are there any benefits? https://perldoc.perl.org/perlre.html https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax https://unicode.org/L2/L2005/05012r-pattern.html ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
My instinct is not to worry about it unless someone has actually hit the issue in practice and raised a bug. Paul On 16 November 2017 at 10:23, Serhiy Storchaka wrote: > Currently the re module ignores only 6 ASCII whitespaces in the re.VERBOSE > mode: > > U+0009 CHARACTER TABULATION > U+000A LINE FEED > U+000B LINE TABULATION > U+000C FORM FEED > U+000D CARRIAGE RETURN > U+0020 SPACE > > Perl ignores characters that Unicode calls "Pattern White Space" in the /x > mode. It ignores additional 5 non-ASCII characters. > > U+0085 NEXT LINE > U+200E LEFT-TO-RIGHT MARK > U+200F RIGHT-TO-LEFT MARK > U+2028 LINE SEPARATOR > U+2029 PARAGRAPH SEPARATOR > > The regex module just ignores characters for which str.isspace() returns > True. It ignores additional 20 non-ASCII whitespace characters, including > characters U+001C..001F whose classification as whitespaces is questionable, > but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK. > > U+001C [FILE SEPARATOR] > U+001D [GROUP SEPARATOR] > U+001E [RECORD SEPARATOR] > U+001F [UNIT SEPARATOR] > U+00A0 NO-BREAK SPACE > U+1680 OGHAM SPACE MARK > U+2000 EN QUAD > U+2001 EM QUAD > U+2002 EN SPACE > U+2003 EM SPACE > U+2004 THREE-PER-EM SPACE > U+2005 FOUR-PER-EM SPACE > U+2006 SIX-PER-EM SPACE > U+2007 FIGURE SPACE > U+2008 PUNCTUATION SPACE > U+2009 THIN SPACE > U+200A HAIR SPACE > U+202F NARROW NO-BREAK SPACE > U+205F MEDIUM MATHEMATICAL SPACE > U+3000 IDEOGRAPHIC SPACE > > Is it worth to extend the set of ignored whitespaces to "Pattern > Whitespaces"? Would it add any benefit? Or add confusion? Should this depend > on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose bytes > patterns? > > And there is a similar question about the Python parser. If Python uses > Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern > Whitespaces" as whitespaces? There will be technical problems with > supporting this, but are there any benefits? > > > https://perldoc.perl.org/perlre.html > https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax > https://unicode.org/L2/L2005/05012r-pattern.html > > ___ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
[Python-ideas] Ignorable whitespaces in the re.VERBOSE mode
Currently the re module ignores only 6 ASCII whitespaces in the re.VERBOSE mode: U+0009 CHARACTER TABULATION U+000A LINE FEED U+000B LINE TABULATION U+000C FORM FEED U+000D CARRIAGE RETURN U+0020 SPACE Perl ignores characters that Unicode calls "Pattern White Space" in the /x mode. It ignores additional 5 non-ASCII characters. U+0085 NEXT LINE U+200E LEFT-TO-RIGHT MARK U+200F RIGHT-TO-LEFT MARK U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR The regex module just ignores characters for which str.isspace() returns True. It ignores additional 20 non-ASCII whitespace characters, including characters U+001C..001F whose classification as whitespaces is questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK. U+001C [FILE SEPARATOR] U+001D [GROUP SEPARATOR] U+001E [RECORD SEPARATOR] U+001F [UNIT SEPARATOR] U+00A0 NO-BREAK SPACE U+1680 OGHAM SPACE MARK U+2000 EN QUAD U+2001 EM QUAD U+2002 EN SPACE U+2003 EM SPACE U+2004 THREE-PER-EM SPACE U+2005 FOUR-PER-EM SPACE U+2006 SIX-PER-EM SPACE U+2007 FIGURE SPACE U+2008 PUNCTUATION SPACE U+2009 THIN SPACE U+200A HAIR SPACE U+202F NARROW NO-BREAK SPACE U+205F MEDIUM MATHEMATICAL SPACE U+3000 IDEOGRAPHIC SPACE Is it worth to extend the set of ignored whitespaces to "Pattern Whitespaces"? Would it add any benefit? Or add confusion? Should this depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose bytes patterns? And there is a similar question about the Python parser. If Python uses Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern Whitespaces" as whitespaces? There will be technical problems with supporting this, but are there any benefits? https://perldoc.perl.org/perlre.html https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax https://unicode.org/L2/L2005/05012r-pattern.html ___ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/