Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-21 Thread Serhiy Storchaka

21.11.17 04:20, Stephen J. Turnbull пише:

Serhiy Storchaka writes:
  > I agree. But if there is a special part of the Unicode standard for
  > Pattern White Spaces which includes non-ASCII characters, perhaps there
  > is a need in them. I asked for the case if Python developers with very
  > different cultures have need in additional whitespaces in regular
  > expressions, but I don't know. Seems nobody has claimed their need.

I doubt that Japanese would want it.  I do use \N{IDEOGRAPHIC SPACE} a
bit as a *target* of regular expressions, but I would never want it as
non-syntactic in re.VERBOSE.  (Of course, I'm not a native Japanese, but
I have never heard a Japanese developer wish for use of that character
in any programming language, outside of literal strings.)

  > In particularly I don't know how helpful would be supporting
  > right-to-left and left-to-right marks in verbose regular expressions

That's a good question.  Interpretation and display of R2L in
programming constructs came up briefly in the discussions about BIDI
on the emacs-devel list.  I'll ask Eli Zaretskii, who implemented it
for Emacs.


Thank you Stephen. I would prefer to not change anything (because 
supporting additional whitespaces will complicate and slow down the 
code, and can add subtle bugs, add likely will add a confusion for 
users). But I want to know whether there is a real need in supporting 
additional whitespaces and rtl and ltr marks in regular expressions and 
Python syntax.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-20 Thread Stephen J. Turnbull
Serhiy Storchaka writes:

 > I agree. But if there is a special part of the Unicode standard for 
 > Pattern White Spaces which includes non-ASCII characters, perhaps there 
 > is a need in them. I asked for the case if Python developers with very 
 > different cultures have need in additional whitespaces in regular 
 > expressions, but I don't know. Seems nobody has claimed their need.

I doubt that Japanese would want it.  I do use \N{IDEOGRAPHIC SPACE} a
bit as a *target* of regular expressions, but I would never want it as
non-syntactic in re.VERBOSE.  (Of course, I'm not a native Japanese, but
I have never heard a Japanese developer wish for use of that character
in any programming language, outside of literal strings.)

 > In particularly I don't know how helpful would be supporting 
 > right-to-left and left-to-right marks in verbose regular expressions

That's a good question.  Interpretation and display of R2L in
programming constructs came up briefly in the discussions about BIDI
on the emacs-devel list.  I'll ask Eli Zaretskii, who implemented it
for Emacs.

Steve


-- 
Associate Professor  Division of Policy and Planning Science
http://turnbull/sk.tsukuba.ac.jp/ Faculty of Systems and Information
Email: turnb...@sk.tsukuba.ac.jp   University of Tsukuba
Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-20 Thread M.-A. Lemburg
For consistency, we should probably have "whitespace" for re
equal to whatever "\s" matches, since this is what the engine
itself considers as whitespace (and then also covers the special
case where you use the re.ASCII flag).

Still, the only practical case I could imagine, where extending the
list would indeed make sense, is to have the   character qualify
as whitespace for re.VERBOSE, since this can sometimes be introduced
via copy&paste from other sources (e.g. web pages showing a
regular expression).

Due to whitespace being what it is, it's hard to tell whether you've
just copied a \u0020 or a \u00a0. The latter can easily render the
regular expression non-working with the current interpretation of
re.VERBOSE.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 20 2017)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-20 Thread Serhiy Storchaka

20.11.17 10:13, Stephen J. Turnbull пише:

Otherwise I'm with Paul, who writes:

  > My instinct is not to worry about it unless someone has actually hit
  > the issue in practice and raised a bug.

After the tabs vs. spaces fiasco, I lean steeply to the right for code
-- including embedded languages like regexes.  *We* say what is
allowed there, and *you* can find an editor that does it our way.


I agree. But if there is a special part of the Unicode standard for 
Pattern White Spaces which includes non-ASCII characters, perhaps there 
is a need in them. I asked for the case if Python developers with very 
different cultures have need in additional whitespaces in regular 
expressions, but I don't know. Seems nobody has claimed their need.


In particularly I don't know how helpful would be supporting 
right-to-left and left-to-right marks in verbose regular expressions (or 
even in Python code), or this will just add confusion? Unicode 
identifiers already can be misused for confusion due to homoglyphs. The 
problem is not that correctly looking program can be rejected by the 
compiler, but that the program can work differently from expected 
because it uses different names that look the same.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-20 Thread Stephen J. Turnbull
All,

If we *do* seriously consider adding those characters as ignorable in
re.VERBOSE, that's because somebody is using them in text a lot, and
it slips into their coding.  Given frequent use, we should consider
how a lot more whitespace characters can be conveniently searched
individually and readably, because whitespace characters are the
ultimate confusables.  This may be a no-op, given the \N and \u
notations, but \u is pretty opaque and \N leads to character-per-line
regexes. ;-)

Otherwise I'm with Paul, who writes:

 > My instinct is not to worry about it unless someone has actually hit
 > the issue in practice and raised a bug.

After the tabs vs. spaces fiasco, I lean steeply to the right for code
-- including embedded languages like regexes.  *We* say what is
allowed there, and *you* can find an editor that does it our way.

The point of re.VERBOSE is to allow writing regexes the way we write
Python code, formatting to emphasize structure and improve
readability.  I don't see why we would want to allow more than we
already do, given that any fancy whitespace formatting for "literate
programming" will be done by the code formatting engine of the
document preparation system anyway.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-17 Thread Stephan Houben
I put the actual space characters here so you can see them
in a non-proportional font (which I assume most Python programmer use).

https://gist.github.com/stephanh42/7c1c122154fd3f26d864233a40d8

The control characters aren't rendered at all (Vim renders them as ^\ ^] ^^
^_,
respectively). Most of the other spaces are rendered exactly like the
normal space.

The only ones which render differently are
U+1680 | | OGHAM SPACE MARK
U+3000 | | IDEOGRAPHIC SPACE

I understand Ogham has recently (since 6th century CE) seen a decline in
popularity.

However, I think Python should totally adopt U+3000 as a new whitespace
character
and start promoting it as the One True Way to indent code,
so as to finally end the age-old spaces vs tabs conflict.

[That was supposed to be a joke.]

Stephan



2017-11-17 16:38 GMT+01:00 Victor Stinner :

> I don't think that we need more than space (U+0020) and Unix newline
> (U+000A) ;-)
>
> Victor
>
> 2017-11-16 11:23 GMT+01:00 Serhiy Storchaka :
> > Currently the re module ignores only 6 ASCII whitespaces in the
> re.VERBOSE
> > mode:
> >
> >  U+0009 CHARACTER TABULATION
> >  U+000A LINE FEED
> >  U+000B LINE TABULATION
> >  U+000C FORM FEED
> >  U+000D CARRIAGE RETURN
> >  U+0020 SPACE
> >
> > Perl ignores characters that Unicode calls "Pattern White Space" in the
> /x
> > mode. It ignores additional 5 non-ASCII characters.
> >
> >  U+0085 NEXT LINE
> >  U+200E LEFT-TO-RIGHT MARK
> >  U+200F RIGHT-TO-LEFT MARK
> >  U+2028 LINE SEPARATOR
> >  U+2029 PARAGRAPH SEPARATOR
> >
> > The regex module just ignores characters for which str.isspace() returns
> > True. It ignores additional 20 non-ASCII whitespace characters, including
> > characters U+001C..001F whose classification as whitespaces is
> questionable,
> > but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.
> >
> >  U+001C [FILE SEPARATOR]
> >  U+001D [GROUP SEPARATOR]
> >  U+001E [RECORD SEPARATOR]
> >  U+001F [UNIT SEPARATOR]
> >  U+00A0 NO-BREAK SPACE
> >  U+1680 OGHAM SPACE MARK
> >  U+2000 EN QUAD
> >  U+2001 EM QUAD
> >  U+2002 EN SPACE
> >  U+2003 EM SPACE
> >  U+2004 THREE-PER-EM SPACE
> >  U+2005 FOUR-PER-EM SPACE
> >  U+2006 SIX-PER-EM SPACE
> >  U+2007 FIGURE SPACE
> >  U+2008 PUNCTUATION SPACE
> >  U+2009 THIN SPACE
> >  U+200A HAIR SPACE
> >  U+202F NARROW NO-BREAK SPACE
> >  U+205F MEDIUM MATHEMATICAL SPACE
> >  U+3000 IDEOGRAPHIC SPACE
> >
> > Is it worth to extend the set of ignored whitespaces to "Pattern
> > Whitespaces"? Would it add any benefit? Or add confusion? Should this
> depend
> > on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose
> bytes
> > patterns?
> >
> > And there is a similar question about the Python parser. If Python uses
> > Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern
> > Whitespaces" as whitespaces? There will be technical problems with
> > supporting this, but are there any benefits?
> >
> >
> > https://perldoc.perl.org/perlre.html
> > https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
> > https://unicode.org/L2/L2005/05012r-pattern.html
> >
> > ___
> > Python-ideas mailing list
> > Python-ideas@python.org
> > https://mail.python.org/mailman/listinfo/python-ideas
> > Code of Conduct: http://python.org/psf/codeofconduct/
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-17 Thread Victor Stinner
I don't think that we need more than space (U+0020) and Unix newline
(U+000A) ;-)

Victor

2017-11-16 11:23 GMT+01:00 Serhiy Storchaka :
> Currently the re module ignores only 6 ASCII whitespaces in the re.VERBOSE
> mode:
>
>  U+0009 CHARACTER TABULATION
>  U+000A LINE FEED
>  U+000B LINE TABULATION
>  U+000C FORM FEED
>  U+000D CARRIAGE RETURN
>  U+0020 SPACE
>
> Perl ignores characters that Unicode calls "Pattern White Space" in the /x
> mode. It ignores additional 5 non-ASCII characters.
>
>  U+0085 NEXT LINE
>  U+200E LEFT-TO-RIGHT MARK
>  U+200F RIGHT-TO-LEFT MARK
>  U+2028 LINE SEPARATOR
>  U+2029 PARAGRAPH SEPARATOR
>
> The regex module just ignores characters for which str.isspace() returns
> True. It ignores additional 20 non-ASCII whitespace characters, including
> characters U+001C..001F whose classification as whitespaces is questionable,
> but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.
>
>  U+001C [FILE SEPARATOR]
>  U+001D [GROUP SEPARATOR]
>  U+001E [RECORD SEPARATOR]
>  U+001F [UNIT SEPARATOR]
>  U+00A0 NO-BREAK SPACE
>  U+1680 OGHAM SPACE MARK
>  U+2000 EN QUAD
>  U+2001 EM QUAD
>  U+2002 EN SPACE
>  U+2003 EM SPACE
>  U+2004 THREE-PER-EM SPACE
>  U+2005 FOUR-PER-EM SPACE
>  U+2006 SIX-PER-EM SPACE
>  U+2007 FIGURE SPACE
>  U+2008 PUNCTUATION SPACE
>  U+2009 THIN SPACE
>  U+200A HAIR SPACE
>  U+202F NARROW NO-BREAK SPACE
>  U+205F MEDIUM MATHEMATICAL SPACE
>  U+3000 IDEOGRAPHIC SPACE
>
> Is it worth to extend the set of ignored whitespaces to "Pattern
> Whitespaces"? Would it add any benefit? Or add confusion? Should this depend
> on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose bytes
> patterns?
>
> And there is a similar question about the Python parser. If Python uses
> Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern
> Whitespaces" as whitespaces? There will be technical problems with
> supporting this, but are there any benefits?
>
>
> https://perldoc.perl.org/perlre.html
> https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
> https://unicode.org/L2/L2005/05012r-pattern.html
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-17 Thread Serhiy Storchaka

17.11.17 00:09, MRAB пише:

On 2017-11-16 21:44, Serhiy Storchaka wrote:

16.11.17 19:38, Guido van Rossum пише:
Who would benefit from changing this? Let's not change things just 
because we can, or because Perl 6 does it.


I don't know. I know the disadvantages of making this change, and I ask
what is the benefit. If there is a benefit, and it is important for
Python, I could implement this feature in re and regex.

You could see what some more languages, e.g. C#, do. If there isn't a 
consensus of some kind, it's best to leave it.


I haven't found this in the documentation, but according to the sources 
it uses only 5 ASCII whitespaces (exluding \v).


Java uses 6 ASCII whitespaces.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-16 Thread MRAB

On 2017-11-16 21:44, Serhiy Storchaka wrote:

16.11.17 19:38, Guido van Rossum пише:
Who would benefit from changing this? Let's not change things just 
because we can, or because Perl 6 does it.


I don't know. I know the disadvantages of making this change, and I ask
what is the benefit. If there is a benefit, and it is important for
Python, I could implement this feature in re and regex.

You could see what some more languages, e.g. C#, do. If there isn't a 
consensus of some kind, it's best to leave it.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-16 Thread Serhiy Storchaka

16.11.17 19:38, Guido van Rossum пише:
Who would benefit from changing this? Let's not change things just 
because we can, or because Perl 6 does it.


I don't know. I know the disadvantages of making this change, and I ask 
what is the benefit. If there is a benefit, and it is important for 
Python, I could implement this feature in re and regex.


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-16 Thread Guido van Rossum
Who would benefit from changing this? Let's not change things just because
we can, or because Perl 6 does it.

On Thu, Nov 16, 2017 at 9:21 AM, MRAB  wrote:

> On 2017-11-16 10:23, Serhiy Storchaka wrote:
>
>> Currently the re module ignores only 6 ASCII whitespaces in the
>> re.VERBOSE mode:
>>
>>U+0009 CHARACTER TABULATION
>>U+000A LINE FEED
>>U+000B LINE TABULATION
>>U+000C FORM FEED
>>U+000D CARRIAGE RETURN
>>U+0020 SPACE
>>
>> Perl ignores characters that Unicode calls "Pattern White Space" in the
>> /x mode. It ignores additional 5 non-ASCII characters.
>>
>>U+0085 NEXT LINE
>>U+200E LEFT-TO-RIGHT MARK
>>U+200F RIGHT-TO-LEFT MARK
>>U+2028 LINE SEPARATOR
>>U+2029 PARAGRAPH SEPARATOR
>>
>> The regex module just ignores characters for which str.isspace() returns
>> True. It ignores additional 20 non-ASCII whitespace characters,
>> including characters U+001C..001F whose classification as whitespaces is
>> questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT
>> MARK.
>>
>>U+001C [FILE SEPARATOR]
>>U+001D [GROUP SEPARATOR]
>>U+001E [RECORD SEPARATOR]
>>U+001F [UNIT SEPARATOR]
>>U+00A0 NO-BREAK SPACE
>>U+1680 OGHAM SPACE MARK
>>U+2000 EN QUAD
>>U+2001 EM QUAD
>>U+2002 EN SPACE
>>U+2003 EM SPACE
>>U+2004 THREE-PER-EM SPACE
>>U+2005 FOUR-PER-EM SPACE
>>U+2006 SIX-PER-EM SPACE
>>U+2007 FIGURE SPACE
>>U+2008 PUNCTUATION SPACE
>>U+2009 THIN SPACE
>>U+200A HAIR SPACE
>>U+202F NARROW NO-BREAK SPACE
>>U+205F MEDIUM MATHEMATICAL SPACE
>>U+3000 IDEOGRAPHIC SPACE
>>
>> str.isspace appears to be Unicode "Whitespace" plus those 4
> "questionable" codepoints.
>
>
> Is it worth to extend the set of ignored whitespaces to "Pattern
>> Whitespaces"? Would it add any benefit? Or add confusion? Should this
>> depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in
>> verbose bytes patterns?
>>
>> And there is a similar question about the Python parser. If Python uses
>> Unicode definition for identifier, shouldn't it accept non-ASCII
>> "Pattern Whitespaces" as whitespaces? There will be technical problems
>> with supporting this, but are there any benefits?
>>
>>
>> https://perldoc.perl.org/perlre.html
>> https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
>> https://unicode.org/L2/L2005/05012r-pattern.html
>>
>> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-16 Thread MRAB

On 2017-11-16 10:23, Serhiy Storchaka wrote:

Currently the re module ignores only 6 ASCII whitespaces in the
re.VERBOSE mode:

   U+0009 CHARACTER TABULATION
   U+000A LINE FEED
   U+000B LINE TABULATION
   U+000C FORM FEED
   U+000D CARRIAGE RETURN
   U+0020 SPACE

Perl ignores characters that Unicode calls "Pattern White Space" in the
/x mode. It ignores additional 5 non-ASCII characters.

   U+0085 NEXT LINE
   U+200E LEFT-TO-RIGHT MARK
   U+200F RIGHT-TO-LEFT MARK
   U+2028 LINE SEPARATOR
   U+2029 PARAGRAPH SEPARATOR

The regex module just ignores characters for which str.isspace() returns
True. It ignores additional 20 non-ASCII whitespace characters,
including characters U+001C..001F whose classification as whitespaces is
questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.

   U+001C [FILE SEPARATOR]
   U+001D [GROUP SEPARATOR]
   U+001E [RECORD SEPARATOR]
   U+001F [UNIT SEPARATOR]
   U+00A0 NO-BREAK SPACE
   U+1680 OGHAM SPACE MARK
   U+2000 EN QUAD
   U+2001 EM QUAD
   U+2002 EN SPACE
   U+2003 EM SPACE
   U+2004 THREE-PER-EM SPACE
   U+2005 FOUR-PER-EM SPACE
   U+2006 SIX-PER-EM SPACE
   U+2007 FIGURE SPACE
   U+2008 PUNCTUATION SPACE
   U+2009 THIN SPACE
   U+200A HAIR SPACE
   U+202F NARROW NO-BREAK SPACE
   U+205F MEDIUM MATHEMATICAL SPACE
   U+3000 IDEOGRAPHIC SPACE

str.isspace appears to be Unicode "Whitespace" plus those 4 
"questionable" codepoints.



Is it worth to extend the set of ignored whitespaces to "Pattern
Whitespaces"? Would it add any benefit? Or add confusion? Should this
depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in
verbose bytes patterns?

And there is a similar question about the Python parser. If Python uses
Unicode definition for identifier, shouldn't it accept non-ASCII
"Pattern Whitespaces" as whitespaces? There will be technical problems
with supporting this, but are there any benefits?


https://perldoc.perl.org/perlre.html
https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
https://unicode.org/L2/L2005/05012r-pattern.html


___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-16 Thread Paul Moore
My instinct is not to worry about it unless someone has actually hit
the issue in practice and raised a bug.
Paul

On 16 November 2017 at 10:23, Serhiy Storchaka  wrote:
> Currently the re module ignores only 6 ASCII whitespaces in the re.VERBOSE
> mode:
>
>  U+0009 CHARACTER TABULATION
>  U+000A LINE FEED
>  U+000B LINE TABULATION
>  U+000C FORM FEED
>  U+000D CARRIAGE RETURN
>  U+0020 SPACE
>
> Perl ignores characters that Unicode calls "Pattern White Space" in the /x
> mode. It ignores additional 5 non-ASCII characters.
>
>  U+0085 NEXT LINE
>  U+200E LEFT-TO-RIGHT MARK
>  U+200F RIGHT-TO-LEFT MARK
>  U+2028 LINE SEPARATOR
>  U+2029 PARAGRAPH SEPARATOR
>
> The regex module just ignores characters for which str.isspace() returns
> True. It ignores additional 20 non-ASCII whitespace characters, including
> characters U+001C..001F whose classification as whitespaces is questionable,
> but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.
>
>  U+001C [FILE SEPARATOR]
>  U+001D [GROUP SEPARATOR]
>  U+001E [RECORD SEPARATOR]
>  U+001F [UNIT SEPARATOR]
>  U+00A0 NO-BREAK SPACE
>  U+1680 OGHAM SPACE MARK
>  U+2000 EN QUAD
>  U+2001 EM QUAD
>  U+2002 EN SPACE
>  U+2003 EM SPACE
>  U+2004 THREE-PER-EM SPACE
>  U+2005 FOUR-PER-EM SPACE
>  U+2006 SIX-PER-EM SPACE
>  U+2007 FIGURE SPACE
>  U+2008 PUNCTUATION SPACE
>  U+2009 THIN SPACE
>  U+200A HAIR SPACE
>  U+202F NARROW NO-BREAK SPACE
>  U+205F MEDIUM MATHEMATICAL SPACE
>  U+3000 IDEOGRAPHIC SPACE
>
> Is it worth to extend the set of ignored whitespaces to "Pattern
> Whitespaces"? Would it add any benefit? Or add confusion? Should this depend
> on the re.ASCII mode? Should the byte b'\x85' be ignorable in verbose bytes
> patterns?
>
> And there is a similar question about the Python parser. If Python uses
> Unicode definition for identifier, shouldn't it accept non-ASCII "Pattern
> Whitespaces" as whitespaces? There will be technical problems with
> supporting this, but are there any benefits?
>
>
> https://perldoc.perl.org/perlre.html
> https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
> https://unicode.org/L2/L2005/05012r-pattern.html
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Ignorable whitespaces in the re.VERBOSE mode

2017-11-16 Thread Serhiy Storchaka
Currently the re module ignores only 6 ASCII whitespaces in the 
re.VERBOSE mode:


 U+0009 CHARACTER TABULATION
 U+000A LINE FEED
 U+000B LINE TABULATION
 U+000C FORM FEED
 U+000D CARRIAGE RETURN
 U+0020 SPACE

Perl ignores characters that Unicode calls "Pattern White Space" in the 
/x mode. It ignores additional 5 non-ASCII characters.


 U+0085 NEXT LINE
 U+200E LEFT-TO-RIGHT MARK
 U+200F RIGHT-TO-LEFT MARK
 U+2028 LINE SEPARATOR
 U+2029 PARAGRAPH SEPARATOR

The regex module just ignores characters for which str.isspace() returns 
True. It ignores additional 20 non-ASCII whitespace characters, 
including characters U+001C..001F whose classification as whitespaces is 
questionable, but doesn't ignore LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK.


 U+001C [FILE SEPARATOR]
 U+001D [GROUP SEPARATOR]
 U+001E [RECORD SEPARATOR]
 U+001F [UNIT SEPARATOR]
 U+00A0 NO-BREAK SPACE
 U+1680 OGHAM SPACE MARK
 U+2000 EN QUAD
 U+2001 EM QUAD
 U+2002 EN SPACE
 U+2003 EM SPACE
 U+2004 THREE-PER-EM SPACE
 U+2005 FOUR-PER-EM SPACE
 U+2006 SIX-PER-EM SPACE
 U+2007 FIGURE SPACE
 U+2008 PUNCTUATION SPACE
 U+2009 THIN SPACE
 U+200A HAIR SPACE
 U+202F NARROW NO-BREAK SPACE
 U+205F MEDIUM MATHEMATICAL SPACE
 U+3000 IDEOGRAPHIC SPACE

Is it worth to extend the set of ignored whitespaces to "Pattern 
Whitespaces"? Would it add any benefit? Or add confusion? Should this 
depend on the re.ASCII mode? Should the byte b'\x85' be ignorable in 
verbose bytes patterns?


And there is a similar question about the Python parser. If Python uses 
Unicode definition for identifier, shouldn't it accept non-ASCII 
"Pattern Whitespaces" as whitespaces? There will be technical problems 
with supporting this, but are there any benefits?



https://perldoc.perl.org/perlre.html
https://www.unicode.org/reports/tr31/tr31-4.html#Pattern_Syntax
https://unicode.org/L2/L2005/05012r-pattern.html

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/