Re: [Python-Dev] re performance

2017-02-02 Thread MRAB

On 2017-02-02 04:37, Franklin? Lee wrote:

On Thu, Jan 26, 2017 at 4:13 PM, Sven R. Kunze  wrote:

Hi folks,

I recently refreshed regular expressions theoretical basics *indulging in
reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html

However, reaching the chart in the lower third of the article, I saw Python
2.4 measured against a naive Thompson matching implementation. And I was
surprised about how bad it performed compared to an unoptimized version of
an older than dirt algorithm.



From my perspective, I can say, that regular expressions might worth
optimizing especially for web applications (url matching usually uses
regexes) but also for other applications where I've seen many tight loops
using regexes as well. So, I am probing interest on this topic here.


What I (think I) know:
- Both re and regex use the same C backend, which is not based on NFA.
- The re2 library, which the writer of that article made, allows
capture groups (but only up to a limit) and bounded repetitions (up to
a limit).
- Perl has started to optimize some regex patterns.


[snip]

re and regex use different C backends. Both are based on NFA.

re2 is based on DFA, with a fallback to re.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-02-01 Thread Franklin? Lee
On Thu, Jan 26, 2017 at 4:13 PM, Sven R. Kunze  wrote:
> Hi folks,
>
> I recently refreshed regular expressions theoretical basics *indulging in
> reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
>
> However, reaching the chart in the lower third of the article, I saw Python
> 2.4 measured against a naive Thompson matching implementation. And I was
> surprised about how bad it performed compared to an unoptimized version of
> an older than dirt algorithm.

> From my perspective, I can say, that regular expressions might worth
> optimizing especially for web applications (url matching usually uses
> regexes) but also for other applications where I've seen many tight loops
> using regexes as well. So, I am probing interest on this topic here.

What I (think I) know:
- Both re and regex use the same C backend, which is not based on NFA.
- The re2 library, which the writer of that article made, allows
capture groups (but only up to a limit) and bounded repetitions (up to
a limit).
- Perl has started to optimize some regex patterns.

What I think:
- The example is uncharacteristic of what people write, like URL matching.
- But enabling naive code to perform well is usually a good thing. The
fewer details newbies need to know to write code, the better.
- It's possible for Python to optimize a large category of patterns,
even if not all.
- It's also possible to optimize large parts of patterns with
backrefs. E.g. If there's a backref, the group that the backref refers
to can still be made into an NFA.
- To do the above, you'd need a way to generate all possible matches.
- Optimization can be costly. The full NFA construction could be
generated only upon request, or maybe the code automatically tries to
optimize after 100 uses (like a JIT). This should only be considered
if re2's construction really is costly.

If people want NFAs, I think the "easiest" way is to use re2. Jakub
Wilk mentioned it before, but here it is again.
https://github.com/google/re2

re2 features:
https://github.com/google/re2/wiki/Syntax

Named references aren't supported, but I think that can be worked
around with some renumbering. It's just a matter of translating the
pattern.

It also doesn't like lookarounds, which AFAIK are perfectly doable
with NFAs. Looks like lookaheads and lookbehinds are hard to do
without a lot of extra space (https://github.com/google/re2/issues/5).

Facebook has a Python wrapper for re2.
https://github.com/facebook/pyre2/

In a post linked to from this thread, Serhiy mentioned another Python
wrapper for re2. This wrapper is designed to be like re, and should
probably be the basis of any efforts. It's not been updated for 11
months, though.
https://pypi.python.org/pypi/re2/
https://github.com/axiak/pyre2/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-02-01 Thread Wang, Peter Xihong
+1

We'd like to get more details on how to try this "new mode", and do a 
full/comprehensive comparison between the "re" vs "regex".

Peter


 
-Original Message-
From: Victor Stinner [mailto:victor.stin...@gmail.com] 
Sent: Wednesday, February 01, 2017 12:58 PM
To: Lukasz Langa 
Cc: Wang, Peter Xihong ; python-dev@python.org
Subject: Re: [Python-Dev] re performance

2017-02-01 20:42 GMT+01:00 Lukasz Langa :
> However, this benchmark is incomplete in the sense that it only checks 
> the compatibility mode of `regex`, whereas it's the new mode that 
> lends the biggest performance gains. So, providing checks for the 
> other engine would show us the full picture.

Would you mind to write a pull request for performance to add a command line 
option to test "re" (stdlib) or "regex" (PyPI, in the new mode)? Or maybe even 
regex and regex_compatibility_mode :-)

Source:
https://github.com/python/performance/blob/master/performance/benchmarks/bm_regex_v8.py#L1789

Example of benchmark with cmdline options:
https://github.com/python/performance/blob/master/performance/benchmarks/bm_raytrace.py#L385

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-02-01 Thread Victor Stinner
2017-02-01 20:42 GMT+01:00 Lukasz Langa :
> However, this benchmark is incomplete in the sense that it only checks the
> compatibility mode of `regex`, whereas it's the new mode that lends the
> biggest performance gains. So, providing checks for the other engine would
> show us the full picture.

Would you mind to write a pull request for performance to add a
command line option to test "re" (stdlib) or "regex" (PyPI, in the new
mode)? Or maybe even regex and regex_compatibility_mode :-)

Source:
https://github.com/python/performance/blob/master/performance/benchmarks/bm_regex_v8.py#L1789

Example of benchmark with cmdline options:
https://github.com/python/performance/blob/master/performance/benchmarks/bm_raytrace.py#L385

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-02-01 Thread Serhiy Storchaka

On 31.01.17 21:40, Wang, Peter Xihong wrote:

Regarding to the performance difference between "re" and "regex" and packaging 
related options, we did a performance comparison using Python 3.6.0 to run some micro-benchmarks in 
the Python Benchmark Suite (https://github.com/python/performance):

Results in ms, and the lower the better (running on Ubuntu 15.10)
re  regex (via pip install regex, and a replacement of 
"import re" with "import regex as re")
bm_regex_compile.py 229 298
bm_regex_dna.py 171 267
bm_regex_effbot.py  2.773.04
bm_regex_v8.py  24.814.1
This data shows "re" is better than "regex" in term of performance in 3 out of 
4 above micro-benchmarks.


bm_regex_v8 is the one that is purposed to reflect real-world use of 
regular expressions.


See also different comparison at 
https://mail.python.org/pipermail/speed/2016-March/000311.html. In some 
tests regex surpasses re, in other tests re surpasses regex. re2 is much 
faster than other engines in all tests except the one in which it is 
much slower (and this engine is the least featured).



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-02-01 Thread Lukasz Langa

> On Jan 31, 2017, at 11:40 AM, Wang, Peter Xihong 
>  wrote:
> 
> Regarding to the performance difference between "re" and "regex" and 
> packaging related options, we did a performance comparison using Python 3.6.0 
> to run some micro-benchmarks in the Python Benchmark Suite 
> (https://github.com/python/performance 
> ):
> 
> Results in ms, and the lower the better (running on Ubuntu 15.10)
>   re  regex (via pip install regex, 
> and a replacement of "import re" with "import regex as re")
> bm_regex_compile.py   229 298
> bm_regex_dna.py   171 267
> bm_regex_effbot.py2.773.04
> bm_regex_v8.py24.814.1
> This data shows "re" is better than "regex" in term of performance in 3 out 
> of 4 above micro-benchmarks.


This is very informative, thank you! This clearly shows we should rather pursue 
the PyPI route (with a documentation endorsement and possible bundling for 3.7) 
than full-blown replacement.

However, this benchmark is incomplete in the sense that it only checks the 
compatibility mode of `regex`, whereas it's the new mode that lends the biggest 
performance gains. So, providing checks for the other engine would show us the 
full picture. We'd need to add checks that prove the regular expressions in 
said benchmarks end up with equivalent matches, to be sure we're testing the 
same thing.

- Ł___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-31 Thread Wang, Peter Xihong
Regarding to the performance difference between "re" and "regex" and packaging 
related options, we did a performance comparison using Python 3.6.0 to run some 
micro-benchmarks in the Python Benchmark Suite 
(https://github.com/python/performance):

Results in ms, and the lower the better (running on Ubuntu 15.10)
re  regex (via pip install regex, 
and a replacement of "import re" with "import regex as re")
bm_regex_compile.py 229 298
bm_regex_dna.py 171 267
bm_regex_effbot.py  2.773.04
bm_regex_v8.py  24.814.1
This data shows "re" is better than "regex" in term of performance in 3 out of 
4 above micro-benchmarks.

Anyone searching for "regular expression python" will get a first hit at the 
Python documentation on "re".  Naturally, any new developer could start with 
"re" since day 1 and not bother to look elsewhere for alternatives later on.

We did a query for "import re" against the big cloud computing software 
application, OpenStack (with 3.7 million lines of source codes and majority of 
them written in Python), and got ~1000 hits.

With that being said, IMHO, it would be nice to capture ("borrow") the 
performance benefit from "regex" and merged into "re", without knowing or 
worrying about packaging/installing stuff.

Cheers,

Peter

 

-Original Message-
From: Python-Dev 
[mailto:python-dev-bounces+peter.xihong.wang=intel@python.org] On Behalf Of 
Nick Coghlan
Sent: Tuesday, January 31, 2017 1:54 AM
To: Barry Warsaw 
Cc: python-dev@python.org
Subject: Re: [Python-Dev] re performance

On 30 January 2017 at 15:26, Barry Warsaw  wrote:
> On Jan 30, 2017, at 12:38 PM, Nick Coghlan wrote:
>
>>I think there are 3 main candidates that could fit that bill:
>>
>>- requests
>>- setuptools
>>- regex
>
> Actually, I think pkg_resources would make an excellent candidate.  
> The setuptools crew is working on a branch that would allow for 
> setuptools and pkg_resources to be split, which would be great for 
> other reasons.  Splitting them may mean that pkg_resources could 
> eventually be added to the stdlib, but as an intermediate step, it 
> could also test out this idea.  It probably has a lot less of the baggage 
> that you outline.

Yep, if/when pkg_resources is successfully split out from the rest of 
setuptools, I agree it would also be a good candidate for stdlib bundling - 
version independent runtime access to the database of installed packages is a 
key capability for many use cases, and not currently something we support 
especially well.

It's also far more analogous to the existing pip bundling, since 
setuptools/pkg_resources are also maintained under the PyPA structure.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/peter.xihong.wang%40intel.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-31 Thread Nick Coghlan
On 30 January 2017 at 15:26, Barry Warsaw  wrote:
> On Jan 30, 2017, at 12:38 PM, Nick Coghlan wrote:
>
>>I think there are 3 main candidates that could fit that bill:
>>
>>- requests
>>- setuptools
>>- regex
>
> Actually, I think pkg_resources would make an excellent candidate.  The
> setuptools crew is working on a branch that would allow for setuptools and
> pkg_resources to be split, which would be great for other reasons.  Splitting
> them may mean that pkg_resources could eventually be added to the stdlib, but
> as an intermediate step, it could also test out this idea.  It probably has a
> lot less of the baggage that you outline.

Yep, if/when pkg_resources is successfully split out from the rest of
setuptools, I agree it would also be a good candidate for stdlib
bundling - version independent runtime access to the database of
installed packages is a key capability for many use cases, and not
currently something we support especially well.

It's also far more analogous to the existing pip bundling, since
setuptools/pkg_resources are also maintained under the PyPA structure.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-30 Thread Lukasz Langa

> On Jan 30, 2017, at 6:26 AM, Barry Warsaw  wrote:
> 
> Actually, I think pkg_resources would make an excellent candidate.  The
> setuptools crew is working on a branch that would allow for setuptools and
> pkg_resources to be split, which would be great for other reasons.  Splitting
> them may mean that pkg_resources could eventually be added to the stdlib, but
> as an intermediate step, it could also test out this idea.  It probably has a
> lot less of the baggage that you outline.

FWIW this is what we do at Facebook. We ship pkg_resources with everything but 
not setuptools. So there's definitely value in that. It's a little surprising 
we haven't been doing that yet, given that zipped applications need it.

- Ł
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-30 Thread Barry Warsaw
On Jan 30, 2017, at 12:38 PM, Nick Coghlan wrote:

>I think there are 3 main candidates that could fit that bill:
>
>- requests
>- setuptools
>- regex

Actually, I think pkg_resources would make an excellent candidate.  The
setuptools crew is working on a branch that would allow for setuptools and
pkg_resources to be split, which would be great for other reasons.  Splitting
them may mean that pkg_resources could eventually be added to the stdlib, but
as an intermediate step, it could also test out this idea.  It probably has a
lot less of the baggage that you outline.

Cheers,
-Barry


pgpD4mq5b21qp.pgp
Description: OpenPGP digital signature
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-30 Thread Berker Peksağ
On Mon, Jan 30, 2017 at 2:56 PM, Antoine Pitrou  wrote:
> On Sun, 29 Jan 2017 20:30:38 +
> Steve Holden  wrote:
>> Why not declare re deprecated and remove it in Python 4?
>
> Why deprecate and remove a library that's perfectly usable and
> satisfactory for the vast majority of regular expression usage?  It's
> not like regex presents a radically different API...
>
> Do we really have to face another wave of stdlib bashing on this
> mailing-list?

I agree with Antoine. And note that re has an active maintainer who
has fixed a bunch of bugs and added some new features in the last few
years.

--Berker
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-30 Thread Antoine Pitrou
On Sun, 29 Jan 2017 20:30:38 +
Steve Holden  wrote:
> Why not declare re deprecated and remove it in Python 4?

Why deprecate and remove a library that's perfectly usable and
satisfactory for the vast majority of regular expression usage?  It's
not like regex presents a radically different API...

Do we really have to face another wave of stdlib bashing on this
mailing-list?

Regards

Antoine.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-30 Thread Nick Coghlan
On 28 January 2017 at 18:07, Barry Warsaw  wrote:
> On Jan 28, 2017, at 03:43 PM, Nick Coghlan wrote:
>
>>I still think it could be a good candidate for a first "bundled"
>>module, where we don't migrate it fully into the CPython development
>>process, but *do* officially bless it and provide it by default in the
>>form of a bundled wheel file (similar to what we do with pip).
>
> How would that work exactly.  I.e. is there a PEP?

There's no PEP, and I don't think it would make sense to have one
without a specific concrete example to drive the discussion.

I think there are 3 main candidates that could fit that bill:

- requests
- setuptools
- regex

Using requests to drive it would bring in a lot of incidental
complexity around SSL/TLS certificate management and integrating with
platform trust stores, which I think would overshadow the core policy
discussion of switching from a pure co-development model for the
standard library to one where some components are independently
versioned with recent snapshots being included in CPython feature and
maintenance branches.

For setuptools, one of our main goals in PyPA is to make it just one
build system option amongst many, as well as ensuring that it's easy
to do service deployments that don't have any build system present at
all, so bundling it by default with the stdlib would run counter to
those goals

regex by contrast is a relatively simple standalone module that offers
a more actively developed alternative to the standard library's re
module, but would still require a PEP to address the following
technical and policy questions that are normally handled by copying
code wholesale into the CPython tree under a contributor license
agreement:

- what licensing and other IP assurances would the PSF need to be
comfortable redistributing the bundled component?
- what maintenance sustainability questions would need to be
addressed? (e.g. succession planning for the ability to publish new
releases)
- how would bundled modules be handled in the CPython test suite?
- how would bundled modules be handled during local development,
during installation on supported OSes, and when creating virtual
environments?
- how does platform support in bundled modules relate to the platform
support guidelines given in PEP 11?
- how are additional build requirements for bundled modules (e.g. C++
compilers) handled?
- would bundled modules be permitted to gain features in CPython
maintenance releases, or would we expect any bundled modules to offer
conservative maintenance branches for the life of a given CPython
feature release?
- how would maintenance collaboration and issue escalation work? Would
bundled module maintainers need to follow both their own issue tracker
and the CPython one?

In many respects, PEP 453 and the bundling of pip already offers
precedent for how these are handled (e.g. it's collectively maintained
by PyPA, and that group is as closely associated with the PSF from a
legal and support perspective as python-dev is), so any PEP along
these lines would be about deciding which parts of what we did for pip
were a technical and policy precedent for 3rd party module bundling in
general, and which are unique to pip.

> While I think it could be a good idea to bundle (more?) third party packages
> for a variety of reasons, I want to make sure it's done in a way that's still
> friendly to downstream repackagers, as I'm sure you do to. :)

Indeed :)

> For Debian/Ubuntu, we already have some additional machinations to make
> ensurepip and such work properly in a distro packaging environment.  We'd want
> to be able to do the same for any bundled packages as well.

Yeah, I suspect any development along these lines would make it even
more important to have something like PEP 534 in place so that the
bundled modules could be omitted by default in some cases (such as in
virtual environments), and require people to depend on them explicitly
if they needed them.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-29 Thread Greg Ewing

Armin Rigo wrote:


The theoretical kind of regexp is about giving a "yes/no" answer,
whereas the concrete "re" or "regexp" modules gives a match object,
which lets you ask for the subgroups' location, for example.

Another issue is that the theoretical engine has no notion of
greedy/non-greedy matching.


These things aren't part of the classical theory of REs that is
usually taught, but it should be possible to do them in linear time.
They can be done for context-free languages using e.g. an LALR parser,
and regular languages are a subset of context-free languages.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-29 Thread Serhiy Storchaka

On 29.01.17 12:18, Jakub Wilk wrote:

* Armin Rigo , 2017-01-28, 12:44:

The theoretical kind of regexp is about giving a "yes/no" answer,
whereas the concrete "re" or "regexp" modules gives a match object,
which lets you ask for the subgroups' location, for example. Strange
at it may seem, I am not aware of a way to do that using the
linear-time approach of the theory---if it answers "yes", then you
have no way of knowing *where* the subgroups matched.

Another issue is that the theoretical engine has no notion of
greedy/non-greedy matching.


RE2 has linear execution time, and it supports both capture groups and
greedy/non-greedy matching.

The implementation is explained in this article:
https://swtch.com/~rsc/regexp/regexp3.html


Not all features of Python regular expressions can be implemented with 
linear complexity. It is possible to compile the part of regular 
expressions to the implementation with linear complexity. Patches are 
welcome.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-29 Thread Serhiy Storchaka

On 29.01.17 22:30, Steve Holden wrote:

Why not declare re deprecated and remove it in Python 4? I am pretty
sure everyone wants to keep re in all 3.x releases, but that support
need not extend beyond. So Py4 would have no battery for re, but it
would (should!) be common knowledge that regex was the go-to module for
general-purpose pattern matching. If re has advantages in certain
situations someone might upgrade the 3.x implementation and provide it
as a 3rd-party module, though the effort involved would be significant,
so someone would have to be motivated to keep it.


Regular expressions are used in a number of standard modules and 
scripts. Excluding them from the stdlib would require excluding or 
rewriting the large part of the stdlib.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-29 Thread Paul Moore
On 29 January 2017 at 20:30, Steve Holden  wrote:
> Why not declare re deprecated and remove it in Python 4? I am pretty sure
> everyone wants to keep re in all 3.x releases, but that support need not
> extend beyond. So Py4 would have no battery for re, but it would (should!)
> be common knowledge that regex was the go-to module for general-purpose
> pattern matching. If re has advantages in certain situations someone might
> upgrade the 3.x implementation and provide it as a 3rd-party module, though
> the effort involved would be significant, so someone would have to be
> motivated to keep it.

Not having regex capability distributed with Python is a pretty big
regression. There are still a lot of users who don't have access to
3rd party modules.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-29 Thread Steve Holden
Why not declare re deprecated and remove it in Python 4? I am pretty sure
everyone wants to keep re in all 3.x releases, but that support need not
extend beyond. So Py4 would have no battery for re, but it would (should!)
be common knowledge that regex was the go-to module for general-purpose
pattern matching. If re has advantages in certain situations someone might
upgrade the 3.x implementation and provide it as a 3rd-party module, though
the effort involved would be significant, so someone would have to be
motivated to keep it.

regards
 Steve

Steve Holden

On Sun, Jan 29, 2017 at 4:13 PM, Antoine Pitrou  wrote:

> On Sat, 28 Jan 2017 12:07:05 -0500
> Barry Warsaw  wrote:
> > On Jan 28, 2017, at 03:43 PM, Nick Coghlan wrote:
> >
> > >I still think it could be a good candidate for a first "bundled"
> > >module, where we don't migrate it fully into the CPython development
> > >process, but *do* officially bless it and provide it by default in the
> > >form of a bundled wheel file (similar to what we do with pip).
> >
> > How would that work exactly.  I.e. is there a PEP?
> >
> > While I think it could be a good idea to bundle (more?) third party
> packages
> > for a variety of reasons, I want to make sure it's done in a way that's
> still
> > friendly to downstream repackagers, as I'm sure you do to. :)
>
> That sounds like a lot of effort and maintenance...  Don't we bundle pip
> *exactly* so that we don't have to bundle other third-party packages
> and instead tell users to "just use `pip install `"?
>
> To sum it up, how about we simply add an official suggestion to use
> regex in the docs?
>
> Regards
>
> Antoine.
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/
> steve%40holdenweb.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-29 Thread Antoine Pitrou
On Sat, 28 Jan 2017 12:07:05 -0500
Barry Warsaw  wrote:
> On Jan 28, 2017, at 03:43 PM, Nick Coghlan wrote:
> 
> >I still think it could be a good candidate for a first "bundled"
> >module, where we don't migrate it fully into the CPython development
> >process, but *do* officially bless it and provide it by default in the
> >form of a bundled wheel file (similar to what we do with pip).  
> 
> How would that work exactly.  I.e. is there a PEP?
> 
> While I think it could be a good idea to bundle (more?) third party packages
> for a variety of reasons, I want to make sure it's done in a way that's still
> friendly to downstream repackagers, as I'm sure you do to. :)

That sounds like a lot of effort and maintenance...  Don't we bundle pip
*exactly* so that we don't have to bundle other third-party packages
and instead tell users to "just use `pip install `"?

To sum it up, how about we simply add an official suggestion to use
regex in the docs?

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-29 Thread Jakub Wilk

* Armin Rigo , 2017-01-28, 12:44:
The theoretical kind of regexp is about giving a "yes/no" answer, whereas the 
concrete "re" or "regexp" modules gives a match object, which lets you ask for 
the subgroups' location, for example. Strange at it may seem, I am not aware 
of a way to do that using the linear-time approach of the theory---if it 
answers "yes", then you have no way of knowing *where* the subgroups matched.


Another issue is that the theoretical engine has no notion of 
greedy/non-greedy matching.


RE2 has linear execution time, and it supports both capture groups and 
greedy/non-greedy matching.


The implementation is explained in this article:
https://swtch.com/~rsc/regexp/regexp3.html

--
Jakub Wilk
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Serhiy Storchaka

On 28.01.17 20:04, Brett Cannon wrote:

Maybe regex should get a mention in the docs like requests does under
urllib.request?


https://bugs.python.org/issue22594

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Brett Cannon
On Fri, 27 Jan 2017 at 13:26 MRAB  wrote:

> On 2017-01-27 17:03, Łukasz Langa wrote:
> >
> >> On Jan 26, 2017, at 5:16 PM, MRAB  >> > wrote:
> >
> >>> So, it seems as if regex already uses a better algorithm although I
> >>> couldn't find any reference to any regex theoretical framework like
> dfa,
> >>> nfa, thompson multiple-state simulation or something.
> >>>
> >> It still uses backtracking, like in the re module.
> >
> > What’s the status of regex inclusion in the stdlib?
> >
> >
> I'm not bothered about it. It's quite a bit bigger than the re module,
> and, anyway, keeping it as a third-party module gives me more freedom to
> make updates, which are available for a range of Python versions.
>

Maybe regex should get a mention in the docs like requests does under
urllib.request?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Barry Warsaw
On Jan 28, 2017, at 03:43 PM, Nick Coghlan wrote:

>I still think it could be a good candidate for a first "bundled"
>module, where we don't migrate it fully into the CPython development
>process, but *do* officially bless it and provide it by default in the
>form of a bundled wheel file (similar to what we do with pip).

How would that work exactly.  I.e. is there a PEP?

While I think it could be a good idea to bundle (more?) third party packages
for a variety of reasons, I want to make sure it's done in a way that's still
friendly to downstream repackagers, as I'm sure you do to. :)

For Debian/Ubuntu, we already have some additional machinations to make
ensurepip and such work properly in a distro packaging environment.  We'd want
to be able to do the same for any bundled packages as well.

Cheers,
-Barry
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Terry Reedy

On 1/28/2017 9:43 AM, Nick Coghlan wrote:

On 27 January 2017 at 22:24, MRAB  wrote:

On 2017-01-27 17:03, Łukasz Langa wrote:



What’s the status of regex inclusion in the stdlib?


I'm not bothered about it. It's quite a bit bigger than the re module, and,
anyway, keeping it as a third-party module gives me more freedom to make
updates, which are available for a range of Python versions.


I still think it could be a good candidate for a first "bundled"
module, where we don't migrate it fully into the CPython development
process, but *do* officially bless it and provide it by default in the
form of a bundled wheel file (similar to what we do with pip).


So like pip, we would bundle the current version, install it in 
site-packages, and users who use it could upgrade as desired, even after 
x.y bugfix ends.


What would be nice would be to have a short entry in the docs for 
bundled modules that links to the real doc, wherever it is.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Nick Coghlan
On 27 January 2017 at 22:24, MRAB  wrote:
> On 2017-01-27 17:03, Łukasz Langa wrote:
>>
>>
>>> On Jan 26, 2017, at 5:16 PM, MRAB >> > wrote:
>>
>>
 So, it seems as if regex already uses a better algorithm although I
 couldn't find any reference to any regex theoretical framework like dfa,
 nfa, thompson multiple-state simulation or something.

>>> It still uses backtracking, like in the re module.
>>
>>
>> What’s the status of regex inclusion in the stdlib?
>>
>>
> I'm not bothered about it. It's quite a bit bigger than the re module, and,
> anyway, keeping it as a third-party module gives me more freedom to make
> updates, which are available for a range of Python versions.

I still think it could be a good candidate for a first "bundled"
module, where we don't migrate it fully into the CPython development
process, but *do* officially bless it and provide it by default in the
form of a bundled wheel file (similar to what we do with pip).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-28 Thread Armin Rigo
Hi Sven,

On 26 January 2017 at 22:13, Sven R. Kunze  wrote:
> I recently refreshed regular expressions theoretical basics *indulging in
> reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html

Theoretical regular expressions and what Python/Perl/etc. call regular
expressions are a bit different.  You can read more about it at
https://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times
.

Discussions about why they are different often focus on
backreferences, which is a rare feature.  Let me add two other points.

The theoretical kind of regexp is about giving a "yes/no" answer,
whereas the concrete "re" or "regexp" modules gives a match object,
which lets you ask for the subgroups' location, for example.  Strange
at it may seem, I am not aware of a way to do that using the
linear-time approach of the theory---if it answers "yes", then you
have no way of knowing *where* the subgroups matched.

Another issue is that the theoretical engine has no notion of
greedy/non-greedy matching.  Basically, you walk over the source
character and it answers "yes" or "no" after each of them.  This is
different from a typical backtracking implementation.  In Python:

>>> re.match(r'a*', 'aaa')
>>> re.match(r'a*?', 'aaa')

This matches either three or zero characters in Python.  The two
versions are however indistinguishable for the theoretical engine.


A bientôt,

Armin.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-27 Thread MRAB

On 2017-01-27 22:15, Michael Felt wrote:

On 27/01/2017 22:24, MRAB wrote:
> I'm not bothered about it. It's quite a bit bigger than the re module,
> and, anyway, keeping it as a third-party module gives me more freedom
> to make updates, which are available for a range of Python versions.

I tried packaging it (pip build) and ran into a minor syntax error (on
AIX using xlc). How do I go about reporting what I found?


Report it on the project's issue tracker.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-27 Thread Michael Felt

On 27/01/2017 22:24, MRAB wrote:
I'm not bothered about it. It's quite a bit bigger than the re module, 
and, anyway, keeping it as a third-party module gives me more freedom 
to make updates, which are available for a range of Python versions. 


I tried packaging it (pip build) and ran into a minor syntax error (on 
AIX using xlc). How do I go about reporting what I found?


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-27 Thread MRAB

On 2017-01-27 17:03, Łukasz Langa wrote:


On Jan 26, 2017, at 5:16 PM, MRAB > wrote:



So, it seems as if regex already uses a better algorithm although I
couldn't find any reference to any regex theoretical framework like dfa,
nfa, thompson multiple-state simulation or something.


It still uses backtracking, like in the re module.


What’s the status of regex inclusion in the stdlib?


I'm not bothered about it. It's quite a bit bigger than the re module, 
and, anyway, keeping it as a third-party module gives me more freedom to 
make updates, which are available for a range of Python versions.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-27 Thread Mark Lawrence via Python-Dev

On 27/01/2017 17:03, Łukasz Langa wrote:



On Jan 26, 2017, at 5:16 PM, MRAB mailto:pyt...@mrabarnett.plus.com>> wrote:



So, it seems as if regex already uses a better algorithm although I
couldn't find any reference to any regex theoretical framework like dfa,
nfa, thompson multiple-state simulation or something.


It still uses backtracking, like in the re module.


What’s the status of regex inclusion in the stdlib?

- Ł



I've asked about this in the past, but have now come to the conclusion 
that it is way better to leave regex, and many other third party 
modules, on pypi rather than have them tied into the Python release 
cycle.  If YMMV so be it.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-27 Thread Łukasz Langa

> On Jan 26, 2017, at 5:16 PM, MRAB  wrote:

>> So, it seems as if regex already uses a better algorithm although I
>> couldn't find any reference to any regex theoretical framework like dfa,
>> nfa, thompson multiple-state simulation or something.
>> 
> It still uses backtracking, like in the re module.

What’s the status of regex inclusion in the stdlib?

- Ł

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-26 Thread MRAB

On 2017-01-26 21:46, Sven R. Kunze wrote:

On 26.01.2017 22:33, Vlastimil Brom wrote:

Hi,
I can't speak about the details of mrab's implementation, but using
regex, I get the resulting match instantly: [...]


Nice! I focused on the stdlib re module as this is mainly used by other
frameworks (like Django).


(I personally prefer to use regex for other advantages, than this
artificial case, but it certainly doesn't hurt to have better
performance here too:)


Me, too.

So, it seems as if regex already uses a better algorithm although I
couldn't find any reference to any regex theoretical framework like dfa,
nfa, thompson multiple-state simulation or something.


It still uses backtracking, like in the re module.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-26 Thread MRAB

On 2017-01-26 21:13, Sven R. Kunze wrote:

Hi folks,

I recently refreshed regular expressions theoretical basics *indulging
in reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html

However, reaching the chart in the lower third of the article, I saw
Python 2.4 measured against a naive Thompson matching implementation.
And I was surprised about how bad it performed compared to an
unoptimized version of an older than dirt algorithm.

So, I decided to give it a try with Python2.7 and Python3.5. Eh, voilà,
100% cpu and no results so far. Nothing has changed at all since 2007.

  >>> import re
  >>> re.match(r'a?'*30 + r'a'*30, 'a'*30)
 (still waiting)

Quoting from the article: " Some might argue that this test is unfair to
the backtracking implementations, since it focuses on an uncommon corner
case. This argument misses the point: given a choice between an
implementation with a predictable, consistent, fast running time on all
inputs or one that usually runs quickly but can take years of CPU time
(or more) on some inputs, the decision should be easy."

Victor, as the head of Python performance department, and Matthew, as
the maintainer of the new regex module, what is your stance on this
particular issue?

  From my perspective, I can say, that regular expressions might worth
optimizing especially for web applications (url matching usually uses
regexes) but also for other applications where I've seen many tight
loops using regexes as well. So, I am probing interest on this topic here.
It's possible to add some checks to reduce the problem, as I've done in 
the regex module, but I'm not sure how easy it would be to retrofit it 
into the existing re code. (It wasn't pleasant in the regex module, so...).


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-26 Thread Sven R. Kunze

On 26.01.2017 22:33, Vlastimil Brom wrote:

Hi,
I can't speak about the details of mrab's implementation, but using
regex, I get the resulting match instantly: [...]


Nice! I focused on the stdlib re module as this is mainly used by other 
frameworks (like Django).



(I personally prefer to use regex for other advantages, than this
artificial case, but it certainly doesn't hurt to have better
performance here too:)


Me, too.

So, it seems as if regex already uses a better algorithm although I 
couldn't find any reference to any regex theoretical framework like dfa, 
nfa, thompson multiple-state simulation or something.



Cheers,
Sven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] re performance

2017-01-26 Thread Vlastimil Brom
2017-01-26 22:13 GMT+01:00 Sven R. Kunze :
> Hi folks,
>
> I recently refreshed regular expressions theoretical basics *indulging in
> reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
>
> However, reaching the chart in the lower third of the article, I saw Python
> 2.4 measured against a naive Thompson matching implementation. And I was
> surprised about how bad it performed compared to an unoptimized version of
> an older than dirt algorithm.
>
> So, I decided to give it a try with Python2.7 and Python3.5. Eh, voilà, 100%
> cpu and no results so far. Nothing has changed at all since 2007.
>
 import re
 re.match(r'a?'*30 + r'a'*30, 'a'*30)
>  (still waiting)
>
> Quoting from the article: " Some might argue that this test is unfair to the
> backtracking implementations, since it focuses on an uncommon corner case.
> This argument misses the point: given a choice between an implementation
> with a predictable, consistent, fast running time on all inputs or one that
> usually runs quickly but can take years of CPU time (or more) on some
> inputs, the decision should be easy."
>
> Victor, as the head of Python performance department, and Matthew, as the
> maintainer of the new regex module, what is your stance on this particular
> issue?
>
> From my perspective, I can say, that regular expressions might worth
> optimizing especially for web applications (url matching usually uses
> regexes) but also for other applications where I've seen many tight loops
> using regexes as well. So, I am probing interest on this topic here.
>
> Best,
> Sven
>

Hi,
I can't speak about the details of mrab's implementation, but using
regex, I get the resulting match instantly:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 07:18:10) [MSC v.1900
32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import regex
>>> regex.match(r'a?'*30 + r'a'*30, 'a'*30)

>>>

(I personally prefer to use regex for other advantages, than this
artificial case, but it certainly doesn't hurt to have better
performance here too:)

regards,
vbr
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com