Re: New implementation of re module

MRAB Thu, 30 Jul 2009 15:10:04 -0700

Piet van Oostrum wrote:

MRAB <pyt...@mrabarnett.plus.com> (M) wrote:

M> Hi all,
M> I've been working on a new implementation of the re module. The details
M> are at http://bugs.python.org/issue2636, specifically from
M> http://bugs.python.org/issue2636#msg90954. I've included a .pyd file for
M> Python 2.6 on Windows if you want to try it out.

M> I'm interested in how fast it is generally, compared with the current re
M> module, but especially when faced with those 'pathological' regular
M> expressions which seem to take a long time to finish, for example:

M>     re.search(r"^(.+|D)*A$", "x" * 25 + "B")

M> which on my PC (1.8GHz) takes 18.98secs with the re module but <0.01secs
M> with this new implementation.


Is this version also going to use the Thompson approach?


That approach is what inspired the method in the new implementation, but
its speed comes from asking _whether_ there's a match, not _where_
there's a match, so it can ignore the route taken to get the match.

"grep", for example, selects those lines which match a pattern, whereas
the typical Python (and Perl) usage is to extract part of a string or
which part of the string matches.

Trying to support lazy and possessive quantifiers in Thompson's approach
is tricky (at least, I haven't been able to figure it out), and
back-references are right out! :-)

There's always the chance that I could come up with some sort of hybrid
scheme, applying different approaches depending on certain conditions.
It already switches from depth-first to breadth-first when it's taken
too many iterations without advancing and merges similar states, which
is why it's so much better with 'pathological' patterns; I just have to
get the depth-first phase to be as fast as the current 're' module.
--
http://mail.python.org/mailman/listinfo/python-list

Re: New implementation of re module

Reply via email to