On Thu, Jan 26, 2017 at 4:13 PM, Sven R. Kunze <srku...@mail.de> wrote: > Hi folks, > > I recently refreshed regular expressions theoretical basics *indulging in > reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html > > However, reaching the chart in the lower third of the article, I saw Python > 2.4 measured against a naive Thompson matching implementation. And I was > surprised about how bad it performed compared to an unoptimized version of > an older than dirt algorithm.
> From my perspective, I can say, that regular expressions might worth > optimizing especially for web applications (url matching usually uses > regexes) but also for other applications where I've seen many tight loops > using regexes as well. So, I am probing interest on this topic here. What I (think I) know: - Both re and regex use the same C backend, which is not based on NFA. - The re2 library, which the writer of that article made, allows capture groups (but only up to a limit) and bounded repetitions (up to a limit). - Perl has started to optimize some regex patterns. What I think: - The example is uncharacteristic of what people write, like URL matching. - But enabling naive code to perform well is usually a good thing. The fewer details newbies need to know to write code, the better. - It's possible for Python to optimize a large category of patterns, even if not all. - It's also possible to optimize large parts of patterns with backrefs. E.g. If there's a backref, the group that the backref refers to can still be made into an NFA. - To do the above, you'd need a way to generate all possible matches. - Optimization can be costly. The full NFA construction could be generated only upon request, or maybe the code automatically tries to optimize after 100 uses (like a JIT). This should only be considered if re2's construction really is costly. If people want NFAs, I think the "easiest" way is to use re2. Jakub Wilk mentioned it before, but here it is again. https://github.com/google/re2 re2 features: https://github.com/google/re2/wiki/Syntax Named references aren't supported, but I think that can be worked around with some renumbering. It's just a matter of translating the pattern. It also doesn't like lookarounds, which AFAIK are perfectly doable with NFAs. Looks like lookaheads and lookbehinds are hard to do without a lot of extra space (https://github.com/google/re2/issues/5). Facebook has a Python wrapper for re2. https://github.com/facebook/pyre2/ In a post linked to from this thread, Serhiy mentioned another Python wrapper for re2. This wrapper is designed to be like re, and should probably be the basis of any efforts. It's not been updated for 11 months, though. https://pypi.python.org/pypi/re2/ https://github.com/axiak/pyre2/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com