Re: regexp: does anyone really need ordered alternation?

Ian Young Mon, 31 Mar 2008 13:39:26 -0700

On Mon, Mar 31, 2008 at 9:04 AM, Antony Scriven <[EMAIL PROTECTED]> wrote:
>
>  On 31/03/2008, Ian Young <[EMAIL PROTECTED]> wrote:
>
>   > [...]
>
>  >
>   > I did a couple informal benchmarks when I preparing for
>   > a short talk I gave at my school.  The slides are at
>   > <http://vim-soc-regexp.googlecode.com/files/ThursExtra.odp>,
>   > and near the end are some graphs with the data
>   > I collected.  Short version: the pathological cases
>   > mirror Russ Cox's results pretty closely. The
>   > non-pathological cases tend to be a bit slower with the
>   > new engine than the old, but we're talking differences of
>   > a few nanoseconds in most cases.  A few cases had more
>   > substantial differences (in the order of hundreds of
>   > nanoseconds), which is the reason I believe some work
>   > should go into optimizing before we incorporate the new
>   > engine into a release.
>
>  Do you know why the new engine is slower than the old one in
>  these cases? Are your benchmarks for matching only, or do
>  they include the compilation phase? Also doesn't the old
>  engine use something like Boyer-Moore on string literals in
>  the regexp to home in quickly on a potential match? Could
>  you make use of that code too? --Antony
>
>


To some extent, a minor slowdown on many regexps is unavoidable.  When
the backtracking engine guesses right on one of the first tries, it
has executed very quickly. In contrast, the NFA implementation tries
all paths of execution simultaneously until at least one match is
found.  So on regexps with a lot of branches, the NFA engine will
often do more work.  The idea is that it's worth it to trade a few
nanoseconds here and there for the really massive delays on cases the
backtracking engine handles poorly (such as those illustrated in Russ'
article and my slides).

The graphs in the slideshow are only execution time. Compile time for
the new engine is slightly slower across the board, but nothing in
compilation topped 20ns, so it seemed less important (although our
compilation phase is certainly not as streamlined as it could be).

I forgot to link to the second file, which is a more complete set of
data for the curious.  You can find it at
<http://vim-soc-regexp.googlecode.com/files/times_basic.ods>.  It
includes graphs of both compilation and execution times, plus all the
raw data: time dumps for each run and the line numbers that will match
bars on the graph to the regexp with that runtime. You can see in this
file which of the regexps I tested were noticeably slower in the new
implementation: generally those with lots of branches where the greedy
backtracking approach guesses right the first time, such as
/a*a*a*a*a*b/ against 'aaaaaaaaab'.

As for some sort of optimization for string literals, it's certainly a
possibility, but we haven't looked into it (we've been focusing on
getting it running correctly before we bother optimizing).  I don't
know if the old code has such a thing - I haven't looked at it
carefully enough.


Ian

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: regexp: does anyone really need ordered alternation?

Raspunde prin e-mail lui