Re: regexp: does anyone really need ordered alternation?

nicolasweber Fri, 28 Mar 2008 03:12:50 -0700

Hi,

> >  > During the development of the new regexp, one thing confuses me a lot:
> >  > ordered alternation. (e.g. given r.e. 'ab\|abc' and text 'abc', 'ab'
> >  > matched, not 'abc')
>
> For what it's worth, I disagree strongly.  This behavior is nothing
> but a bug in the existing implementation - a documented bug, but a bug
> nonetheless.  In this particular case, I definitely think that we
> should strive for compatibility with other regex engines, rather than
> backwards compatibility with older vim versions.

I'm not sure what you're arguing about here. You say that given the
regex 'ab\|abc' and the string 'abc', the regex does currently match
the 'ab' part of the string, but it should match 'abc' because that's
what posix says and that's what other regex engines do, right? (If
not, please ignore the rest of this mail.)

I agree that Posix ("leftmost longest") requires that the match is
'abc'. But let's see what other regex engines do:

    nico$ python
    Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17)
    [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
    Type "help", "copyright", "credits" or "license" for more
information.
    >>> import re
    >>> r = re.compile('ab|abc')
    >>> m = r.match('abc')
    >>> m.group(0)
    'ab'
    >>> ^D

    nico$ irb
    >> m = /(ab|abc)/.match 'abc'
    => #<MatchData:0x5fc47c>
    >> m.to_s
    => "ab"

    nico$ perl -e 'if( "abc" =~ /(ab|abc)/) {  print "$1\n" }'
    ab

So Python, Ruby and Perl all decide not to be Posix-compliant. This is
for example documented in the pcre documentation ( http://www.pcre.org/pcre.txt
):

       If a leaf node is reached, a matching string has  been  found,
and  at
       that  point the algorithm stops. Thus, if there is more than
one possi-
       ble match, this algorithm returns the first one that it finds.
Whether
       this  is the shortest, the longest, or some intermediate length
depends
       on the way the greedy and ungreedy repetition quantifiers are
specified
       in the pattern.

Jeffrey Friedel discusses this in his book "Mastering Regular
Expressions" in chapter 4, section "NFA, DFA, and POSIX". He argues
that NFA engines naturally exhibit the behaviour documented by the
pcre documentation, so if you're writing an NFA matcher, you shouldn't
have a problem with this (?). Jeffrey writes: "If efficiency is an
issue with a Traditional NFA (and with backtracking, believe me, it
can be), it is doubly so with a POSIX NFA since there can be so much
more backtracking". A "Traditional NFA" backtracks until _any_ match
is found and then returns, while a POSIX NFA has to backtrack through
_all_ possible matches so it can be sure it has seen the longest
match.

It's been a while since I read the book, but I believe the gist was
that a DFA always gives you leftmost-longest in a fast way, but it
offers far less useful features (no backreferences). An NFA gives you
all kinds of features, but Posix-compliance makes it slow, so in
practice most regex engines are NFAs without leftmost longest. For
more details, read the book.

To sum up: Changing from current vim behaviour to leftmost longest
does _not_ buy us compatibility with other regex engines. Keeping the
current behaviour shouldn't be a problem with an NFA either.

Nico
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: regexp: does anyone really need ordered alternation?

Raspunde prin e-mail lui