Re: regexp: does anyone really need ordered alternation?

Mike Williams Fri, 28 Mar 2008 05:45:27 -0700

On 28/03/2008 10:11, [EMAIL PROTECTED] wrote:
> Hi,
> 
>>>  > During the development of the new regexp, one thing confuses me a lot:
>>>  > ordered alternation. (e.g. given r.e. 'ab\|abc' and text 'abc', 'ab'
>>>  > matched, not 'abc')
>> For what it's worth, I disagree strongly.  This behavior is nothing
>> but a bug in the existing implementation - a documented bug, but a bug
>> nonetheless.  In this particular case, I definitely think that we
>> should strive for compatibility with other regex engines, rather than
>> backwards compatibility with older vim versions.
> 
> I'm not sure what you're arguing about here. You say that given the
> regex 'ab\|abc' and the string 'abc', the regex does currently match
> the 'ab' part of the string, but it should match 'abc' because that's
> what posix says and that's what other regex engines do, right? (If
> not, please ignore the rest of this mail.)
> 
> I agree that Posix ("leftmost longest") requires that the match is
> 'abc'. But let's see what other regex engines do:
> 
>     nico$ python
>     Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17)
>     [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
>     Type "help", "copyright", "credits" or "license" for more
> information.
>     >>> import re
>     >>> r = re.compile('ab|abc')
>     >>> m = r.match('abc')
>     >>> m.group(0)
>     'ab'
>     >>> ^D
> 
>     nico$ irb
>     >> m = /(ab|abc)/.match 'abc'
>     => #<MatchData:0x5fc47c>
>     >> m.to_s
>     => "ab"
> 
>     nico$ perl -e 'if( "abc" =~ /(ab|abc)/) {  print "$1\n" }'
>     ab
> 
> So Python, Ruby and Perl all decide not to be Posix-compliant.


Interesting selection of languages to try ;-)  You may have picked the 
ones with a common regex code base.  Russ Cox's article certainly shows 
similar performance/behaviour (See section and graph on Performance at 
http://swtch.com/~rsc/regexp/regexp1.html)  I wonder what Tcl and awk do?

 > This is
> for example documented in the pcre documentation ( 
> http://www.pcre.org/pcre.txt
> ):
> 
>        If a leaf node is reached, a matching string has  been  found,
> and  at
>        that  point the algorithm stops. Thus, if there is more than
> one possi-
>        ble match, this algorithm returns the first one that it finds.
> Whether
>        this  is the shortest, the longest, or some intermediate length
> depends
>        on the way the greedy and ungreedy repetition quantifiers are
> specified
>        in the pattern.
> 
> Jeffrey Friedel discusses this in his book "Mastering Regular
> Expressions" in chapter 4, section "NFA, DFA, and POSIX". He argues
> that NFA engines naturally exhibit the behaviour documented by the
> pcre documentation, so if you're writing an NFA matcher, you shouldn't
> have a problem with this (?). Jeffrey writes: "If efficiency is an
> issue with a Traditional NFA (and with backtracking, believe me, it
> can be), it is doubly so with a POSIX NFA since there can be so much
> more backtracking". A "Traditional NFA" backtracks until _any_ match
> is found and then returns, while a POSIX NFA has to backtrack through
> _all_ possible matches so it can be sure it has seen the longest
> match.
> 
> It's been a while since I read the book, but I believe the gist was
> that a DFA always gives you leftmost-longest in a fast way, but it
> offers far less useful features (no backreferences). An NFA gives you
> all kinds of features, but Posix-compliance makes it slow, so in
> practice most regex engines are NFAs without leftmost longest. For
> more details, read the book.
> 
> To sum up: Changing from current vim behaviour to leftmost longest
> does _not_ buy us compatibility with other regex engines. Keeping the
> current behaviour shouldn't be a problem with an NFA either.
> 
> Nico
> > 
> 

Mike
-- 
I'm somewhere between the age of consent and collapse.

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: regexp: does anyone really need ordered alternation?

Raspunde prin e-mail lui