On 28/03/2008 10:11, [EMAIL PROTECTED] wrote:
> Hi,
>
>>> > During the development of the new regexp, one thing confuses me a lot:
>>> > ordered alternation. (e.g. given r.e. 'ab\|abc' and text 'abc', 'ab'
>>> > matched, not 'abc')
>> For what it's worth, I disagree strongly. This behavior is nothing
>> but a bug in the existing implementation - a documented bug, but a bug
>> nonetheless. In this particular case, I definitely think that we
>> should strive for compatibility with other regex engines, rather than
>> backwards compatibility with older vim versions.
>
> I'm not sure what you're arguing about here. You say that given the
> regex 'ab\|abc' and the string 'abc', the regex does currently match
> the 'ab' part of the string, but it should match 'abc' because that's
> what posix says and that's what other regex engines do, right? (If
> not, please ignore the rest of this mail.)
>
> I agree that Posix ("leftmost longest") requires that the match is
> 'abc'. But let's see what other regex engines do:
>
> nico$ python
> Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17)
> [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
> Type "help", "copyright", "credits" or "license" for more
> information.
> >>> import re
> >>> r = re.compile('ab|abc')
> >>> m = r.match('abc')
> >>> m.group(0)
> 'ab'
> >>> ^D
>
> nico$ irb
> >> m = /(ab|abc)/.match 'abc'
> => #<MatchData:0x5fc47c>
> >> m.to_s
> => "ab"
>
> nico$ perl -e 'if( "abc" =~ /(ab|abc)/) { print "$1\n" }'
> ab
>
> So Python, Ruby and Perl all decide not to be Posix-compliant.
Interesting selection of languages to try ;-) You may have picked the
ones with a common regex code base. Russ Cox's article certainly shows
similar performance/behaviour (See section and graph on Performance at
http://swtch.com/~rsc/regexp/regexp1.html) I wonder what Tcl and awk do?
> This is
> for example documented in the pcre documentation (
> http://www.pcre.org/pcre.txt
> ):
>
> If a leaf node is reached, a matching string has been found,
> and at
> that point the algorithm stops. Thus, if there is more than
> one possi-
> ble match, this algorithm returns the first one that it finds.
> Whether
> this is the shortest, the longest, or some intermediate length
> depends
> on the way the greedy and ungreedy repetition quantifiers are
> specified
> in the pattern.
>
> Jeffrey Friedel discusses this in his book "Mastering Regular
> Expressions" in chapter 4, section "NFA, DFA, and POSIX". He argues
> that NFA engines naturally exhibit the behaviour documented by the
> pcre documentation, so if you're writing an NFA matcher, you shouldn't
> have a problem with this (?). Jeffrey writes: "If efficiency is an
> issue with a Traditional NFA (and with backtracking, believe me, it
> can be), it is doubly so with a POSIX NFA since there can be so much
> more backtracking". A "Traditional NFA" backtracks until _any_ match
> is found and then returns, while a POSIX NFA has to backtrack through
> _all_ possible matches so it can be sure it has seen the longest
> match.
>
> It's been a while since I read the book, but I believe the gist was
> that a DFA always gives you leftmost-longest in a fast way, but it
> offers far less useful features (no backreferences). An NFA gives you
> all kinds of features, but Posix-compliance makes it slow, so in
> practice most regex engines are NFAs without leftmost longest. For
> more details, read the book.
>
> To sum up: Changing from current vim behaviour to leftmost longest
> does _not_ buy us compatibility with other regex engines. Keeping the
> current behaviour shouldn't be a problem with an NFA either.
>
> Nico
> >
>
Mike
--
I'm somewhere between the age of consent and collapse.
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---