On 17.03.2012 10:59, Philippe Sigaud wrote:
On Wed, Mar 14, 2012 at 10:48, Dmitry Olshansky<dmitry.o...@gmail.com> wrote:
That's one of the caveats on PEG. That and greedy operators.
'a*a' never succeeds because 'a*' consumes all the available a's.
Hey, wait, I thought there has to be backtrack here, i.e. when second 'a'
finally fails?
PEG only do local backtracking in 'or' choices, not in sequences. I
think that's because the original author was dealing with packrat
parsing and its O(input-size) guarantee.
Ok, let's agree on fact that semantically
a* is :
As <- a As / a
and a*? is this:
As <- a / a As
Now that's local ;)
I remember trying compile-time backtracking in another similar library
in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
but I don't know the consequences. How do you do that in std.regex?
There are two fundamental ways to get around the problem, the easiest
to implement is to use a stack to record a position in input + number of
alternative/production (I call it a PC - program counter) every time
there is a "fork" like that. Then when the current path ultimately fails
pop stack, reset input and go over again until you either match or empty
the stack. That's how to avoid deeeep recursion that happens here.
The problematic thing now is combinatorial explosion of
a* a* a* .... //simplified, but you get the idea
The trick to avoid this sort of crap is to
1) agree that whatever matches first has higher priority (left-most
match) that's a simple disambiguation rule.
2) record what positions + pc you place on stack e.g. use a set, or in
fact, a bitmap to push every unique combination (pc,index) only once.
Voila! Now you have a linear worst-case algorithm with (M*N) complexity
where M is number of all possible PCs, and N is number of all possible
indexes in input.
Now if we put aside all crap talk about mathematical purity and monads,
and you name it, a packrat is essentially this, a dynamic programming
trick applied to recursive top-down parsing. The trick could be extended
even more to avoid extra checks on input characters, but the essence is
this "memoization".
(nice article btw, I learnt some about regexes)
Thanks, I hope it makes them more popular.
Might as well keep me busy fixing bugs :)
--
Dmitry Olshansky