On 17.03.2012 10:59, Philippe Sigaud wrote:
On Wed, Mar 14, 2012 at 10:48, Dmitry Olshansky<dmitry.o...@gmail.com>  wrote:

That's one of the caveats on PEG. That and greedy operators.

'a*a' never succeeds because 'a*' consumes all the available a's.


Hey, wait, I thought there has to be backtrack here, i.e. when second 'a'
finally fails?

PEG only do local backtracking in 'or' choices, not in sequences. I
think that's because the original author was dealing with packrat
parsing and its O(input-size) guarantee.


Ok, let's agree on fact that semantically
a* is :
As <- a As / a

and a*? is this:
As <- a / a As

Now that's local ;)


I remember trying compile-time backtracking in another similar library
in D 1-2 years ago and getting lots of pb. I might add that in Pegged,
but I don't know the consequences. How do you do that in std.regex?


There are two fundamental ways to get around the problem, the easiest
to implement is to use a stack to record a position in input + number of alternative/production (I call it a PC - program counter) every time there is a "fork" like that. Then when the current path ultimately fails pop stack, reset input and go over again until you either match or empty the stack. That's how to avoid deeeep recursion that happens here.

The problematic thing now is combinatorial explosion of
a* a* a* .... //simplified, but you get the idea

The trick to avoid this sort of crap is to
1) agree that whatever matches first has higher priority (left-most match) that's a simple disambiguation rule. 2) record what positions + pc you place on stack e.g. use a set, or in fact, a bitmap to push every unique combination (pc,index) only once.

Voila! Now you have a linear worst-case algorithm with (M*N) complexity where M is number of all possible PCs, and N is number of all possible indexes in input.

Now if we put aside all crap talk about mathematical purity and monads, and you name it, a packrat is essentially this, a dynamic programming trick applied to recursive top-down parsing. The trick could be extended even more to avoid extra checks on input characters, but the essence is this "memoization".


(nice article btw, I learnt some about regexes)

Thanks, I hope it makes them more popular.
Might as well keep me busy fixing bugs :)

--
Dmitry Olshansky

Reply via email to