[PEG] Breaking the Mizushima(2010) cut operator

Ger Hobbelt Thu, 10 Jul 2014 07:15:46 -0700

L.S.,

I'm looking for feedback = to be corrected about mistakes, any, more or
less obvious, because I think I ran into an issue and after having
doublechecked several times what I consider to be the cause several times,
I can't find any goof(s) in it anymore.


The observation below is written as a fact, but I know sometimes I overlook
the obvious elephant in the room. :'-(  And I want to make sure I got this
one nailed.

Thanks!

...

# Intro #

Take grammar1:

G = A + B / B               (1)
B = A + C / C               (2)
A = a
C = c

Matching Inputs:

c                            (1)
a+c                          (2)
a+a+c                        (3)

Application of M's cut operator rules will *not* add a cut (marked as '#')
in rule G(1) as the ExtFirst(e) set for both the A+B and B choices will
include the 'a' prefix (full common 'first' prefix: 'a+').
It does add an automatic cut operator to rule B(2) as there the union of
the ExtFirst of both choices is empty:

B = A # + C / C               (2)

And this automatic cut insertion is correct as it does not alter the
behaviour of the grammar in any way.

If we would be using, as suggested in the next paper, BITES (Redziejowski,
2011) rather than the FIRST basic building block used by Mizushima et
al(2010), things don't change: The inferred ExtBites(e) would deliver 'a+'
rather than just 'a' but that's about it for this example grammar.


# A Breaking Grammar #

Take grammar2, which is a minimal alteration of grammar1 which accepts the
same input set ('language') as grammar1:

G = B / A + B               (1)
B = A + C / C               (2)
A = a
C = c

Matching Inputs:

c                            (1)
a+c                          (2)
a+a+c                        (3)

Note that rule G has 'flipped' both choices but still parses the same input.
Mazushima's algorithm does not mind, so, once again, only rule B gets a cut:

G = B / A + B               (1)
B = A # + C / C             (2)
A = a
C = c

as the ExtFirst() sets turn up the same results: B-choice1 = {'a'},
B-choice2 = {'c'}. Also upgrading to Redziejowski's BITES does not change
things.


## The Problem ##

However, the cut in B will now kill the memoization cache for the initial
'a' parsed by B in the inputs 'a+c' and 'a+a+c'. When we now inspect the
latter, it will fail the first B choice, backtracking to G, where it will
be matched by the second choice in G, but this time around, without any
help of memoization for the initial 'a' it would otherwise have received in
grammar1.

In other words: for the input 'a+a+c', the second evaluation of the leading
matching 'A = a' at input column 0 (now in rule G(1), in grammar1 its
second evaluation is in the first choice of rule B(2)) does not receive
memoization support anymore.
This means that using the cut operator, using the Mazushima/Redziejowski
ExtFirst/ExtBites approach, for clearing out and reducing your memoization
cache footprint, breaks=removes the memoization support unintended, hence
the grammar does not exhibit guaranteed linear time behaviour any more.


## Side Notes ##

This is the minimal grammar example that I can devise to showcase the
issue.
Having rule G as

G = B / A + B               (1)

looks suspicious to the human eye, when you're aware of 'prefix hiding',
even while the notation as such does *not* imply 'prefix hiding' as we're
looking at non-terminals here, but that suspicion can take a long time to
develop when this construct occurs in a production grammar along the lines
of grammar3 below:

G = B / D                   (1)
B = A # + C / C             (2)
D = A + B                   (3)
A = a
C = c

where rule D is 'obfuscating' the 'A +' common prefix in both choices of
rule G(1). Of course, in a production grammar, the B(2) and D(3)
definitions will very probably be further apart if this issue would not
indirectly be caught by 'probable trouble' in the visual patterns of the
grammar definition itself.
('probable trouble' being a human brain thing, not something a machine
would use. Neither is 'probable trouble' a sure-fire thing; it's just that
when you look at grammars long and often enough, you may get the feeling
'something is odd with this one'. And you *may* be right. Then again, you
may be not.)


# Conclusion #

As the Mazushima cut operator only looks into the future (ExtFirst) but
does not consider history/context of the rule currently under scrutiny
(i.e. the rule G's behaviour when we determine rule B(2) can have a cut
inserted) I expect a solution/fix to this issue would be derived from
propagating the invoking rule's First/Bites info into the
rule-under-scrutiny in order to add the outer context into the condition
set for the automatic insertion of the cut operator. (To be addressed
further in another email.)


# References #

Mizushima et al, 2010: Packrat Parsers can handle practical grammars in
mostly constant space (Mizushima, Maeda, Yamaguchi, 2010)

Redziejowski, 2011: BITES instead of FIRST for parsing expression grammar
(Redziejowski, 2011) -- 'upgrading' Mizushima's algorithm (ExtFirst ->
ExtBites) is already suggested in its ch.1:Introduction.

...

I have searched the Nets to the best of my ability but wasn't able to find
earlier mention of this issue; if I missed it, I love to hear about it and
a reference to it. :-)  (Because if I missed this one, I'm bound to have
missed others in that vicinity too.)


Thank you for bearing with me.


Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   g...@hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

[PEG] Breaking the Mizushima(2010) cut operator

Reply via email to