L.S., I'm looking for feedback = to be corrected about mistakes, any, more or less obvious, because I think I ran into an issue and after having doublechecked several times what I consider to be the cause several times, I can't find any goof(s) in it anymore.
The observation below is written as a fact, but I know sometimes I overlook the obvious elephant in the room. :'-( And I want to make sure I got this one nailed. Thanks! ... # Intro # Take grammar1: G = A + B / B (1) B = A + C / C (2) A = a C = c Matching Inputs: c (1) a+c (2) a+a+c (3) Application of M's cut operator rules will *not* add a cut (marked as '#') in rule G(1) as the ExtFirst(e) set for both the A+B and B choices will include the 'a' prefix (full common 'first' prefix: 'a+'). It does add an automatic cut operator to rule B(2) as there the union of the ExtFirst of both choices is empty: B = A # + C / C (2) And this automatic cut insertion is correct as it does not alter the behaviour of the grammar in any way. If we would be using, as suggested in the next paper, BITES (Redziejowski, 2011) rather than the FIRST basic building block used by Mizushima et al(2010), things don't change: The inferred ExtBites(e) would deliver 'a+' rather than just 'a' but that's about it for this example grammar. # A Breaking Grammar # Take grammar2, which is a minimal alteration of grammar1 which accepts the same input set ('language') as grammar1: G = B / A + B (1) B = A + C / C (2) A = a C = c Matching Inputs: c (1) a+c (2) a+a+c (3) Note that rule G has 'flipped' both choices but still parses the same input. Mazushima's algorithm does not mind, so, once again, only rule B gets a cut: G = B / A + B (1) B = A # + C / C (2) A = a C = c as the ExtFirst() sets turn up the same results: B-choice1 = {'a'}, B-choice2 = {'c'}. Also upgrading to Redziejowski's BITES does not change things. ## The Problem ## However, the cut in B will now kill the memoization cache for the initial 'a' parsed by B in the inputs 'a+c' and 'a+a+c'. When we now inspect the latter, it will fail the first B choice, backtracking to G, where it will be matched by the second choice in G, but this time around, without any help of memoization for the initial 'a' it would otherwise have received in grammar1. In other words: for the input 'a+a+c', the second evaluation of the leading matching 'A = a' at input column 0 (now in rule G(1), in grammar1 its second evaluation is in the first choice of rule B(2)) does not receive memoization support anymore. This means that using the cut operator, using the Mazushima/Redziejowski ExtFirst/ExtBites approach, for clearing out and reducing your memoization cache footprint, breaks=removes the memoization support unintended, hence the grammar does not exhibit guaranteed linear time behaviour any more. ## Side Notes ## This is the minimal grammar example that I can devise to showcase the issue. Having rule G as G = B / A + B (1) looks suspicious to the human eye, when you're aware of 'prefix hiding', even while the notation as such does *not* imply 'prefix hiding' as we're looking at non-terminals here, but that suspicion can take a long time to develop when this construct occurs in a production grammar along the lines of grammar3 below: G = B / D (1) B = A # + C / C (2) D = A + B (3) A = a C = c where rule D is 'obfuscating' the 'A +' common prefix in both choices of rule G(1). Of course, in a production grammar, the B(2) and D(3) definitions will very probably be further apart if this issue would not indirectly be caught by 'probable trouble' in the visual patterns of the grammar definition itself. ('probable trouble' being a human brain thing, not something a machine would use. Neither is 'probable trouble' a sure-fire thing; it's just that when you look at grammars long and often enough, you may get the feeling 'something is odd with this one'. And you *may* be right. Then again, you may be not.) # Conclusion # As the Mazushima cut operator only looks into the future (ExtFirst) but does not consider history/context of the rule currently under scrutiny (i.e. the rule G's behaviour when we determine rule B(2) can have a cut inserted) I expect a solution/fix to this issue would be derived from propagating the invoking rule's First/Bites info into the rule-under-scrutiny in order to add the outer context into the condition set for the automatic insertion of the cut operator. (To be addressed further in another email.) # References # Mizushima et al, 2010: Packrat Parsers can handle practical grammars in mostly constant space (Mizushima, Maeda, Yamaguchi, 2010) Redziejowski, 2011: BITES instead of FIRST for parsing expression grammar (Redziejowski, 2011) -- 'upgrading' Mizushima's algorithm (ExtFirst -> ExtBites) is already suggested in its ch.1:Introduction. ... I have searched the Nets to the best of my ability but wasn't able to find earlier mention of this issue; if I missed it, I love to hear about it and a reference to it. :-) (Because if I missed this one, I'm bound to have missed others in that vicinity too.) Thank you for bearing with me. Met vriendelijke groeten / Best regards, Ger Hobbelt -------------------------------------------------- web: http://www.hobbelt.com/ http://www.hebbut.net/ mail: g...@hobbelt.com mobile: +31-6-11 120 978 --------------------------------------------------
_______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg