Re: [PEG] Bounds for PEG for a language known to be LL(1)

Nicolas Laurent Wed, 22 May 2019 05:25:28 -0700

>
> I have to revisit that paper, because my experience seems to indicate is
> that O(n*g) is likely for input that belongs to the language, but worse
> when the input does not.



I'm pretty damn sure it's O(n*g) in general. I mean, the worse case is that
you'll hit every grammar rule for each input position. Once you've done
that, the only way it can get worse is that you'll perform a super-linear
amount of memo lookups. But since everything is already memoized there is
no way to get any "deeper" in the parsing expression graph when you try to
match a parsing expression.

A trivial translation of LL(k) to PEG would be to place the LL(k)
> lookaheads as PEG lookaheads in front of each choice at O(k*g).


Clever!

I think it depends on the language and the input. Remember that I've parsed
> COBOL and NaturalAG with PEG. Incorrect input with many ambiguous
> constructs makes the parsers take much longer than with correct input.
> Sources of incorrect input? Variations in the language among vendors, for
> example.


I have no doubt, but I'm talking about *true exponential* (in the input
size) grammars. If the study of PEG performance proves anything is that
even linear bound guarantees can lead to poor practical performance.

That is so. I'm exploring PEG for Python because the maintainers of it's
> LL(1) parser expressed concerns about the inconvenience of maintaining an
> LL(1) grammar. But bare-metal speed is a requirement, so
> LL(not-too-small-k) may be the best solution (a problem is that most LL
> parser generators for C seem abandoned). I'll keep working on the PEG
> grammar for Python just to be able to measure.


An interesting direction / research question is how we might be able to do
parser generation with PEG that is non-naive. It's entirely possible to
implement selective memoization, and table lookup as custom parsers. You
could even drive the whole process from a corpus of input, where the parser
generator knows a set of optimizations and follows a couple of heuristics +
test runs over the corpus to find an effective allocation of optimizations
in the generated parser. I'm feeling this could really help close the gap
on LL/LR parsers. It is however, a lot of work.

Nicolas LAURENT


On Tue, 21 May 2019 at 04:50, Juancarlo Añez <apal...@gmail.com> wrote:

> Hello Nicolas,
>
> Thanks for your thoughtful reply!
>
> First of all, with full memoization, PEG is O(n) for all PEG languages.
>> This is in the original Ford paper, but basically, at most you'll invoke
>> every parsing expression at every position so O(n*grammar_size) where
>> grammar_size is a constant that depends on the grammar.
>>
>
> I have to revisit that paper, because my experience seems to indicate is
> that O(n*g) is likely for input that belongs to the language, but worse
> when the input does not.
>
> Beware that full-memoization has been reported to be slower than no
>> memoization at all on most language, with the best performance resulting
>> from memoizing only a few critical rules.
>>
>
> You'd need to qualify "most languages". Again, in my experience, it
> depends on how the grammar is written. I've opted for grammars that are
> easy to write and maintain, and that are easy to work with for producing
> useful ASTs, so they tend to contain many repeated expressions.
>
>
>> Without memoization, a LL(1) language trivially admits an O(n) PEG
>> grammar, which should generally look very much like the LL(1) grammar (I
>> think they might even be provably identical, but I haven't thought this
>> through, so don't quote me on this).
>>
>
> I won't quote you, because the eagerness in PEG makes choice ordering a
> must, so some amount of tweaking is required before an LL(1) grammar can be
> used with PEG. It should be provable, though, that the PEG grammar with
> just choice ordering is still LL(1).
>
>
>> A more interesting question would be if it's possible to derive a O(n)
>> PEG grammar for any PEG grammar for a LL(1) language.
>> I'm not sure. And if it is possible (via a left-factoring process), then
>> I really doubt it's possible without risking getting a new grammar that is
>> exponentially larger than the original (even if a small version exists!).
>>
>
> A trivial translation of LL(k) to PEG would be to place the LL(k)
> lookaheads as PEG lookaheads in front of each choice at O(k*g).
>
>
>> - It's almost impossible to involuntarily write an exponential grammar in
>> PEG. I had trouble coming up with examples of exponential grammars.
>> Necessarily, these will look like puzzles. If you write one of these, you
>> deserve whatever happens to you.
>>
>
> I think it depends on the language and the input. Remember that I've
> parsed COBOL and NaturalAG with PEG. Incorrect input with many ambiguous
> constructs makes the parsers take much longer than with correct input.
> Sources of incorrect input? Variations in the language among vendors, for
> example.
>
>
>> - There exists a set of problematic grammar patterns that have
>> non-exponential complexity but are very slow nonetheless. These all have to
>> do with parsing things that look like binary operators.
>>
>
> Yes.
>
>
>> Things are even worse when some kind of left-recursion support is
>> present. Without entering the details (which I can supply if requested),
>> you get into a bad backtracking pattern and you incur things like O(2^(num
>> of operators)) to parse a "primitive expression". So to ground this, in
>> Java this is ~ 2^10 parser invocation to parse a mere integer. It's easy to
>> fall into these bad patterns, because some of them are the straight
>> transposition of how operators are handled in CFGs. My own solution is to
>> supply explicit parser combinators to enable parsing left- /
>> right-associative expressions.
>>
>
> My experience is that it is best to avoid recursion in PEG grammars unless
> the construct being parsed is recursive.
>
> addition = term '+' addition / term
>
> Behaves differently from:
>
> addition = term ('+' term)*
>
>
>> What I'm getting at here is that it seems that if you map out the
>> potential pitfalls that give bad performance and supply tools to replace
>> them, the risk of accidentally incurring catastrophic performance reduces a
>> lot.
>> Static (and potentially heuristic) analysis of grammars could also help a
>> lot here.
>>
>
> Yes.
>
>
>> As for Python: there is absolutely no reason to doubt it's not feasible
>> to do a Python PEG. The real question is about the performance of the
>> resulting parser compared to LL(1) and LL(k) — not in terms of complexity,
>> but rather in terms of raw metal speed. Table-based lookup does tend to be
>> faster than trying out every alternative (even if the PEG is written in a
>> LL(1) style, and if not then it will almost certainly lose out). But then
>> again, LL(1) is ugly and hard to maintain so that's a trade-off.
>>
>
> That is so. I'm exploring PEG for Python because the maintainers of it's
> LL(1) parser expressed concerns about the inconvenience of maintaining an
> LL(1) grammar. But bare-metal speed is a requirement, so
> LL(not-too-small-k) may be the best solution (a problem is that most LL
> parser generators for C seem abandoned). I'll keep working on the PEG
> grammar for Python just to be able to measure.
>
> --
> Juancarlo *Añez*
>

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Re: [PEG] Bounds for PEG for a language known to be LL(1)

Reply via email to