Fascinating work, I’m always interested to see a new PEG parsing algorithm, and handling left-recursion as cleanly as you did is impressive.
I’m not sure if it would make a difference, but it might be interesting to use just the seed parent relationship rather than the subclause relationship in your pseudo-topological sort of clauses – any clause other than the first one will be in a later column, and thus available in the bottom-up dynamic programming order you defined. (for what it’s worth, I found the textual description of the modifications to topological sort to handle cycles somewhat confusing (though clarified by the code listing), but I don’t have any concrete suggestions for improvement). I’d also be interested to see performance results for a grammar of intermediate complexity between your expression grammar and Java. I have a JSON grammar in the repository for my own experimental parsing algorithm[1], along with some test input (hat tip to Kota Mizushima for providing both). One of the tricky things about new PEG parsing algorithms is that recursive descent is actually a linear-time algorithm for many practical inputs [2, s.5.1], so it’s hard to beat it. I’d also be curious to see a comparison of memory usage between your algorithm and ANTLR4 and Parboiled – recursive descent is often faster in practice than straight packrat, and a lot of the reason is packrat’s much-higher memory use. [1] https://github.com/bruceiv/egg/tree/deriv/perf_test [2] http://eptcs.web.cse.unsw.edu.au/paper.cgi?AFL2017.18.pdf One concern I have is that your claimed O(n) runtime depends on overhead of spurious matches being constant per character, and doesn’t account for the left-recursion features you’ve added to the grammar. (I didn’t do a re-read of your whole paper after noticing this, but it’s possible you avoided claiming it’s a worst-case bound, though your language does suggest that interpretation.). Given that recursion in the grammar may result in the same clause matching multiple times at the same position (as long as the new match is longer), I think the worst case-bound on h is O(n), not O(1), which would make pika parsing’s worst-case time bound O(n^2). That said, trimming a linear factor off the O(n^3) bound for the general-purpose CFG parsing algorithms is still very impressive, and (given that grammar recursion is generally bounded by a small constant in practice), I expect pika parsing to be linear-time in practice. [I found your discussion of grammar looping conflated the issue of left-recursion and grammar looping in a somewhat confusing fashion, when whether or not the recursion is at the same position in the input is an important distinction. My comments on your performance bounds may only apply to left-recursive grammars.] Also, there’s a small typo on p.18: "parsing speed for the expression grammar ... is approximately 8.9 times higher than the parsing speed for the [Java] grammar" Again, neat work, and a fascinating paper. Best, Aaron Moss From: PEG [mailto:peg-boun...@lists.csail.mit.edu] On Behalf Of Luke Hutchison Sent: Tuesday, July 7, 2020 23:22 To: peg@lists.csail.mit.edu Subject: [PEG] Paper on bottom-up, right-to-left PEG parsing EXTERNAL EMAIL: This email originated from outside the University of Portland. Do not click links or open attachments unless you recognize the sender and know the content is safe. ________________________________ I published a preprint of the following paper on arXiv. After writing the paper, I realized that Bryan Ford briefly described the idea behind this algorithm in his 2002 Master's thesis; however, as far as I can tell, he did not pursue the idea to implementation, preferring the higher efficiency of packrat parsing. I submitted this to ACM TOPLAS for review. Feedback welcome. Luke Hutchison --- Pika parsing: reformulating packrat parsing as a dynamic programming algorithm solves the left recursion and error recovery problems Abstract: A recursive descent parser is built from a set of mutually-recursive functions, where each function directly implements one of the nonterminals of a grammar. A packrat parser uses memoization to reduce the time complexity for recursive descent parsing from exponential to linear in the length of the input. Recursive descent parsers are extremely simple to write, but suffer from two significant problems: (i) left-recursive grammars cause the parser to get stuck in infinite recursion, and (ii) it can be difficult or impossible to optimally recover the parse state and continue parsing after a syntax error. Both problems are solved by the pika parser, a novel reformulation of packrat parsing as a dynamic programming algorithm, which requires parsing the input in reverse: bottom-up and right to left, rather than top-down and left to right. This reversed parsing order enables pika parsers to handle grammars that use either direct or indirect left recursion to achieve left associativity, simplifying grammar writing, and also enables optimal recovery from syntax errors, which is a crucial property for IDEs and compilers. Pika parsing maintains the linear-time performance characteristics of packrat parsing as a function of input length. The pika parser was benchmarked against the widely-used Parboiled2 and ANTLR4 parsing libraries. The pika parser performed significantly better than the other parsers for an expression grammar, although for a complex grammar implementing the Java language specification, a large constant performance impact was incurred per input character. Therefore, if performance is important, pika parsing is best applied to simple to moderate-sized grammars, or to very large inputs, if other parsing alternatives do not scale linearly in the length of the input. Several new insights into precedence, associativity, and left recursion are presented. Preprint: https://arxiv.org/abs/2005.06444<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Farxiv.org%2Fabs%2F2005.06444&data=02%7C01%7Cmossa%40up.edu%7C3f24e5e7d2a6499eeba008d82307700d%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C1%7C637297862149291999&sdata=%2FyADiuBgL3jYF0W0fQQSd0%2F47%2F0NheHpq8mRsrZ4oq0%3D&reserved=0>
_______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg