Fascinating work, I’m always interested to see a new PEG parsing algorithm, and 
handling left-recursion as cleanly as you did is impressive.

I’m not sure if it would make a difference, but it might be interesting to use 
just the seed parent relationship rather than the subclause relationship in 
your pseudo-topological sort of clauses – any clause other than the first one 
will be in a later column, and thus available in the bottom-up dynamic 
programming order you defined. (for what it’s worth, I found the textual 
description of the modifications to topological sort to handle cycles somewhat 
confusing (though clarified by the code listing), but I don’t have any concrete 
suggestions for improvement).

I’d also be interested to see performance results for a grammar of intermediate 
complexity between your expression grammar and Java. I have a JSON grammar in 
the repository for my own experimental parsing algorithm[1], along with some 
test input (hat tip to Kota Mizushima for providing both). One of the tricky 
things about new PEG parsing algorithms is that recursive descent is actually a 
linear-time algorithm for many practical inputs [2, s.5.1], so it’s hard to 
beat it. I’d also be curious to see a comparison of memory usage between your 
algorithm and ANTLR4 and Parboiled – recursive descent is often faster in 
practice than straight packrat, and a lot of the reason is packrat’s 
much-higher memory use.

[1] https://github.com/bruceiv/egg/tree/deriv/perf_test
[2] http://eptcs.web.cse.unsw.edu.au/paper.cgi?AFL2017.18.pdf

One concern I have is that your claimed O(n) runtime depends on overhead of 
spurious matches being constant per character, and doesn’t account for the 
left-recursion features you’ve added to the grammar. (I didn’t do a re-read of 
your whole paper after noticing this, but it’s possible you avoided claiming 
it’s a worst-case bound, though your language does suggest that 
interpretation.). Given that recursion in the grammar may result in the same 
clause matching multiple times at the same position (as long as the new match 
is longer), I think the worst case-bound on h is O(n), not O(1), which would 
make pika parsing’s worst-case time bound O(n^2). That said, trimming a linear 
factor off the O(n^3) bound for the general-purpose CFG parsing algorithms is 
still very impressive, and (given that grammar recursion is generally bounded 
by a small constant in practice), I expect pika parsing to be linear-time in 
practice. [I found your discussion of grammar looping conflated the issue of 
left-recursion and grammar looping in a somewhat confusing fashion, when 
whether or not the recursion is at the same position in the input is an 
important distinction. My comments on your performance bounds may only apply to 
left-recursive grammars.]

Also, there’s a small typo on p.18: "parsing speed for the expression grammar 
... is approximately 8.9 times higher than the parsing speed for the [Java] 
grammar"

Again, neat work, and a fascinating paper.

Best,
Aaron Moss

From: PEG [mailto:peg-boun...@lists.csail.mit.edu] On Behalf Of Luke Hutchison
Sent: Tuesday, July 7, 2020 23:22
To: peg@lists.csail.mit.edu
Subject: [PEG] Paper on bottom-up, right-to-left PEG parsing

EXTERNAL EMAIL: This email originated from outside the University of Portland. 
Do not click links or open attachments unless you recognize the sender and know 
the content is safe.
________________________________
I published a preprint of the following paper on arXiv.

After writing the paper, I realized that Bryan Ford briefly described the idea 
behind this algorithm in his 2002 Master's thesis; however, as far as I can 
tell, he did not pursue the idea to implementation, preferring the higher 
efficiency of packrat parsing.

I submitted this to ACM TOPLAS for review. Feedback welcome.

Luke Hutchison

---

Pika parsing: reformulating packrat parsing as a dynamic programming algorithm 
solves the left recursion and error recovery problems
Abstract: A recursive descent parser is built from a set of mutually-recursive 
functions, where each function directly implements one of the nonterminals of a 
grammar. A packrat parser uses memoization to reduce the time complexity for 
recursive descent parsing from exponential to linear in the length of the 
input. Recursive descent parsers are extremely simple to write, but suffer from 
two significant problems: (i) left-recursive grammars cause the parser to get 
stuck in infinite recursion, and (ii) it can be difficult or impossible to 
optimally recover the parse state and continue parsing after a syntax error. 
Both problems are solved by the pika parser, a novel reformulation of packrat 
parsing as a dynamic programming algorithm, which requires parsing the input in 
reverse: bottom-up and right to left, rather than top-down and left to right. 
This reversed parsing order enables pika parsers to handle grammars that use 
either direct or indirect left recursion to achieve left associativity, 
simplifying grammar writing, and also enables optimal recovery from syntax 
errors, which is a crucial property for IDEs and compilers. Pika parsing 
maintains the linear-time performance characteristics of packrat parsing as a 
function of input length. The pika parser was benchmarked against the 
widely-used Parboiled2 and ANTLR4 parsing libraries. The pika parser performed 
significantly better than the other parsers for an expression grammar, although 
for a complex grammar implementing the Java language specification, a large 
constant performance impact was incurred per input character. Therefore, if 
performance is important, pika parsing is best applied to simple to 
moderate-sized grammars, or to very large inputs, if other parsing alternatives 
do not scale linearly in the length of the input. Several new insights into 
precedence, associativity, and left recursion are presented.
Preprint: 
https://arxiv.org/abs/2005.06444<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Farxiv.org%2Fabs%2F2005.06444&data=02%7C01%7Cmossa%40up.edu%7C3f24e5e7d2a6499eeba008d82307700d%7Cea8f3949231c40b6a33f56873af96f87%7C0%7C1%7C637297862149291999&sdata=%2FyADiuBgL3jYF0W0fQQSd0%2F47%2F0NheHpq8mRsrZ4oq0%3D&reserved=0>

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to