Note that PEG does not impose to use packrat parsing, even
though it
was developed to use it. I think it's a historical 'accident'
that put
the two together: Bryan Ford thesis used the two together.
Interesting. After trying to use ANTLR-C# several years back, I
got disillusioned because nobody was interested in fixing the
bugs in it (ANTLR's author is a Java guy first and foremost) and
the source code of the required libraries didn't have source code
or a license (wtf.)
So, for awhile I was thinking about how I might make my own
parser generator that was "better" than ANTLR. I liked the syntax
of PEG descriptions, but I was concerned about the performance
hit of packrat and, besides, I already liked the syntax and
flexibility of ANTLR. So my idea was to make something that was
LL(k) and mixed the syntax of ANTLR and PEG while using more sane
(IMO) semantics than ANTLR did at the time (I've no idea if ANTLR
3 still uses the same semantics today...) All of this is 'water
under the bridge' now, but I hand-wrote a lexer to help me plan
out how my parser-generator would produce code. The output code
was to be both more efficient and significantly more readable
than ANTLR's output. I didn't get around to writing the
parser-generator itself but I'll have a look back at my handmade
lexer for inspiration.
However, as I found a few hours ago, Packrat parsing
(typically used to
handle PEG) has serious disadvantages: it complicates
debugging because of
frequent backtracking, it has problems with error recovery,
and typically
disallows to add actions with side effects (because of
possibility of
backtracking). These are important enough to reconsider my
plans of using
Pegged. I will try to analyze whether the issues are so
fundamental that I
(or somebody else) will have to create an ANTLR-like parser
instead, or
whether it is possible to introduce changes into Pegged that
would fix these
problems.
I don't like the sound of this either. Even if PEGs were fast,
difficulty in debugging, error handling, etc. would give me
pause. I insist on well-rounded tools. For example, even though
LALR(1) may be the fastest type of parser (is it?), I prefer not
to use it due to its inflexibility (it just doesn't like some
reasonable grammars), and the fact that the generated code is
totally unreadable and hard to debug (mind you, when I learned
LALR in school I found that it is possible to visualize how it
works in a pretty intuitive way--but debuggers won't do that for
you.)
While PEGs are clearly far more flexible than LALR and probably
more flexible than LL(k), I am a big fan of old-fashioned
recursive descent because it's very flexible (easy to insert
actions during parsing, and it's possible to use custom parsing
code in certain places, if necessary*) and the parser generator's
output is potentially very straightforward to understand and
debug. In my mind, the main reason you want to use a parser
generator instead of hand-coding is convenience, e.g. (1) to
compress the grammar down so you can see it clearly, (2) have the
PG compute the first-sets and follow-sets for you, (3) get
reasonably automatic error handling.
* (If the language you want to parse is well-designed, you'll
probably not need much custom parsing. But it's a nice thing to
offer in a general-purpose parser generator.)
I'm not totally sure yet how to support good error messages,
efficiency and straightforward output at the same time, but by
the power of D I'm sure I could think of something...
I would like to submit another approach to parsing that I dare
say is my favorite, even though I have hardly used it at all yet.
ANTLR offers something called "tree parsing" that is extremely
cool. It parses trees instead of linear token streams, and
produces other trees as output. I don't have a good sense of how
tree parsing works, but I think that some kind of tree-based
parser generator could become the basis for a very flexible and
easy-to-understand D front-end. If a PG operates on trees instead
of linear token streams, I have a sneaky suspicion that it could
revolutionize how a compiler front-end works.
Why? because right now parsers operate just once, on the user's
input, and from there you manipulate the AST with "ordinary"
code. But if you have a tree parser, you can routinely manipulate
and transform parts of the tree with a sequence of independent
parsers and grammars. Thus, parsers would replace a lot of things
for which you would otherwise use a visitor pattern, or
something. I think I'll try to sketch out this idea in more
detail later.