Re: Let's stop parser Hell

Chad J Sat, 07 Jul 2012 12:00:24 -0700

On 07/07/2012 02:23 PM, David Piepgrass wrote:

Note that PEG does not impose to use packrat parsing, even though it
was developed to use it. I think it's a historical 'accident' that put
the two together: Bryan Ford thesis used the two together.


Interesting. After trying to use ANTLR-C# several years back, I got
disillusioned because nobody was interested in fixing the bugs in it
(ANTLR's author is a Java guy first and foremost) and the source code of
the required libraries didn't have source code or a license (wtf.)

So, for awhile I was thinking about how I might make my own parser
generator that was "better" than ANTLR. I liked the syntax of PEG
descriptions, but I was concerned about the performance hit of packrat
and, besides, I already liked the syntax and flexibility of ANTLR. So my
idea was to make something that was LL(k) and mixed the syntax of ANTLR
and PEG while using more sane (IMO) semantics than ANTLR did at the time
(I've no idea if ANTLR 3 still uses the same semantics today...) All of
this is 'water under the bridge' now, but I hand-wrote a lexer to help
me plan out how my parser-generator would produce code. The output code
was to be both more efficient and significantly more readable than
ANTLR's output. I didn't get around to writing the parser-generator
itself but I'll have a look back at my handmade lexer for inspiration.

However, as I found a few hours ago, Packrat parsing (typically used to
handle PEG) has serious disadvantages: it complicates debugging
because of
frequent backtracking, it has problems with error recovery, and
typically
disallows to add actions with side effects (because of possibility of
backtracking). These are important enough to reconsider my plans of
using
Pegged. I will try to analyze whether the issues are so fundamental
that I
(or somebody else) will have to create an ANTLR-like parser instead, or
whether it is possible to introduce changes into Pegged that would
fix these
problems.


I don't like the sound of this either. Even if PEGs were fast,
difficulty in debugging, error handling, etc. would give me pause. I
insist on well-rounded tools. For example, even though LALR(1) may be
the fastest type of parser (is it?), I prefer not to use it due to its
inflexibility (it just doesn't like some reasonable grammars), and the
fact that the generated code is totally unreadable and hard to debug
(mind you, when I learned LALR in school I found that it is possible to
visualize how it works in a pretty intuitive way--but debuggers won't do
that for you.)

While PEGs are clearly far more flexible than LALR and probably more
flexible than LL(k), I am a big fan of old-fashioned recursive descent
because it's very flexible (easy to insert actions during parsing, and
it's possible to use custom parsing code in certain places, if
necessary*) and the parser generator's output is potentially very
straightforward to understand and debug. In my mind, the main reason you
want to use a parser generator instead of hand-coding is convenience,
e.g. (1) to compress the grammar down so you can see it clearly, (2)
have the PG compute the first-sets and follow-sets for you, (3) get
reasonably automatic error handling.

* (If the language you want to parse is well-designed, you'll probably
not need much custom parsing. But it's a nice thing to offer in a
general-purpose parser generator.)

I'm not totally sure yet how to support good error messages, efficiency
and straightforward output at the same time, but by the power of D I'm
sure I could think of something...

I would like to submit another approach to parsing that I dare say is my
favorite, even though I have hardly used it at all yet. ANTLR offers
something called "tree parsing" that is extremely cool. It parses trees
instead of linear token streams, and produces other trees as output. I
don't have a good sense of how tree parsing works, but I think that some
kind of tree-based parser generator could become the basis for a very
flexible and easy-to-understand D front-end. If a PG operates on trees
instead of linear token streams, I have a sneaky suspicion that it could
revolutionize how a compiler front-end works.

Why? because right now parsers operate just once, on the user's input,
and from there you manipulate the AST with "ordinary" code. But if you
have a tree parser, you can routinely manipulate and transform parts of
the tree with a sequence of independent parsers and grammars. Thus,
parsers would replace a lot of things for which you would otherwise use
a visitor pattern, or something. I think I'll try to sketch out this
idea in more detail later.


I was thinking the same thing.

My intent is to create a kind of regular-expression-of-nodes withpush/pop operators to recognize ascent and descent on the tree. Such aregular expression would allow one to capture subtrees out ofgeneralized patterns and then place them into new trees that then becomethe input for the next pattern or set of patterns. I think this is muchcloser to how I conceptualize semantic analysis than how semanticanalysis is done in front ends like DMD: it should to be done withpattern recognition and substitution, not with myriad nestedif-statements and while-loops.


My vision is to have code similar to this in the front-end:

/+
Lower
        while ( boolExpr )
        {
                statements;
        }

Into

        loopAgain:
        if ( !boolExpr )
                goto exitLoop
        statements;
        goto loopAgain
        exitLoop:
+/
void lowerWhileStatement( SyntaxElement* syntaxNode )
{
        auto captures = syntaxNode.matchNodes(
                TOK_WHILE_NODE,
                OP_ENTER_NODE,
                        OP_CAPTURE(0),
                        OP_BEGIN,
                                TOK_EXPRESSION,
                        OP_END,
                        OP_CAPTURE(1),
                        OP_BEGIN,
                                TOK_STATEMENT,
                        OP_END,
                OP_LEAVE_NODE);
        
        if ( captures is null )
                return;
        
        syntaxNode.replaceWith(
                LabelNode("loopAgain"),
                TOK_IF_STATEMENT,
                OP_INSERT,
                OP_BEGIN,
                        TOK_NEGATE,
                        OP_INSERT,
                        OP_BEGIN,
                                captures[0], // Expression
                        OP_END,
                        GotoStatement("exitLoop"),
                OP_END,
                captures[1], // statements
                GotoStatement("loopAgain"),
                LabelNode("exitLoop")
                );
}

The specifics will easily change. One problem with the above code isthat it could probably stand to use more templates and compile-timeaction to allow the front-end to merge patterns happening in the samepass to be merged together into one expression, thus preventing anyunnecessary rescanning. It all becomes DFAs or DPDAs operating onsyntax trees.

In this vision I do not use classes and inheritance for my AST. InsteadI use structs that contain some kind of nodeType member that would beone of the tokens/symbols in the grammar, like TOK_WHILE_NODE in theabove code. Dynamic dispatch is instead performed by (very fast) DFAsrecognizing parts of the AST.

This kind of architecture leads to other interesting benefits, likebeing able to assert which symbols a pattern is designed to handle orwhich symbols are allowed to exist in the AST at any point in time.Thus if you write a lowering that introduces nodes that a later passcan't handle, you'll know very quickly, at least in principle.

I wanted to make such a front-end so that I could easily make a Cbackend. I believe such a compiler would be able to do that with greatease. I really want a D compiler that can output ANSI C code that canbe used with few or no OS/CPU dependencies. I would be willing to losea lot of the nifty parallelism/concurrency stuff and deal with referencecounting instead of full garbage collection, as long as it lets meEASILY target new systems (any phone, console platform, and someembedded microcontrollers). Then what I have is something that's asubiquitous as C, but adds a lot of useful features like exceptionhandling, dynamic arrays, templates, CTFE, etc etc. My ideas for how todeal with ASTs in pattern recognition and substitution followed from this.

Needing to use D in places where it isn't available is a real pain-pointfor me right now, and I'll probably find ways to spend time on iteventually.

Re: Let's stop parser Hell

Reply via email to