[bitc-dev] Grokking combinators at last

Jonathan S. Shapiro Wed, 03 Jun 2015 06:54:12 -0700

Wow. I have *finally* wrapped my head around combinators. It's like the
joke about the MIT professor: "Yes, I was right, it *is* obvious." I want
to try to capture why wrapping my head around this stuff was difficult.
Maybe some poor soul will find this note someday using a search engine and
be able to step past what I just went through.


I actually had a number of issues. Some had to do with different behaviors
in different libraries. One had to do with the way that combinators violate
"stuff every compiler author knows". The last one was the degree of
obfuscation that creeps in when the phase distinction between tokenizer and
parser is removed. Keean stated at one point on LtU that he thought nobody
still used that distinction. Actually, *every* compiler that has any desire
to be performant maintains that distinction rigorously.

AST FABRICATION

Very few of the [f]parsec examples build any ASTs. It wasn't initially
clear to me why >>. and .>> and friends needed to exist. I think I prefer
Keean's idiom in which parsers that should *not* produce output are wrapped
in the form discard(parser), and then the sequencing operation is
overloaded to discard on the right or the left according to type
resolution. I'd be very happy to see an operator overload for discard() in
languages that support that; I'm simply saying that the proliferation of
similar operators in F# like .>> and >>. confused me. Perhaps they exist in
F# to work around the absence of ad hoc polymorphism in the language? It
seems puzzling. Come to that, I don't understand why this proliferation
exists in the Haskell parsec library either. Perhaps the "discard" idea
didn't occur to the original implementor and the existing operators took on
a life of their own. Combinator libraries seem to have been copied one from
another over time; it wouldn't be surprising if restrictions from one
language occasionally got carried over into a language that might have
supported other choices. Who knows!

Anyway, since none of the readily discovered examples built ASTs, I was
baffled about how to do so. Finally I read through Laurent LeBrun's parser
for GLSL. When I got to the if/then/else case, I realized what the pipeN
parsers are for. I also came to more fully appreciate just how ugly the
lack of any phase distinction really makes things as I looked at the
scattering of whitespace handlers throughout the text.

So: if you're trying to figure out how to build an AST that takes more than
one parse result as an input, consider the pipeN parsers and have a look at
LeBrun's work.

BACKTRACKING

My first confusion arose from backtracking. It happened that the first
combinator library I looked at was a backtracking combinator library. Not
all combinator libraries use backtracking. Many use single 'token"
lookahead (though more on that below). But I stumbled into a backtracking
library first, and that set my assumptions as I continued to read.
Backtracking raises the question: "When do we know that some amount of
input has been durably consumed and we can stop buffering that part?"
Because otherwise you have to slurp the entire input into memory and hold
on to it, and there are an awful lot of inputs in the world that are WAY
too big for that.

Backtracking is only useful when a language is ambiguous. This can arise in
three ways. The language itself (the grammar) can be ambiguous; this
reflects poor language design, but you don't always get to pick your
language. The language can be ambiguous through limitations of algorithm
(e.g. you need 2 token lookahead but your tool only gives you one token
lookahead); sometimes you have to live with imperfect tools. The language
can become ambiguous because error correcting productions have been
introduced; this is ubiquitous in all production parsers.

There's an interesting side-note here: all of the really *good* parsers
support some degree of backtracking (or something similar like k>1
lookahead) for the sake of error correction. Given this, the argument that
parsers should avoid backtracking because backtracking is O(n^3) is a
little silly. It's silly even if we ignore the packrat ideas that reduce it
back to O(1) as a matter of implementation: error detection and correction
isn't really optional. :-)

Anyway... the problem with backtracking is that you can hypothetically
backtrack all the way to the initial token, so you have to buffer the
entire input. And yes, Mathilda, people really *do* use parsers on stream
inputs that won't fit in memory. It took me a while to figure out what's
going on in combinator libraries so as to avoid this.

The first thing going on is that many combinator libraries only do one
token of lookahead. Once you "accept" a token and proceed to the next
[sub]parser, that token has been consumed and cannot be re-examined. This
kind of combinator is non-backtracking (of course), and it introduces a
subtle issue of ordering: if cases are not checked in the correct order,
some parses cannot occur. For example, if you match "identifier" before you
check all of the keywords that are withheld from the space of identifiers,
you'll never match the keywords. This error cannot be diagnosed by a
combinator library in its current form.

The second thing going on is that many combinator libraries have an
explicit "attempt" or "try" type of construct. This says "try the following
parser, and if it doesn't work restore the parse state to where it is now
so that we can try something else". The sample implementations of "state"
that I looked at simply recorded the input position and restored it. It
took me a while to realize that a different implementation could simply use
the try points as "markers" on the input stream (and a variant of this can
be done even in pure idiom). The [positionally] first "try" point marks the
earliest location in the input stream that cannot [yet] be discarded as
consumed. In this approach, we still have the issue that the input buffer
may grow fairly large (e.g. an entire top-level form), but there is at
least a well-defined commit/cut point at which we can discard fully
consumed input. The "attempt/try" combinator introduces a *different* issue
of ordering: it admits the possibility of ambiguous grammars. Whether that
is a bug or a feature depends on what you are trying to do.

The combination of attempt/try with aggressive token consumption is
actually very clever if properly used, but there are other issues with
tokenization to consider.

TOKENIZING AND PHASE DISTINCTIONS

Almost all of the combinator libraries handle tokenization by doing string
comparisons against the next input position. You check for "do", and then
you check for "if", and so forth. This is far and away the worst misfeature
of combinators. It is a potential source of bugs that are very hard to
detect, and it makes a lie out of the notion that combinator-style parsers
can be naively composed.

>From a performance perspective, it isn't as bad as it looks. Most tokens
will be rejected in the first character, most others in the second, and
then a few will suffer from common prefix issues (e.g. "defunion" vs.
"defstruct"). As long as you are dealing with an in-memory input, this can
all be done fairly quickly. Not so much if you have to buffer input from a
stream!

But consider what happens if your language includes both "do" and "done".
In the absence of backtracking, you had better check those in the right
order. Check "do" first and you'll never see "done" at all, because the
token comparator doesn't have any notion of maximal match or the
possibility of token collision. You can collide on keywords and identifiers
in the same way. And the problem with composition is that this sort of
match can be buried deep in some parser in a way that you might not readily
spot. What has happened here is that the "maximal munch" rule for
tokenization has been lost. I suspect the only reason we do not trip over
this routinely in combinator parsers is that potentially conflicting
keywords rarely arise in the same left context and most of the languages
people are parsing are small.

As near as I can tell, this cannot be fixed as long as combinators produce
functions. If we want to get this cleaned up (and incidentally restore our
phase distinction), then combinators need to accept and return a (parser
function, tokenizer) pair. The parser functions get combined just as they
do now. The tokenizers get merged. If you want to embed a sub-language
(e.g. parsing javascript inside an HTML script tag), you introduce a new
tokenizer at the point of the sub-parser call. Simply by gathering all of
the tokens and regular expressions into a single tokenizer object you admit
the possibility of dynamically optimizing the token evaluator in the same
way that grep (and friends) do.

Of course, this assumes that you don't actually *want* context-sensitive
tokenization. If you do, then of course you don't want the tokenizers
merged and I'm not sure *what* to do. I still suspect that it's useful to
have a tokenizer object, if only because you can move all of the whitespace
handling into that and avoid having it spill out all over your parser.

It's interesting to me that the parsec library actually *had* a tokenizer
object and did away with it. I'm not clear why they did that.

It seems pretty clear that one can supply tokenizer-based alternatives to
the character parsers that are present in any of the combinator libraries.
That is: you can build a parser that preserves a phase distinction while
still using parser combinators. I'm not clear why I can't find any examples
of this in the wild.


SPECULATION

And just to wrap this up, I'm now convinced that a phase distinction could
be done for the parser too, *provided* we know all of the primitives and
all of the combining operations. Basically, we could carry around and
incrementally merge a "parser" object in the same way that we can carry
around a "tokenizer" object. In this approach [sub]parsers aren't
functions; they are objects containing sets of productions and actions. We
then introduce a "run" operator that does parsing in compile-and-go form.

Of course: with a powerful enough mechanism for compile-time execution, we
could build constant [sub]parsers in this fashion and compute the parse
tables *at compile time*, including any error checks that might be
performed by the selected parse algorithm.


It just makes your head spin...


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

[bitc-dev] Grokking combinators at last

Reply via email to