Wow. I have *finally* wrapped my head around combinators. It's like the joke about the MIT professor: "Yes, I was right, it *is* obvious." I want to try to capture why wrapping my head around this stuff was difficult. Maybe some poor soul will find this note someday using a search engine and be able to step past what I just went through.
I actually had a number of issues. Some had to do with different behaviors in different libraries. One had to do with the way that combinators violate "stuff every compiler author knows". The last one was the degree of obfuscation that creeps in when the phase distinction between tokenizer and parser is removed. Keean stated at one point on LtU that he thought nobody still used that distinction. Actually, *every* compiler that has any desire to be performant maintains that distinction rigorously. AST FABRICATION Very few of the [f]parsec examples build any ASTs. It wasn't initially clear to me why >>. and .>> and friends needed to exist. I think I prefer Keean's idiom in which parsers that should *not* produce output are wrapped in the form discard(parser), and then the sequencing operation is overloaded to discard on the right or the left according to type resolution. I'd be very happy to see an operator overload for discard() in languages that support that; I'm simply saying that the proliferation of similar operators in F# like .>> and >>. confused me. Perhaps they exist in F# to work around the absence of ad hoc polymorphism in the language? It seems puzzling. Come to that, I don't understand why this proliferation exists in the Haskell parsec library either. Perhaps the "discard" idea didn't occur to the original implementor and the existing operators took on a life of their own. Combinator libraries seem to have been copied one from another over time; it wouldn't be surprising if restrictions from one language occasionally got carried over into a language that might have supported other choices. Who knows! Anyway, since none of the readily discovered examples built ASTs, I was baffled about how to do so. Finally I read through Laurent LeBrun's parser for GLSL. When I got to the if/then/else case, I realized what the pipeN parsers are for. I also came to more fully appreciate just how ugly the lack of any phase distinction really makes things as I looked at the scattering of whitespace handlers throughout the text. So: if you're trying to figure out how to build an AST that takes more than one parse result as an input, consider the pipeN parsers and have a look at LeBrun's work. BACKTRACKING My first confusion arose from backtracking. It happened that the first combinator library I looked at was a backtracking combinator library. Not all combinator libraries use backtracking. Many use single 'token" lookahead (though more on that below). But I stumbled into a backtracking library first, and that set my assumptions as I continued to read. Backtracking raises the question: "When do we know that some amount of input has been durably consumed and we can stop buffering that part?" Because otherwise you have to slurp the entire input into memory and hold on to it, and there are an awful lot of inputs in the world that are WAY too big for that. Backtracking is only useful when a language is ambiguous. This can arise in three ways. The language itself (the grammar) can be ambiguous; this reflects poor language design, but you don't always get to pick your language. The language can be ambiguous through limitations of algorithm (e.g. you need 2 token lookahead but your tool only gives you one token lookahead); sometimes you have to live with imperfect tools. The language can become ambiguous because error correcting productions have been introduced; this is ubiquitous in all production parsers. There's an interesting side-note here: all of the really *good* parsers support some degree of backtracking (or something similar like k>1 lookahead) for the sake of error correction. Given this, the argument that parsers should avoid backtracking because backtracking is O(n^3) is a little silly. It's silly even if we ignore the packrat ideas that reduce it back to O(1) as a matter of implementation: error detection and correction isn't really optional. :-) Anyway... the problem with backtracking is that you can hypothetically backtrack all the way to the initial token, so you have to buffer the entire input. And yes, Mathilda, people really *do* use parsers on stream inputs that won't fit in memory. It took me a while to figure out what's going on in combinator libraries so as to avoid this. The first thing going on is that many combinator libraries only do one token of lookahead. Once you "accept" a token and proceed to the next [sub]parser, that token has been consumed and cannot be re-examined. This kind of combinator is non-backtracking (of course), and it introduces a subtle issue of ordering: if cases are not checked in the correct order, some parses cannot occur. For example, if you match "identifier" before you check all of the keywords that are withheld from the space of identifiers, you'll never match the keywords. This error cannot be diagnosed by a combinator library in its current form. The second thing going on is that many combinator libraries have an explicit "attempt" or "try" type of construct. This says "try the following parser, and if it doesn't work restore the parse state to where it is now so that we can try something else". The sample implementations of "state" that I looked at simply recorded the input position and restored it. It took me a while to realize that a different implementation could simply use the try points as "markers" on the input stream (and a variant of this can be done even in pure idiom). The [positionally] first "try" point marks the earliest location in the input stream that cannot [yet] be discarded as consumed. In this approach, we still have the issue that the input buffer may grow fairly large (e.g. an entire top-level form), but there is at least a well-defined commit/cut point at which we can discard fully consumed input. The "attempt/try" combinator introduces a *different* issue of ordering: it admits the possibility of ambiguous grammars. Whether that is a bug or a feature depends on what you are trying to do. The combination of attempt/try with aggressive token consumption is actually very clever if properly used, but there are other issues with tokenization to consider. TOKENIZING AND PHASE DISTINCTIONS Almost all of the combinator libraries handle tokenization by doing string comparisons against the next input position. You check for "do", and then you check for "if", and so forth. This is far and away the worst misfeature of combinators. It is a potential source of bugs that are very hard to detect, and it makes a lie out of the notion that combinator-style parsers can be naively composed. >From a performance perspective, it isn't as bad as it looks. Most tokens will be rejected in the first character, most others in the second, and then a few will suffer from common prefix issues (e.g. "defunion" vs. "defstruct"). As long as you are dealing with an in-memory input, this can all be done fairly quickly. Not so much if you have to buffer input from a stream! But consider what happens if your language includes both "do" and "done". In the absence of backtracking, you had better check those in the right order. Check "do" first and you'll never see "done" at all, because the token comparator doesn't have any notion of maximal match or the possibility of token collision. You can collide on keywords and identifiers in the same way. And the problem with composition is that this sort of match can be buried deep in some parser in a way that you might not readily spot. What has happened here is that the "maximal munch" rule for tokenization has been lost. I suspect the only reason we do not trip over this routinely in combinator parsers is that potentially conflicting keywords rarely arise in the same left context and most of the languages people are parsing are small. As near as I can tell, this cannot be fixed as long as combinators produce functions. If we want to get this cleaned up (and incidentally restore our phase distinction), then combinators need to accept and return a (parser function, tokenizer) pair. The parser functions get combined just as they do now. The tokenizers get merged. If you want to embed a sub-language (e.g. parsing javascript inside an HTML script tag), you introduce a new tokenizer at the point of the sub-parser call. Simply by gathering all of the tokens and regular expressions into a single tokenizer object you admit the possibility of dynamically optimizing the token evaluator in the same way that grep (and friends) do. Of course, this assumes that you don't actually *want* context-sensitive tokenization. If you do, then of course you don't want the tokenizers merged and I'm not sure *what* to do. I still suspect that it's useful to have a tokenizer object, if only because you can move all of the whitespace handling into that and avoid having it spill out all over your parser. It's interesting to me that the parsec library actually *had* a tokenizer object and did away with it. I'm not clear why they did that. It seems pretty clear that one can supply tokenizer-based alternatives to the character parsers that are present in any of the combinator libraries. That is: you can build a parser that preserves a phase distinction while still using parser combinators. I'm not clear why I can't find any examples of this in the wild. SPECULATION And just to wrap this up, I'm now convinced that a phase distinction could be done for the parser too, *provided* we know all of the primitives and all of the combining operations. Basically, we could carry around and incrementally merge a "parser" object in the same way that we can carry around a "tokenizer" object. In this approach [sub]parsers aren't functions; they are objects containing sets of productions and actions. We then introduce a "run" operator that does parsing in compile-and-go form. Of course: with a powerful enough mechanism for compile-time execution, we could build constant [sub]parsers in this fashion and compute the parse tables *at compile time*, including any error checks that might be performed by the selected parse algorithm. It just makes your head spin... shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
