On 3 June 2015 at 14:53, Jonathan S. Shapiro <[email protected]> wrote: > > I actually had a number of issues. Some had to do with different behaviors > in different libraries. One had to do with the way that combinators violate > "stuff every compiler author knows". The last one was the degree of > obfuscation that creeps in when the phase distinction between tokenizer and > parser is removed. Keean stated at one point on LtU that he thought nobody > still used that distinction. Actually, *every* compiler that has any desire > to be performant maintains that distinction rigorously. >
Mine doesn't, so that statement is obviously false by one counter-example :-) I use virtual tokenization, which is a parser-combinator. IE I take a parser description in combinators, and convert it into a tokenizer. It does both in one pass by effectively using lazy tokenization. > > AST FABRICATION > > Very few of the [f]parsec examples build any ASTs. It wasn't initially > clear to me why >>. and .>> and friends needed to exist. I think I prefer > Keean's idiom in which parsers that should *not* produce output are wrapped > in the form discard(parser), and then the sequencing operation is > overloaded to discard on the right or the left according to type > resolution. I'd be very happy to see an operator overload for discard() in > languages that support that; I'm simply saying that the proliferation of > similar operators in F# like .>> and >>. confused me. Perhaps they exist in > F# to work around the absence of ad hoc polymorphism in the language? It > seems puzzling. Come to that, I don't understand why this proliferation > exists in the Haskell parsec library either. Perhaps the "discard" idea > didn't occur to the original implementor and the existing operators took on > a life of their own. Combinator libraries seem to have been copied one from > another over time; it wouldn't be surprising if restrictions from one > language occasionally got carried over into a language that might have > supported other choices. Who knows! > The general principles are the combinators build an abstract representation (doesn't have to be a tree, for example a Prolog clause database is a map of predicates to term lists, or in SQL you have lists of grouping and ordering terms to apply to a query). You then read out the abstract syntax into the target concrete syntax. Everything is a compiler :-) For parser combinators the transformation is often a no-op so you just go straight from the combinators to the actual parser objects / functions. > Anyway, since none of the readily discovered examples built ASTs, I was > baffled about how to do so. Finally I read through Laurent LeBrun's parser > for GLSL. When I got to the if/then/else case, I realized what the pipeN > parsers are for. I also came to more fully appreciate just how ugly the > lack of any phase distinction really makes things as I looked at the > scattering of whitespace handlers throughout the text. > > So: if you're trying to figure out how to build an AST that takes more > than one parse result as an input, consider the pipeN parsers and have a > look at LeBrun's work. > > BACKTRACKING > > My first confusion arose from backtracking. It happened that the first > combinator library I looked at was a backtracking combinator library. Not > all combinator libraries use backtracking. Many use single 'token" > lookahead (though more on that below). But I stumbled into a backtracking > library first, and that set my assumptions as I continued to read. > Backtracking raises the question: "When do we know that some amount of > input has been durably consumed and we can stop buffering that part?" > Because otherwise you have to slurp the entire input into memory and hold > on to it, and there are an awful lot of inputs in the world that are WAY > too big for that. > Use file buffering :-) I use mmap for high performance, but seeking on the input stream is reasonably fast within a buffer page. The OS does a pretty good job of buffering file reads. > Backtracking is only useful when a language is ambiguous. This can arise > in three ways. The language itself (the grammar) can be ambiguous; this > reflects poor language design, but you don't always get to pick your > language. The language can be ambiguous through limitations of algorithm > (e.g. you need 2 token lookahead but your tool only gives you one token > lookahead); sometimes you have to live with imperfect tools. The language > can become ambiguous because error correcting productions have been > introduced; this is ubiquitous in all production parsers. > > There's an interesting side-note here: all of the really *good* parsers > support some degree of backtracking (or something similar like k>1 > lookahead) for the sake of error correction. Given this, the argument that > parsers should avoid backtracking because backtracking is O(n^3) is a > little silly. It's silly even if we ignore the packrat ideas that reduce it > back to O(1) as a matter of implementation: error detection and correction > isn't really optional. :-) > > Anyway... the problem with backtracking is that you can hypothetically > backtrack all the way to the initial token, so you have to buffer the > entire input. And yes, Mathilda, people really *do* use parsers on stream > inputs that won't fit in memory. It took me a while to figure out what's > going on in combinator libraries so as to avoid this. > Just seek the file back to the beginning :-) > The first thing going on is that many combinator libraries only do one > token of lookahead. Once you "accept" a token and proceed to the next > [sub]parser, that token has been consumed and cannot be re-examined. This > kind of combinator is non-backtracking (of course), and it introduces a > subtle issue of ordering: if cases are not checked in the correct order, > some parses cannot occur. For example, if you match "identifier" before you > check all of the keywords that are withheld from the space of identifiers, > you'll never match the keywords. This error cannot be diagnosed by a > combinator library in its current form. > > The second thing going on is that many combinator libraries have an > explicit "attempt" or "try" type of construct. This says "try the following > parser, and if it doesn't work restore the parse state to where it is now > so that we can try something else". The sample implementations of "state" > that I looked at simply recorded the input position and restored it. It > took me a while to realize that a different implementation could simply use > the try points as "markers" on the input stream (and a variant of this can > be done even in pure idiom). The [positionally] first "try" point marks the > earliest location in the input stream that cannot [yet] be discarded as > consumed. In this approach, we still have the issue that the input buffer > may grow fairly large (e.g. an entire top-level form), but there is at > least a well-defined commit/cut point at which we can discard fully > consumed input. The "attempt/try" combinator introduces a *different* issue > of ordering: it admits the possibility of ambiguous grammars. Whether that > is a bug or a feature depends on what you are trying to do. > > The combination of attempt/try with aggressive token consumption is > actually very clever if properly used, but there are other issues with > tokenization to consider. > > TOKENIZING AND PHASE DISTINCTIONS > > Almost all of the combinator libraries handle tokenization by doing string > comparisons against the next input position. You check for "do", and then > you check for "if", and so forth. This is far and away the worst misfeature > of combinators. It is a potential source of bugs that are very hard to > detect, and it makes a lie out of the notion that combinator-style parsers > can be naively composed. > > From a performance perspective, it isn't as bad as it looks. Most tokens > will be rejected in the first character, most others in the second, and > then a few will suffer from common prefix issues (e.g. "defunion" vs. > "defstruct"). As long as you are dealing with an in-memory input, this can > all be done fairly quickly. Not so much if you have to buffer input from a > stream! > > But consider what happens if your language includes both "do" and "done". > In the absence of backtracking, you had better check those in the right > order. Check "do" first and you'll never see "done" at all, because the > token comparator doesn't have any notion of maximal match or the > possibility of token collision. You can collide on keywords and identifiers > in the same way. And the problem with composition is that this sort of > match can be buried deep in some parser in a way that you might not readily > spot. What has happened here is that the "maximal munch" rule for > tokenization has been lost. I suspect the only reason we do not trip over > this routinely in combinator parsers is that potentially conflicting > keywords rarely arise in the same left context and most of the languages > people are parsing are small. > I dont see the do / done problem. One looks for "d" "o" (optional space) some other token, the other "d" "o" "n" "e", there is no ambiguity.. The "do" parser should clearly reject "don...". > > As near as I can tell, this cannot be fixed as long as combinators produce > functions. If we want to get this cleaned up (and incidentally restore our > phase distinction), then combinators need to accept and return a (parser > function, tokenizer) pair. The parser functions get combined just as they > do now. The tokenizers get merged. If you want to embed a sub-language > (e.g. parsing javascript inside an HTML script tag), you introduce a new > tokenizer at the point of the sub-parser call. Simply by gathering all of > the tokens and regular expressions into a single tokenizer object you admit > the possibility of dynamically optimizing the token evaluator in the same > way that grep (and friends) do. > > Of course, this assumes that you don't actually *want* context-sensitive > tokenization. If you do, then of course you don't want the tokenizers > merged and I'm not sure *what* to do. I still suspect that it's useful to > have a tokenizer object, if only because you can move all of the whitespace > handling into that and avoid having it spill out all over your parser. > > It's interesting to me that the parsec library actually *had* a tokenizer > object and did away with it. I'm not clear why they did that. > > It seems pretty clear that one can supply tokenizer-based alternatives to > the character parsers that are present in any of the combinator libraries. > That is: you can build a parser that preserves a phase distinction while > still using parser combinators. I'm not clear why I can't find any examples > of this in the wild. > > > SPECULATION > > And just to wrap this up, I'm now convinced that a phase distinction could > be done for the parser too, *provided* we know all of the primitives and > all of the combining operations. Basically, we could carry around and > incrementally merge a "parser" object in the same way that we can carry > around a "tokenizer" object. In this approach [sub]parsers aren't > functions; they are objects containing sets of productions and actions. We > then introduce a "run" operator that does parsing in compile-and-go form. > > Of course: with a powerful enough mechanism for compile-time execution, we > could build constant [sub]parsers in this fashion and compute the parse > tables *at compile time*, including any error checks that might be > performed by the selected parse algorithm. > Yes, I think this is an interesting idea. I am not convinced a tokenizer is a good idea as traditionally you end up with hacks to make it work for a given grammar. You could generate a tokenizer, but it may end up context sensitive which could be problematic. Keean.
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
