On 3 June 2015 at 14:53, Jonathan S. Shapiro <[email protected]> wrote:
>
> I actually had a number of issues. Some had to do with different behaviors
> in different libraries. One had to do with the way that combinators violate
> "stuff every compiler author knows". The last one was the degree of
> obfuscation that creeps in when the phase distinction between tokenizer and
> parser is removed. Keean stated at one point on LtU that he thought nobody
> still used that distinction. Actually, *every* compiler that has any desire
> to be performant maintains that distinction rigorously.
>

Mine doesn't, so that statement is obviously false by one counter-example
:-)

I use virtual tokenization, which is a parser-combinator. IE I take a
parser description in combinators, and convert it into a tokenizer. It does
both in one pass by effectively using lazy tokenization.


>
> AST FABRICATION
>
> Very few of the [f]parsec examples build any ASTs. It wasn't initially
> clear to me why >>. and .>> and friends needed to exist. I think I prefer
> Keean's idiom in which parsers that should *not* produce output are wrapped
> in the form discard(parser), and then the sequencing operation is
> overloaded to discard on the right or the left according to type
> resolution. I'd be very happy to see an operator overload for discard() in
> languages that support that; I'm simply saying that the proliferation of
> similar operators in F# like .>> and >>. confused me. Perhaps they exist in
> F# to work around the absence of ad hoc polymorphism in the language? It
> seems puzzling. Come to that, I don't understand why this proliferation
> exists in the Haskell parsec library either. Perhaps the "discard" idea
> didn't occur to the original implementor and the existing operators took on
> a life of their own. Combinator libraries seem to have been copied one from
> another over time; it wouldn't be surprising if restrictions from one
> language occasionally got carried over into a language that might have
> supported other choices. Who knows!
>

The general principles are the combinators build an abstract representation
(doesn't have to be a tree, for example a Prolog clause database is a map
of predicates to term lists, or in SQL you have lists of grouping and
ordering terms to apply to a query). You then read out the abstract syntax
into the target concrete syntax. Everything is a compiler :-) For parser
combinators the transformation is often a no-op so you just go straight
from the combinators to the actual parser objects / functions.


> Anyway, since none of the readily discovered examples built ASTs, I was
> baffled about how to do so. Finally I read through Laurent LeBrun's parser
> for GLSL. When I got to the if/then/else case, I realized what the pipeN
> parsers are for. I also came to more fully appreciate just how ugly the
> lack of any phase distinction really makes things as I looked at the
> scattering of whitespace handlers throughout the text.
>
> So: if you're trying to figure out how to build an AST that takes more
> than one parse result as an input, consider the pipeN parsers and have a
> look at LeBrun's work.
>
> BACKTRACKING
>
> My first confusion arose from backtracking. It happened that the first
> combinator library I looked at was a backtracking combinator library. Not
> all combinator libraries use backtracking. Many use single 'token"
> lookahead (though more on that below). But I stumbled into a backtracking
> library first, and that set my assumptions as I continued to read.
> Backtracking raises the question: "When do we know that some amount of
> input has been durably consumed and we can stop buffering that part?"
> Because otherwise you have to slurp the entire input into memory and hold
> on to it, and there are an awful lot of inputs in the world that are WAY
> too big for that.
>

Use file buffering :-) I use mmap for high performance, but seeking on the
input stream is reasonably fast within a buffer page. The OS does a pretty
good job of buffering file reads.


> Backtracking is only useful when a language is ambiguous. This can arise
> in three ways. The language itself (the grammar) can be ambiguous; this
> reflects poor language design, but you don't always get to pick your
> language. The language can be ambiguous through limitations of algorithm
> (e.g. you need 2 token lookahead but your tool only gives you one token
> lookahead); sometimes you have to live with imperfect tools. The language
> can become ambiguous because error correcting productions have been
> introduced; this is ubiquitous in all production parsers.
>
> There's an interesting side-note here: all of the really *good* parsers
> support some degree of backtracking (or something similar like k>1
> lookahead) for the sake of error correction. Given this, the argument that
> parsers should avoid backtracking because backtracking is O(n^3) is a
> little silly. It's silly even if we ignore the packrat ideas that reduce it
> back to O(1) as a matter of implementation: error detection and correction
> isn't really optional. :-)
>
> Anyway... the problem with backtracking is that you can hypothetically
> backtrack all the way to the initial token, so you have to buffer the
> entire input. And yes, Mathilda, people really *do* use parsers on stream
> inputs that won't fit in memory. It took me a while to figure out what's
> going on in combinator libraries so as to avoid this.
>

Just seek the file back to the beginning :-)


> The first thing going on is that many combinator libraries only do one
> token of lookahead. Once you "accept" a token and proceed to the next
> [sub]parser, that token has been consumed and cannot be re-examined. This
> kind of combinator is non-backtracking (of course), and it introduces a
> subtle issue of ordering: if cases are not checked in the correct order,
> some parses cannot occur. For example, if you match "identifier" before you
> check all of the keywords that are withheld from the space of identifiers,
> you'll never match the keywords. This error cannot be diagnosed by a
> combinator library in its current form.
>
> The second thing going on is that many combinator libraries have an
> explicit "attempt" or "try" type of construct. This says "try the following
> parser, and if it doesn't work restore the parse state to where it is now
> so that we can try something else". The sample implementations of "state"
> that I looked at simply recorded the input position and restored it. It
> took me a while to realize that a different implementation could simply use
> the try points as "markers" on the input stream (and a variant of this can
> be done even in pure idiom). The [positionally] first "try" point marks the
> earliest location in the input stream that cannot [yet] be discarded as
> consumed. In this approach, we still have the issue that the input buffer
> may grow fairly large (e.g. an entire top-level form), but there is at
> least a well-defined commit/cut point at which we can discard fully
> consumed input. The "attempt/try" combinator introduces a *different* issue
> of ordering: it admits the possibility of ambiguous grammars. Whether that
> is a bug or a feature depends on what you are trying to do.
>
> The combination of attempt/try with aggressive token consumption is
> actually very clever if properly used, but there are other issues with
> tokenization to consider.
>
> TOKENIZING AND PHASE DISTINCTIONS
>
> Almost all of the combinator libraries handle tokenization by doing string
> comparisons against the next input position. You check for "do", and then
> you check for "if", and so forth. This is far and away the worst misfeature
> of combinators. It is a potential source of bugs that are very hard to
> detect, and it makes a lie out of the notion that combinator-style parsers
> can be naively composed.
>
> From a performance perspective, it isn't as bad as it looks. Most tokens
> will be rejected in the first character, most others in the second, and
> then a few will suffer from common prefix issues (e.g. "defunion" vs.
> "defstruct"). As long as you are dealing with an in-memory input, this can
> all be done fairly quickly. Not so much if you have to buffer input from a
> stream!
>
> But consider what happens if your language includes both "do" and "done".
> In the absence of backtracking, you had better check those in the right
> order. Check "do" first and you'll never see "done" at all, because the
> token comparator doesn't have any notion of maximal match or the
> possibility of token collision. You can collide on keywords and identifiers
> in the same way. And the problem with composition is that this sort of
> match can be buried deep in some parser in a way that you might not readily
> spot. What has happened here is that the "maximal munch" rule for
> tokenization has been lost. I suspect the only reason we do not trip over
> this routinely in combinator parsers is that potentially conflicting
> keywords rarely arise in the same left context and most of the languages
> people are parsing are small.
>

 I dont see the do / done problem. One looks for "d" "o" (optional space)
some other token, the other "d" "o" "n" "e", there is no ambiguity.. The
"do" parser should clearly reject "don...".


>
> As near as I can tell, this cannot be fixed as long as combinators produce
> functions. If we want to get this cleaned up (and incidentally restore our
> phase distinction), then combinators need to accept and return a (parser
> function, tokenizer) pair. The parser functions get combined just as they
> do now. The tokenizers get merged. If you want to embed a sub-language
> (e.g. parsing javascript inside an HTML script tag), you introduce a new
> tokenizer at the point of the sub-parser call. Simply by gathering all of
> the tokens and regular expressions into a single tokenizer object you admit
> the possibility of dynamically optimizing the token evaluator in the same
> way that grep (and friends) do.
>
> Of course, this assumes that you don't actually *want* context-sensitive
> tokenization. If you do, then of course you don't want the tokenizers
> merged and I'm not sure *what* to do. I still suspect that it's useful to
> have a tokenizer object, if only because you can move all of the whitespace
> handling into that and avoid having it spill out all over your parser.
>
> It's interesting to me that the parsec library actually *had* a tokenizer
> object and did away with it. I'm not clear why they did that.
>
> It seems pretty clear that one can supply tokenizer-based alternatives to
> the character parsers that are present in any of the combinator libraries.
> That is: you can build a parser that preserves a phase distinction while
> still using parser combinators. I'm not clear why I can't find any examples
> of this in the wild.
>
>
> SPECULATION
>
> And just to wrap this up, I'm now convinced that a phase distinction could
> be done for the parser too, *provided* we know all of the primitives and
> all of the combining operations. Basically, we could carry around and
> incrementally merge a "parser" object in the same way that we can carry
> around a "tokenizer" object. In this approach [sub]parsers aren't
> functions; they are objects containing sets of productions and actions. We
> then introduce a "run" operator that does parsing in compile-and-go form.
>
> Of course: with a powerful enough mechanism for compile-time execution, we
> could build constant [sub]parsers in this fashion and compute the parse
> tables *at compile time*, including any error checks that might be
> performed by the selected parse algorithm.
>

Yes, I think this is an interesting idea. I am not convinced a tokenizer is
a good idea as traditionally you end up with hacks to make it work for a
given grammar. You could generate a tokenizer, but it may end up context
sensitive which could be problematic.


Keean.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to