On Fri, Sep 5, 2014 at 10:28 AM, Jonathan S. Shapiro
<[email protected]> wrote:
> I'm still mired in lexical analysis, though nearly out of the woods; was on
> vacation for a while, which slowed me down. Vacations are stressful! :-(
>
> Traditional compilers have made a strong phase distinction between the
> tokenizer and the parser. The tokenizer turns an input character stream into
> an (position-annotated) token stream. The parser takes a token stream and
> constructs an AST tree. Some, but not all, tokens have corresponding AST
> node types. E.g. identifiers get an AST. Though some tokenizers implement a
> notion of a "start state" that can be changed, most tokenizers are
> context-free.
>
> There are numerous cases where we need limited lookahead and/or backtracking
> in common languages. Historically this was best done at the token stream
> layer, but with a modern, memoizing parser it isn't clear that this is still
> motivated.
>
> LR and LL grammars are strictly more powerful than regular expressions.
> There is nothing we can express in a tokenizer that cannot be expressed just
> as well in a grammar.
>
> As I have been working on the parser generator, I have realized that there
> are tokenization sub-patterns that I want to re-use. E.g. character escapes
> want the same syntax inside character literals that is used inside strings.
> This led me in the direction of a tokenizer specification that looks a lot
> like a grammar specification.
>
> All of which is leading me to the following questions:
>
> Is the tokenizer/parser phase distinction actually useful in modern front
> ends?
> What value is it providing, and should it be preserved?
> Why use two algorithms (regular expression engine, parse algorithm) when one
> will suffice?
>
> The only place in BitC v0 where the phase distinction was important was the
> layout engine, and that actually does seem difficult to express in a
> conventional grammar. It can be expressed in a predicated grammar with
> suitable predicates.
>
> Opinions?

I'd guess that if there are any good reasons, they'll be all about
error reporting.  Error reporting probably points in the other
direction though, towards using a single algorithm, since parsers are
usually forced to have fairly fancy error reporting mechanisms since
that's where all the complicated errors occur.

Geoffrey
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to