On Fri, Sep 5, 2014 at 10:35 AM, Jonathan S. Shapiro <[email protected]>
wrote:

> I'm still mired in lexical analysis, though nearly out of the woods; was
> on vacation for a while, which slowed me down. Vacations are stressful! :-(
>
> Traditional compilers have made a strong phase distinction between the
> tokenizer and the parser. The tokenizer turns an input character stream
> into a (position-annotated) token stream. The parser takes a token stream
> and constructs an AST tree. Some, but not all, tokens have corresponding
> AST node types. E.g. identifiers get an AST. Though some tokenizers
> implement a notion of a "start state" that can be changed, most tokenizers
> are context-free.
>
> There are numerous cases where we need limited lookahead and/or
> backtracking in common languages. Historically this was best done at the
> token stream layer, but with a modern, memoizing parser it isn't clear that
> this is still motivated.
>
> LR and LL grammars are strictly more powerful than regular expressions.
> There is nothing we can express in a tokenizer that cannot be expressed
> just as well in a grammar.
>
> As I have been working on the parser generator, I have realized that there
> are tokenization sub-patterns that I want to re-use. E.g. character escapes
> want the same syntax inside character literals that is used inside strings.
> This led me in the direction of a tokenizer specification that looks a lot
> like a grammar specification.
>
> All of which is leading me to the following questions:
>
>    1. Is the tokenizer/parser phase distinction actually useful in modern
>    front ends?
>    2. What value is it providing, and should it be preserved?
>    3. Why use two algorithms (regular expression engine, parse algorithm)
>    when one will suffice?
>
> The only place in BitC v0 where the phase distinction was important was
> the layout engine, and *that* actually *does* seem difficult to express
> in a conventional grammar. It *can* be expressed in a predicated grammar
> with suitable predicates.
>
> Opinions?
>
>
> shap
>

I think it is a good question to ask: after more than half century of
formal grammar theory and practice and parser generator experience, how
important is it really *in practice* to have the token stream vs. parse
tree/AST distinction?  Note that I appreciate the intellectual appeal -- I
am just asking from the _practice_ point of view.
I don't want my syntax colorizer to colorize keywords; I want it to
colorize based on meaning, e.g. semantics of program fragments.

Quite a few widely used contemporary programming languages who have
stubbornly tried to honor that phase distinction have found themselves in
awkward positions, and I have never gotten the opportunity to experience
the real practical value of the phase distinction -- only the downsides
most of the time.  The situation is even worse when you consider the
traditional C translation phases, in particular preprocessing tokens vs.
tokens and the confusion that they harbor.

-- Gaby
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to