On Fri, Sep 5, 2014 at 10:35 AM, Jonathan S. Shapiro <[email protected]> wrote:
> I'm still mired in lexical analysis, though nearly out of the woods; was > on vacation for a while, which slowed me down. Vacations are stressful! :-( > > Traditional compilers have made a strong phase distinction between the > tokenizer and the parser. The tokenizer turns an input character stream > into a (position-annotated) token stream. The parser takes a token stream > and constructs an AST tree. Some, but not all, tokens have corresponding > AST node types. E.g. identifiers get an AST. Though some tokenizers > implement a notion of a "start state" that can be changed, most tokenizers > are context-free. > > There are numerous cases where we need limited lookahead and/or > backtracking in common languages. Historically this was best done at the > token stream layer, but with a modern, memoizing parser it isn't clear that > this is still motivated. > > LR and LL grammars are strictly more powerful than regular expressions. > There is nothing we can express in a tokenizer that cannot be expressed > just as well in a grammar. > > As I have been working on the parser generator, I have realized that there > are tokenization sub-patterns that I want to re-use. E.g. character escapes > want the same syntax inside character literals that is used inside strings. > This led me in the direction of a tokenizer specification that looks a lot > like a grammar specification. > > All of which is leading me to the following questions: > > 1. Is the tokenizer/parser phase distinction actually useful in modern > front ends? > 2. What value is it providing, and should it be preserved? > 3. Why use two algorithms (regular expression engine, parse algorithm) > when one will suffice? > > The only place in BitC v0 where the phase distinction was important was > the layout engine, and *that* actually *does* seem difficult to express > in a conventional grammar. It *can* be expressed in a predicated grammar > with suitable predicates. > > Opinions? > > > shap > I think it is a good question to ask: after more than half century of formal grammar theory and practice and parser generator experience, how important is it really *in practice* to have the token stream vs. parse tree/AST distinction? Note that I appreciate the intellectual appeal -- I am just asking from the _practice_ point of view. I don't want my syntax colorizer to colorize keywords; I want it to colorize based on meaning, e.g. semantics of program fragments. Quite a few widely used contemporary programming languages who have stubbornly tried to honor that phase distinction have found themselves in awkward positions, and I have never gotten the opportunity to experience the real practical value of the phase distinction -- only the downsides most of the time. The situation is even worse when you consider the traditional C translation phases, in particular preprocessing tokens vs. tokens and the confusion that they harbor. -- Gaby
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
