I'm still mired in lexical analysis, though nearly out of the woods; was on vacation for a while, which slowed me down. Vacations are stressful! :-(
Traditional compilers have made a strong phase distinction between the tokenizer and the parser. The tokenizer turns an input character stream into an (position-annotated) token stream. The parser takes a token stream and constructs an AST tree. Some, but not all, tokens have corresponding AST node types. E.g. identifiers get an AST. Though some tokenizers implement a notion of a "start state" that can be changed, most tokenizers are context-free. There are numerous cases where we need limited lookahead and/or backtracking in common languages. Historically this was best done at the token stream layer, but with a modern, memoizing parser it isn't clear that this is still motivated. LR and LL grammars are strictly more powerful than regular expressions. There is nothing we can express in a tokenizer that cannot be expressed just as well in a grammar. As I have been working on the parser generator, I have realized that there are tokenization sub-patterns that I want to re-use. E.g. character escapes want the same syntax inside character literals that is used inside strings. This led me in the direction of a tokenizer specification that looks a lot like a grammar specification. All of which is leading me to the following questions: 1. Is the tokenizer/parser phase distinction actually useful in modern front ends? 2. What value is it providing, and should it be preserved? 3. Why use two algorithms (regular expression engine, parse algorithm) when one will suffice? The only place in BitC v0 where the phase distinction was important was the layout engine, and *that* actually *does* seem difficult to express in a conventional grammar. It *can* be expressed in a predicated grammar with suitable predicates. Opinions? Jonathan
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
