On Fri, Sep 5, 2014 at 10:28 AM, Jonathan S. Shapiro <[email protected]> wrote: > I'm still mired in lexical analysis, though nearly out of the woods; was on > vacation for a while, which slowed me down. Vacations are stressful! :-( > > Traditional compilers have made a strong phase distinction between the > tokenizer and the parser. The tokenizer turns an input character stream into > an (position-annotated) token stream. The parser takes a token stream and > constructs an AST tree. Some, but not all, tokens have corresponding AST > node types. E.g. identifiers get an AST. Though some tokenizers implement a > notion of a "start state" that can be changed, most tokenizers are > context-free. > > There are numerous cases where we need limited lookahead and/or backtracking > in common languages. Historically this was best done at the token stream > layer, but with a modern, memoizing parser it isn't clear that this is still > motivated. > > LR and LL grammars are strictly more powerful than regular expressions. > There is nothing we can express in a tokenizer that cannot be expressed just > as well in a grammar. > > As I have been working on the parser generator, I have realized that there > are tokenization sub-patterns that I want to re-use. E.g. character escapes > want the same syntax inside character literals that is used inside strings. > This led me in the direction of a tokenizer specification that looks a lot > like a grammar specification. > > All of which is leading me to the following questions: > > Is the tokenizer/parser phase distinction actually useful in modern front > ends? > What value is it providing, and should it be preserved? > Why use two algorithms (regular expression engine, parse algorithm) when one > will suffice? > > The only place in BitC v0 where the phase distinction was important was the > layout engine, and that actually does seem difficult to express in a > conventional grammar. It can be expressed in a predicated grammar with > suitable predicates. > > Opinions?
I'd guess that if there are any good reasons, they'll be all about error reporting. Error reporting probably points in the other direction though, towards using a single algorithm, since parsers are usually forced to have fairly fancy error reporting mechanisms since that's where all the complicated errors occur. Geoffrey _______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
