On Fri, Sep 5, 2014 at 8:56 PM, Gabriel Dos Reis < [email protected]> wrote:
> > I think it is a good question to ask: after more than half century of > formal grammar theory and practice and parser generator experience, how > important is it really *in practice* to have the token stream vs. parse > tree/AST distinction? > So I've had more chance to think about this, and I think I actually knew the answer all along. Actually, there are two. 1. White Space White space is significant to the tokenizer, because it is a token separator. So, for example, if we write the tokenizer rules in parser style, then in tk_if ::= 'i' 'f' we to *not* intend to allow spaces between the 'i' and the 'f'. But in this rule: if_expr ::= tk_if expr tk_then block tk_else block we intend to allow arbitrary white space between the right side elements. But that's not the main one. 2. Performance For most languages, tokenization is the performance-limiting factor in the compiler. Or at least this has been true historically. E.g. a big selling point of Turbo Pascal was speed, and most of that was had by tokenization performance improvements. One of the ways they accomplished that was by doing binary word I/O rather than character I/O and using a variant of Duff's Device to manage the incoming character stream. The resulting tokenizer was faster than hell and essentially unmaintainable. All of that obfuscation was motivated by the desire to avoid calling the stdio library once per character. With characters being wider these days, and processors being so much faster relative to I/O, it isn't clear that methods like that still carry the benefit that they once did. One thing that strikes me is that the tokenization pass introduces another layer of buffering. *Every* tokenizer involves two layers of character I/O: one in the standard library and another in the token reader. The token reader implements its own buffer because a variable length buffer turns out to be the simplest way to deal with arbitrary length identifiers. A smart tokenizer operates by recording the starting offset in the input buffer, advancing across characters (possibly coalescing the input buffer during refill) and then grabbing the entire token string after it knows the token length. If that buffer is big enough, you can bypass the buffering provided by stdio. In practice you want to do that anyway if you are implementing something like a REPL loop so that you can do things like input refresh. Separate tokenizers came into their own in languages that were LALR, LR(1), or LL(1). In such languages, you need a push-back mechanism for a single token, but nothing more than that. So you *don't* really get an accumulation of tokens at the input end of things. More recent languages require deep or even infinite lookahead, which effectively means that you need to buffer the token-level stream in addition to the character-level stream. The memory space burden for that can be surprisingly large. Once position information is concerned, the token structure is going to be several times the size of the token it contains, so you want the "fixed" token structures to get freed as quickly as possible. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
