On Fri, Sep 5, 2014 at 8:56 PM, Gabriel Dos Reis <
[email protected]> wrote:

>
> I think it is a good question to ask: after more than half century of
> formal grammar theory and practice and parser generator experience, how
> important is it really *in practice* to have the token stream vs. parse
> tree/AST distinction?
>

So I've had more chance to think about this, and I think I actually knew
the answer all along. Actually, there are two.

1. White Space

White space is significant to the tokenizer, because it is a token
separator. So, for example, if we write the tokenizer rules in parser
style, then in

tk_if ::= 'i' 'f'


we to *not* intend to allow spaces between the 'i' and the 'f'. But in this
rule:

if_expr ::= tk_if expr tk_then block tk_else block


we intend to allow arbitrary white space between the right side elements.

But that's not the main one.

2. Performance

For most languages, tokenization is the performance-limiting factor in the
compiler. Or at least this has been true historically. E.g. a big selling
point of Turbo Pascal was speed, and most of that was had by tokenization
performance improvements. One of the ways they accomplished that was by
doing binary word I/O rather than character I/O and using a variant of
Duff's Device to manage the incoming character stream. The resulting
tokenizer was faster than hell and essentially unmaintainable. All of that
obfuscation was motivated by the desire to avoid calling the stdio library
once per character.

With characters being wider these days, and processors being so much faster
relative to I/O, it isn't clear that methods like that still carry the
benefit that they once did.


One thing that strikes me is that the tokenization pass introduces another
layer of buffering. *Every* tokenizer involves two layers of character I/O:
one in the standard library and another in the token reader. The token
reader implements its own buffer because a variable length buffer turns out
to be the simplest way to deal with arbitrary length identifiers. A smart
tokenizer operates by recording the starting offset in the input buffer,
advancing across characters (possibly coalescing the input buffer during
refill) and then grabbing the entire token string after it knows the token
length. If that buffer is big enough, you can bypass the buffering provided
by stdio. In practice you want to do that anyway if you are implementing
something like a REPL loop so that you can do things like input refresh.

Separate tokenizers came into their own in languages that were LALR, LR(1),
or LL(1). In such languages, you need a push-back mechanism for a single
token, but nothing more than that. So you *don't* really get an
accumulation of tokens at the input end of things.

More recent languages require deep or even infinite lookahead, which
effectively means that you need to buffer the token-level stream in
addition to the character-level stream. The memory space burden for that
can be surprisingly large. Once position information is concerned, the
token structure is going to be several times the size of the token it
contains, so you want the "fixed" token structures to get freed as quickly
as possible.


shap
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to