". a big selling point of Turbo Pascal was speed"

That was a long time ago.. Machines are a  lot faster now and if you really
want performance  parallel/ memory processing could give big gains.  I
would have thought you could process code files in parallel an insert into
a large tree.

"With characters being wider these days, and processors being so much
faster relative to I/O, it isn't clear that methods like that still carry
the benefit that they once did."

Yep.

"The memory space burden for that can be surprisingly large."

You don't really need to go to disk again unless you need more than 2G
except for the initial load. I think its quite reasonable for a complier to
use 1G these days if needed. If your compilation units are smaller than
 than a in memory tree  ( built during parsing) would allow much faster
parallel processing, However it comes at a complexity cost.  Its better to
just use smaller compilation units where everything can be buffered and run
these in parallel.

And you surely could generate this from events from the parsing but IMHO
this is not a problem that needs solving.  The only serious compilation
problem i see these days is around C++ templates/ headers and auto GUI
generators .

Regards,

Ben




On Sun, Sep 7, 2014 at 4:44 AM, Jonathan S. Shapiro <[email protected]>
wrote:

> On Fri, Sep 5, 2014 at 8:56 PM, Gabriel Dos Reis <
> [email protected]> wrote:
>
>>
>> I think it is a good question to ask: after more than half century of
>> formal grammar theory and practice and parser generator experience, how
>> important is it really *in practice* to have the token stream vs. parse
>> tree/AST distinction?
>>
>
> So I've had more chance to think about this, and I think I actually knew
> the answer all along. Actually, there are two.
>
> 1. White Space
>
> White space is significant to the tokenizer, because it is a token
> separator. So, for example, if we write the tokenizer rules in parser
> style, then in
>
> tk_if ::= 'i' 'f'
>
>
> we to *not* intend to allow spaces between the 'i' and the 'f'. But in
> this rule:
>
> if_expr ::= tk_if expr tk_then block tk_else block
>
>
> we intend to allow arbitrary white space between the right side elements.
>
> But that's not the main one.
>
> 2. Performance
>
> For most languages, tokenization is the performance-limiting factor in the
> compiler. Or at least this has been true historically. E.g. a big selling
> point of Turbo Pascal was speed, and most of that was had by tokenization
> performance improvements. One of the ways they accomplished that was by
> doing binary word I/O rather than character I/O and using a variant of
> Duff's Device to manage the incoming character stream. The resulting
> tokenizer was faster than hell and essentially unmaintainable. All of that
> obfuscation was motivated by the desire to avoid calling the stdio library
> once per character.
>
> With characters being wider these days, and processors being so much
> faster relative to I/O, it isn't clear that methods like that still carry
> the benefit that they once did.
>
>
> One thing that strikes me is that the tokenization pass introduces another
> layer of buffering. *Every* tokenizer involves two layers of character
> I/O: one in the standard library and another in the token reader. The token
> reader implements its own buffer because a variable length buffer turns out
> to be the simplest way to deal with arbitrary length identifiers. A smart
> tokenizer operates by recording the starting offset in the input buffer,
> advancing across characters (possibly coalescing the input buffer during
> refill) and then grabbing the entire token string after it knows the token
> length. If that buffer is big enough, you can bypass the buffering provided
> by stdio. In practice you want to do that anyway if you are implementing
> something like a REPL loop so that you can do things like input refresh.
>
> Separate tokenizers came into their own in languages that were LALR,
> LR(1), or LL(1). In such languages, you need a push-back mechanism for a
> single token, but nothing more than that. So you *don't* really get an
> accumulation of tokens at the input end of things.
>
> More recent languages require deep or even infinite lookahead, which
> effectively means that you need to buffer the token-level stream in
> addition to the character-level stream. The memory space burden for that
> can be surprisingly large. Once position information is concerned, the
> token structure is going to be several times the size of the token it
> contains, so you want the "fixed" token structures to get freed as quickly
> as possible.
>
>
> shap
>
> _______________________________________________
> bitc-dev mailing list
> [email protected]
> http://www.coyotos.org/mailman/listinfo/bitc-dev
>
>
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to