Re: DCT: D compiler as a collection of libraries

Roman D. Boiko Mon, 14 May 2012 08:03:48 -0700

On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:

I think you are wasting much more memory and performance bystoring all the tokens in the lexer.
Imagine I want to implement a simple syntax highlighter: justhighlight keywords. How can I tell DCT to *not* store alltokens because I need each one in turn? And since I'll behighlighting in the editor I will need column and lineinformation. That means I'll have to do that O(log(n))operation for every token.
So you see, for the simplest use case of a lexer theperformance of DCT is awful.
Now imagine I want to build an AST. Again, I consume the tokensone by one, probably peeking in some cases. If I want to storeline and column information I just copy them to the AST. Yousay the tokens are discarded but their data is not, and that'swhy their data is usually copied.


Currently I think about making token a class instead of struct.

A token (fromhttps://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is:


// Represents lexed token
struct Token
{

size_t startIndex; // position of the first code unit in thesource stringstring spelling; // characters from which this token has beenlexedTokenKind kind; // enum; each keyword and operator, have adedicated kindubyte annotations; // meta information like whether a tokenis valid, or an integer literal is signed, long, hexadecimal, etc.

}

Making it a class would give several benefits:

* allow not to worry about allocating a big array of tokens.E.g., on 64-bit OS the largest module in Phobos (IIRC, thestd.datetime) consumes 13.5MB in an array of almost 500K tokens.It would require 4 times smaller chunk of contiguous memory if itwas an array of class objects, because each would consume only 8bytes instead of 32.

* allow subclassing, for example, for storing strongly typedliteral values; this flexibility could also facilitate futureextensibility (but it's difficult to predict which kind ofextension may be needed)

* there would be no need to copy data from tokens into AST,passing an object would be enough (again, copy 8 instead of 32bytes); the same applies to passing into methods - no need topass by ref to minimise overhead

It would incur some additional memory overhead (at least 8 bytesper token), but that's hardly significant. Also there isadditional price for accessing token members because ofindirection, and, possibly, worse cache friendliness (tokeninstances may be allocated anywhere in memory, not close to eachother).

These considerations are mostly about performance. I think thereis also some impact on design, but couldn't find anythingsignificant (given that currently I see a token as merely adatastructure without associated behavior).

Could anybody suggest other pros and cons? Which option would youchoose?

Re: DCT: D compiler as a collection of libraries

Reply via email to