On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
I think you are wasting much more memory and performance by storing all the tokens in the lexer.

Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token.

So you see, for the simplest use case of a lexer the performance of DCT is awful.

Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied.

Currently I think about making token a class instead of struct.

A token (from https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is:

// Represents lexed token
struct Token
{
size_t startIndex; // position of the first code unit in the source string string spelling; // characters from which this token has been lexed TokenKind kind; // enum; each keyword and operator, have a dedicated kind ubyte annotations; // meta information like whether a token is valid, or an integer literal is signed, long, hexadecimal, etc.
}

Making it a class would give several benefits:

* allow not to worry about allocating a big array of tokens. E.g., on 64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes 13.5MB in an array of almost 500K tokens. It would require 4 times smaller chunk of contiguous memory if it was an array of class objects, because each would consume only 8 bytes instead of 32.

* allow subclassing, for example, for storing strongly typed literal values; this flexibility could also facilitate future extensibility (but it's difficult to predict which kind of extension may be needed)

* there would be no need to copy data from tokens into AST, passing an object would be enough (again, copy 8 instead of 32 bytes); the same applies to passing into methods - no need to pass by ref to minimise overhead

It would incur some additional memory overhead (at least 8 bytes per token), but that's hardly significant. Also there is additional price for accessing token members because of indirection, and, possibly, worse cache friendliness (token instances may be allocated anywhere in memory, not close to each other).

These considerations are mostly about performance. I think there is also some impact on design, but couldn't find anything significant (given that currently I see a token as merely a datastructure without associated behavior).

Could anybody suggest other pros and cons? Which option would you choose?

Reply via email to