Le 14/05/2012 17:00, Roman D. Boiko a écrit :
On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:
I think you are wasting much more memory and performance by storing
all the tokens in the lexer.

Imagine I want to implement a simple syntax highlighter: just
highlight keywords. How can I tell DCT to *not* store all tokens
because I need each one in turn? And since I'll be highlighting in the
editor I will need column and line information. That means I'll have
to do that O(log(n)) operation for every token.

So you see, for the simplest use case of a lexer the performance of
DCT is awful.

Now imagine I want to build an AST. Again, I consume the tokens one by
one, probably peeking in some cases. If I want to store line and
column information I just copy them to the AST. You say the tokens are
discarded but their data is not, and that's why their data is usually
copied.

Currently I think about making token a class instead of struct.

A token (from
https://github.com/roman-d-boiko/dct/blob/master/fe/core.d) is:

// Represents lexed token
struct Token
{
size_t startIndex; // position of the first code unit in the source string
string spelling; // characters from which this token has been lexed
TokenKind kind; // enum; each keyword and operator, have a dedicated kind
ubyte annotations; // meta information like whether a token is valid, or
an integer literal is signed, long, hexadecimal, etc.
}

Making it a class would give several benefits:

* allow not to worry about allocating a big array of tokens. E.g., on
64-bit OS the largest module in Phobos (IIRC, the std.datetime) consumes
13.5MB in an array of almost 500K tokens. It would require 4 times
smaller chunk of contiguous memory if it was an array of class objects,
because each would consume only 8 bytes instead of 32.


Why is this a benefice ?

* allow subclassing, for example, for storing strongly typed literal
values; this flexibility could also facilitate future extensibility (but
it's difficult to predict which kind of extension may be needed)


I'm pretty sure that D's token will not change that much. If the need isn't identified right know, I'd advocate for YAGNI.

* there would be no need to copy data from tokens into AST, passing an
object would be enough (again, copy 8 instead of 32 bytes); the same
applies to passing into methods - no need to pass by ref to minimise
overhead


Yes but now you add pressure on the GC and add indirections. I'm not sure it worth it. It seems to me like a premature optimization.

It would incur some additional memory overhead (at least 8 bytes per
token), but that's hardly significant. Also there is additional price
for accessing token members because of indirection, and, possibly, worse
cache friendliness (token instances may be allocated anywhere in memory,
not close to each other).

These considerations are mostly about performance. I think there is also
some impact on design, but couldn't find anything significant (given
that currently I see a token as merely a datastructure without
associated behavior).

Could anybody suggest other pros and cons? Which option would you choose?

You are over engineering the whole stuff.

Reply via email to