On Friday, 11 May 2012 at 09:08:24 UTC, Jacob Carlborg wrote:
On 2012-05-11 10:58, Roman D. Boiko wrote:
Each token contains:
* start index (position in the original encoding, 0 corresponds
to the first code unit after BOM),
* token value encoded as UTF-8 string,
* token kind (e.g., token.kind = TokenKind.Float),
* possibly enum with annotations (e.g., token.annotations =
FloatAnnotation.Hex | FloatAnnotation.Real)

What about line and column information?
Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 "ab/c.d"), or discarding them.

* Does it convert numerical literals and similar to their actual values
It is planned to add a post-processor for that as part of parser,
please see README.md for some more details.

Isn't that a job for the lexer?
That might be done in lexer for efficiency reasons (to avoid lexing token value again). But separating this into a dedicated post-processing phase leads to a much cleaner design (IMO), also suitable for uses when such values are not needed. Also I don't think that performance would be improved given the ratio of number of literals to total number of tokens and the need to store additional information per token if it is done in lexer. I will elaborate on that later.



Reply via email to