On Friday, 11 May 2012 at 09:08:24 UTC, Jacob Carlborg wrote:
On 2012-05-11 10:58, Roman D. Boiko wrote:
Each token contains:
* start index (position in the original encoding, 0 corresponds
to the first code unit after BOM),
* token value encoded as UTF-8 string,
* token kind (e.g., token.kind = TokenKind.Float),
* possibly enum with annotations (e.g., token.annotations =
FloatAnnotation.Hex | FloatAnnotation.Real)
What about line and column information?
Indices of the first code unit of each line are stored inside
lexer and a function will compute Location (line number, column
number, file specification) for any index. This way size of Token
instance is reduced to the minimum. It is assumed that Location
can be computed on demand, and is not needed frequently. So
column is calculated by reverse walk till previous end of line,
etc. Locations will possible to calculate both taking into
account special token sequences (e.g., #line 3 "ab/c.d"), or
discarding them.
* Does it convert numerical literals and similar to their
actual values
It is planned to add a post-processor for that as part of
parser,
please see README.md for some more details.
Isn't that a job for the lexer?
That might be done in lexer for efficiency reasons (to avoid
lexing token value again). But separating this into a dedicated
post-processing phase leads to a much cleaner design (IMO), also
suitable for uses when such values are not needed. Also I don't
think that performance would be improved given the ratio of
number of literals to total number of tokens and the need to
store additional information per token if it is done in lexer. I
will elaborate on that later.