On 10/22/10 16:28 CDT, Sean Kelly wrote:
Andrei Alexandrescu Wrote:
I have in mind the entire implementation of a simple design, but never
had the time to execute on it. The tokenizer would work like this:
alias Lexer!(
"+", "PLUS",
"-", "MINUS",
"+=", "PLUS_EQ",
...
"if", "IF",
"else", "ELSE"
...
) DLexer;
Such a declaration generates numeric values DLexer.PLUS etc. and
generates an efficient code that extracts a stream of tokens from a
stream of text. Each token in the token stream has the ID and the text.
What about, say, floating-point literals? It seems like the first element of a
pair might have to be a regex pattern.
Yah, with regard to such regular patterns (strings, comments, numbers,
identifiers) there are at least two possibilities that I see:
1. Go the full route of allowing regexen in the definition. This is very
hard because you need to generate an efficient (N|D)FA during compilation.
2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the
compile-time table matches, just call onUnrecognizedString(). In
conjunction with a few simple specialized functions, that makes it very
simple to define arbitrarily complex lexers where the bulk of the work
(and the most tedious part) is done by the D compiler.
Andrei