Dnia 22-10-2010 o 21:48:49 Andrei Alexandrescu
<seewebsiteforem...@erdani.org> napisał(a):
On 10/22/10 14:02 CDT, Tomek Sowiński wrote:
Interesting idea. Here's another: D will soon need bindings for CORBA,
Thrift, etc, so lexers will have to be written all over to grok
interface files. Perhaps a generic tokenizer which can be parametrized
with a lexical grammar would bring more ROI, I got a hunch D's templates
are strong enough to pull this off without any source code generation
ala JavaCC. The books I read on compilers say tokenization is a solved
problem, so the theory part on what a good abstraction should be is
done. What you think?
Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer
generator.
I have in mind the entire implementation of a simple design, but never
had the time to execute on it. The tokenizer would work like this:
alias Lexer!(
"+", "PLUS",
"-", "MINUS",
"+=", "PLUS_EQ",
...
"if", "IF",
"else", "ELSE"
...
) DLexer;
Yes. One remark: native language constructs scale better for a grammar:
enum TokenDef : string {
Digit = "[0-9]",
Letter = "[a-zA-Z_]",
Identifier = Letter~'('~Letter~'|'~Digit~')',
...
Plus = "+",
Minus = "-",
PlusEq = "+=",
...
If = "if",
Else = "else",
...
}
alias Lexer!TokenDef DLexer;
BTW, there's a bug related:
http://d.puremagic.com/issues/show_bug.cgi?id=2950
Such a declaration generates numeric values DLexer.PLUS etc. and
generates an efficient code that extracts a stream of tokens from a
stream of text. Each token in the token stream has the ID and the text.
All good ideas.
Comments, strings etc. can be handled in one of several ways but that's
a longer discussion.
The discussion's started anyhow. So what're the options?
--
Tomek