On 10/22/10 14:02 CDT, Tomek Sowiński wrote:
Dnia 22-10-2010 o 00:01:21 Walter Bright <newshou...@digitalmars.com>
napisał(a):

As we all know, tool support is important for D's success. Making
tools easier to build will help with that.

To that end, I think we need a lexer for the standard library -
std.lang.d.lex. It would be helpful in writing color syntax
highlighting filters, pretty printers, repl, doc generators, static
analyzers, and even D compilers.

It should:

1. support a range interface for its input, and a range interface for
its output
2. optionally not generate lexical errors, but just try to recover and
continue
3. optionally return comments and ddoc comments as tokens
4. the tokens should be a value type, not a reference type
5. generally follow along with the C++ one so that they can be
maintained in tandem

It can also serve as the basis for creating a javascript
implementation that can be embedded into web pages for syntax
highlighting, and eventually an std.lang.d.parse.

Anyone want to own this?

Interesting idea. Here's another: D will soon need bindings for CORBA,
Thrift, etc, so lexers will have to be written all over to grok
interface files. Perhaps a generic tokenizer which can be parametrized
with a lexical grammar would bring more ROI, I got a hunch D's templates
are strong enough to pull this off without any source code generation
ala JavaCC. The books I read on compilers say tokenization is a solved
problem, so the theory part on what a good abstraction should be is
done. What you think?

Yes. IMHO writing a D tokenizer is a wasted effort. We need a tokenizer generator.

I have in mind the entire implementation of a simple design, but never had the time to execute on it. The tokenizer would work like this:

alias Lexer!(
    "+", "PLUS",
    "-", "MINUS",
    "+=", "PLUS_EQ",
    ...
    "if", "IF",
    "else", "ELSE"
    ...
) DLexer;

Such a declaration generates numeric values DLexer.PLUS etc. and generates an efficient code that extracts a stream of tokens from a stream of text. Each token in the token stream has the ID and the text.

Comments, strings etc. can be handled in one of several ways but that's a longer discussion.

The undertaking is doable but nontrivial.


Andrei

Reply via email to