Re: Looking for champion - std.lang.d.lex

Andrei Alexandrescu Sat, 23 Oct 2010 11:20:35 -0700

On 10/23/10 11:44 CDT, Sean Kelly wrote:

Andrei Alexandrescu<seewebsiteforem...@erdani.org>  wrote:

On 10/22/10 16:28 CDT, Sean Kelly wrote:

Andrei Alexandrescu Wrote:


I have in mind the entire implementation of a simple design, but
never
had the time to execute on it. The tokenizer would work like this:

alias Lexer!(
       "+", "PLUS",
       "-", "MINUS",
       "+=", "PLUS_EQ",
       ...
       "if", "IF",
       "else", "ELSE"
       ...
) DLexer;

Such a declaration generates numeric values DLexer.PLUS etc. and
generates an efficient code that extracts a stream of tokens from a
stream of text. Each token in the token stream has the ID and the
text.


What about, say, floating-point literals?  It seems like the first
element of a pair might have to be a regex pattern.



Yah, with regard to such regular patterns (strings, comments, numbers,
identifiers) there are at least two possibilities that I see:

1. Go the full route of allowing regexen in the definition. This is
very hard because you need to generate an efficient (N|D)FA during
compilation.

2. Pragmatically allow "fallthrough" routines, i.e. if nothing in the
compile-time table matches, just call onUnrecognizedString(). In
conjunction with a few simple specialized functions, that makes it
very simple to define arbitrarily complex lexers where the bulk of the
work (and the most tedious part) is done by the D compiler.


For the second, that may push the work of recognizing some lexical
elements into the parser. For example, a comment may be defined as /**/,
which if there is no lexical definition of a comment means that it
parses as four distinct valid tokens, div mul mul div.


I was thinking comments could be easily caught by simple routines:

alias Lexer!(
       "+", "PLUS",
       "-", "MINUS",
       "+=", "PLUS_EQ",
       ...
       "/*", q{parseNonNestedComment("*/")},
       "/+", q{parseNestedComment("+/")},
       "//", q{parseOneLineComment()},
       ...
       "if", "IF",
       "else", "ELSE",
       ...
) DLexer;

During compilation, such non-tokens are recognized as code by the lexergenerator and called appropriately. A comprehensive library of suchroutines completes a useful library.



Andrei

Re: Looking for champion - std.lang.d.lex

Reply via email to