On 10/2/13 7:41 AM, Dicebot wrote:
After brief discussion with Brian and gathering data from the
review
thread, I have decided to start voting for `std.d.lexer`
inclusion into
Phobos.
Thanks all involved for the work, first of all Brian.
I have the proverbial good news and bad news. The only bad news
is that I'm voting "no" on this proposal.
But there's plenty of good news.
1. I am not attempting to veto this, so just consider it a
normal vote when tallying.
2. I do vote for inclusion in the /etc/ package for the time
being.
3. The work is good and the code valuable, so even in the case
my suggestions (below) will be followed, a virtually all code
pulp that gets work done can be reused.
Vision
======
I'd been following the related discussions for a while, but I
have made up my mind today as I was working on a C++ lexer
today. The C++ lexer is for Facebook's internal linter. I'm
translating the lexer from C++.
Before long I realized two simple things. First, I can't reuse
anything from Brian's code (without copying it and doing
surgery on it), although it is extremely similar to what I'm
doing.
Second, I figured that it is almost trivial to implement a
simple, generic, and reusable (across languages and tasks)
static trie searcher that takes a compile-time array with all
tokens and keywords and returns the token at the front of a
range with minimum comparisons.
Such a trie searcher is not intelligent, but is very composable
and extremely fast. It is just smart enough to do maximum munch
(e.g. interprets "==" and "foreach" as one token each, not
two), but is not smart enough to distinguish an identifier
"whileTrue" from the keyword "while" (it claims "while" was
found and stops right at the beginning of "True" in the
stream). This is for generality so applications can define how
identifiers work (e.g. Lisp allows "-" in identifiers but D
doesn't etc). The trie finder doesn't do numbers or comments
either. No regexen of any kind.
The beauty of it all is that all of these more involved bits
(many of which are language specific) can be implemented
modularly and trivially as a postprocessing step after the trie
finder. For example the user specifies "/*" as a token to the
trie finder. Whenever a comment starts, the trie finder will
find and return it; then the user implements the alternate
grammar of multiline comments.
To encode the tokens returned by the trie, we must do away with
definitions such as
enum TokenType : ushort { invalid, assign, ... }
These are fine for a tokenizer written in C, but are needless
duplication from a D perspective. I think a better approach is:
struct TokenType {
string symbol;
...
}
TokenType tok(string s)() {
static immutable string interned = s;
return TokenType(interned);
}
Instead of associating token types with small integers, we
associate them with string addresses. (For efficiency we may
use pointers to zero-terminated strings, but I don't think
that's necessary). Token types are interned by design, i.e. to
compare two tokens for equality it suffices to compare the
strings with "is" (this can be extended to general identifiers,
not only statically-known tokens). Then, each token type has a
natural representation that doesn't require the user to
remember the name of the token. The left shift token is simply
tok!"<<" and is application-global.
The static trie finder does not even build a trie - it simply
generates a bunch of switch statements. The signature I've used
is:
Tuple!(size_t, size_t, Token)
staticTrieFinder(alias TokenTable, R)(R r) {
It returns a tuple with (a) whitespace characters before token,
(b) newlines before token, and (c) the token itself, returned
as tok!"whatever". To use for C++:
alias CppTokenTable = TypeTuple!(
"~", "(", ")", "[", "]", "{", "}", ";", ",", "?",
"<", "<<", "<<=", "<=", ">", ">>", ">>=", "%", "%=", "=",
"==", "!", "!=",
"^", "^=", "*", "*=",
":", "::", "+", "++", "+=", "&", "&&", "&=", "|", "||", "|=",
"-", "--", "-=", "->", "->*",
"/", "/=", "//", "/*",
"\\",
".",
"'",
"\"",
"#", "##",
"and",
"and_eq",
"asm",
"auto",
...
);
Then the code uses staticTrieFinder!([CppTokenTable])(range).
Of course, it's also possible to define the table itself as an
array. I'm exploring right now in search for the most
advantageous choices.
I think the above would be a true lexer in the D spirit:
- exploits D's string templates to essentially define
non-alphanumeric symbols that are easy to use and understand,
not confined to predefined tables (that enum!) and cheap to
compare;
- exploits D's code generation abilities to generate really
fast code using inlined trie searching;
- offers and API that is generic, flexible, and infinitely
reusable.
If what we need at this point is a conventional lexer for the D
language, std.d.lexer is the ticket. But I think it wouldn't be
difficult to push our ambitions way beyond that. What say you?