Re: std.d.lexer : voting thread

Joseph Rushton Wakeling Sun, 06 Oct 2013 05:46:30 -0700

On Saturday, 5 October 2013 at 00:24:22 UTC, Andrei Alexandrescuwrote:

On 10/2/13 7:41 AM, Dicebot wrote:
After brief discussion with Brian and gathering data from thereviewthread, I have decided to start voting for `std.d.lexer`inclusion into
Phobos.
Thanks all involved for the work, first of all Brian.
I have the proverbial good news and bad news. The only bad newsis that I'm voting "no" on this proposal.
But there's plenty of good news.
1. I am not attempting to veto this, so just consider it anormal vote when tallying.
2. I do vote for inclusion in the /etc/ package for the timebeing.
3. The work is good and the code valuable, so even in the casemy suggestions (below) will be followed, a virtually all codepulp that gets work done can be reused.
Vision
======
I'd been following the related discussions for a while, but Ihave made up my mind today as I was working on a C++ lexertoday. The C++ lexer is for Facebook's internal linter. I'mtranslating the lexer from C++.
Before long I realized two simple things. First, I can't reuseanything from Brian's code (without copying it and doingsurgery on it), although it is extremely similar to what I'mdoing.
Second, I figured that it is almost trivial to implement asimple, generic, and reusable (across languages and tasks)static trie searcher that takes a compile-time array with alltokens and keywords and returns the token at the front of arange with minimum comparisons.
Such a trie searcher is not intelligent, but is very composableand extremely fast. It is just smart enough to do maximum munch(e.g. interprets "==" and "foreach" as one token each, nottwo), but is not smart enough to distinguish an identifier"whileTrue" from the keyword "while" (it claims "while" wasfound and stops right at the beginning of "True" in thestream). This is for generality so applications can define howidentifiers work (e.g. Lisp allows "-" in identifiers but Ddoesn't etc). The trie finder doesn't do numbers or commentseither. No regexen of any kind.
The beauty of it all is that all of these more involved bits(many of which are language specific) can be implementedmodularly and trivially as a postprocessing step after the triefinder. For example the user specifies "/*" as a token to thetrie finder. Whenever a comment starts, the trie finder willfind and return it; then the user implements the alternategrammar of multiline comments.
To encode the tokens returned by the trie, we must do away withdefinitions such as
enum TokenType : ushort { invalid, assign, ... }
These are fine for a tokenizer written in C, but are needlessduplication from a D perspective. I think a better approach is:
struct TokenType {
  string symbol;
  ...
}

TokenType tok(string s)() {
  static immutable string interned = s;
  return TokenType(interned);
}
Instead of associating token types with small integers, weassociate them with string addresses. (For efficiency we mayuse pointers to zero-terminated strings, but I don't thinkthat's necessary). Token types are interned by design, i.e. tocompare two tokens for equality it suffices to compare thestrings with "is" (this can be extended to general identifiers,not only statically-known tokens). Then, each token type has anatural representation that doesn't require the user toremember the name of the token. The left shift token is simplytok!"<<" and is application-global.
The static trie finder does not even build a trie - it simplygenerates a bunch of switch statements. The signature I've usedis:
Tuple!(size_t, size_t, Token)
staticTrieFinder(alias TokenTable, R)(R r) {
It returns a tuple with (a) whitespace characters before token,(b) newlines before token, and (c) the token itself, returnedas tok!"whatever". To use for C++:
alias CppTokenTable = TypeTuple!(
  "~", "(", ")", "[", "]", "{", "}", ";", ",", "?",
"<", "<<", "<<=", "<=", ">", ">>", ">>=", "%", "%=", "=","==", "!", "!=",
  "^", "^=", "*", "*=",
  ":", "::", "+", "++", "+=", "&", "&&", "&=", "|", "||", "|=",
  "-", "--", "-=", "->", "->*",
  "/", "/=", "//", "/*",
  "\\",
  ".",
  "'",
  "\"",
  "#", "##",
  "and",
  "and_eq",
  "asm",
  "auto",
  ...
);
Then the code uses staticTrieFinder!([CppTokenTable])(range).Of course, it's also possible to define the table itself as anarray. I'm exploring right now in search for the mostadvantageous choices.
I think the above would be a true lexer in the D spirit:
- exploits D's string templates to essentially definenon-alphanumeric symbols that are easy to use and understand,not confined to predefined tables (that enum!) and cheap tocompare;
- exploits D's code generation abilities to generate reallyfast code using inlined trie searching;
- offers and API that is generic, flexible, and infinitelyreusable.
If what we need at this point is a conventional lexer for the Dlanguage, std.d.lexer is the ticket. But I think it wouldn't bedifficult to push our ambitions way beyond that. What say you?

How quickly do you think this vision could be realized? If soon,I'd say it's worth delaying a decision on the current proposedlexer, if not ... well, jam tomorrow, perfect is the enemy ofgood, and all that ...

Re: std.d.lexer : voting thread

Reply via email to