On Saturday, 5 October 2013 at 00:24:22 UTC, Andrei Alexandrescu wrote:
On 10/2/13 7:41 AM, Dicebot wrote:
After brief discussion with Brian and gathering data from the review thread, I have decided to start voting for `std.d.lexer` inclusion into
Phobos.

Thanks all involved for the work, first of all Brian.

I have the proverbial good news and bad news. The only bad news is that I'm voting "no" on this proposal.

But there's plenty of good news.

1. I am not attempting to veto this, so just consider it a normal vote when tallying.

2. I do vote for inclusion in the /etc/ package for the time being.

3. The work is good and the code valuable, so even in the case my suggestions (below) will be followed, a virtually all code pulp that gets work done can be reused.

Vision
======

I'd been following the related discussions for a while, but I have made up my mind today as I was working on a C++ lexer today. The C++ lexer is for Facebook's internal linter. I'm translating the lexer from C++.

Before long I realized two simple things. First, I can't reuse anything from Brian's code (without copying it and doing surgery on it), although it is extremely similar to what I'm doing.

Second, I figured that it is almost trivial to implement a simple, generic, and reusable (across languages and tasks) static trie searcher that takes a compile-time array with all tokens and keywords and returns the token at the front of a range with minimum comparisons.

Such a trie searcher is not intelligent, but is very composable and extremely fast. It is just smart enough to do maximum munch (e.g. interprets "==" and "foreach" as one token each, not two), but is not smart enough to distinguish an identifier "whileTrue" from the keyword "while" (it claims "while" was found and stops right at the beginning of "True" in the stream). This is for generality so applications can define how identifiers work (e.g. Lisp allows "-" in identifiers but D doesn't etc). The trie finder doesn't do numbers or comments either. No regexen of any kind.

The beauty of it all is that all of these more involved bits (many of which are language specific) can be implemented modularly and trivially as a postprocessing step after the trie finder. For example the user specifies "/*" as a token to the trie finder. Whenever a comment starts, the trie finder will find and return it; then the user implements the alternate grammar of multiline comments.

To encode the tokens returned by the trie, we must do away with definitions such as

enum TokenType : ushort { invalid, assign, ... }

These are fine for a tokenizer written in C, but are needless duplication from a D perspective. I think a better approach is:

struct TokenType {
  string symbol;
  ...
}

TokenType tok(string s)() {
  static immutable string interned = s;
  return TokenType(interned);
}

Instead of associating token types with small integers, we associate them with string addresses. (For efficiency we may use pointers to zero-terminated strings, but I don't think that's necessary). Token types are interned by design, i.e. to compare two tokens for equality it suffices to compare the strings with "is" (this can be extended to general identifiers, not only statically-known tokens). Then, each token type has a natural representation that doesn't require the user to remember the name of the token. The left shift token is simply tok!"<<" and is application-global.

The static trie finder does not even build a trie - it simply generates a bunch of switch statements. The signature I've used is:

Tuple!(size_t, size_t, Token)
staticTrieFinder(alias TokenTable, R)(R r) {

It returns a tuple with (a) whitespace characters before token, (b) newlines before token, and (c) the token itself, returned as tok!"whatever". To use for C++:

alias CppTokenTable = TypeTuple!(
  "~", "(", ")", "[", "]", "{", "}", ";", ",", "?",
"<", "<<", "<<=", "<=", ">", ">>", ">>=", "%", "%=", "=", "==", "!", "!=",
  "^", "^=", "*", "*=",
  ":", "::", "+", "++", "+=", "&", "&&", "&=", "|", "||", "|=",
  "-", "--", "-=", "->", "->*",
  "/", "/=", "//", "/*",
  "\\",
  ".",
  "'",
  "\"",
  "#", "##",
  "and",
  "and_eq",
  "asm",
  "auto",
  ...
);

Then the code uses staticTrieFinder!([CppTokenTable])(range). Of course, it's also possible to define the table itself as an array. I'm exploring right now in search for the most advantageous choices.

I think the above would be a true lexer in the D spirit:

- exploits D's string templates to essentially define non-alphanumeric symbols that are easy to use and understand, not confined to predefined tables (that enum!) and cheap to compare;

- exploits D's code generation abilities to generate really fast code using inlined trie searching;

- offers and API that is generic, flexible, and infinitely reusable.

If what we need at this point is a conventional lexer for the D language, std.d.lexer is the ticket. But I think it wouldn't be difficult to push our ambitions way beyond that. What say you?


How quickly do you think this vision could be realized? If soon, I'd say it's worth delaying a decision on the current proposed lexer, if not ... well, jam tomorrow, perfect is the enemy of good, and all that ...

Reply via email to