Given the various proposals for a lexer module for Phobos, I thought I'd share some characteristics it ought to have.

First of all, it should be suitable for, at a minimum:

1. compilers

2. syntax highlighting editors

3. source code formatters

4. html creation

To that end:

1. It should accept as input an input range of UTF8. I feel it is a mistake to templatize it for UTF16 and UTF32. Anyone desiring to feed it UTF16 should use an 'adapter' range to convert the input to UTF8. (This is what component programming is all about.)

2. It should output an input range of tokens

3. tokens should be values, not classes

4. It should avoid memory allocation as much as possible

5. It should read or write any mutable global state outside of its "Lexer"
instance

6. A single "Lexer" instance should be able to serially accept input ranges, sharing and updating one identifier table

7. It should accept a callback delegate for errors. That delegate should decide whether to:
   1. ignore the error (and "Lexer" will try to recover and continue)
   2. print an error message (and "Lexer" will try to recover and continue)
   3. throw an exception, "Lexer" is done with that input range

8. Lexer should be configurable as to whether it should collect information about comments and ddoc comments or not

9. Comments and ddoc comments should be attached to the next following token, they should not themselves be tokens

10. High speed matters a lot

11. Tokens should have begin/end line/column markers, though most of the time this can be implicitly determined

12. It should come with unittests that, using -cov, show 100% coverage


Basically, I don't want anyone to be motivated to do a separate one after seeing this one.

Reply via email to