To help with performance comparisons I ripped dmd's lexer out and got it building as a few .d files. It's very crude. It's got tons of casts (more than the original c++ version). I attempted no cleanup or any other change than the minimum I could to get it to build and run. Obviously there's tons of room for cleanup, but that's not the point... it's just useful as a baseline.
The branch: https://github.com/braddr/phobos/tree/dmd_lexer The commit with the changes: https://github.com/braddr/phobos/commit/040540ef3baa38997b15a56be3e9cd9c4bfa51ab On my desktop (far from idle, it's running 2 of the auto testers), it consistently takes 0.187s to lex all of the .d files in phobos. Later, Brad On 8/1/2012 5:10 PM, Walter Bright wrote: > Given the various proposals for a lexer module for Phobos, I thought I'd > share some characteristics it ought to have. > > First of all, it should be suitable for, at a minimum: > > 1. compilers > > 2. syntax highlighting editors > > 3. source code formatters > > 4. html creation > > To that end: > > 1. It should accept as input an input range of UTF8. I feel it is a mistake > to templatize it for UTF16 and UTF32. Anyone > desiring to feed it UTF16 should use an 'adapter' range to convert the input > to UTF8. (This is what component > programming is all about.) > > 2. It should output an input range of tokens > > 3. tokens should be values, not classes > > 4. It should avoid memory allocation as much as possible > > 5. It should read or write any mutable global state outside of its "Lexer" > instance > > 6. A single "Lexer" instance should be able to serially accept input ranges, > sharing and updating one identifier table > > 7. It should accept a callback delegate for errors. That delegate should > decide whether to: > 1. ignore the error (and "Lexer" will try to recover and continue) > 2. print an error message (and "Lexer" will try to recover and continue) > 3. throw an exception, "Lexer" is done with that input range > > 8. Lexer should be configurable as to whether it should collect information > about comments and ddoc comments or not > > 9. Comments and ddoc comments should be attached to the next following token, > they should not themselves be tokens > > 10. High speed matters a lot > > 11. Tokens should have begin/end line/column markers, though most of the time > this can be implicitly determined > > 12. It should come with unittests that, using -cov, show 100% coverage > > > Basically, I don't want anyone to be motivated to do a separate one after > seeing this one.