On 10/05/13 13:45, Jacob Carlborg wrote: > I think we can have both. A hand written lexer, specifically targeted for D > that is very fast. Then a more general lexer that can be used for many > languages.
The assumption, that a hand-written lexer will be much faster than a generated one, is wrong. If there's any significant perf difference then it's just a matter of improving the generator. An automatically generated lexer will be much more flexible (the source spec can be reused without a single modification for anything from an intelligent LOC-like counter or a syntax highlighter to a compiler), easier to maintain/review and less buggy. Compare the perf numbers previously posted here for the various lexers with: $ time ./tokenstats stats std/datetime.d Lexed 1589336 bytes, found 461315 tokens, 13770 keywords, 65946 identifiers. Comments: Line: 958 @ ~40.16 Block: 1 @ ~16 Nesting: 534 @ ~441.7 [count @ avg_len] 0m0.010s user 0m0.001s system 0m0.011s elapsed 99.61% CPU $ time ./tokenstats dump-no-io std/datetime.d 0m0.013s user 0m0.001s system 0m0.014s elapsed 99.78% CPU 'tokenstats' is built from PEG-like spec plus a bit CT magic. The generator supports inline rules written in D too, but the only ones actually written in D are for defining what an identifier is, matching EOLs and handling DelimitedStrings. Initially, performance was not a consideration at all and there's some very low hanging fruit in there; there's still room for improvement. Unfortunately, the language and compiler situation has prevented me from doing any work on this for the last half year or so. The code won't work with any current compiler and needs a lot of cleanups (which I have been planning to do /after/ updating the tooling, which seems very unlikely to be possible now), hence it's not in a releasable state. [1] artur [1] If anyone wants to play with it, use as a reference etc and isn't afraid of running a binary, a linux x86 one can be gotten from http://d-h.st/xtX The only really useful functionality is 'tokenstats dump file.d', which will dump all found tokens with line and columns numbers. It's just a tool i've been using for identifying regressions and benching.