On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
Better, but still slow.
I implemented the various suggestions from a past thread and made
the lexer only work ubyte[] (to aviod phobos converting
everything to dchar all the time) and gave the tokenizer instance
a character buffer that it re-uses.
Results:
$ avgtime -q -r 200 ./dscanner --tokenCount
../phobos/std/datetime.d
------------------------
Total time (ms): 13861.8
Repetitions : 200
Sample mode : 69 (90 ocurrences)
Median time : 69.0745
Avg time : 69.3088
Std dev. : 0.670203
Minimum : 68.613
Maximum : 72.635
95% conf.int. : [67.9952, 70.6223] e = 1.31357
99% conf.int. : [67.5824, 71.0351] e = 1.72633
EstimatedAvg95%: [69.2159, 69.4016] e = 0.0928836
EstimatedAvg99%: [69.1867, 69.4308] e = 0.12207
If my math is right, that means it's getting 4.9 million
tokens/second now. According to Valgrind the only way to really
improve things now is to require that the input to the lexer
support slicing. (Remember the secret of Tango's XML parser...)
The bottleneck is now on the calls to .idup to construct the
token strings from slices of the buffer.
I guess that at some point
pure nothrow TokenType lookupTokenType(const string input)
might become a bottleneck. (DMD does not generate near-optimal
string switches, I think.)
Right now that's a fairly small box on KCachegrind.