On Monday, 28 January 2013 at 21:03:21 UTC, Timon Gehr wrote:
Better, but still slow.

I implemented the various suggestions from a past thread and made the lexer only work ubyte[] (to aviod phobos converting everything to dchar all the time) and gave the tokenizer instance a character buffer that it re-uses.

Results:

$ avgtime -q -r 200 ./dscanner --tokenCount ../phobos/std/datetime.d

------------------------
Total time (ms): 13861.8
Repetitions    : 200
Sample mode    : 69 (90 ocurrences)
Median time    : 69.0745
Avg time       : 69.3088
Std dev.       : 0.670203
Minimum        : 68.613
Maximum        : 72.635
95% conf.int.  : [67.9952, 70.6223]  e = 1.31357
99% conf.int.  : [67.5824, 71.0351]  e = 1.72633
EstimatedAvg95%: [69.2159, 69.4016]  e = 0.0928836
EstimatedAvg99%: [69.1867, 69.4308]  e = 0.12207

If my math is right, that means it's getting 4.9 million tokens/second now. According to Valgrind the only way to really improve things now is to require that the input to the lexer support slicing. (Remember the secret of Tango's XML parser...) The bottleneck is now on the calls to .idup to construct the token strings from slices of the buffer.

I guess that at some point

pure nothrow TokenType lookupTokenType(const string input)

might become a bottleneck. (DMD does not generate near-optimal string switches, I think.)

Right now that's a fairly small box on KCachegrind.

Reply via email to