On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
Try this benchmark comparing various classification schemes:
bool isIdentifierChar1(ubyte c)
{
return ((c >= '0' || c == '$') &&
(c <= '9' || c >= 'A') &&
(c <= 'Z' || c >= 'a' || c == '_') &&
(c <= 'z'));
}
I'd like to point out this is quite a complicated function to
begin with, so it doesn't generalize to all isXXX is ascii, for
which the tests would be fairly simpler.
In any case, (on my win32 machine) I can go from 810msecs to
500msecs using this function instead:
bool isIdentifierChar1(ubyte c)
{
return c <= 'z' && (
'a' <= c ||
('0' <= c && (c <= '9' || c == '_' || ('A' <= c && c
<= 'Z'))) ||
c == '$');
}
That said, I'm abusing the fact that 50% of your bench is for
chars over 0x80. If I loop only on actual ASCII you can find in
text, (0x20 - 0X80), then those numbers "only" go from "320" =>
"300". Only slightly better, but still a win.
*BUT*, if your functions were to accept any arbitrary codepoint,
it would absolutely murder.