On Tuesday, 1 April 2014 at 18:35:50 UTC, Walter Bright wrote:
Try this benchmark comparing various classification schemes:

bool isIdentifierChar1(ubyte c)
{
    return ((c >= '0' || c == '$') &&
            (c <= '9' || c >= 'A')  &&
            (c <= 'Z' || c >= 'a' || c == '_') &&
            (c <= 'z'));
}

I'd like to point out this is quite a complicated function to begin with, so it doesn't generalize to all isXXX is ascii, for which the tests would be fairly simpler.

In any case, (on my win32 machine) I can go from 810msecs to 500msecs using this function instead:

bool isIdentifierChar1(ubyte c)
{
    return c <= 'z' && (
            'a' <= c ||
('0' <= c && (c <= '9' || c == '_' || ('A' <= c && c <= 'Z'))) ||
            c == '$');
}

That said, I'm abusing the fact that 50% of your bench is for chars over 0x80. If I loop only on actual ASCII you can find in text, (0x20 - 0X80), then those numbers "only" go from "320" => "300". Only slightly better, but still a win.

*BUT*, if your functions were to accept any arbitrary codepoint, it would absolutely murder.

Reply via email to