Hello everyone

I've been experimenting a bit with faster utf8 operations (and
conversion that does not do much). I'm writing down the results so
they don't get forgotten, as well as trying to put them in rpython
comments.

As far as non-SSE algorithms go, for things like splitlines, split
etc. is important to walk the utf8 string quickly and check properties
of characters.

So far the current finding has been that lookup table, for example:

 def next_codepoint_pos(code, pos):
     chr1 = ord(code[pos])
     if chr1 < 0x80:
         return pos + 1
    return pos + ord(runicode._utf8_code_length[chr1 - 0x80])

is significantly slower than following code (both don't do error checking):

def next_codepoint_pos(code, pos):
    chr1 = ord(code[pos])
    if chr1 < 0x80:
        return pos + 1
    if 0xC2 >= chr1 <= 0xDF:
        return pos + 2
    if chr >= 0xE0 and chr <= 0xEF:
        return pos + 3
    return pos + 4

The exact difference depends on how much multi-byte characters are
there and how big the strings are. It's up to 40%, but as a general
rule, the more ascii characters are, the less of an impact it has, as
well as the larger they are, the more impact memory/L2/L3 cache has.

PS. SSE will be faster still, but we might not want SSE for just splitlines

Cheers,
fijal
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to