Hi, as we discussed on the sprint I have now experimented with an SSE/AVX implementation to 'len(utf8 string)' (this includes a check if it is valid utf8). Since this is related to this mailing list thread I'll just add it here!
I ran some small measurements on it: Here some explanation of the names: pypy-seq-.*: sequential implementation in C, nothing fancy just a baseline pypy-vec-sse4-.*: implementation using sse4 (128 bit registers) pypy-vec-avx2-.*: implementation using avx2 (256 bit registers) libunistring-.*: benchmarking the function u8_check in that gnu library, NO length is calculated mystrlenutf8-.*: some guy doing length calculation (no validity check) only using 64bit words instead of per byte iteration. (see here [1]) .*-news-de: html of a german website (has quite a lot of 2 byte code points), ~ 1MB .*-news-cn: worldjournarl.com -> mandarin (html website with lots of 4 byte code points) ~ 700 KB .*-tipitaka-thai: xml page of some religious text with lots of 3 byte code points (~4.5 MB) copied many times (original file was 300KB) Why is u8u16 missing? Well, as far as I can tell there is no function in u8u16 that returns the length of an utf8 string and checks if it is valid at the same time, without rewriting it. u8u16 is really just for transforming utf8 to utf16. The benchmark runs read the content from a file (e.g. .*-news-de, a german html news website), and in a loop iterates 10 times the utf-8-get-length-and-check function written in C and sums up the time for each run (using clock_t clock(void) in C, man 3 clock). ..................... pypy-seq-news-de: Median +- std dev: 76.0 us +- 1.4 us ..................... pypy-sse4-vec-news-de: Median +- std dev: 5.16 us +- 0.14 us ..................... pypy-avx2-vec-news-de: Median +- std dev: 384 ns +- 11 ns ..................... libunistring-news-de: Median +- std dev: 33.0 us +- 0.4 us ..................... mystrlenutf8-news-de: Median +- std dev: 9.25 us +- 0.22 us ..................... pypy-seq-news-cn: Median +- std dev: 59.8 us +- 1.2 us ..................... pypy-sse4-vec-news-cn: Median +- std dev: 7.70 us +- 0.12 us ..................... pypy-avx2-vec-news-cn: Median +- std dev: 23.3 ns +- 0.4 ns ..................... libunistring-news-cn: Median +- std dev: 30.5 us +- 0.4 us ..................... mystrlenutf8-news-cn: Median +- std dev: 6.54 us +- 0.20 us ..................... pypy-seq-tipitaka-thai: Median +- std dev: 939 us +- 39 us ..................... pypy-sse4-vec-tipitaka-thai: Median +- std dev: 425 us +- 7 us ..................... pypy-avx2-vec-tipitaka-thai: Median +- std dev: 19.9 ns +- 0.3 ns ..................... libunistring-tipitaka-thai: Median +- std dev: 615 us +- 28 us ..................... WARNING: the benchmark seems unstable, the standard deviation is high (stdev/median: 17%) Try to rerun the benchmark with more runs, samples and/or loops mystrlenutf8-tipitaka-thai: Median +- std dev: 45.1 us +- 7.9 us What do you think? I think it would even be a good idea to take a look at AVX512 (which gives you a crazy amount of 512 bits (or 64 bytes) in your vector register). The AVX implementation is a bit fishy (compare avx2-vec-tipitaka-thai and pypy-avx2-vec-news-cn). I need to recheck that, it would not make sense to process 10x 4.5 MB in 20ns and 10x 700KB in 23ns. As soon as I have ironed out the issue I'll start to think about indexing... Cheers, Richard [1] http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html On 03/04/2017 07:01 PM, Maciej Fijalkowski wrote: > Hello everyone > > I've been experimenting a bit with faster utf8 operations (and > conversion that does not do much). I'm writing down the results so > they don't get forgotten, as well as trying to put them in rpython > comments. > > As far as non-SSE algorithms go, for things like splitlines, split > etc. is important to walk the utf8 string quickly and check properties > of characters. > > So far the current finding has been that lookup table, for example: > > def next_codepoint_pos(code, pos): > chr1 = ord(code[pos]) > if chr1 < 0x80: > return pos + 1 > return pos + ord(runicode._utf8_code_length[chr1 - 0x80]) > > is significantly slower than following code (both don't do error checking): > > def next_codepoint_pos(code, pos): > chr1 = ord(code[pos]) > if chr1 < 0x80: > return pos + 1 > if 0xC2 >= chr1 <= 0xDF: > return pos + 2 > if chr >= 0xE0 and chr <= 0xEF: > return pos + 3 > return pos + 4 > > The exact difference depends on how much multi-byte characters are > there and how big the strings are. It's up to 40%, but as a general > rule, the more ascii characters are, the less of an impact it has, as > well as the larger they are, the more impact memory/L2/L3 cache has. > > PS. SSE will be faster still, but we might not want SSE for just splitlines > > Cheers, > fijal > _______________________________________________ > pypy-dev mailing list > pypy-dev@python.org > https://mail.python.org/mailman/listinfo/pypy-dev > _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev