On 28/03/2021 18:29, Kristoffer Brånemyr via GNU coreutils General Discussion
wrote:
Hi,
I wanted to practice some more using vector intrinsics, so I made a small AVX2
optimization for wc -l. Depending on line length it is about 2-5x faster than
previous version. (Well, only looking at user time it is much faster than that
even.)
I put the patch at https://github.com/coreutils/coreutils/pull/50 . Maybe this
is a pointless optimization, I guess not many people run wc -l on gigabytes of
data, but maybe it could be useful for someone...
As an aside, I think that .gitignore should be updated to include
src/libcksum_pclmul.a , and if patch is accepted also the libwc_avx2.a I added
in this patch.
Good call. I've excluded all future .a files with:
https://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=f20797f1d
Some informal benchmark results with big files already in filecache:
-----
An HTML formatted e-book, concatenated many times. Most lines around 80 chars,
with some shorter.
avx2 (1.98x faster)
38256750 /disk2/download/storfil4
real 0m0,292s
user 0m0,040s
sys 0m0,252s
normal wc
38256750 /disk2/download/storfil4
real 0m0,580s
user 0m0,346s
sys 0m0,234s
-------
A big file with only \n character in it.
avx2 (4.9x faster)
1328545792 /disk2/download/storfil6_bara_radbryt
real 0m0,160s
user 0m0,012s
sys 0m0,148s
normal wc
1328545792 /disk2/download/storfil6_bara_radbryt
real 0m0,768s
user 0m0,626s
sys 0m0,142s
----
A big file with no \n at all.
avx2 (I think they are basically equally fast, since running several times it
varied who was faster)
0 /disk2/download/storfil7_inga_radbryt
real 0m0,277s
user 0m0,035s
sys 0m0,242s
normal wc
0 /disk2/download/storfil7_inga_radbryt
real 0m0,269s
user 0m0,039s
sys 0m0,230s
Excellent results.
I'll review this very soon.
thanks!
Pádraig