On 31/03/2024 18:58, Evgeny Nizhibitsky wrote:
Yes, it's true that simplifying and speeding-up by the bufsize increase are two 
different things although the former allowed the latter.

I just landed more tests with hyperfine for various configurations spanning over the current master 
version and a new approach with a range of bufsizes from 16 KiB up to 1 MiB, running on 1 billion 
yes'es like you did (1by), a generated file for the recent 1 billion row challenge (1brc, with 
entries like "<station name>;<temperature:0.2f>") and the first 100 million 
rows for both of them (100my and 100mrc, respectively), all in /dev/shm, yet again with 7800X3D:

The reported timings are as follows:

| version | 100my | 100mrc | 1by | 1brc |
| ------- | ------- | ------- | ------- | ------- |
| master | 21.3 ms ± 1.0 ms | 163.1 ms ± 1.5 ms | 197.1 ms ± 3.0 ms | 1.680 s ± 
 0.010 s |
| 16 KiB | 21.0 ms ± 1.1 ms | 163.7 ms ± 2.1 ms | 194.3 ms ± 2.5 ms | 1.658 s ± 
0.015 s |
| 32 KiB | 20.2 ms ± 0.7 ms | 158.9 ms ± 3.0 ms | 194.6 ms ± 6.4 ms | 1.620 s ± 
0.023 s |
| 64 KiB | 19.8 ms ± 0.6 ms | 154.0 ms ± 5.3 ms | 187.5 ms ± 7.2 ms | 1.553 s ± 
0.013 s |
| 128 KiB | 18.8 ms ± 0.6 ms | 148.9 ms ± 5.4 ms | 178.4 ms ± 1.3 ms | 1.530 s 
± 0.013 s |
| 256 KiB | 19.2 ms ± 0.8 ms | 145.8 ms ± 1.5 ms | 176.4 ms ± 1.6 ms | 1.522 s 
± 0.017 s |
| 512 KiB | 19.6 ms ± 0.7 ms | 146.4 ms ± 1.0 ms | 183.0 ms ± 5.0 ms | 1.512 s 
± 0.014 s |
| 1 MiB | 19.3 ms ± 0.7 ms | 145.7 ms ± 1.8 ms | 188.4 ms ± 6.2 ms | 1.499 s ± 
0.012 s |

And the corresponding speed-up values are as follows:

| version | 100my | 100mrc | 1by | 1brc |
| ------- | ------- | ------- | ------- | ------- |
| master | 0% | 0% | 0% | 0% |
| 16 KiB | 1% | 0% | 1% | 1% |
| 32 KiB | 5% | 3% | 1% | 4% |
| 64 KiB | 8% | 6% | 5% | 8% |
| 128 KiB | 13% | 10% | 10% | 10% |
| 256 KiB | 11% | 12% | 12% | 10% |
| 512 KiB | 9% | 11% | 8% | 11% |
| 1 MiB | 10% | 12% | 5% | 12% |

So again in my case the new approach is on par with the old one while the sweet 
spot bufsize of 256 KB seems to bring the best value.

Still more testing on different CPUs and sample files should probably be 
conducted.

Excellent.
This concurs with my testing with this patch on my laptop,
and my testing of 256KiB buffer sizes with:
https://github.com/coreutils/coreutils/commit/fcfba90d0

I'll test on a few other systems and adjust the configure check before 
committing.

thanks!
Pádraig

Reply via email to