On Thu, 25 Jan 2024, 06:22 Hans Henrik Bergan, <divinit...@gmail.com> wrote:
> On Wed, 24 Jan 2024 at 17:59, Marco Pivetta <ocram...@gmail.com> wrote: > > > > Depends on the actual numbers: is there any way to make a comparison that > > is relatively stable across architectures? > > > > Would it be feasible to start with the > > cross-platform-let-the-compiler-do-its-job version (that somebody may > > actually be capable of auditing), and then introduce other versions when > > the jump is significant enough? > > > > don't know about "relatively stable across architectures" but wrote > some benchmarking code, keep reading. > > > > On Wed, 24 Jan 2024 at 17:55, tag Knife <fennic...@gmail.com> wrote: > > Should we even be considering the specific instruction implementations? > > I've always been in the camp > > of you are not smarter than the compiler. As even the best human written > > ASM code can be slower > > than the obscure instructions the compiler might choose to use in a weird > > and wonderful way. > > The BLAKE3 team is smarter than GCC11.4, even with -march=native > -mtune=native, which is *not* commonly used in PHP, > the compiler didn't stand a chance against the hand-optimized assembly > versions, > > wrote some benchmarks, but the TL;DR is: > portable -O2 usually used by PHP managed 1126MB/s, > portable -O2 -march=native managed 533MB/s (wtf? gcc obviously got > something wrong here), > hand-written -O2 SSE2 managed 3144MB/s, > hand-written -O2 SSE41 managed 3332MB/s, > hand-written -O2 avx2 managed 6554MB/s, > hand-writen -O2 AVX512 managed 8913MB/s, > on my AMD Ryzen 9 7950x, > benchmarking code: > https://gist.github.com/divinity76/5729472dd5d77e94cd0acb245aac2226 > raw output: > array(6) { > ["O2-portable-march"]=> > array(2) { > ["microseconds_for_16_kib"]=> > int(29295) > ["mb_per_second"]=> > float(533.3674688513398) > } > ["O2-portable"]=> > array(2) { > ["microseconds_for_16_kib"]=> > int(13876) > ["mb_per_second"]=> > float(1126.0449697319111) > } > ["O2-sse2"]=> > array(2) { > ["microseconds_for_16_kib"]=> > int(4969) > ["mb_per_second"]=> > float(3144.4958744214127) > } > ["O2-sse41"]=> > array(2) { > ["microseconds_for_16_kib"]=> > int(4688) > ["mb_per_second"]=> > float(3332.977815699659) > } > ["O2-avx2"]=> > array(2) { > ["microseconds_for_16_kib"]=> > int(2384) > ["mb_per_second"]=> > float(6554.1107382550335) > } > ["O2-avx512"]=> > array(2) { > ["microseconds_for_16_kib"]=> > int(1753) > ["mb_per_second"]=> > float(8913.291500285226) > } > } > Oh yes, the AVX jump is impressive 😵