On 07/02/2024 06:29, Alexander Monakov wrote: > On Tue, 6 Feb 2024, Elena Ufimtseva wrote: >> Hello Alexander >> >> On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov <amona...@ispras.ru> >> wrote: >> >>> Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD >>> routines are invoked much more rarely in normal use when most buffers >>> are non-zero. This makes use of AVX512 unprofitable, as it incurs extra >>> frequency and voltage transition periods during which the CPU operates >>> at reduced performance, as described in >>> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html >> >> I would like to point out that the frequency scaling is not currently an >> issue on AMD Zen4 Genoa CPUs, for example. >> And microcode architecture description here: >> https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf >> Although, the cpu frequency downscaling mentioned in the above document is >> only in relation to floating point operations. >> But from other online discussions I gather that the data path for the >> integer registers in Zen4 is also 256 bits and it allows to avoid >> frequency downscaling for FP and heavy instructions. > > Yes, that's correct: in particular, on Zen 4 512-bit vector loads occupy load > ports for two consecutive cycles, so from load throughput perspective there's > no difference between 256-bit vectors and 512-bit vectors. Generally AVX-512 > still has benefits on Zen 4 since it's a richer instruction set (it also > reduces > pressure in the CPU front-end and is more power-efficient), but as the new > AVX2 > buffer_is_zero is saturating load ports I would expect that AVX512 can exceed > its performance only by a small margin if at all, not anywhere close to 2x. > >> And looking at the optimizations for AVX2 in your other patch, would >> unrolling the loop for AVX512 ops benefit from the speedup taken that the >> data path has the same width? > > No, 256-bit datapath on Zen 4 means that it's easier to saturate it with > 512-bit loads than with 256-bit loads, so an AVX512 loop is roughly comparable > to a similar AVX-256 loop unrolled twice. > > Aside: AVX512 variant needs a little more thought to use VPTERNLOG properly. > >> If the frequency downscaling is not observed on some of the CPUs, can >> AVX512 be maintained and used selectively for some >> of the CPUs? > > Please note that a properly optimized buffer_is_zero is limited by load > throughput, not ALUs. On Zen 4 AVX2 is sufficient to saturate L1 cache load > bandwidth in buffer_is_zero. For data outside of L1 cache, the benefits > of AVX-512 diminish more and more. > > I don't have Zen 4 based machines at hand to see if AVX-512 is beneficial > there for buffer_is_zero for reasons like reaching higher turbo clocks or > higher memory parallelism. >
FWIW, this frequency downscaling problem that was more prominent in Skylake is /supposedly/ no longer observed in Intel Sapphire Rapids either: https://www.phoronix.com/review/intel-sapphirerapids-avx512/8