On 07/02/2024 06:29, Alexander Monakov wrote:
> On Tue, 6 Feb 2024, Elena Ufimtseva wrote:
>> Hello Alexander
>>
>> On Tue, Feb 6, 2024 at 12:50 PM Alexander Monakov <amona...@ispras.ru>
>> wrote:
>>
>>> Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
>>> routines are invoked much more rarely in normal use when most buffers
>>> are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
>>> frequency and voltage transition periods during which the CPU operates
>>> at reduced performance, as described in
>>> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
>>
>> I would like to point out that the frequency scaling is not currently an
>> issue on AMD Zen4 Genoa CPUs, for example.
>> And microcode architecture description here:
>> https://www.amd.com/system/files/documents/4th-gen-epyc-processor-architecture-white-paper.pdf
>> Although, the cpu frequency downscaling mentioned in the above document is
>> only in relation to floating point operations.
>> But from other online discussions I gather that the data path for the
>> integer registers in Zen4 is also 256 bits and it allows to avoid
>> frequency downscaling for FP and heavy instructions.
> 
> Yes, that's correct: in particular, on Zen 4 512-bit vector loads occupy load
> ports for two consecutive cycles, so from load throughput perspective there's
> no difference between 256-bit vectors and 512-bit vectors. Generally AVX-512
> still has benefits on Zen 4 since it's a richer instruction set (it also 
> reduces
> pressure in the CPU front-end and is more power-efficient), but as the new 
> AVX2
> buffer_is_zero is saturating load ports I would expect that AVX512 can exceed
> its performance only by a small margin if at all, not anywhere close to 2x.
> 
>> And looking at the optimizations for AVX2 in your other patch, would
>> unrolling the loop for AVX512 ops benefit from the speedup taken that the
>> data path has the same width?
> 
> No, 256-bit datapath on Zen 4 means that it's easier to saturate it with
> 512-bit loads than with 256-bit loads, so an AVX512 loop is roughly comparable
> to a similar AVX-256 loop unrolled twice.
> 
> Aside: AVX512 variant needs a little more thought to use VPTERNLOG properly.
> 
>> If the frequency downscaling is not observed on some of the CPUs, can
>> AVX512 be maintained and used selectively for some
>> of the CPUs?
> 
> Please note that a properly optimized buffer_is_zero is limited by load
> throughput, not ALUs. On Zen 4 AVX2 is sufficient to saturate L1 cache load
> bandwidth in buffer_is_zero. For data outside of L1 cache, the benefits
> of AVX-512 diminish more and more.
> 
> I don't have Zen 4 based machines at hand to see if AVX-512 is beneficial
> there for buffer_is_zero for reasons like reaching higher turbo clocks or
> higher memory parallelism.
> 

FWIW, this frequency downscaling problem that was more prominent in Skylake is
/supposedly/ no longer observed in Intel Sapphire Rapids either:

https://www.phoronix.com/review/intel-sapphirerapids-avx512/8

Reply via email to