https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81616

--- Comment #51 from Martin Jambor <jamborm at gcc dot gnu.org> ---
(In reply to Andrew Roberts from comment #50)
> with the matrix.c benchmark on Ryzen and looking at the other options when
> using -march=znver1 and -mtune=znver1
> 
> mult took 225281 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=128
> mult took 185961 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=256
> mult took 187577 clocks -march=znver1 -mtune=znver1 -mprefer-vector-width=512
> 
> -adding mno-avx2 has no effect on the above baseline.
> 
> adding in -mno-fma
> 
> mult took 223302 clocks -march=znver1 -mtune=znver1
> -mprefer-vector-width=128 -mno-fma
> mult took 123773 clocks -march=znver1 -mtune=znver1
> -mprefer-vector-width=256 -mno-fma
> mult took 124690 clocks -march=znver1 -mtune=znver1
> -mprefer-vector-width=512 -mno-fma
> 
> Is the patch in trunk yet? I was assuming it was from the other comments.

Yes, but by default (on Zen) it only prevents generating FMAs for
128bit operands (or smaller).  Originally, AMD kept 256bit ones or
larger intact in their splitting patch (and in a conversation they
hinted that they might be beneficial in some scenarios) and I kept the
condition there because 256bit vectors are not well understood and I
had little time.

We will definitely look at this whe examining AVX256 on Zen.  I am not
sure whether want to lift the restriction only based on matrix.c in
stage 4.  But I would not oppose it.

Reply via email to