https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #13 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Hongtao.liu from comment #11)
> (In reply to Michael_S from comment #10)
> > (In reply to Hongtao.liu from comment #9)
> > > (In reply to Michael_S from comment #8)
> > > > What are values of gcc "loop" cost of the relevant instructions now?
> > > > 1. AVX256 Load
> > > > 2. FMA3 ymm,ymm,ymm
> > > > 3. AVX256 Regmove
> > > > 4. FMA3 mem,ymm,ymm
> > > 
> > > For skylake, outside of register allocation.
> > > 
> > > they are
> > > 1. AVX256 Load  ---- 10
> > > 2. FMA3 ymm,ymm,ymm --- 16
> > > 3. AVX256 Regmove  --- 2
> > > 4. FMA3 mem,ymm,ymm --- 32
> > > 
> > > In RA, no direct cost for fma instrcutions, but we can disparage memory
> > > alternative in FMA instructions, but again, it may hurt performance in 
> > > some
> > > cases.
> > > 
> > > 1. AVX256 Load  ---- 10
> > > 3. AVX256 Regmove  --- 2
> > > 
> > > BTW: we have done a lot of experiments with different cost models and no
> > > significant performance impact on SPEC2017.
> > 
> > Thank you.
> > With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only
> > in conditions of heavy registers pressure. So, why it generates it in my
> > loop, where registers pressure in the innermost loop is light and even in
> > the next outer level the pressure isn't heavy?
> > What am I missing?
> 
> the actual transformation gcc did is
> 
> vmovuxx (mem1), %ymmA     pass_combine     
> vmovuxx (mem), %ymmD         ---->     vmovuxx   (mem1), %ymmA
> vfmadd213 %ymmD,%ymmC,%ymmA            vfmadd213 (mem),%ymmC,%ymmA
> 
> then RA works like
>                             RA
> vmovuxx (mem1), %ymmA      ---->  %vmovaps %ymmB, %ymmA
> vfmadd213 (mem),%ymmC,%ymmA       vfmadd213 (mem),%ymmC,%ymmA
> 
> it "look like" but actually not this one.
> 
>  vmovuxx      (mem), %ymmA
>  vfnmadd231xx %ymmB, %ymmC, %ymmA
> transformed to
>  vmovaxx      %ymmB, %ymmA
>  vfnmadd213xx (mem), %ymmC, %ymmA
> 
> ymmB is allocate for (mem1) not (mem)

Thank you.
Now compiler's reasoning is starting to make more sense.
Still I don't understand why compiler does not compare the cost of full loop
body after combining to the cost before combining and does not come to
conclusion that combining increased the cost.

Reply via email to