https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #13 from Michael_S <already5chosen at yahoo dot com> --- (In reply to Hongtao.liu from comment #11) > (In reply to Michael_S from comment #10) > > (In reply to Hongtao.liu from comment #9) > > > (In reply to Michael_S from comment #8) > > > > What are values of gcc "loop" cost of the relevant instructions now? > > > > 1. AVX256 Load > > > > 2. FMA3 ymm,ymm,ymm > > > > 3. AVX256 Regmove > > > > 4. FMA3 mem,ymm,ymm > > > > > > For skylake, outside of register allocation. > > > > > > they are > > > 1. AVX256 Load ---- 10 > > > 2. FMA3 ymm,ymm,ymm --- 16 > > > 3. AVX256 Regmove --- 2 > > > 4. FMA3 mem,ymm,ymm --- 32 > > > > > > In RA, no direct cost for fma instrcutions, but we can disparage memory > > > alternative in FMA instructions, but again, it may hurt performance in > > > some > > > cases. > > > > > > 1. AVX256 Load ---- 10 > > > 3. AVX256 Regmove --- 2 > > > > > > BTW: we have done a lot of experiments with different cost models and no > > > significant performance impact on SPEC2017. > > > > Thank you. > > With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only > > in conditions of heavy registers pressure. So, why it generates it in my > > loop, where registers pressure in the innermost loop is light and even in > > the next outer level the pressure isn't heavy? > > What am I missing? > > the actual transformation gcc did is > > vmovuxx (mem1), %ymmA pass_combine > vmovuxx (mem), %ymmD ----> vmovuxx (mem1), %ymmA > vfmadd213 %ymmD,%ymmC,%ymmA vfmadd213 (mem),%ymmC,%ymmA > > then RA works like > RA > vmovuxx (mem1), %ymmA ----> %vmovaps %ymmB, %ymmA > vfmadd213 (mem),%ymmC,%ymmA vfmadd213 (mem),%ymmC,%ymmA > > it "look like" but actually not this one. > > vmovuxx (mem), %ymmA > vfnmadd231xx %ymmB, %ymmC, %ymmA > transformed to > vmovaxx %ymmB, %ymmA > vfnmadd213xx (mem), %ymmC, %ymmA > > ymmB is allocate for (mem1) not (mem) Thank you. Now compiler's reasoning is starting to make more sense. Still I don't understand why compiler does not compare the cost of full loop body after combining to the cost before combining and does not come to conclusion that combining increased the cost.