pr57193.c scan-assembler-times movdqa 2

vmakarov at gcc dot gnu.org Fri, 01 Mar 2019 14:39:05 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87716


Vladimir Makarov <vmakarov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vmakarov at gcc dot gnu.org

--- Comment #4 from Vladimir Makarov <vmakarov at gcc dot gnu.org> ---
  I don't think it can be easily fixed.  We have the following code in
IRA (here - means a removed insn, pref means preferred hard reg for
destination pseudo, hard reg in () means assigned hard reg, copy and
constrain mean preference of two pseudo to have the same hard reg):

  -28: r109(di)=di; REG_DEAD di;pref di
  -29: r110(si)=si; REG_DEAD si;pref si
  -30: r111(dx)=dx; REG_DEAD dx;pref dx
  -31: r112(xmm0)=xmm0; REG_DEAD xmm0;pref xmm0
    5: r100(xmm3)=r112(xmm0); REG_DEAD r112 ->copy(100,112)
  -32: r113(xmm1)=xmm1; REG_DEAD xmm1;pref xmm1
   -6: r101(xmm1)=r113(xmm1); REG_DEAD r113 ->copy(101,113)
   10: r103(xmm2)=[r109(di)]; REG_DEAD r109
   11:
r102(xmm2)=trunc(zero_extend(r103(xmm2))+zero_extend([r110(si)])+const_vector
0>>0x1);REG_DEAD r110,r103->constrain(102,103)
   14: r104(xmm0)=vec_select(vec_concat(r102(xmm2),r101(xmm1)),parallel)
   16: r105(xmm2)=vec_select(vec_concat(r102(xmm2),r101(xmm3)),parallel);
REG_DEAD r102, r101->constrain(102,105)
   19: r106(xmm0)=trunc(zero_extend(r104(xmm0))*zero_extend(r100(xmm3))
0>>0x10); REG_DEAD r104->constrain(106,104)
   21: r107(xmm2)=trunc(zero_extend(r105(xmm2))*zero_extend(r100(xmm3))
0>>0x10); REG_DEAD r105, r100->constrain(107,105)(107,100)
   23: r108(xmm0)=vec_concat(us_truncate(r106(xmm0)),us_truncate(r107(xmm2)));
REG_DEAD r107, r106->constrain(108,106)
   25: [r111(dx)]=r108(xmm0); REG_DEAD r111, r108

We form threads of pseudos to have the same hard reg:

Threads:
  1. freq 9000: a2r107(2000) a5r105(2000) a8r102(3000) a10r103(2000)
  2. freq 6000: a1r108(2000) a3r106(2000) a6r104(2000)
  3. freq 5000: a4r100(3000) a13r112(2000); pref xmm0
  4. freq 5000: a7r101(3000) a12r113(2000); pref xmm1

Then coloring algorithm prefers pushing pseudos to coloring stack by
threads when the other priorities the same.  In this case we assign by
threads basically:

      r102  -- assign reg 22(xmm2)
      r107  -- assign reg 22(xmm2)
      r105  -- assign reg 22(xmm2)
      r103  -- assign reg 22(xmm2)
      r108  -- assign reg 20(xmm0)
      r106  -- assign reg 20(xmm0)
      r104  -- assign reg 20(xmm0)
      r100  -- assign reg 23(xmm3)
      r112  -- assign reg 20(xmm0)
      r101  -- assign reg 21(xmm1)
      r113  -- assign reg 21(xmm1)
      r111  -- assign reg 1(dx)
      r110  -- assign reg 4(si)
      r109  -- assign reg 5(di)

We assign xmm2 (first sse reg after xmm0 and xmm1) to pseudos in the
1st thread becuase threads 3 and 4 prefer xmm0 and xmm1.

In LRA:

  As insn 14 requres p104 and p102 be in the same hard reg we generate an
additional insn:
      r114(xmm0) = r102(xmm2)

We could get the desired allocation if we start assignments with
pseudos from threads with less priority (in order to assign xmm3 to
pseudos from the first thread).  But it would worsen performance in
common case.

RA is all about heuristic solution.  In some case they work, in some
cases they don't.  We should see the whole pictures.  Actually in this
case RA removes 5 copies out of 6 and satisfies 5 out 6 2-op
contraints without additional movement.

Probably an additional RA subpass which swaps pseudo-register
assignments in order to improve allocation could help.  But right now
I don't see how effectively to implement this and is it really worth
to do.

[Bug rtl-optimization/87716] [9 Regression] FAIL: gcc.target/i386/pr57193.c scan-assembler-times movdqa 2

Reply via email to