[Bug rtl-optimization/87716] [9 Regression] FAIL: gcc.target/i386/pr57193.c scan-assembler-times movdqa 2

2019-03-27 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87716

Jeffrey A. Law  changed:

   What|Removed |Added

   Priority|P3  |P2

[Bug rtl-optimization/87716] [9 Regression] FAIL: gcc.target/i386/pr57193.c scan-assembler-times movdqa 2

2019-03-01 Thread vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87716

Vladimir Makarov  changed:

   What|Removed |Added

 CC||vmakarov at gcc dot gnu.org

--- Comment #4 from Vladimir Makarov  ---
  I don't think it can be easily fixed.  We have the following code in
IRA (here - means a removed insn, pref means preferred hard reg for
destination pseudo, hard reg in () means assigned hard reg, copy and
constrain mean preference of two pseudo to have the same hard reg):

  -28: r109(di)=di; REG_DEAD di;pref di
  -29: r110(si)=si; REG_DEAD si;pref si
  -30: r111(dx)=dx; REG_DEAD dx;pref dx
  -31: r112(xmm0)=xmm0; REG_DEAD xmm0;pref xmm0
5: r100(xmm3)=r112(xmm0); REG_DEAD r112 ->copy(100,112)
  -32: r113(xmm1)=xmm1; REG_DEAD xmm1;pref xmm1
   -6: r101(xmm1)=r113(xmm1); REG_DEAD r113 ->copy(101,113)
   10: r103(xmm2)=[r109(di)]; REG_DEAD r109
   11:
r102(xmm2)=trunc(zero_extend(r103(xmm2))+zero_extend([r110(si)])+const_vector
0>>0x1);REG_DEAD r110,r103->constrain(102,103)
   14: r104(xmm0)=vec_select(vec_concat(r102(xmm2),r101(xmm1)),parallel)
   16: r105(xmm2)=vec_select(vec_concat(r102(xmm2),r101(xmm3)),parallel);
REG_DEAD r102, r101->constrain(102,105)
   19: r106(xmm0)=trunc(zero_extend(r104(xmm0))*zero_extend(r100(xmm3))
0>>0x10); REG_DEAD r104->constrain(106,104)
   21: r107(xmm2)=trunc(zero_extend(r105(xmm2))*zero_extend(r100(xmm3))
0>>0x10); REG_DEAD r105, r100->constrain(107,105)(107,100)
   23: r108(xmm0)=vec_concat(us_truncate(r106(xmm0)),us_truncate(r107(xmm2)));
REG_DEAD r107, r106->constrain(108,106)
   25: [r111(dx)]=r108(xmm0); REG_DEAD r111, r108

We form threads of pseudos to have the same hard reg:

Threads:
  1. freq 9000: a2r107(2000) a5r105(2000) a8r102(3000) a10r103(2000)
  2. freq 6000: a1r108(2000) a3r106(2000) a6r104(2000)
  3. freq 5000: a4r100(3000) a13r112(2000); pref xmm0
  4. freq 5000: a7r101(3000) a12r113(2000); pref xmm1

Then coloring algorithm prefers pushing pseudos to coloring stack by
threads when the other priorities the same.  In this case we assign by
threads basically:

  r102  -- assign reg 22(xmm2)
  r107  -- assign reg 22(xmm2)
  r105  -- assign reg 22(xmm2)
  r103  -- assign reg 22(xmm2)
  r108  -- assign reg 20(xmm0)
  r106  -- assign reg 20(xmm0)
  r104  -- assign reg 20(xmm0)
  r100  -- assign reg 23(xmm3)
  r112  -- assign reg 20(xmm0)
  r101  -- assign reg 21(xmm1)
  r113  -- assign reg 21(xmm1)
  r111  -- assign reg 1(dx)
  r110  -- assign reg 4(si)
  r109  -- assign reg 5(di)

We assign xmm2 (first sse reg after xmm0 and xmm1) to pseudos in the
1st thread becuase threads 3 and 4 prefer xmm0 and xmm1.

In LRA:

  As insn 14 requres p104 and p102 be in the same hard reg we generate an
additional insn:
  r114(xmm0) = r102(xmm2)

We could get the desired allocation if we start assignments with
pseudos from threads with less priority (in order to assign xmm3 to
pseudos from the first thread).  But it would worsen performance in
common case.

RA is all about heuristic solution.  In some case they work, in some
cases they don't.  We should see the whole pictures.  Actually in this
case RA removes 5 copies out of 6 and satisfies 5 out 6 2-op
contraints without additional movement.

Probably an additional RA subpass which swaps pseudo-register
assignments in order to improve allocation could help.  But right now
I don't see how effectively to implement this and is it really worth
to do.

[Bug rtl-optimization/87716] [9 Regression] FAIL: gcc.target/i386/pr57193.c scan-assembler-times movdqa 2

2018-10-24 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87716

--- Comment #3 from Segher Boessenkool  ---
(and swap xmm0 and xmm3 in all later instructions).

Yes.  But it seems IRA doesn't figure this out.

[Bug rtl-optimization/87716] [9 Regression] FAIL: gcc.target/i386/pr57193.c scan-assembler-times movdqa 2

2018-10-23 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87716

H.J. Lu  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2018-10-24
   Target Milestone|--- |9.0
 Ever confirmed|0   |1

--- Comment #2 from H.J. Lu  ---
We currently generate:

test1:
movdqa  (%rdi), %xmm2
pavgb   (%rsi), %xmm2
movdqa  %xmm0, %xmm3
movdqa  %xmm2, %xmm0
punpckhbw   %xmm1, %xmm2
punpcklbw   %xmm1, %xmm0
pmulhuw %xmm3, %xmm2
pmulhuw %xmm3, %xmm0
packuswb%xmm2, %xmm0
movaps  %xmm0, (%rdx)
ret

One of

movdqa  %xmm0, %xmm3
movdqa  %xmm2, %xmm0

is redundant. We should generate

movdqa  %xmm2, %xmm3

[Bug rtl-optimization/87716] [9 Regression] FAIL: gcc.target/i386/pr57193.c scan-assembler-times movdqa 2

2018-10-23 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87716

--- Comment #1 from Segher Boessenkool  ---
A slightly older compiler gave

test1:
movdqa  (%rdi), %xmm2
pavgb   (%rsi), %xmm2
movdqa  %xmm2, %xmm3
punpckhbw   %xmm1, %xmm2
punpcklbw   %xmm1, %xmm3
pmulhuw %xmm0, %xmm2
pmulhuw %xmm0, %xmm3
packuswb%xmm2, %xmm3
movaps  %xmm3, (%rdx)
ret

What is so super strange about the current generated code?