https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87716
Vladimir Makarov <vmakarov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |vmakarov at gcc dot gnu.org --- Comment #4 from Vladimir Makarov <vmakarov at gcc dot gnu.org> --- I don't think it can be easily fixed. We have the following code in IRA (here - means a removed insn, pref means preferred hard reg for destination pseudo, hard reg in () means assigned hard reg, copy and constrain mean preference of two pseudo to have the same hard reg): -28: r109(di)=di; REG_DEAD di;pref di -29: r110(si)=si; REG_DEAD si;pref si -30: r111(dx)=dx; REG_DEAD dx;pref dx -31: r112(xmm0)=xmm0; REG_DEAD xmm0;pref xmm0 5: r100(xmm3)=r112(xmm0); REG_DEAD r112 ->copy(100,112) -32: r113(xmm1)=xmm1; REG_DEAD xmm1;pref xmm1 -6: r101(xmm1)=r113(xmm1); REG_DEAD r113 ->copy(101,113) 10: r103(xmm2)=[r109(di)]; REG_DEAD r109 11: r102(xmm2)=trunc(zero_extend(r103(xmm2))+zero_extend([r110(si)])+const_vector 0>>0x1);REG_DEAD r110,r103->constrain(102,103) 14: r104(xmm0)=vec_select(vec_concat(r102(xmm2),r101(xmm1)),parallel) 16: r105(xmm2)=vec_select(vec_concat(r102(xmm2),r101(xmm3)),parallel); REG_DEAD r102, r101->constrain(102,105) 19: r106(xmm0)=trunc(zero_extend(r104(xmm0))*zero_extend(r100(xmm3)) 0>>0x10); REG_DEAD r104->constrain(106,104) 21: r107(xmm2)=trunc(zero_extend(r105(xmm2))*zero_extend(r100(xmm3)) 0>>0x10); REG_DEAD r105, r100->constrain(107,105)(107,100) 23: r108(xmm0)=vec_concat(us_truncate(r106(xmm0)),us_truncate(r107(xmm2))); REG_DEAD r107, r106->constrain(108,106) 25: [r111(dx)]=r108(xmm0); REG_DEAD r111, r108 We form threads of pseudos to have the same hard reg: Threads: 1. freq 9000: a2r107(2000) a5r105(2000) a8r102(3000) a10r103(2000) 2. freq 6000: a1r108(2000) a3r106(2000) a6r104(2000) 3. freq 5000: a4r100(3000) a13r112(2000); pref xmm0 4. freq 5000: a7r101(3000) a12r113(2000); pref xmm1 Then coloring algorithm prefers pushing pseudos to coloring stack by threads when the other priorities the same. In this case we assign by threads basically: r102 -- assign reg 22(xmm2) r107 -- assign reg 22(xmm2) r105 -- assign reg 22(xmm2) r103 -- assign reg 22(xmm2) r108 -- assign reg 20(xmm0) r106 -- assign reg 20(xmm0) r104 -- assign reg 20(xmm0) r100 -- assign reg 23(xmm3) r112 -- assign reg 20(xmm0) r101 -- assign reg 21(xmm1) r113 -- assign reg 21(xmm1) r111 -- assign reg 1(dx) r110 -- assign reg 4(si) r109 -- assign reg 5(di) We assign xmm2 (first sse reg after xmm0 and xmm1) to pseudos in the 1st thread becuase threads 3 and 4 prefer xmm0 and xmm1. In LRA: As insn 14 requres p104 and p102 be in the same hard reg we generate an additional insn: r114(xmm0) = r102(xmm2) We could get the desired allocation if we start assignments with pseudos from threads with less priority (in order to assign xmm3 to pseudos from the first thread). But it would worsen performance in common case. RA is all about heuristic solution. In some case they work, in some cases they don't. We should see the whole pictures. Actually in this case RA removes 5 copies out of 6 and satisfies 5 out 6 2-op contraints without additional movement. Probably an additional RA subpass which swaps pseudo-register assignments in order to improve allocation could help. But right now I don't see how effectively to implement this and is it really worth to do.