On Wed, Jun 26, 2024 at 2:52 PM Richard Biener <richard.guent...@gmail.com> wrote: > > On Wed, Jun 26, 2024 at 8:09 AM liuhongt <hongtao....@intel.com> wrote: > > > > 416.gamess regressed 4-6% on x86_64 since my r15-882-g1d6199e5f8c1c0. > > The commit adjust rtx_cost of mem to reduce cost of (add op0 disp). > > But Cost of ADDR could be cheaper than XEXP (addr, 0) when it's a lea. > > It is the case in the PR, the patch uses lower cost to enable more > > simplication and fix the regression. > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > Ok for trunk? > > > > gcc/ChangeLog: > > > > PR target/115462 > > * config/i386/i386.cc (ix86_rtx_costs): Use cost of addr when > > it's lower than rtx_cost (XEXP (addr, 0)) + 1. > > > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr115462.c: New test. > > --- > > gcc/config/i386/i386.cc | 9 +++++++-- > > gcc/testsuite/gcc.target/i386/pr115462.c | 22 ++++++++++++++++++++++ > > 2 files changed, 29 insertions(+), 2 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr115462.c > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index d4ccc24be6e..83dab8220dd 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -22341,8 +22341,13 @@ ix86_rtx_costs (rtx x, machine_mode mode, int > > outer_code_i, int opno, > > if (GET_CODE (addr) == PLUS > > && x86_64_immediate_operand (XEXP (addr, 1), Pmode)) > > { > > - *total += 1; > > - *total += rtx_cost (XEXP (addr, 0), Pmode, PLUS, 0, speed); > > + /* PR115462: Cost of ADDR could be cheaper than XEXP (addr, 0) > > + when it's a lea, use lower cost to enable more > > + simplification. */ > > + unsigned cost1 = rtx_cost (addr, Pmode, MEM, 0, speed); > > + unsigned cost2 = rtx_cost (XEXP (addr, 0), Pmode, > > + PLUS, 0, speed) + 1; > > Just as comment - this is a bit ugly, why would we not always use the > address cost? (and why are you using 'MEM'?) Should this be better > handled on the insn_cost level when it's clear the PLUS is separate address > calculation (LEA) rather than address calculation in a MEM context? For MEM, rtx_cost doesn't use address_cost but iterates each subrtx, and adds up the costs, So for MEM (reg) and MEM (reg + 4), the former costs 5, the latter costs 9, it is not accurate for x86. Ideally address_cost should be used, but it reduces cost too much(range from 1-3). (I've tried that, it regressed many testcases, because two many registers are propagated into addr and increase register pressure). So the current solution is to make constant disp as cheap as possible so more constant can be propagated into the address(but not registers).
> > > + *total += MIN (cost1, cost2); > > return true; > > } > > } > > diff --git a/gcc/testsuite/gcc.target/i386/pr115462.c > > b/gcc/testsuite/gcc.target/i386/pr115462.c > > new file mode 100644 > > index 00000000000..ad50a6382bc > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/i386/pr115462.c > > @@ -0,0 +1,22 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -mavx2 -fno-tree-vectorize -fno-pic" } */ > > +/* { dg-final { scan-assembler-times {(?n)movl[ \t]+.*, p1\.0\+[0-9]*\(,} > > 3 } } */ > > + > > +int > > +foo (long indx, long indx2, long indx3, long indx4, long indx5, long > > indx6, long n, int* q) > > +{ > > + static int p1[10000]; > > + int* p2 = p1 + 1000; > > + int* p3 = p1 + 4000; > > + int* p4 = p1 + 8000; > > + > > + for (long i = 0; i != n; i++) > > + { > > + /* scan for movl %edi, p1.0+3996(,%rax,4), > > + p1.0+3996 should be propagted into the loop. */ > > + p2[indx++] = q[indx++]; > > + p3[indx2++] = q[indx2++]; > > + p4[indx3++] = q[indx3++]; > > + } > > + return p1[indx6] + p1[indx5]; > > +} > > -- > > 2.31.1 > > -- BR, Hongtao