Re: [PATCH 1/2] Implementation of new RISCV optimizations pass: fold-mem-offsets.

Jeff Law via Gcc-patches Wed, 07 Jun 2023 22:37:33 -0700



On 5/25/23 06:35, Manolis Tsamis wrote:

Implementation of the new RISC-V optimization pass for memory offset
calculations, documentation and testcases.

gcc/ChangeLog:

        * config.gcc: Add riscv-fold-mem-offsets.o to extra_objs.
        * config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Schedule a new
        pass.
        * config/riscv/riscv-protos.h (make_pass_fold_mem_offsets): Declare.
        * config/riscv/riscv.opt: New options.
        * config/riscv/t-riscv: New build rule.
        * doc/invoke.texi: Document new option.
        * config/riscv/riscv-fold-mem-offsets.cc: New file.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/fold-mem-offsets-1.c: New test.
        * gcc.target/riscv/fold-mem-offsets-2.c: New test.
        * gcc.target/riscv/fold-mem-offsets-3.c: New test.

So not going into the guts of the patch yet.

From a benchmark standpoint the only two that get out of the +-0.05%range are mcf and deepsjeng (from a dynamic instruction standpoint). Sofrom an evaluation standpoint we can probably focus our efforts there.And as we know, mcf is actually memory bound, so while improving itsdynamic instruction count is good, the end performance improvement maybe marginal.

As I mentioned to Philipp many months ago this reminds me a lot of aproblem I've seen before. Basically register elimination emits codethat can be terrible in some circumstances. So I went and poked at thisagain.

I think the key difference between now and what I was dealing withbefore is for the cases that really matter for rv64 we have a shNaddinsn in the sequence. That private port I was working on before did nothave shNadd (don't ask, I probably can't tell). Our target also hadreg+reg addressing modes. What I can't remember was if we were tryingharder to fold the constant terms into the memory reference or if wewere more focused on the reg+reg. Ultimately it's probably not thatimportant to remember -- the key is there are very significantdifferences in the target's capabilities which impact how we should begenerating code in this case. Those differences affect the code wegenerate *and* the places where we can potentially get control and dosome address rewriting.

A key sequence in mcf looks something like this in IRA, others havesimilar structure:

(insn 237 234 239 26 (set (reg:DI 377)
        (plus:DI (ashift:DI (reg:DI 200 [ _173 ])
                (const_int 3 [0x3]))
            (reg/f:DI 65 frame))) "pbeampp.c":139:15 333 {*shNadd}
     (nil))
(insn 239 237 235 26 (set (reg/f:DI 380)
        (plus:DI (reg:DI 513)
            (reg:DI 377))) "pbeampp.c":139:15 5 {adddi3}
     (expr_list:REG_DEAD (reg:DI 377)
        (expr_list:REG_EQUAL (plus:DI (reg:DI 377)
                (const_int -32768 [0xffffffffffff8000]))
            (nil))))

[ ... ]

(insn 240 235 255 26 (set (reg/f:DI 204 [ _177 ])
        (mem/f:DI (plus:DI (reg/f:DI 380)
                (const_int 280 [0x118])) [7 *_176+0 S8 A64])) 
"pbeampp.c":139:15 179 {*movdi_64bit}
     (expr_list:REG_DEAD (reg/f:DI 380)
        (nil)))

The key here is insn 237. It's generally going to be bad to have FPshow up in a shadd insn because its going to be eliminated intosp+offset. That'll generate an input reload before insn 237 and wecan't do any combination with the constant in insn 239.


After LRA it looks like this:

(insn 1540 234 1541 26 (set (reg:DI 11 a1 [750])
        (const_int 32768 [0x8000])) "pbeampp.c":139:15 179 {*movdi_64bit}
     (nil))
(insn 1541 1540 1611 26 (set (reg:DI 12 a2 [749])
        (plus:DI (reg:DI 11 a1 [750])
            (const_int -272 [0xfffffffffffffef0]))) "pbeampp.c":139:15 5 
{adddi3}
     (expr_list:REG_EQUAL (const_int 32496 [0x7ef0])

(nil)))(insn 1611 1541 1542 26 (set (reg:DI 29 t4 [795])

        (plus:DI (reg/f:DI 2 sp)
            (const_int 64 [0x40]))) "pbeampp.c":139:15 5 {adddi3}
     (nil))
(insn 1542 1611 237 26 (set (reg:DI 12 a2 [749])
        (plus:DI (reg:DI 12 a2 [749])
            (reg:DI 29 t4 [795]))) "pbeampp.c":139:15 5 {adddi3}
     (nil))
(insn 237 1542 239 26 (set (reg:DI 12 a2 [377])
        (plus:DI (ashift:DI (reg:DI 14 a4 [orig:200 _173 ] [200])
                (const_int 3 [0x3]))
            (reg:DI 12 a2 [749]))) "pbeampp.c":139:15 333 {*shNadd}
     (nil))
(insn 239 237 235 26 (set (reg/f:DI 12 a2 [380])
        (plus:DI (reg:DI 10 a0 [513])
            (reg:DI 12 a2 [377]))) "pbeampp.c":139:15 5 {adddi3}
     (expr_list:REG_EQUAL (plus:DI (reg:DI 12 a2 [377])
            (const_int -32768 [0xffffffffffff8000]))

(nil)))

[ ... ]

(insn 240 235 255 26 (set (reg/f:DI 14 a4 [orig:204 _177 ] [204])
        (mem/f:DI (plus:DI (reg/f:DI 12 a2 [380])
                (const_int 280 [0x118])) [7 *_176+0 S8 A64])) 
"pbeampp.c":139:15 179 {*movdi_64bit}
     (nil))



Reload/LRA made an absolute mess of that code.

But before we add a new pass (target specific or generic), I think itmay be in our best interest experiment a bit of creative rewriting topreserve the shadd, but without the frame pointer. Perhaps also lookingfor a way to fold the constants, both the explicit ones and the implicitone from FP elimination.


This looks particularly promising:

Trying 237, 239 -> 240:
  237: r377:DI=r200:DI<<0x3+frame:DI
      REG_DEAD r200:DI
  239: r380:DI=r513:DI+r377:DI
      REG_DEAD r377:DI
      REG_EQUAL r377:DI-0x8000
  240: r204:DI=[r380:DI+0x118]
      REG_DEAD r380:DI
Failed to match this instruction:
(set (reg/f:DI 204 [ _177 ])
    (mem/f:DI (plus:DI (plus:DI (plus:DI (mult:DI (reg:DI 200 [ _173 ])
                        (const_int 8 [0x8]))
                    (reg/f:DI 65 frame))
                (reg:DI 513))
            (const_int 280 [0x118])) [7 *_176+0 S8 A64]))



We could reassociate this as

t1 = r200 * 8 + r513
t2 = frame + 280
t3 = t1 + t2
r204 = *t3

Which after elimination would be

t1 = r2000 * 8 + r513
t2 = sp + C + 280
t3 = t1 + t2
r204 = *t3

C + 280 will simplify. And we'll probably end up in the addptr3 casewhich I think gives us a chance to write this a bit so that we end up wit

t1 = r200 * 8 + r513
t2 = sp + t1
r204 = *(t2 + 280 + C)

Or at least I *think* we might be able to get there. Anyway, as I said,I think this deserves a bit of playing around before we jump straightinto adding a new pass.


jeff

Re: [PATCH 1/2] Implementation of new RISCV optimizations pass: fold-mem-offsets.

Reply via email to