On 5/25/23 06:35, Manolis Tsamis wrote:
Implementation of the new RISC-V optimization pass for memory offset
calculations, documentation and testcases.

gcc/ChangeLog:

        * config.gcc: Add riscv-fold-mem-offsets.o to extra_objs.
        * config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Schedule a new
        pass.
        * config/riscv/riscv-protos.h (make_pass_fold_mem_offsets): Declare.
        * config/riscv/riscv.opt: New options.
        * config/riscv/t-riscv: New build rule.
        * doc/invoke.texi: Document new option.
        * config/riscv/riscv-fold-mem-offsets.cc: New file.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/fold-mem-offsets-1.c: New test.
        * gcc.target/riscv/fold-mem-offsets-2.c: New test.
        * gcc.target/riscv/fold-mem-offsets-3.c: New test.
So not going into the guts of the patch yet.

From a benchmark standpoint the only two that get out of the +-0.05% range are mcf and deepsjeng (from a dynamic instruction standpoint). So from an evaluation standpoint we can probably focus our efforts there. And as we know, mcf is actually memory bound, so while improving its dynamic instruction count is good, the end performance improvement may be marginal.

As I mentioned to Philipp many months ago this reminds me a lot of a problem I've seen before. Basically register elimination emits code that can be terrible in some circumstances. So I went and poked at this again.

I think the key difference between now and what I was dealing with before is for the cases that really matter for rv64 we have a shNadd insn in the sequence. That private port I was working on before did not have shNadd (don't ask, I probably can't tell). Our target also had reg+reg addressing modes. What I can't remember was if we were trying harder to fold the constant terms into the memory reference or if we were more focused on the reg+reg. Ultimately it's probably not that important to remember -- the key is there are very significant differences in the target's capabilities which impact how we should be generating code in this case. Those differences affect the code we generate *and* the places where we can potentially get control and do some address rewriting.

A key sequence in mcf looks something like this in IRA, others have similar structure:

(insn 237 234 239 26 (set (reg:DI 377)
        (plus:DI (ashift:DI (reg:DI 200 [ _173 ])
                (const_int 3 [0x3]))
            (reg/f:DI 65 frame))) "pbeampp.c":139:15 333 {*shNadd}
     (nil))
(insn 239 237 235 26 (set (reg/f:DI 380)
        (plus:DI (reg:DI 513)
            (reg:DI 377))) "pbeampp.c":139:15 5 {adddi3}
     (expr_list:REG_DEAD (reg:DI 377)
        (expr_list:REG_EQUAL (plus:DI (reg:DI 377)
                (const_int -32768 [0xffffffffffff8000]))
            (nil))))
[ ... ]
(insn 240 235 255 26 (set (reg/f:DI 204 [ _177 ])
        (mem/f:DI (plus:DI (reg/f:DI 380)
                (const_int 280 [0x118])) [7 *_176+0 S8 A64])) 
"pbeampp.c":139:15 179 {*movdi_64bit}
     (expr_list:REG_DEAD (reg/f:DI 380)
        (nil)))


The key here is insn 237. It's generally going to be bad to have FP show up in a shadd insn because its going to be eliminated into sp+offset. That'll generate an input reload before insn 237 and we can't do any combination with the constant in insn 239.

After LRA it looks like this:

(insn 1540 234 1541 26 (set (reg:DI 11 a1 [750])
        (const_int 32768 [0x8000])) "pbeampp.c":139:15 179 {*movdi_64bit}
     (nil))
(insn 1541 1540 1611 26 (set (reg:DI 12 a2 [749])
        (plus:DI (reg:DI 11 a1 [750])
            (const_int -272 [0xfffffffffffffef0]))) "pbeampp.c":139:15 5 
{adddi3}
     (expr_list:REG_EQUAL (const_int 32496 [0x7ef0])
(nil))) (insn 1611 1541 1542 26 (set (reg:DI 29 t4 [795])
        (plus:DI (reg/f:DI 2 sp)
            (const_int 64 [0x40]))) "pbeampp.c":139:15 5 {adddi3}
     (nil))
(insn 1542 1611 237 26 (set (reg:DI 12 a2 [749])
        (plus:DI (reg:DI 12 a2 [749])
            (reg:DI 29 t4 [795]))) "pbeampp.c":139:15 5 {adddi3}
     (nil))
(insn 237 1542 239 26 (set (reg:DI 12 a2 [377])
        (plus:DI (ashift:DI (reg:DI 14 a4 [orig:200 _173 ] [200])
                (const_int 3 [0x3]))
            (reg:DI 12 a2 [749]))) "pbeampp.c":139:15 333 {*shNadd}
     (nil))
(insn 239 237 235 26 (set (reg/f:DI 12 a2 [380])
        (plus:DI (reg:DI 10 a0 [513])
            (reg:DI 12 a2 [377]))) "pbeampp.c":139:15 5 {adddi3}
     (expr_list:REG_EQUAL (plus:DI (reg:DI 12 a2 [377])
            (const_int -32768 [0xffffffffffff8000]))
(nil)))
[ ... ]
(insn 240 235 255 26 (set (reg/f:DI 14 a4 [orig:204 _177 ] [204])
        (mem/f:DI (plus:DI (reg/f:DI 12 a2 [380])
                (const_int 280 [0x118])) [7 *_176+0 S8 A64])) 
"pbeampp.c":139:15 179 {*movdi_64bit}
     (nil))


Reload/LRA made an absolute mess of that code.

But before we add a new pass (target specific or generic), I think it may be in our best interest experiment a bit of creative rewriting to preserve the shadd, but without the frame pointer. Perhaps also looking for a way to fold the constants, both the explicit ones and the implicit one from FP elimination.

This looks particularly promising:

Trying 237, 239 -> 240:
  237: r377:DI=r200:DI<<0x3+frame:DI
      REG_DEAD r200:DI
  239: r380:DI=r513:DI+r377:DI
      REG_DEAD r377:DI
      REG_EQUAL r377:DI-0x8000
  240: r204:DI=[r380:DI+0x118]
      REG_DEAD r380:DI
Failed to match this instruction:
(set (reg/f:DI 204 [ _177 ])
    (mem/f:DI (plus:DI (plus:DI (plus:DI (mult:DI (reg:DI 200 [ _173 ])
                        (const_int 8 [0x8]))
                    (reg/f:DI 65 frame))
                (reg:DI 513))
            (const_int 280 [0x118])) [7 *_176+0 S8 A64]))


We could reassociate this as

t1 = r200 * 8 + r513
t2 = frame + 280
t3 = t1 + t2
r204 = *t3

Which after elimination would be

t1 = r2000 * 8 + r513
t2 = sp + C + 280
t3 = t1 + t2
r204 = *t3

C + 280 will simplify. And we'll probably end up in the addptr3 case which I think gives us a chance to write this a bit so that we end up wit
t1 = r200 * 8 + r513
t2 = sp + t1
r204 = *(t2 + 280 + C)


Or at least I *think* we might be able to get there. Anyway, as I said, I think this deserves a bit of playing around before we jump straight into adding a new pass.

jeff

Reply via email to