[Bug middle-end/108016] RISC-V：Bad codegen in scalar code comparing to LLVM

alexey.merzlyakov at samsung dot com via Gcc-bugs Wed, 05 Feb 2025 03:28:58 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108016


Alexey Merzlyakov <alexey.merzlyakov at samsung dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alexey.merzlyakov at samsung 
dot c
                   |                            |om

--- Comment #4 from Alexey Merzlyakov <alexey.merzlyakov at samsung dot com> ---
I suspect, there might be 3 items are related to the reported in this ticket
case.

=== Item1 ===

This is related to the above analysis provided by Jeff, where the unnecessary
stack store-load generated.
The simplified C-testcase:

  struct value {
    unsigned int a;
    unsigned char b;
  };

  struct value func(unsigned int a, unsigned int b) {
    struct value out;
    out.a = a;
    out.b = a > b;
    return out;
  }

Here I am just wondering, even if the optimized tree has is going through the
memory, why DSE is not optimizing-out unnecessary store-load chain to something
like:

  (insn 16 14 17 2 (set (mem/c:QI (plus:DI (reg/f:DI 65 frame)
                (const_int -4 [0xfffffffffffffffc])))
        (subreg:QI (reg:DI 140) 0))
  (insn 27 21 28 2 (set (reg:DI 148 [ D.2377+4 ])
        (zero_extend:DI (mem/c:SI (plus:DI (reg/f:DI 65 frame)
                (const_int -4 [0xfffffffffffffffc])))))
  ->
  (insn 41 14 17 2 (set (reg:DI 154) (reg:DI 140)))
  (insn 27 21 28 2 (set (reg:DI 148 [ D.2377+4 ])
        (zero_extend:DI (subreg/s/u:QI (reg:QI 154) 0))) ?

At least in the testcase from above, it does similar optimization for another
store-load [sp-0x8] chain.

=== Item2 ===

Even if get rid from unnecessary store-load generation, the stack is still
being allocated for the func frame. Assembler, generated for the following
test-case, contains unnecessary stack allocation in prologue/epilogue (thanks
to my colleague, Ankit Mahato for supporting with the reduced testcase on this
situation):

  struct value {
    unsigned int a;
    unsigned int b;
  };

  struct value func(void) {
    struct value out;
    out.a = 0;
    out.b = 1;
    return out;
  }

Here is as in previous case, tree-optimized dump containing MEM usage. However,
for this test-case further memory store-load was optimized-out by DSE1, and
function stack won't contain locals on it anymore. But GCC is still generating
unnecessary stack allocation:

  func():
        addi    sp,sp,-16
        li      a0,1
        slli    a0,a0,32
        addi    sp,sp,16
        jr      ra

I looks like that frame size once being calculated in expand-rtl, will never
being changed. Even after DSE removed all unnecessary stack usage, stack offset
value was not corrected. I suppose that DSE if changes the usage of stack (for
some reasons), could it correct the "frame_offset" value, that is responsible
for further SP size allocation in pro_and_epilogue.

For that, I've added simple and ugly "frame_offset" value optimizing-out
function (checking no insns contain RTX_FRAME) to the end of DSE optimization;
and it works for me on the above test-case.

The main question here - should GCC change frame size after RTX was expanded;
or it is not intended to work like this?

=== Item3 ===

Generation of extra sext.w instruction does not seem to be related to the
previous items.
The following test-case contains ADD_OVERFLOW built-in function:

  unsigned int func(unsigned int a, unsigned int b) {
    unsigned int out;
    unsigned int overflow = __builtin_add_overflow(a, b, &out);
    return overflow & out;
  }

It produces an extra sign_extend, whether LLVM does not:

  func(unsigned int, unsigned int):
        addw    a1,a0,a1
        sext.w  a5,a1
        sltu    a0,a5,a0
        and     a0,a0,a1
        ret

The riscv.md > uaddv<mode>4 implementation contains "riscv_emit_binary (PLUS,
operands[0], operands[1], operands[2])" which produces the sum of two values in
SI-mode:

  (insn 11 10 12 (set (reg:SI 141)
        (plus:SI (subreg/s/u:SI (reg/v:DI 139 [ a ]) 0)
                (subreg/s/u:SI (reg/v:DI 140 [ b ]) 0)))

Sign-extend for this is being made separately by calling "emit_insn
(gen_extend_insn (t4, operands[0], DImode, SImode, 0))":

  (insn 12 11 13 (set (reg:DI 143)
        (sign_extend:DI (reg:SI 141)))

(a+b) sum in DI-mode is used for expansion of conditional branch:

  (jump_insn 13 12 14 (set (pc)
        (if_then_else (ltu (reg:DI 143 [a + b])
                        (reg:DI 142 [a]))
                (label_ref 16)
                (pc)))

while the output from the "uaddv<mode>4" - operand[0] == (a+b):SI is used
further in the code.

It appears that instructions generation, made in such way can not be optimized
out by further fwprop, crprop, combine, etc... pipeline; because of
extra-dependency outside it PLUS-SIGN_EXTEND chain. Thus, sext.w remains to be
untouched.

However, if change "riscv_emit_binary (PLUS, operands[0], operands[1],
operands[2]);" routine in the "uaddv<mode>4" -> to something producing results
in DI-mode, say "emit_insn (gen_add3_insn (operands[0], operands[1],
operands[2]));"; and expand became to produce the following unoptimized code
for ADD_OVERFLOW:

(insn 8 7 9 (set (reg:DI 142)
        (sign_extend:DI (plus:SI (subreg/s/u:SI (reg/v:DI 139 [ a ]) 0)
                (subreg/s/u:SI (reg/v:DI 140 [ b ]) 0))))
"add_overflow3.c":5:19 -1
(insn 9 8 10 (set (reg:SI 141) <- optimized-out by fwprop1
        (subreg/s/u:SI (reg:DI 142) 0)) "add_overflow3.c":5:19 -1
(insn 10 9 11 (set (reg:DI 143) <- optimized-out by crprop1 (DCE)
        (sign_extend:DI (reg:SI 141))) "add_overflow3.c":5:19 -1

However, this unoptimized code is being compensated by fwprop and crprop and
the finally produced code won't contain sext.w instruction. I do not love this
tricky approach, as we're initially expanding unoptimized code for
ADD_OVERFLOW, in the hope of future optimizations will get rid of unnecessary
stuff. But I still haven't found yet a better way in RTL-optimizations to
handle extra sign_extend for the initial case. Any suggestions? It feels like
the even number of errors leads to the right result, while the odd number
leaves one error behind.

[Bug middle-end/108016] RISC-V：Bad codegen in scalar code comparing to LLVM

Reply via email to