https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108016
Alexey Merzlyakov <alexey.merzlyakov at samsung dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |alexey.merzlyakov at samsung
dot c
| |om
--- Comment #4 from Alexey Merzlyakov <alexey.merzlyakov at samsung dot com> ---
I suspect, there might be 3 items are related to the reported in this ticket
case.
=== Item1 ===
This is related to the above analysis provided by Jeff, where the unnecessary
stack store-load generated.
The simplified C-testcase:
struct value {
unsigned int a;
unsigned char b;
};
struct value func(unsigned int a, unsigned int b) {
struct value out;
out.a = a;
out.b = a > b;
return out;
}
Here I am just wondering, even if the optimized tree has is going through the
memory, why DSE is not optimizing-out unnecessary store-load chain to something
like:
(insn 16 14 17 2 (set (mem/c:QI (plus:DI (reg/f:DI 65 frame)
(const_int -4 [0xfffffffffffffffc])))
(subreg:QI (reg:DI 140) 0))
(insn 27 21 28 2 (set (reg:DI 148 [ D.2377+4 ])
(zero_extend:DI (mem/c:SI (plus:DI (reg/f:DI 65 frame)
(const_int -4 [0xfffffffffffffffc])))))
->
(insn 41 14 17 2 (set (reg:DI 154) (reg:DI 140)))
(insn 27 21 28 2 (set (reg:DI 148 [ D.2377+4 ])
(zero_extend:DI (subreg/s/u:QI (reg:QI 154) 0))) ?
At least in the testcase from above, it does similar optimization for another
store-load [sp-0x8] chain.
=== Item2 ===
Even if get rid from unnecessary store-load generation, the stack is still
being allocated for the func frame. Assembler, generated for the following
test-case, contains unnecessary stack allocation in prologue/epilogue (thanks
to my colleague, Ankit Mahato for supporting with the reduced testcase on this
situation):
struct value {
unsigned int a;
unsigned int b;
};
struct value func(void) {
struct value out;
out.a = 0;
out.b = 1;
return out;
}
Here is as in previous case, tree-optimized dump containing MEM usage. However,
for this test-case further memory store-load was optimized-out by DSE1, and
function stack won't contain locals on it anymore. But GCC is still generating
unnecessary stack allocation:
func():
addi sp,sp,-16
li a0,1
slli a0,a0,32
addi sp,sp,16
jr ra
I looks like that frame size once being calculated in expand-rtl, will never
being changed. Even after DSE removed all unnecessary stack usage, stack offset
value was not corrected. I suppose that DSE if changes the usage of stack (for
some reasons), could it correct the "frame_offset" value, that is responsible
for further SP size allocation in pro_and_epilogue.
For that, I've added simple and ugly "frame_offset" value optimizing-out
function (checking no insns contain RTX_FRAME) to the end of DSE optimization;
and it works for me on the above test-case.
The main question here - should GCC change frame size after RTX was expanded;
or it is not intended to work like this?
=== Item3 ===
Generation of extra sext.w instruction does not seem to be related to the
previous items.
The following test-case contains ADD_OVERFLOW built-in function:
unsigned int func(unsigned int a, unsigned int b) {
unsigned int out;
unsigned int overflow = __builtin_add_overflow(a, b, &out);
return overflow & out;
}
It produces an extra sign_extend, whether LLVM does not:
func(unsigned int, unsigned int):
addw a1,a0,a1
sext.w a5,a1
sltu a0,a5,a0
and a0,a0,a1
ret
The riscv.md > uaddv<mode>4 implementation contains "riscv_emit_binary (PLUS,
operands[0], operands[1], operands[2])" which produces the sum of two values in
SI-mode:
(insn 11 10 12 (set (reg:SI 141)
(plus:SI (subreg/s/u:SI (reg/v:DI 139 [ a ]) 0)
(subreg/s/u:SI (reg/v:DI 140 [ b ]) 0)))
Sign-extend for this is being made separately by calling "emit_insn
(gen_extend_insn (t4, operands[0], DImode, SImode, 0))":
(insn 12 11 13 (set (reg:DI 143)
(sign_extend:DI (reg:SI 141)))
(a+b) sum in DI-mode is used for expansion of conditional branch:
(jump_insn 13 12 14 (set (pc)
(if_then_else (ltu (reg:DI 143 [a + b])
(reg:DI 142 [a]))
(label_ref 16)
(pc)))
while the output from the "uaddv<mode>4" - operand[0] == (a+b):SI is used
further in the code.
It appears that instructions generation, made in such way can not be optimized
out by further fwprop, crprop, combine, etc... pipeline; because of
extra-dependency outside it PLUS-SIGN_EXTEND chain. Thus, sext.w remains to be
untouched.
However, if change "riscv_emit_binary (PLUS, operands[0], operands[1],
operands[2]);" routine in the "uaddv<mode>4" -> to something producing results
in DI-mode, say "emit_insn (gen_add3_insn (operands[0], operands[1],
operands[2]));"; and expand became to produce the following unoptimized code
for ADD_OVERFLOW:
(insn 8 7 9 (set (reg:DI 142)
(sign_extend:DI (plus:SI (subreg/s/u:SI (reg/v:DI 139 [ a ]) 0)
(subreg/s/u:SI (reg/v:DI 140 [ b ]) 0))))
"add_overflow3.c":5:19 -1
(insn 9 8 10 (set (reg:SI 141) <- optimized-out by fwprop1
(subreg/s/u:SI (reg:DI 142) 0)) "add_overflow3.c":5:19 -1
(insn 10 9 11 (set (reg:DI 143) <- optimized-out by crprop1 (DCE)
(sign_extend:DI (reg:SI 141))) "add_overflow3.c":5:19 -1
However, this unoptimized code is being compensated by fwprop and crprop and
the finally produced code won't contain sext.w instruction. I do not love this
tricky approach, as we're initially expanding unoptimized code for
ADD_OVERFLOW, in the hope of future optimizations will get rid of unnecessary
stuff. But I still haven't found yet a better way in RTL-optimizations to
handle extra sign_extend for the initial case. Any suggestions? It feels like
the even number of errors leads to the right result, while the odd number
leaves one error behind.