[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #18 from YunQiang Su --- https://gcc.gnu.org/pipermail/gcc-patches/2024-June/654956.html
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #17 from YunQiang Su --- I send the patch here. So we may need some more test.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #16 from Siarhei Volkau --- Might it be that LoongArch have register reuse dependency? I observed similar behavior on XBurst with load/store/reuse pattern: e.g. this code LW $v0, 0($t1)# Xburst load latency is 4 but it has bypass SW $v0, 0($t2)# to subsequent store operation, thus no stall here ADD $v0, $t1, $t2 # but it stalls here, because of register reuse # until LW op is not completed.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #15 from Siarhei Volkau --- Created attachment 58437 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58437=edit application to test performance of shift Here is the test application (MIPS32 specific) I wrote. It allows to detect execution cycles and extra pipeline stalls for SLL if they take place. for XBurst 1 (jz4725b) result is the following: `SLL to use latency test` execution median: 168417 ns, min: 168416 ns `SLL to use latency test with nop` execution median: 196250 ns, min: 196166 ns `SLL to branch latency test` execution median: 196250 ns, min: 196166 ns `SLL to branch latency test with nop` execution median: 224000 ns, min: 224000 ns `SLL by 7 to use latency test` execution median: 168417 ns, min: 168416 ns `SLL by 15 to use latency test` execution median: 168417 ns, min: 168416 ns `SLL by 23 to use latency test` execution median: 168417 ns, min: 168416 ns `SLL by 31 to use latency test` execution median: 168417 ns, min: 168416 ns `LUI>AND>BEQZ reference test` execution median: 196250 ns, min: 196166 ns `SLL>BGEZ reference test` execution median: 168417 ns, min: 168416 ns and what does it mean: `SLL to use latency test` 168417 ns and `.. with nop` 196250 ns means that there's no extra stall cycles between SLL and further use by ALU operation. `SLL to branch latency test` and `.. with nop` result means that there's no extra stall cycles between SLL and further use by branch operations. `SLL by N` results means that SLL execution time doesn't depend on shift amount. and finally, the reference test results showcases that SLL>BGEZ approach is faster than LUI>AND>BEQZ.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #14 from YunQiang Su --- And it seems that the performance of SLL is related with the operand. Just iterate from 0 to 1e9: ``` 0b00 : b00: 000223c0sll a0,v0,0xf <-- the code is something wrong in normal code, we should access v0 here. v0 will be 100 or 1000. b04: 04810003bgeza0,b14 b08: nop b0c: 03e8jr ra b10: 240203e8li v0,1000 b14: 03e8jr ra b18: 24020064li v0,100 b1c: nop ``` is slower than ``` 0b00 : b00: 000223c0sll a0,a0,0xf b04: 04810003bgeza0,b14 b08: nop b0c: 03e8jr ra b10: 240203e8li v0,1000 b14: 03e8jr ra b18: 24020064li v0,100 b1c: nop ``` I have no idea how to make a trade off here.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #13 from YunQiang Su --- I try to insert li $3, 500 li $5, 500 between SLL/BGEZ and LUI+AND/BNE. The later is still some faster on Loongson 3A4000. I notice something like this in 74K's software manual: The 74K core’s ALU is pipelined. Some ALU instructions complete the operation and bypass the results in this cycle. These instructions are referred to as single-cycle ops and they include all logical instructions (AND, ANDI, OR, ORI, XOR, XORI, LUI), some shift instructions (SLL sa<=8, SRL 31<=sa<=25), and some arithmetic instructions (ADD rt=0, ADDU rt=0, SLT, SLTI, SLTU, SLTIU, SEH, SEB, ZEH, ZEB). In addition, add instructions (ADD, ADDU, ADDI, ADDIU) complete the operation and bypass results to the ALU pipe in this cycle. I guess it means that if sa>8, SLL may be some slow. On Loongson 3A4000, the value seems to be 20/21. It may means that we should be care about for 64bit. Can you have a test on XBurst 1?
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #12 from Siarhei Volkau --- Highly likely it's because of data dependency, and not direct cost of shift operations on LoongArch, although can't find information to prove that. So, I guess it still might get performance benefit in cases where scheduler can put some instruction(s) between SLL and BGEZ. Since you have access to hardware you can measure performace of two variants: 1) SLL+BGEZ 2) SLL+NOT+BGEZ if their performance is equal then I'm correct and scheduling automaton for GS464 seems have to be fixed. >From my side I can confirm that SLL+BGEZ is faster than LUI+AND+BEQ on Ingenic XBurst 1 cores.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 YunQiang Su changed: What|Removed |Added Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #11 from YunQiang Su --- For -Os, let's track it with this one https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115473
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #10 from YunQiang Su --- I have some performance test. sll+bgez is some slower than lui+and+beqz. On Loongson 3A4000, it is about 10%. So this "optimization" makes sense only for -Os.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #9 from YunQiang Su --- I see about condmove: it is broken since gcc14. int f32(int a) { int p = (a & (1<<16)); if (p) return 100; else return 1000; }
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #8 from Siarhei Volkau --- Created attachment 58377 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58377=edit condmove testcase Tested with current GCC master branch: - Work with -Os confirmed. - Condmove issue present in GCC 11 but not current master. Even for GCC 11 it is very rare case, although found one relatively simple to reproduce: it is excerpt from Python 3.8.x, reduced as much as I can. Compilation flags tested: {-O2|-Os} -mips32 -DNDEBUG -mbranch-cost={1|10} So, my opinion, the patch you propose is perfectly fine. Condmove issue seems not relevant anymore.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #7 from YunQiang Su --- Ohh, I need add "&&" before "!reload_completed". It seems work with -Os. can you give me you test code? I cannot figure out a non-workable condmove C code for it. With the constant less than 0x, ANDI+BEQ/BNE do be generated with -Os but not for -O2.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #6 from Siarhei Volkau --- Well, it is work mostly well. However, it still has issues, addressed in my patch: 1) Doesn't work for -Os : highly likely costing issue. 2) Breaks condmoves, as mine does. I have no idea how to avoid that though. 3) Overlaps preferable ANDI+BEQ/BNE cases: (as it don't break condmoves) I think it will be okay whether fixed 1 and 3. PS: tested by applying the patch on GCC 11, will try with upstream this weekend.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #5 from YunQiang Su --- I copy the RTL pattern from RISC-V, and it seems work ``` --- a/gcc/config/mips/mips.md +++ b/gcc/config/mips/mips.md @@ -6253,6 +6253,40 @@ (define_insn "*branch_bit_inverted" } [(set_attr "type" "branch") (set_attr "branch_likely" "no")]) + +(define_insn_and_split "*branch_on_bit" + [(set (pc) + (if_then_else + (match_operator 0 "equality_operator" + [(zero_extract:GPR (match_operand:GPR 2 "register_operand" "d") +(const_int 1) +(match_operand:GPR 3 "const_int_operand")) +(const_int 0)]) + (label_ref (match_operand 1)) + (pc)))] + "!ISA_HAS_BBIT && !ISA_HAS_EXT_INS && !TARGET_MIPS16" + "#" + "!reload_completed" + [(set (match_dup 4) + (ashift:GPR (match_dup 2) (match_dup 3))) + (set (pc) + (if_then_else + (match_op_dup 0 [(match_dup 4) (const_int 0)]) + (label_ref (match_operand 1)) + (pc)))] +{ + int shift = GET_MODE_BITSIZE (mode) - 1 - INTVAL (operands[3]); + operands[3] = GEN_INT (shift); + operands[4] = gen_reg_rtx (mode); + + if (GET_CODE (operands[0]) == EQ) +operands[0] = gen_rtx_GE (mode, operands[4], const0_rtx); + else +operands[0] = gen_rtx_LT (mode, operands[4], const0_rtx); +} +[(set_attr "type" "branch")]) + + ```
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #4 from YunQiang Su --- Ohh, RISC-V has solved this problem in recent release. So we can just do similar work.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #3 from Siarhei Volkau --- I know that the patch breaks condmove cases, that's why it is silly.
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #2 from YunQiang Su --- (In reply to YunQiang Su from comment #1) > RISC-V has this problem, too. > Maybe we can try to combine it in `combine` pass, while it may be not easy. > It may break some code like: > > ``` > int f1(); > int f2(); > > int f(int a) { > int p = (a & 0x8); > if (p) > return p; > else > return f2(); > } > ``` > > And in fact your patch also break it. Ohh, this comment is not correct
[Bug target/111376] missed optimization of one bit test on MIPS32r1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376 --- Comment #1 from YunQiang Su --- RISC-V has this problem, too. Maybe we can try to combine it in `combine` pass, while it may be not easy. It may break some code like: ``` int f1(); int f2(); int f(int a) { int p = (a & 0x8); if (p) return p; else return f2(); } ``` And in fact your patch also break it.