[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-19 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #18 from YunQiang Su  ---
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/654956.html

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-18 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #17 from YunQiang Su  ---
I send the patch here.
So we may need some more test.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-15 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #16 from Siarhei Volkau  ---
Might it be that LoongArch have register reuse dependency?

I observed similar behavior on XBurst with load/store/reuse pattern:

e.g. this code
LW $v0, 0($t1)# Xburst load latency is 4 but it has bypass 
SW $v0, 0($t2)# to subsequent store operation, thus no stall here
ADD $v0, $t1, $t2 # but it stalls here, because of register reuse
  # until LW op is not completed.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-15 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #15 from Siarhei Volkau  ---
Created attachment 58437
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58437=edit
application to test performance of shift

Here is the test application (MIPS32 specific) I wrote.

It allows to detect execution cycles and extra pipeline stalls for SLL if they
take place.

for XBurst 1 (jz4725b) result is the following:

`SLL to use latency test` execution median: 168417 ns, min: 168416 ns
`SLL to use latency test with nop` execution median: 196250 ns, min: 196166 ns

`SLL to branch latency test` execution median: 196250 ns, min: 196166 ns
`SLL to branch latency test with nop` execution median: 224000 ns, min: 224000
ns

`SLL by 7 to use latency test` execution median: 168417 ns, min: 168416 ns
`SLL by 15 to use latency test` execution median: 168417 ns, min: 168416 ns
`SLL by 23 to use latency test` execution median: 168417 ns, min: 168416 ns
`SLL by 31 to use latency test` execution median: 168417 ns, min: 168416 ns

`LUI>AND>BEQZ reference test` execution median: 196250 ns, min: 196166 ns
`SLL>BGEZ reference test` execution median: 168417 ns, min: 168416 ns



and what does it mean:
`SLL to use latency test` 168417 ns and `.. with nop` 196250 ns
means that there's no extra stall cycles between SLL and further use by ALU
operation.

`SLL to branch latency test` and `.. with nop` result
means that there's no extra stall cycles between SLL and further use by branch
operations.

`SLL by N` results means that SLL execution time doesn't depend on shift
amount.

and finally, the reference test results showcases that SLL>BGEZ approach is
faster than LUI>AND>BEQZ.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-15 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #14 from YunQiang Su  ---
And it seems that the performance of SLL is related with the operand.

Just iterate from 0 to 1e9:

```
0b00 :
 b00:   000223c0sll a0,v0,0xf  <-- the code is something
wrong
   in normal code, we should
access
   v0 here.  v0 will be 100 or
1000.
 b04:   04810003bgeza0,b14 
 b08:   nop
 b0c:   03e8jr  ra
 b10:   240203e8li  v0,1000
 b14:   03e8jr  ra
 b18:   24020064li  v0,100
 b1c:   nop
```

is slower than

```
0b00 :
 b00:   000223c0sll a0,a0,0xf
 b04:   04810003bgeza0,b14 
 b08:   nop
 b0c:   03e8jr  ra
 b10:   240203e8li  v0,1000
 b14:   03e8jr  ra
 b18:   24020064li  v0,100
 b1c:   nop
```


I have no idea how to make a trade off here.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-15 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #13 from YunQiang Su  ---
I try to insert 
li  $3, 500
li  $5, 500
between SLL/BGEZ  and LUI+AND/BNE.

The later is still some faster on Loongson 3A4000.

I notice something like this in 74K's software manual:

The 74K core’s ALU is pipelined. Some ALU instructions complete the operation
and bypass the results in this cycle. These instructions are referred to as
single-cycle ops and they include all logical instructions (AND, ANDI, OR, ORI,
XOR, XORI, LUI), some shift instructions (SLL sa<=8, SRL 31<=sa<=25), and some
arithmetic instructions (ADD rt=0, ADDU rt=0, SLT, SLTI, SLTU, SLTIU, SEH, SEB,
ZEH, ZEB). In addition, add instructions (ADD, ADDU, ADDI, ADDIU) complete the
operation and bypass results to the ALU pipe in this cycle.


I guess it means that if sa>8, SLL may be some slow.
On Loongson 3A4000, the value seems to be 20/21. It may means that we should be
care about for 64bit.

Can you have a test on XBurst 1?

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-14 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #12 from Siarhei Volkau  ---
Highly likely it's because of data dependency, and not direct cost of shift
operations on LoongArch, although can't find information to prove that.
So, I guess it still might get performance benefit in cases where scheduler can
put some instruction(s) between SLL and BGEZ.

Since you have access to hardware you can  measure performace of two variants:
1) SLL+BGEZ
2) SLL+NOT+BGEZ
if their performance is equal then I'm correct and scheduling automaton for
GS464 seems have to be fixed.

>From my side I can confirm that SLL+BGEZ is faster than LUI+AND+BEQ on Ingenic
XBurst 1 cores.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-14 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

YunQiang Su  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #11 from YunQiang Su  ---
For -Os, let's track it with this one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115473

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-14 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #10 from YunQiang Su  ---
I have some performance test.

sll+bgez is some slower than lui+and+beqz.
On Loongson 3A4000, it is about 10%.

So this "optimization" makes sense only for -Os.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-12 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #9 from YunQiang Su  ---
I see about condmove: it is broken since gcc14.

int
f32(int a)
{
  int p = (a & (1<<16));
  if (p)
return 100;
  else
return 1000;
}

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-07 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #8 from Siarhei Volkau  ---
Created attachment 58377
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58377=edit
condmove testcase

Tested with current GCC master branch:

- Work with -Os confirmed.
- Condmove issue present in GCC 11 but not current master. Even for GCC 11 it
is very rare case, although found one relatively simple to reproduce: it is
excerpt from Python 3.8.x, reduced as much as I can.
Compilation flags tested: {-O2|-Os} -mips32 -DNDEBUG -mbranch-cost={1|10}

So, my opinion, the patch you propose is perfectly fine.
Condmove issue seems not relevant anymore.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-06 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #7 from YunQiang Su  ---
Ohh, I need add "&&" before "!reload_completed".

It seems work with -Os.
can you give me you test code?

I cannot figure out a non-workable condmove C code for it.


With the constant less than 0x, ANDI+BEQ/BNE do be generated with -Os
but not for -O2.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-06 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #6 from Siarhei Volkau  ---
Well, it is work mostly well.
However, it still has issues, addressed in my patch:
 1) Doesn't work for -Os : highly likely costing issue.
 2) Breaks condmoves, as mine does. I have no idea how to avoid that though.
 3) Overlaps preferable ANDI+BEQ/BNE cases: (as it don't break condmoves)

I think it will be okay whether fixed 1 and 3.

PS: tested by applying the patch on GCC 11, will try with upstream this
weekend.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-05 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #5 from YunQiang Su  ---
I copy the RTL pattern from RISC-V, and it seems work

```
--- a/gcc/config/mips/mips.md
+++ b/gcc/config/mips/mips.md
@@ -6253,6 +6253,40 @@ (define_insn "*branch_bit_inverted"
 }
   [(set_attr "type" "branch")
(set_attr "branch_likely" "no")])
+
+(define_insn_and_split "*branch_on_bit"
+  [(set (pc)
+   (if_then_else
+   (match_operator 0 "equality_operator"
+   [(zero_extract:GPR (match_operand:GPR 2 "register_operand" "d")
+(const_int 1)
+(match_operand:GPR 3 "const_int_operand"))
+(const_int 0)])
+   (label_ref (match_operand 1))
+   (pc)))]
+  "!ISA_HAS_BBIT && !ISA_HAS_EXT_INS && !TARGET_MIPS16"
+  "#"
+  "!reload_completed"
+  [(set (match_dup 4)
+   (ashift:GPR (match_dup 2) (match_dup 3)))
+   (set (pc)
+   (if_then_else
+   (match_op_dup 0 [(match_dup 4) (const_int 0)])
+   (label_ref (match_operand 1))
+   (pc)))]
+{
+  int shift = GET_MODE_BITSIZE (mode) - 1 - INTVAL (operands[3]);
+  operands[3] = GEN_INT (shift);
+  operands[4] = gen_reg_rtx (mode);
+
+  if (GET_CODE (operands[0]) == EQ)
+operands[0] = gen_rtx_GE (mode, operands[4], const0_rtx);
+  else
+operands[0] = gen_rtx_LT (mode, operands[4], const0_rtx);
+}
+[(set_attr "type" "branch")])
+
+
```

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-05 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #4 from YunQiang Su  ---
Ohh, RISC-V has solved this problem in recent release.
So we can just do similar work.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-04 Thread lis8215 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #3 from Siarhei Volkau  ---
I know that the patch breaks condmove cases, that's why it is silly.

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-04 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #2 from YunQiang Su  ---
(In reply to YunQiang Su from comment #1)
> RISC-V has this problem, too.
> Maybe we can try to combine it in `combine` pass, while it may be not easy.
> It may break some code like:
> 
> ```
> int f1();
> int f2();
> 
> int f(int a) {
> int p = (a & 0x8);
> if (p)
> return p;
> else
> return f2();
> }
> ```
> 
> And in fact your patch also break it.

Ohh, this comment is not correct

[Bug target/111376] missed optimization of one bit test on MIPS32r1

2024-06-04 Thread syq at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376

--- Comment #1 from YunQiang Su  ---
RISC-V has this problem, too.
Maybe we can try to combine it in `combine` pass, while it may be not easy.
It may break some code like:

```
int f1();
int f2();

int f(int a) {
int p = (a & 0x8);
if (p)
return p;
else
return f2();
}
```

And in fact your patch also break it.