在 2023/12/13 上午2:27, Xi Ruoyao 写道:
On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:
I guess here the problem is floating-point compare instruction is much
more costly than other instructions but the fact is not correctly
modeled yet.  Could you try
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
where I've raised fp_add cost (which is used for estimating floating-
point compare cost) to 5 instructions and see if it solves your problem
without LOGICAL_OP_NON_SHORT_CIRCUIT?
I think this is not the same issue as the cost of floating-point
comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT
affects how the short-circuit branch, such as (A AND-IF B), is executed,
and it is not directly related to the cost of floating-point comparison
instructions. I will try to test it using SPECCPU 2017.
The point is if the cost of floating-point comparison is very high, the
middle end *should* short cut floating-point comparisons even if
LOGICAL_OP_NON_SHORT_CIRCUIT = 1.

I've created https://gcc.gnu.org/PR112985.

Another factor regressing the code is we don't have modeled movcf2gr
instruction yet, so we are not really eliding the branches as
LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.
I made up this:

diff --git a/gcc/config/loongarch/loongarch.md 
b/gcc/config/loongarch/loongarch.md
index a5d0dcd65fe..84d828ebd0f 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode"
    [(set_attr "type" "fcmp")
     (set_attr "mode" "FCC")])
+(define_insn "movcf2gr<GPR:mode>"
+  [(set (match_operand:GPR 0 "register_operand" "=r")
+       (if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z")
+                             (const_int 0))
+                         (const_int 1)
+                         (const_int 0)))]
+  "TARGET_HARD_FLOAT"
+  "movcf2gr\t%0,%1"
+  [(set_attr "type" "move")
+   (set_attr "mode" "FCC")])
+
+(define_expand "cstore<ANYF:mode>4"
+  [(set (match_operand:SI 0 "register_operand")
+       (match_operator:SI 1 "loongarch_fcmp_operator"
+         [(match_operand:ANYF 2 "register_operand")
+          (match_operand:ANYF 3 "register_operand")]))]
+  ""
+  {
+    rtx fcc = gen_reg_rtx (FCCmode);
+    rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode,
+                             operands[2], operands[3]);
+
+    emit_insn (gen_rtx_SET (fcc, cmp));
+    if (TARGET_64BIT)
+      {
+       rtx gpr = gen_reg_rtx (DImode);
+       emit_insn (gen_movcf2grdi (gpr, fcc));
+       emit_insn (gen_rtx_SET (operands[0],
+                               lowpart_subreg (SImode, gpr, DImode)));
+      }
+    else
+      emit_insn (gen_movcf2grsi (operands[0], fcc));
+
+    DONE;
+  })
+
  ;;
  ;;  ....................
diff --git a/gcc/config/loongarch/predicates.md 
b/gcc/config/loongarch/predicates.md
index 9e9ce58cb53..83fea08315c 100644
--- a/gcc/config/loongarch/predicates.md
+++ b/gcc/config/loongarch/predicates.md
@@ -590,6 +590,10 @@ (define_predicate "order_operator"
  (define_predicate "loongarch_cstore_operator"
    (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu"))
+(define_predicate "loongarch_fcmp_operator"
+  (match_code
+    "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt"))
+
  (define_predicate "small_data_pattern"
    (and (match_code "set,parallel,unspec,unspec_volatile,prefetch")
         (match_test "loongarch_small_data_pattern_p (op)")))

and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT
= 1):

        fld.s   $f1,$r4,0
        fld.s   $f0,$r4,4
        fld.s   $f3,$r4,8
        fld.s   $f2,$r4,12
        fcmp.slt.s      $fcc1,$f0,$f3
        fcmp.sgt.s      $fcc0,$f1,$f2
        movcf2gr        $r13,$fcc1
        movcf2gr        $r12,$fcc0
        or      $r12,$r12,$r13
        bnez    $r12,.L3
        fld.s   $f4,$r4,16
        fld.s   $f5,$r4,20
        or      $r4,$r0,$r0
        fcmp.sgt.s      $fcc1,$f1,$f5
        fcmp.slt.s      $fcc0,$f0,$f4
        movcf2gr        $r12,$fcc1
        movcf2gr        $r13,$fcc0
        or      $r12,$r12,$r13
        bnez    $r12,.L2
        fcmp.sgt.s      $fcc1,$f3,$f5
        fcmp.slt.s      $fcc0,$f2,$f4
        movcf2gr        $r4,$fcc1
        movcf2gr        $r12,$fcc0
        or      $r4,$r4,$r12
        xori    $r4,$r4,1
        slli.w  $r4,$r4,0
        jr      $r1
        .align  4
.L3:
        or      $r4,$r0,$r0
        .align  4
.L2:
        jr      $r1

Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).

Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples).  We may be able to handle via
the ext_dce pass [1] in the future.

[1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html

This test was extracted from the hot functions of 526.blender_r. Setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamic instruction count and a 13.4% performance improvement. After applying the patch mentioned above, the assembly code looks much better with LOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526. Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 further improved the performance of 526 by 3%. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, while the optimizations you made determine how rtl is generated. They are not conflicting and combining them would yield better results.  Currently, I have only tested it on 526, and I will continue testing its impact on the entire SPEC 2017 suite.

Reply via email to