Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

Jiahao Xu Tue, 12 Dec 2023 22:17:36 -0800


在 2023/12/13 上午2:27, Xi Ruoyao 写道:

On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:

On Tue, 2023-12-12 at 19:59 +0800, Jiahao Xu wrote:

I guess here the problem is floating-point compare instruction is much
more costly than other instructions but the fact is not correctly
modeled yet.  Could you try
https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640012.html
where I've raised fp_add cost (which is used for estimating floating-
point compare cost) to 5 instructions and see if it solves your problem
without LOGICAL_OP_NON_SHORT_CIRCUIT?

I think this is not the same issue as the cost of floating-point
comparison instructions. The definition of LOGICAL_OP_NON_SHORT_CIRCUIT
affects how the short-circuit branch, such as (A AND-IF B), is executed,
and it is not directly related to the cost of floating-point comparison
instructions. I will try to test it using SPECCPU 2017.

The point is if the cost of floating-point comparison is very high, the
middle end *should* short cut floating-point comparisons even if
LOGICAL_OP_NON_SHORT_CIRCUIT = 1.

I've created https://gcc.gnu.org/PR112985.

Another factor regressing the code is we don't have modeled movcf2gr
instruction yet, so we are not really eliding the branches as
LOGICAL_OP_NON_SHORT_CIRCUIT = 1 supposes to do.

I made up this:

diff --git a/gcc/config/loongarch/loongarch.md 
b/gcc/config/loongarch/loongarch.md
index a5d0dcd65fe..84d828ebd0f 100644
--- a/gcc/config/loongarch/loongarch.md
+++ b/gcc/config/loongarch/loongarch.md
@@ -3169,6 +3169,42 @@ (define_insn "s<code>_<ANYF:mode>_using_FCCmode"
    [(set_attr "type" "fcmp")
     (set_attr "mode" "FCC")])

+(define_insn "movcf2gr<GPR:mode>"

+  [(set (match_operand:GPR 0 "register_operand" "=r")
+       (if_then_else:GPR (ne (match_operand:FCC 1 "register_operand" "z")
+                             (const_int 0))
+                         (const_int 1)
+                         (const_int 0)))]
+  "TARGET_HARD_FLOAT"
+  "movcf2gr\t%0,%1"
+  [(set_attr "type" "move")
+   (set_attr "mode" "FCC")])
+
+(define_expand "cstore<ANYF:mode>4"
+  [(set (match_operand:SI 0 "register_operand")
+       (match_operator:SI 1 "loongarch_fcmp_operator"
+         [(match_operand:ANYF 2 "register_operand")
+          (match_operand:ANYF 3 "register_operand")]))]
+  ""
+  {
+    rtx fcc = gen_reg_rtx (FCCmode);
+    rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), FCCmode,
+                             operands[2], operands[3]);
+
+    emit_insn (gen_rtx_SET (fcc, cmp));
+    if (TARGET_64BIT)
+      {
+       rtx gpr = gen_reg_rtx (DImode);
+       emit_insn (gen_movcf2grdi (gpr, fcc));
+       emit_insn (gen_rtx_SET (operands[0],
+                               lowpart_subreg (SImode, gpr, DImode)));
+      }
+    else
+      emit_insn (gen_movcf2grsi (operands[0], fcc));
+
+    DONE;
+  })
+

  ;;
  ;;  ....................
diff --git a/gcc/config/loongarch/predicates.md 
b/gcc/config/loongarch/predicates.md
index 9e9ce58cb53..83fea08315c 100644
--- a/gcc/config/loongarch/predicates.md
+++ b/gcc/config/loongarch/predicates.md
@@ -590,6 +590,10 @@ (define_predicate "order_operator"
  (define_predicate "loongarch_cstore_operator"
    (match_code "ne,eq,gt,gtu,ge,geu,lt,ltu,le,leu"))

+(define_predicate "loongarch_fcmp_operator"

+  (match_code
+    "unordered,uneq,unlt,unle,eq,lt,le,ordered,ltgt,ne,ge,gt,unge,ungt"))
+
  (define_predicate "small_data_pattern"
    (and (match_code "set,parallel,unspec,unspec_volatile,prefetch")
         (match_test "loongarch_small_data_pattern_p (op)")))

and now this function is compiled to (with LOGICAL_OP_NON_SHORT_CIRCUIT
= 1):

        fld.s   $f1,$r4,0
        fld.s   $f0,$r4,4
        fld.s   $f3,$r4,8
        fld.s   $f2,$r4,12
        fcmp.slt.s      $fcc1,$f0,$f3
        fcmp.sgt.s      $fcc0,$f1,$f2
        movcf2gr        $r13,$fcc1
        movcf2gr        $r12,$fcc0
        or      $r12,$r12,$r13
        bnez    $r12,.L3
        fld.s   $f4,$r4,16
        fld.s   $f5,$r4,20
        or      $r4,$r0,$r0
        fcmp.sgt.s      $fcc1,$f1,$f5
        fcmp.slt.s      $fcc0,$f0,$f4
        movcf2gr        $r12,$fcc1
        movcf2gr        $r13,$fcc0
        or      $r12,$r12,$r13
        bnez    $r12,.L2
        fcmp.sgt.s      $fcc1,$f3,$f5
        fcmp.slt.s      $fcc0,$f2,$f4
        movcf2gr        $r4,$fcc1
        movcf2gr        $r12,$fcc0
        or      $r4,$r4,$r12
        xori    $r4,$r4,1
        slli.w  $r4,$r4,0
        jr      $r1
        .align  4
.L3:
        or      $r4,$r0,$r0
        .align  4
.L2:
        jr      $r1

Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).

Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples).  We may be able to handle via
the ext_dce pass [1] in the future.

[1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html

This test was extracted from the hot functions of 526.blender_r. SettingLOGICAL_OP_NON_SHORT_CIRCUIT to 0 resulted in a 26% decrease in dynamicinstruction count and a 13.4% performance improvement. After applyingthe patch mentioned above, the assembly code looks much better withLOGICAL_OP_NON_SHORT_CIRCUIT=1, bringing an 11% improvement to 526.Based on this, setting LOGICAL_OP_NON_SHORT_CIRCUIT to 0 furtherimproved the performance of 526 by 3%. The definition ofLOGICAL_OP_NON_SHORT_CIRCUIT determines how gimple is generated, whilethe optimizations you made determine how rtl is generated. They are notconflicting and combining them would yield better results. Currently, Ihave only tested it on 526, and I will continue testing its impact onthe entire SPEC 2017 suite.

Re: [PATCH v2] LoongArch: Define LOGICAL_OP_NON_SHORT_CIRCUIT.

Reply via email to