Re: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2

Richard Sandiford Thu, 25 Jul 2024 04:58:39 -0700

Jennifer Schmitz <[email protected]> writes:
> Thank you for the feedback. I added checks for SCALAR_INT_MODE_P for the reg 
> operands of the compare and if-then-else expressions. As it is not legal to 
> have different modes in the operand registers, I only added one check for 
> each of the expressions.
> The updated patch was bootstrapped and tested again.
> Best,
> Jennifer
>
> From 8da609be99fece8130cf1429bd938b2a26c6672b Mon Sep 17 00:00:00 2001
> From: Jennifer Schmitz <[email protected]>
> Date: Wed, 24 Jul 2024 06:13:59 -0700
> Subject: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2
>
> According to the Neoverse V2 Software Optimization Guide (section 4.14), the
> instruction pairs CMP+CSEL and CMP+CSET can be fused, which had not been
> implemented so far. This patch implements and tests the two fusion pairs.
>
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> There was also no non-noise impact on SPEC CPU2017 benchmark.
> OK for mainline?
>
> Signed-off-by: Jennifer Schmitz <[email protected]>
>
> gcc/
>
>       * config/aarch64/aarch64.cc (aarch_macro_fusion_pair_p): Implement
>       fusion logic.
>       * config/aarch64/aarch64-fusion-pairs.def (cmp+csel): New entry.
>       (cmp+cset): Likewise.
>       * config/aarch64/tuning_models/neoversev2.h: Enable logic in
>       field fusible_ops.
>
> gcc/testsuite/
>
>       * gcc.target/aarch64/fuse_cmp_csel.c: New test.
>       * gcc.target/aarch64/fuse_cmp_cset.c: Likewise.


Thanks for the update.

It looks from a quick scan like the main three instructions associated
with single-set integer COMPAREs are CMP, CMN and TST.  TST could be
distinguished from CMP and CMN based on get_attr_type (), although it
looks like:

(define_insn "*and<mode>_compare0"
  [(set (reg:CC_Z CC_REGNUM)
        (compare:CC_Z
         (match_operand:SHORT 0 "register_operand" "r")
         (const_int 0)))]
  ""
  "tst\\t%<w>0, <short_mask>"
  [(set_attr "type" "alus_imm")]
)

should use logics_imm instead of alus_imm.

Alternatively, we could add a new attribute for "compare_type" and use
that.  That would make the test in aarch_macro_fusion_pair_p slightly
simpler, since it could use get_attr_compare_type without having to
look at the pattern of prev_set.  But there's a danger that we'd
forget to add the new attribute for new comparison instructions.

I did wonder whether we could simply punt on CC_Zmode, but that's
not a reliable test.

But I suppose the counter-argument to my questions above is: how bad
would it be if we fused CMN and TST?  They are at least plausible
fusions, so it probably doesn't matter if we include them too.

So:

> ---
>  gcc/config/aarch64/aarch64-fusion-pairs.def   |  2 ++
>  gcc/config/aarch64/aarch64.cc                 | 22 +++++++++++++
>  gcc/config/aarch64/tuning_models/neoversev2.h |  5 ++-
>  .../gcc.target/aarch64/fuse_cmp_csel.c        | 33 +++++++++++++++++++
>  .../gcc.target/aarch64/fuse_cmp_cset.c        | 31 +++++++++++++++++
>  5 files changed, 92 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c
>
> diff --git a/gcc/config/aarch64/aarch64-fusion-pairs.def 
> b/gcc/config/aarch64/aarch64-fusion-pairs.def
> index 9a43b0c8065..bf5e85ba8fe 100644
> --- a/gcc/config/aarch64/aarch64-fusion-pairs.def
> +++ b/gcc/config/aarch64/aarch64-fusion-pairs.def
> @@ -37,5 +37,7 @@ AARCH64_FUSION_PAIR ("aes+aesmc", AES_AESMC)
>  AARCH64_FUSION_PAIR ("alu+branch", ALU_BRANCH)
>  AARCH64_FUSION_PAIR ("alu+cbz", ALU_CBZ)
>  AARCH64_FUSION_PAIR ("addsub_2reg_const1", ADDSUB_2REG_CONST1)
> +AARCH64_FUSION_PAIR ("cmp+csel", CMP_CSEL)
> +AARCH64_FUSION_PAIR ("cmp+cset", CMP_CSET)
>  
>  #undef AARCH64_FUSION_PAIR
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 9e51236ce9f..7a0351f7dac 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -27348,6 +27348,28 @@ aarch_macro_fusion_pair_p (rtx_insn *prev, rtx_insn 
> *curr)
>        && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
>      return true;
>  
> +  /* Fuse CMP and CSEL.  */

How about adding something like:

  The test for CMP is not exact, and includes things like CMN and TST
  as well.  However, fusing with those instructions doesn't seem
  inherently bad.

> +  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSEL)
> +      && prev_set && curr_set
> +      && GET_CODE (SET_SRC (prev_set)) == COMPARE
> +      && GET_CODE (SET_SRC (curr_set)) == IF_THEN_ELSE
> +      && REG_P (XEXP (SET_SRC (curr_set), 1))
> +      && REG_P (XEXP (SET_SRC (curr_set), 2))

I think we could use aarch64_reg_or_zero instead of REG_P for the last
two lines, which would allow selects involving W/XZR as well.

Looks good to me otherwise, but I'd be interested to hear what others
think about the CMP match.

Thanks,
Richard

> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (prev_set), 0)))
> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (curr_set), 1)))
> +      && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
> +    return true;
> +
> +  /* Fuse CMP and CSET.  */
> +  if (aarch64_fusion_enabled_p (AARCH64_FUSE_CMP_CSET)
> +      && prev_set && curr_set
> +      && GET_CODE (SET_SRC (prev_set)) == COMPARE
> +      && GET_RTX_CLASS (GET_CODE (SET_SRC (curr_set))) == RTX_COMPARE
> +      && REG_P (SET_DEST (curr_set))
> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (SET_SRC (prev_set), 0)))
> +      && reg_referenced_p (SET_DEST (prev_set), PATTERN (curr)))
> +    return true;
> +
>    /* Fuse flag-setting ALU instructions and conditional branch.  */
>    if (aarch64_fusion_enabled_p (AARCH64_FUSE_ALU_BRANCH)
>        && any_condjump_p (curr))
> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h 
> b/gcc/config/aarch64/tuning_models/neoversev2.h
> index f76e4ef358f..ae99fab22d8 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> @@ -221,7 +221,10 @@ static const struct tune_params neoversev2_tunings =
>      2 /* store_pred.  */
>    }, /* memmov_cost.  */
>    5, /* issue_rate  */
> -  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */
> +  (AARCH64_FUSE_AES_AESMC
> +   | AARCH64_FUSE_CMP_BRANCH
> +   | AARCH64_FUSE_CMP_CSEL
> +   | AARCH64_FUSE_CMP_CSET), /* fusible_ops  */
>    "32:16",   /* function_align.  */
>    "4",               /* jump_align.  */
>    "32:16",   /* loop_align.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c 
> b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c
> new file mode 100644
> index 00000000000..85f302bab98
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_csel.c
> @@ -0,0 +1,33 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mcpu=neoverse-v2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +/*
> +** f1:
> +**   ...
> +**   cmp     w[0-9]+, w[0-9]+
> +**   csel    w[0-9]+, w[0-9]+, w[0-9]+, le
> +**   ret
> +*/
> +int f1 (int a, int b, int c)
> +{
> +  int cmp = a > b;
> +  int add1 = c + 3;
> +  int add2 = c + 8;
> +  return cmp ? add1 : add2;
> +}
> +
> +/*
> +** f2:
> +**   ...
> +**   cmp     x[0-9]+, x[0-9]+
> +**   csel    x[0-9]+, x[0-9]+, x[0-9]+, le
> +**   ret
> +*/
> +long long f2 (long long a, long long b, long long c)
> +{
> + long long cmp = a > b;
> +  long long add1 = c + 3;
> +  long long add2 = c + 8;
> +  return cmp ? add1 : add2;
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c 
> b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c
> new file mode 100644
> index 00000000000..04f1ce2773b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/fuse_cmp_cset.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -mcpu=neoverse-v2" } */
> +/* { dg-final { check-function-bodies "**" "" } } */
> +
> +/*
> +** f1:
> +**   cmp     w[0-9]+, w[0-9]+
> +**   cset    w[0-9]+, gt
> +**   ...
> +*/
> +int g;
> +int f1 (int a, int b)
> +{
> +  int cmp = a > b;
> +  g = cmp + 1;
> +  return cmp;
> +}
> +
> +/*
> +** f2:
> +**   cmp     x[0-9]+, x[0-9]+
> +**   cset    x[0-9]+, gt
> +**   ...
> +*/
> +long long h;
> +long long f2 (long long a, long long b)
> +{
> +  long long cmp = a > b;
> +  h = cmp + 1;
> +  return cmp;
> +}

Re: [PATCH] aarch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2

Reply via email to