Oops, realizes I forgot to fill in the test results, there were no issues 😊

> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> bounces+tamar.christina=arm....@gcc.gnu.org> On Behalf Of Tamar
> Christina via Gcc-patches
> Sent: Thursday, February 9, 2023 5:22 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>;
> Marcus Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov
> <kyrylo.tkac...@arm.com>; Richard Sandiford
> <richard.sandif...@arm.com>
> Subject: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab
> instead of target hook [PR108583]
> 
> Hi All,
> 
> This replaces the custom division hook with just an implementation through
> add_highpart.  For NEON we implement the add highpart (Addition +
> extraction of the upper highpart of the register in the same precision) as ADD
> + LSR.
> 
> This representation allows us to easily optimize the sequence using existing
> sequences. This gets us a pretty decent sequence using SRA:
> 
>         umull   v1.8h, v0.8b, v3.8b
>         umull2  v0.8h, v0.16b, v3.16b
>         add     v5.8h, v1.8h, v2.8h
>         add     v4.8h, v0.8h, v2.8h
>         usra    v1.8h, v5.8h, 8
>         usra    v0.8h, v4.8h, 8
>         uzp2    v1.16b, v1.16b, v0.16b
> 
> To get the most optimal sequence however we match (a + ((b + c) >> n))
> where n is half the precision of the mode of the operation into addhn +
> uaddw which is a general good optimization on its own and gets us back to:
> 
> .L4:
>         ldr     q0, [x3]
>         umull   v1.8h, v0.8b, v5.8b
>         umull2  v0.8h, v0.16b, v5.16b
>         addhn   v3.8b, v1.8h, v4.8h
>         addhn   v2.8b, v0.8h, v4.8h
>         uaddw   v1.8h, v1.8h, v3.8b
>         uaddw   v0.8h, v0.8h, v2.8b
>         uzp2    v1.16b, v1.16b, v0.16b
>         str     q1, [x3], 16
>         cmp     x3, x4
>         bne     .L4
> 
> For SVE2 we optimize the initial sequence to the same ADD + LSR which gets
> us:
> 
> .L3:
>         ld1b    z0.h, p0/z, [x0, x3]
>         mul     z0.h, p1/m, z0.h, z2.h
>         add     z1.h, z0.h, z3.h
>         usra    z0.h, z1.h, #8
>         lsr     z0.h, z0.h, #8
>         st1b    z0.h, p0, [x0, x3]
>         inch    x3
>         whilelo p0.h, w3, w2
>         b.any   .L3
> .L1:
>         ret
> 
> and to get the most optimal sequence I match (a + b) >> n (same constraint
> on n) to addhnb which gets us to:
> 
> .L3:
>         ld1b    z0.h, p0/z, [x0, x3]
>         mul     z0.h, p1/m, z0.h, z2.h
>         addhnb  z1.b, z0.h, z3.h
>         addhnb  z0.b, z0.h, z1.h
>         st1b    z0.h, p0, [x0, x3]
>         inch    x3
>         whilelo p0.h, w3, w2
>         b.any   .L3
> 
> There are multiple RTL representations possible for these optimizations, I did
> not represent them using a zero_extend because we seem very inconsistent
> in this in the backend.  Since they are unspecs we won't match them from
> vector ops anyway. I figured maintainers would prefer this, but my
> maintainer ouija board is still out for repairs :)
> 
> There are no new test as new correctness tests were added to the mid-end
> and the existing codegen tests for this already exist.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-going> issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       PR target/108583
>       * config/aarch64/aarch64-simd.md
> (@aarch64_bitmask_udiv<mode>3): Remove.
>       (<su>add<mode>3_highpart, *bitmask_shift_plus<mode>): New.
>       * config/aarch64/aarch64-sve2.md (<su>add<mode>3_highpart,
>       *bitmask_shift_plus<mode>): New.
>       (@aarch64_bitmask_udiv<mode>3): Remove.
>       * config/aarch64/aarch64.cc
>       (aarch64_vectorize_can_special_div_by_constant): Removed.
>       * config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART,
>       UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce
> 68a8df0272c53 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4867,62 +4867,48 @@ (define_expand
> "aarch64_<sur><addsub>hn2<mode>"
>    }
>  )
> 
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as -;; 
> (x +
> ((x + 257) >> 8)) >> 8 assuming the operation is done in -;; double the
> precision of x.
> -;;
> -;; If we imagine a short as being composed of two blocks of bytes then -;;
> adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to -;;
> adding 1 to each sub component:
> -;;
> -;;      short value of 16-bits
> -;; ┌──────────────┬────────────────┐
> -;; │              │                │
> -;; └──────────────┴────────────────┘
> -;;   8-bit part1 ▲  8-bit part2   ▲
> -;;               │                │
> -;;               │                │
> -;;              +1               +1
> -;;
> -;; after the first addition, we have to shift right by 8, and narrow the -;;
> results back to a byte.  Remember that the addition must be done in -;;
> double the precision of the input.  Since 8 is half the size of a short -;; 
> we can
> use a narrowing halfing instruction in AArch64, addhn which also -;; does the
> addition in a wider precision and narrows back to a byte.  The -;; shift 
> itself is
> implicit in the operation as it writes back only the top -;; half of the 
> result. i.e.
> bits 2*esize-1:esize.
> -;;
> -;; Since we have narrowed the result of the first part back to a byte, for 
> -;;
> the second addition we can use a widening addition, uaddw.
> -;;
> -;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
> -;;
> -;; The shift is later optimized by combine to a uzp2 with movi #0.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:VQN 0 "register_operand")
> -   (match_operand:VQN 1 "register_operand")
> -   (match_operand:VQN 2 "immediate_operand")]
> +;; Implement add_highpart as ADD + RSHIFT, we have various optimization
> +for ;; narrowing represented as shifts and so this representation will
> +allow us to ;; further optimize this should the result require
> +narrowing. The alternative ;; representation of ADDHN + UXTL is less
> +efficient and harder to futher ;; optimize.
> +(define_expand "<su>add<mode>3_highpart"
> +  [(set (match_operand:VQN 0 "register_operand")
> +     (unspec:VQN [(match_operand:VQN 1 "register_operand")
> +                  (match_operand:VQN 2 "register_operand")]
> +                 ADD_HIGHPART))]
> +  "TARGET_SIMD"
> +{
> +  rtx result = gen_reg_rtx (<MODE>mode);
> +  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
> +  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> +                                                     shift_amount);
> +  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
> +  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], result,
> +shift_vector));
> +  DONE;
> +})
> +
> +;; Optimize ((a + b) >> n) + c where n is half the bitsize of the
> +vector (define_insn_and_split "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:VQN 0 "register_operand" "=w")
> +     (plus:VQN
> +       (lshiftrt:VQN
> +         (plus:VQN (match_operand:VQN 1 "register_operand" "w")
> +                   (match_operand:VQN 2 "register_operand" "w"))
> +         (match_operand:VQN 3
> "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> +       (match_operand:VQN 4 "register_operand" "w")))]
>    "TARGET_SIMD"
> +  "#"
> +  "&& !reload_completed"
> +  [(const_int 0)]
>  {
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> -
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val,
> <VNARROWQ2>mode));
> -  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
> -  rtx tmp2 = gen_reg_rtx (<MODE>mode);
> -  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
> -  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
> -  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> bitsize);
> -  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
> -  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2,
> shift_vector));
> +  rtx tmp = gen_reg_rtx (<VNARROWQ>mode);  emit_insn
> + (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
> emit_insn
> + (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
>    DONE;
> -})
> +}
> +  [(set_attr "type" "neon_add_halve<q>")]
> +)
> 
>  ;; pmul.
> 
> diff --git a/gcc/config/aarch64/aarch64-sve2.md
> b/gcc/config/aarch64/aarch64-sve2.md
> index
> 40c0728a7e6f00c395c360ce7625bc2e4a018809..ad01c1ddf9257cec951ed0c165
> 58a3c4d856813b 100644
> --- a/gcc/config/aarch64/aarch64-sve2.md
> +++ b/gcc/config/aarch64/aarch64-sve2.md
> @@ -2317,39 +2317,51 @@ (define_insn "@aarch64_sve_<optab><mode>"
>  ;; ---- [INT] Misc optab implementations  ;; 
> -----------------------------------------
> --------------------------------
>  ;; Includes:
> -;; - aarch64_bitmask_udiv
> +;; - add_highpart
>  ;; -------------------------------------------------------------------------
> 
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as -;; 
> (x +
> ((x + 257) >> 8)) >> 8 assuming the operation is done in -;; double the
> precision of x.
> -;;
> -;; See aarch64-simd.md for bigger explanation.
> -(define_expand "@aarch64_bitmask_udiv<mode>3"
> -  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 1 "register_operand")
> -   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
> +;; Implement add_highpart as ADD + RSHIFT, we have various optimization
> +for ;; narrowing represented as shifts and so this representation will
> +allow us to ;; further optimize this should the result require
> +narrowing. The alternative ;; representation of ADDHN + UXTL is less
> +efficient and harder to futher ;; optimize.
> +(define_expand "<su>add<mode>3_highpart"
> +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand")
> +     (unspec:SVE_FULL_HSDI
> +       [(match_operand:SVE_FULL_HSDI 1 "register_operand")
> +        (match_operand:SVE_FULL_HSDI 2 "register_operand")]
> +       ADD_HIGHPART))]
>    "TARGET_SVE2"
>  {
> -  unsigned HOST_WIDE_INT size
> -    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -    FAIL;
> +  rtx result = gen_reg_rtx (<MODE>mode);
> +  int shift_amount = GET_MODE_UNIT_BITSIZE (<MODE>mode) / 2;
> +  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode,
> +                                                     shift_amount);
> +  emit_insn (gen_add<mode>3 (result, operands[1], operands[2]));
> +  emit_insn (gen_vlshr<mode>3 (operands[0], result, shift_vector));
> +  DONE;
> +})
> 
> -  rtx addend = gen_reg_rtx (<MODE>mode);
> +;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
> +(define_insn_and_split "*bitmask_shift_plus<mode>"
> +  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
> +     (unspec:SVE_FULL_HSDI [
> +         (match_operand:<VPRED> 1 "register_operand" "Upl")
> +         (lshiftrt:SVE_FULL_HSDI
> +           (plus:SVE_FULL_HSDI
> +             (match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
> +             (match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
> +           (match_operand:SVE_FULL_HSDI 4
> "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> +        ] UNSPEC_PRED_X))]
> +  "TARGET_SVE2"
> +  "#"
> +  "&& !reload_completed"
> +  [(const_int 0)]
> +{
>    rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
> -  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val,
> <VNARROW>mode));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1,
> operands[1],
> -                           addend));
> -  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2,
> operands[1],
> -                           lowpart_subreg (<MODE>mode, tmp1,
> -                                           <VNARROW>mode)));
> +  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1,
> + operands[2], operands[3]));
>    emit_move_insn (operands[0],
> -               lowpart_subreg (<MODE>mode, tmp2,
> <VNARROW>mode));
> +               lowpart_subreg (<MODE>mode, tmp1,
> <VNARROW>mode));
>    DONE;
>  })
> 
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index
> e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..8a04feb29e6bfb423a09dde2
> cd64853e69d0e1ba 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -24363,46 +24363,6 @@ aarch64_vectorize_vec_perm_const
> (machine_mode vmode, machine_mode op_mode,
> 
>    return ret;
>  }
> -
> -/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
> -
> -bool
> -aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
> -                                            tree vectype, wide_int cst,
> -                                            rtx *output, rtx in0, rtx in1)
> -{
> -  if (code != TRUNC_DIV_EXPR
> -      || !TYPE_UNSIGNED (vectype))
> -    return false;
> -
> -  machine_mode mode = TYPE_MODE (vectype);
> -  unsigned int flags = aarch64_classify_vector_mode (mode);
> -  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
> -    return false;
> -
> -  int pow = wi::exact_log2 (cst + 1);
> -  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE
> (vectype));
> -  /* SVE actually has a div operator, we may have gotten here through
> -     that route.  */
> -  if (pow != (int) (element_precision (vectype) / 2)
> -      || insn_code == CODE_FOR_nothing)
> -    return false;
> -
> -  /* We can use the optimized pattern.  */
> -  if (in0 == NULL_RTX && in1 == NULL_RTX)
> -    return true;
> -
> -  gcc_assert (output);
> -
> -  expand_operand ops[3];
> -  create_output_operand (&ops[0], *output, mode);
> -  create_input_operand (&ops[1], in0, mode);
> -  create_fixed_operand (&ops[2], in1);
> -  expand_insn (insn_code, 3, ops);
> -  *output = ops[0].value;
> -  return true;
> -}
> -
>  /* Generate a byte permute mask for a register of mode MODE,
>     which has NUNITS units.  */
> 
> diff --git a/gcc/config/aarch64/iterators.md
> b/gcc/config/aarch64/iterators.md index
> 6cbc97cc82c06a68259bdf4dec8a0eab230081e5..ae627ae56cbd1e8b882e596d
> ba974e74ef396e0e 100644
> --- a/gcc/config/aarch64/iterators.md
> +++ b/gcc/config/aarch64/iterators.md
> @@ -750,6 +750,8 @@ (define_c_enum "unspec"
>      UNSPEC_REVH              ; Used in aarch64-sve.md.
>      UNSPEC_REVW              ; Used in aarch64-sve.md.
>      UNSPEC_REVBHW    ; Used in aarch64-sve.md.
> +    UNSPEC_SADD_HIGHPART ; Used in aarch64-sve.md.
> +    UNSPEC_UADD_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_SMUL_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_UMUL_HIGHPART ; Used in aarch64-sve.md.
>      UNSPEC_FMLA              ; Used in aarch64-sve.md.
> @@ -2704,6 +2706,7 @@ (define_int_iterator UNPACK
> [UNSPEC_UNPACKSHI UNSPEC_UNPACKUHI
> 
>  (define_int_iterator UNPACK_UNSIGNED [UNSPEC_UNPACKULO
> UNSPEC_UNPACKUHI])
> 
> +(define_int_iterator ADD_HIGHPART [UNSPEC_SADD_HIGHPART
> +UNSPEC_UADD_HIGHPART])
>  (define_int_iterator MUL_HIGHPART [UNSPEC_SMUL_HIGHPART
> UNSPEC_UMUL_HIGHPART])
> 
>  (define_int_iterator CLAST [UNSPEC_CLASTA UNSPEC_CLASTB]) @@ -3342,6
> +3345,8 @@ (define_int_attr su [(UNSPEC_SADDV "s")
>                    (UNSPEC_UNPACKUHI "u")
>                    (UNSPEC_UNPACKSLO "s")
>                    (UNSPEC_UNPACKULO "u")
> +                  (UNSPEC_SADD_HIGHPART "s")
> +                  (UNSPEC_UADD_HIGHPART "u")
>                    (UNSPEC_SMUL_HIGHPART "s")
>                    (UNSPEC_UMUL_HIGHPART "u")
>                    (UNSPEC_COND_FCVTZS "s")
> 
> 
> 
> 
> --

Reply via email to