Re: [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]

2023-03-08 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
>> -Original Message-
>> > + (match_operand:VQN 4 "register_operand" "w")))]
>> >"TARGET_SIMD"
>> > +  "#"
>> > +  "&& true"
>> > +  [(const_int 0)]
>> >  {
>> > -  unsigned HOST_WIDE_INT size
>> > -= (1ULL << GET_MODE_UNIT_BITSIZE (mode)) - 1;
>> > -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
>> > -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
>> > -FAIL;
>> > -
>> > -  rtx addend = gen_reg_rtx (mode);
>> > -  rtx val = aarch64_simd_gen_const_vector_dup (mode, 1);
>> > -  emit_move_insn (addend, lowpart_subreg (mode, val,
>> > mode));
>> > -  rtx tmp1 = gen_reg_rtx (mode);
>> > -  rtx tmp2 = gen_reg_rtx (mode);
>> > -  emit_insn (gen_aarch64_addhn (tmp1, operands[1], addend));
>> > -  unsigned bitsize = GET_MODE_UNIT_BITSIZE (mode);
>> > -  rtx shift_vector = aarch64_simd_gen_const_vector_dup (mode,
>> > bitsize);
>> > -  emit_insn (gen_aarch64_uaddw (tmp2, operands[1], tmp1));
>> > -  emit_insn (gen_aarch64_simd_lshr (operands[0], tmp2,
>> > shift_vector));
>> > +  rtx tmp;
>> > +  if (can_create_pseudo_p ())
>> > +tmp = gen_reg_rtx (mode);  else
>> > +tmp = gen_rtx_REG (mode, REGNO (operands[0]));
>> > + emit_insn (gen_aarch64_addhn (tmp, operands[1],
>> operands[2]));
>> > + emit_insn (gen_aarch64_uaddw (operands[0], operands[4],
>> > + tmp));
>> >DONE;
>> >  })
>> 
>> In the previous review, I said:
>> 
>>   However, IIUC, this pattern would only be formed from combining
>>   three distinct patterns.  Is that right?  If so, we should be able
>>   to handle it as a plain define_split, with no define_insn.
>>   That should make things simpler, so would be worth trying before
>>   the changes I mentioned above.
>> 
>> Did you try that?  I still think it'd be preferable to defining a new insn.
>
> Yes I did! Sorry I forgot to mention that.  When I made it a split for some
> reason It wasn't matching it anymore.

I was hoping for a bit more detail than that :-)  But it seems that
the reason is that we match SRA first, so the final combination
is a 2-to-1 rather than 3-to-1.

So yeah, the patch is OK with the other changes mentioned in the review.

Thanks,
Richard


RE: [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]

2023-03-08 Thread Tamar Christina via Gcc-patches
> -Original Message-
> From: Richard Sandiford 
> Sent: Wednesday, March 8, 2023 9:18 AM
> To: Tamar Christina 
> Cc: gcc-patches@gcc.gnu.org; nd ; Richard Earnshaw
> ; Marcus Shawcroft
> ; Kyrylo Tkachov 
> Subject: Re: [PATCH 4/4]AArch64 Update div-bitmask to implement new
> optab instead of target hook [PR108583]
> 
> Tamar Christina  writes:
> > Ping,
> >
> > And updating the hook.
> >
> > There are no new test as new correctness tests were added to the
> > mid-end and the existing codegen tests for this already exist.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > PR target/108583
> > * config/aarch64/aarch64-simd.md
> (@aarch64_bitmask_udiv3): Remove.
> > (*bitmask_shift_plus): New.
> > * config/aarch64/aarch64-sve2.md (*bitmask_shift_plus): New.
> > (@aarch64_bitmask_udiv3): Remove.
> > * config/aarch64/aarch64.cc
> > (aarch64_vectorize_can_special_div_by_constant,
> > TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
> > (TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
> > aarch64_vectorize_preferred_div_as_shifts_over_mult): New.
> >
> > --- inline copy of patch ---
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index
> >
> 7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534
> 093ea6
> > 599dc7278108 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -4867,60 +4867,27 @@ (define_expand
> "aarch64_hn2"
> >}
> >  )
> >
> > -;; div optimizations using narrowings -;; we can do the division e.g.
> > shorts by 255 faster by calculating it as -;; (x + ((x + 257) >> 8))
> > >> 8 assuming the operation is done in -;; double the precision of x.
> > -;;
> > -;; If we imagine a short as being composed of two blocks of bytes
> > then -;; adding 257 or 0b_0001__0001 to the number is
> > equivalent to -;; adding 1 to each sub component:
> > -;;
> > -;;  short value of 16-bits
> > -;; ┌──┬┐
> > -;; │  ││
> > -;; └──┴┘
> > -;;   8-bit part1 ▲  8-bit part2   ▲
> > -;;   ││
> > -;;   ││
> > -;;  +1   +1
> > -;;
> > -;; after the first addition, we have to shift right by 8, and narrow
> > the -;; results back to a byte.  Remember that the addition must be
> > done in -;; double the precision of the input.  Since 8 is half the
> > size of a short -;; we can use a narrowing halfing instruction in
> > AArch64, addhn which also -;; does the addition in a wider precision
> > and narrows back to a byte.  The -;; shift itself is implicit in the
> > operation as it writes back only the top -;; half of the result. i.e. bits 
> > 2*esize-
> 1:esize.
> > -;;
> > -;; Since we have narrowed the result of the first part back to a
> > byte, for -;; the second addition we can use a widening addition, uaddw.
> > -;;
> > -;; For the final shift, since it's unsigned arithmetic we emit an ushr by 
> > 8.
> > -;;
> > -;; The shift is later optimized by combine to a uzp2 with movi #0.
> > -(define_expand "@aarch64_bitmask_udiv3"
> > -  [(match_operand:VQN 0 "register_operand")
> > -   (match_operand:VQN 1 "register_operand")
> > -   (match_operand:VQN 2 "immediate_operand")]
> > +;; Optimize ((a + b) >> n) + c where n is half the bitsize of the
> > +vector (define_insn_and_split "*bitmask_shift_plus"
> > +  [(set (match_operand:VQN 0 "register_operand" "=")
> > +   (plus:VQN
> > + (lshiftrt:VQN
> > +   (plus:VQN (match_operand:VQN 1 "register_operand" "w")
> > + (match_operand:VQN 2 "register_operand" "w"))
> > +   (match_operand:VQN 3
> > +"aarch64_simd_shift_imm_vec_exact_top" "Dr"))
> 
> I guess this is personal preference, sorry, but I think we should drop the
> constraint.  The predicate does the real check, and the operand is never
> reloaded, so "Dr" isn't any more helpful than an empty constraint, and IMO
> can be confusing.
> 
> > 

Re: [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]

2023-03-08 Thread Richard Sandiford via Gcc-patches
Tamar Christina  writes:
> Ping,
>
> And updating the hook.
>
> There are no new test as new correctness tests were added to the mid-end and
> the existing codegen tests for this already exist.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> PR target/108583
> * config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv3): 
> Remove.
> (*bitmask_shift_plus): New.
> * config/aarch64/aarch64-sve2.md (*bitmask_shift_plus): New.
> (@aarch64_bitmask_udiv3): Remove.
> * config/aarch64/aarch64.cc
> (aarch64_vectorize_can_special_div_by_constant,
> TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
> (TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
> aarch64_vectorize_preferred_div_as_shifts_over_mult): New.
>
> --- inline copy of patch ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534093ea6599dc7278108
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4867,60 +4867,27 @@ (define_expand "aarch64_hn2"
>}
>  )
>
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as
> -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> -;; double the precision of x.
> -;;
> -;; If we imagine a short as being composed of two blocks of bytes then
> -;; adding 257 or 0b_0001__0001 to the number is equivalent to
> -;; adding 1 to each sub component:
> -;;
> -;;  short value of 16-bits
> -;; ┌──┬┐
> -;; │  ││
> -;; └──┴┘
> -;;   8-bit part1 ▲  8-bit part2   ▲
> -;;   ││
> -;;   ││
> -;;  +1   +1
> -;;
> -;; after the first addition, we have to shift right by 8, and narrow the
> -;; results back to a byte.  Remember that the addition must be done in
> -;; double the precision of the input.  Since 8 is half the size of a short
> -;; we can use a narrowing halfing instruction in AArch64, addhn which also
> -;; does the addition in a wider precision and narrows back to a byte.  The
> -;; shift itself is implicit in the operation as it writes back only the top
> -;; half of the result. i.e. bits 2*esize-1:esize.
> -;;
> -;; Since we have narrowed the result of the first part back to a byte, for
> -;; the second addition we can use a widening addition, uaddw.
> -;;
> -;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
> -;;
> -;; The shift is later optimized by combine to a uzp2 with movi #0.
> -(define_expand "@aarch64_bitmask_udiv3"
> -  [(match_operand:VQN 0 "register_operand")
> -   (match_operand:VQN 1 "register_operand")
> -   (match_operand:VQN 2 "immediate_operand")]
> +;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
> +(define_insn_and_split "*bitmask_shift_plus"
> +  [(set (match_operand:VQN 0 "register_operand" "=")
> +   (plus:VQN
> + (lshiftrt:VQN
> +   (plus:VQN (match_operand:VQN 1 "register_operand" "w")
> + (match_operand:VQN 2 "register_operand" "w"))
> +   (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))

I guess this is personal preference, sorry, but I think we should drop
the constraint.  The predicate does the real check, and the operand is
never reloaded, so "Dr" isn't any more helpful than an empty constraint,
and IMO can be confusing.

> + (match_operand:VQN 4 "register_operand" "w")))]
>"TARGET_SIMD"
> +  "#"
> +  "&& true"
> +  [(const_int 0)]
>  {
> -  unsigned HOST_WIDE_INT size
> -= (1ULL << GET_MODE_UNIT_BITSIZE (mode)) - 1;
> -  rtx elt = unwrap_const_vec_duplicate (operands[2]);
> -  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
> -FAIL;
> -
> -  rtx addend = gen_reg_rtx (mode);
> -  rtx val = aarch64_simd_gen_const_vector_dup (mode, 1);
> -  emit_move_insn (addend, lowpart_subreg (mode, val, mode));
> -  rtx tmp1 = gen_reg_rtx (mode);
> -  rtx tmp2 = gen_reg_rtx (mode);
> -  emit_insn (gen_aarch64_addhn (tmp1, operands[1], addend));
> -  unsigned bitsize = GET_MODE_UNIT_BITSIZE (mode);
> -  rtx shift_vector = aarch64_simd_gen_const_vector_dup (mode, bitsize);
> -  emit_insn (gen_aarch64_uaddw (tmp2, operands[1], tmp1));
> -  emit_insn (gen_aarch64_simd_lshr (operands[0], tmp2, shift_vector));
> +  rtx tmp;
> +  if (can_create_pseudo_p ())
> +tmp = gen_reg_rtx (mode);
> +  else
> +tmp = gen_rtx_REG (mode, REGNO (operands[0]));
> +  emit_insn (gen_aarch64_addhn (tmp, operands[1], operands[2]));
> +  emit_insn (gen_aarch64_uaddw (operands[0], operands[4], tmp));
>DONE;
>  })

In the previous review, I said:

  However, IIUC, this pattern would only be formed from combining
  

RE: [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]

2023-03-06 Thread Tamar Christina via Gcc-patches
Ping,

And updating the hook.

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR target/108583
* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv3): Remove.
(*bitmask_shift_plus): New.
* config/aarch64/aarch64-sve2.md (*bitmask_shift_plus): New.
(@aarch64_bitmask_udiv3): Remove.
* config/aarch64/aarch64.cc
(aarch64_vectorize_can_special_div_by_constant,
TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
(TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
aarch64_vectorize_preferred_div_as_shifts_over_mult): New.

--- inline copy of patch ---

diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534093ea6599dc7278108
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,60 +4867,27 @@ (define_expand "aarch64_hn2"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b_0001__0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;  short value of 16-bits
-;; ┌──┬┐
-;; │  ││
-;; └──┴┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;   ││
-;;   ││
-;;  +1   +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv3"
-  [(match_operand:VQN 0 "register_operand")
-   (match_operand:VQN 1 "register_operand")
-   (match_operand:VQN 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus"
+  [(set (match_operand:VQN 0 "register_operand" "=")
+   (plus:VQN
+ (lshiftrt:VQN
+   (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+ (match_operand:VQN 2 "register_operand" "w"))
+   (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+ (match_operand:VQN 4 "register_operand" "w")))]
   "TARGET_SIMD"
+  "#"
+  "&& true"
+  [(const_int 0)]
 {
-  unsigned HOST_WIDE_INT size
-= (1ULL << GET_MODE_UNIT_BITSIZE (mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-FAIL;
-
-  rtx addend = gen_reg_rtx (mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (mode, 1);
-  emit_move_insn (addend, lowpart_subreg (mode, val, mode));
-  rtx tmp1 = gen_reg_rtx (mode);
-  rtx tmp2 = gen_reg_rtx (mode);
-  emit_insn (gen_aarch64_addhn (tmp1, operands[1], addend));
-  unsigned bitsize = GET_MODE_UNIT_BITSIZE (mode);
-  rtx shift_vector = aarch64_simd_gen_const_vector_dup (mode, bitsize);
-  emit_insn (gen_aarch64_uaddw (tmp2, operands[1], tmp1));
-  emit_insn (gen_aarch64_simd_lshr (operands[0], tmp2, shift_vector));
+  rtx tmp;
+  if (can_create_pseudo_p ())
+tmp = gen_reg_rtx (mode);
+  else
+tmp = gen_rtx_REG (mode, REGNO (operands[0]));
+  emit_insn (gen_aarch64_addhn (tmp, operands[1], operands[2]));
+  emit_insn (gen_aarch64_uaddw (operands[0], operands[4], tmp));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/aarch64-sve2.md 
b/gcc/config/aarch64/aarch64-sve2.md
index 
40c0728a7e6f00c395c360ce7625bc2e4a018809..bed44d7d6873877386222d56144cc115e3953a61
 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,41 +2317,24 @@ (define_insn "@aarch64_sve_"
 ;;  [INT] Misc optab implementations
 ;; -
 ;; Includes:
-;; - aarch64_bitmask_udiv
+;; - bitmask_shift_plus
 ;; -
 
-;; div 

[PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]

2023-02-27 Thread Tamar Christina via Gcc-patches
Hi All,

This replaces the custom division hook with just an implementation through
add_highpart.  For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.

This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:

umull   v1.8h, v0.8b, v3.8b
umull2  v0.8h, v0.16b, v3.16b
add v5.8h, v1.8h, v2.8h
add v4.8h, v0.8h, v2.8h
usrav1.8h, v5.8h, 8
usrav0.8h, v4.8h, 8
uzp2v1.16b, v1.16b, v0.16b

To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:

.L4:
ldr q0, [x3]
umull   v1.8h, v0.8b, v5.8b
umull2  v0.8h, v0.16b, v5.16b
addhn   v3.8b, v1.8h, v4.8h
addhn   v2.8b, v0.8h, v4.8h
uaddw   v1.8h, v1.8h, v3.8b
uaddw   v0.8h, v0.8h, v2.8b
uzp2v1.16b, v1.16b, v0.16b
str q1, [x3], 16
cmp x3, x4
bne .L4

For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:

.L3:
ld1bz0.h, p0/z, [x0, x3]
mul z0.h, p1/m, z0.h, z2.h
add z1.h, z0.h, z3.h
usraz0.h, z1.h, #8
lsr z0.h, z0.h, #8
st1bz0.h, p0, [x0, x3]
inchx3
whilelo p0.h, w3, w2
b.any   .L3
.L1:
ret

and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:

.L3:
ld1bz0.h, p0/z, [x0, x3]
mul z0.h, p1/m, z0.h, z2.h
addhnb  z1.b, z0.h, z3.h
addhnb  z0.b, z0.h, z1.h
st1bz0.h, p0, [x0, x3]
inchx3
whilelo p0.h, w3, w2
b.any   .L3

There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend.  Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR target/108583
* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv3): Remove.
(*bitmask_shift_plus): New.
* config/aarch64/aarch64-sve2.md (*bitmask_shift_plus): New.
(@aarch64_bitmask_udiv3): Remove.
* config/aarch64/aarch64.cc
(aarch64_vectorize_can_special_div_by_constant,
TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
(TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
aarch64_vectorize_preferred_div_as_shifts_over_mult): New.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534093ea6599dc7278108
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,60 +4867,27 @@ (define_expand "aarch64_hn2"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b_0001__0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;  short value of 16-bits
-;; ┌──┬┐
-;; │  ││
-;; └──┴┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;   ││
-;;   ││
-;;  +1   +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv3"
-  [(match_operand:VQN 0 "register_operand")
-