Re: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]

2023-02-10 Thread Richard Sandiford via Gcc-patches
I was asking in the 1/2 review whether we need the optab, but that
decision doesn't affect the other patterns, so:

Tamar Christina  writes:
> Hi All,
>
> This replaces the custom division hook with just an implementation through
> add_highpart.  For NEON we implement the add highpart (Addition + extraction 
> of
> the upper highpart of the register in the same precision) as ADD + LSR.
>
> This representation allows us to easily optimize the sequence using existing
> sequences. This gets us a pretty decent sequence using SRA:
>
> umull   v1.8h, v0.8b, v3.8b
> umull2  v0.8h, v0.16b, v3.16b
> add v5.8h, v1.8h, v2.8h
> add v4.8h, v0.8h, v2.8h
> usrav1.8h, v5.8h, 8
> usrav0.8h, v4.8h, 8
> uzp2v1.16b, v1.16b, v0.16b
>
> To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
> is half the precision of the mode of the operation into addhn + uaddw which is
> a general good optimization on its own and gets us back to:
>
> .L4:
> ldr q0, [x3]
> umull   v1.8h, v0.8b, v5.8b
> umull2  v0.8h, v0.16b, v5.16b
> addhn   v3.8b, v1.8h, v4.8h
> addhn   v2.8b, v0.8h, v4.8h
> uaddw   v1.8h, v1.8h, v3.8b
> uaddw   v0.8h, v0.8h, v2.8b
> uzp2v1.16b, v1.16b, v0.16b
> str q1, [x3], 16
> cmp x3, x4
> bne .L4
>
> For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:
>
> .L3:
> ld1bz0.h, p0/z, [x0, x3]
> mul z0.h, p1/m, z0.h, z2.h
> add z1.h, z0.h, z3.h
> usraz0.h, z1.h, #8
> lsr z0.h, z0.h, #8
> st1bz0.h, p0, [x0, x3]
> inchx3
> whilelo p0.h, w3, w2
> b.any   .L3
> .L1:
> ret
>
> and to get the most optimal sequence I match (a + b) >> n (same constraint on 
> n)
> to addhnb which gets us to:
>
> .L3:
> ld1bz0.h, p0/z, [x0, x3]
> mul z0.h, p1/m, z0.h, z2.h
> addhnb  z1.b, z0.h, z3.h
> addhnb  z0.b, z0.h, z1.h
> st1bz0.h, p0, [x0, x3]
> inchx3
> whilelo p0.h, w3, w2
> b.any   .L3
>
> There are multiple RTL representations possible for these optimizations, I did
> not represent them using a zero_extend because we seem very inconsistent in 
> this
> in the backend.  Since they are unspecs we won't match them from vector ops
> anyway. I figured maintainers would prefer this, but my maintainer ouija board
> is still out for repairs :)

I agree this is the best approach as things stand.  Personally, I'd like
to have some way for the target to define simplification rules based on
unspecs, so that unspecs act more like target-specific rtl codes.  But I
know others disagree, and it wouldn't really apply to this case anyway.

> There are no new test as new correctness tests were added to the mid-end and
> the existing codegen tests for this already exist.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and  issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>   PR target/108583
>   * config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv3): Remove.
>   (add3_highpart, *bitmask_shift_plus): New.
>   * config/aarch64/aarch64-sve2.md (add3_highpart,
>   *bitmask_shift_plus): New.
>   (@aarch64_bitmask_udiv3): Remove.
>   * config/aarch64/aarch64.cc
>   (aarch64_vectorize_can_special_div_by_constant): Removed.
>   * config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART,
>   UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New.
>
> --- inline copy of patch -- 
> diff --git a/gcc/config/aarch64/aarch64-simd.md 
> b/gcc/config/aarch64/aarch64-simd.md
> index 
> 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce68a8df0272c53
>  100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4867,62 +4867,48 @@ (define_expand "aarch64_hn2"
>}
>  )
>  
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as
> -;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
> -;; double the precision of x.
> -;;
> -;; If we imagine a short as being composed of two blocks of bytes then
> -;; adding 257 or 0b_0001__0001 to the number is equivalent to
> -;; adding 1 to each sub component:
> -;;
> -;;  short value of 16-bits
> -;; ┌──┬┐
> -;; │  ││
> -;; └──┴┘
> -;;   8-bit part1 ▲  8-bit part2   ▲
> -;;   ││
> -;;   ││
> -;;  +1   +1
> -;;
> -;; after the first addition, we have to shift right by 8, and narrow the
> -;; results back to a byte.  Remember that the addition must be done in
> -;; double the precision of the input.  Since 8 is half the size of a short
> -;; we can use a narrowing halfing instruction 

RE: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]

2023-02-10 Thread Tamar Christina via Gcc-patches
Oops, realizes I forgot to fill in the test results, there were no issues 😊

> -Original Message-
> From: Gcc-patches  bounces+tamar.christina=arm@gcc.gnu.org> On Behalf Of Tamar
> Christina via Gcc-patches
> Sent: Thursday, February 9, 2023 5:22 PM
> To: gcc-patches@gcc.gnu.org
> Cc: nd ; Richard Earnshaw ;
> Marcus Shawcroft ; Kyrylo Tkachov
> ; Richard Sandiford
> 
> Subject: [PATCH 2/2]AArch64 Update div-bitmask to implement new optab
> instead of target hook [PR108583]
> 
> Hi All,
> 
> This replaces the custom division hook with just an implementation through
> add_highpart.  For NEON we implement the add highpart (Addition +
> extraction of the upper highpart of the register in the same precision) as ADD
> + LSR.
> 
> This representation allows us to easily optimize the sequence using existing
> sequences. This gets us a pretty decent sequence using SRA:
> 
> umull   v1.8h, v0.8b, v3.8b
> umull2  v0.8h, v0.16b, v3.16b
> add v5.8h, v1.8h, v2.8h
> add v4.8h, v0.8h, v2.8h
> usrav1.8h, v5.8h, 8
> usrav0.8h, v4.8h, 8
> uzp2v1.16b, v1.16b, v0.16b
> 
> To get the most optimal sequence however we match (a + ((b + c) >> n))
> where n is half the precision of the mode of the operation into addhn +
> uaddw which is a general good optimization on its own and gets us back to:
> 
> .L4:
> ldr q0, [x3]
> umull   v1.8h, v0.8b, v5.8b
> umull2  v0.8h, v0.16b, v5.16b
> addhn   v3.8b, v1.8h, v4.8h
> addhn   v2.8b, v0.8h, v4.8h
> uaddw   v1.8h, v1.8h, v3.8b
> uaddw   v0.8h, v0.8h, v2.8b
> uzp2v1.16b, v1.16b, v0.16b
> str q1, [x3], 16
> cmp x3, x4
> bne .L4
> 
> For SVE2 we optimize the initial sequence to the same ADD + LSR which gets
> us:
> 
> .L3:
> ld1bz0.h, p0/z, [x0, x3]
> mul z0.h, p1/m, z0.h, z2.h
> add z1.h, z0.h, z3.h
> usraz0.h, z1.h, #8
> lsr z0.h, z0.h, #8
> st1bz0.h, p0, [x0, x3]
> inchx3
> whilelo p0.h, w3, w2
> b.any   .L3
> .L1:
> ret
> 
> and to get the most optimal sequence I match (a + b) >> n (same constraint
> on n) to addhnb which gets us to:
> 
> .L3:
> ld1bz0.h, p0/z, [x0, x3]
> mul z0.h, p1/m, z0.h, z2.h
> addhnb  z1.b, z0.h, z3.h
> addhnb  z0.b, z0.h, z1.h
> st1bz0.h, p0, [x0, x3]
> inchx3
> whilelo p0.h, w3, w2
> b.any   .L3
> 
> There are multiple RTL representations possible for these optimizations, I did
> not represent them using a zero_extend because we seem very inconsistent
> in this in the backend.  Since they are unspecs we won't match them from
> vector ops anyway. I figured maintainers would prefer this, but my
> maintainer ouija board is still out for repairs :)
> 
> There are no new test as new correctness tests were added to the mid-end
> and the existing codegen tests for this already exist.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and  issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>   PR target/108583
>   * config/aarch64/aarch64-simd.md
> (@aarch64_bitmask_udiv3): Remove.
>   (add3_highpart, *bitmask_shift_plus): New.
>   * config/aarch64/aarch64-sve2.md (add3_highpart,
>   *bitmask_shift_plus): New.
>   (@aarch64_bitmask_udiv3): Remove.
>   * config/aarch64/aarch64.cc
>   (aarch64_vectorize_can_special_div_by_constant): Removed.
>   * config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART,
>   UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New.
> 
> --- inline copy of patch --
> diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> index
> 7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce
> 68a8df0272c53 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -4867,62 +4867,48 @@ (define_expand
> "aarch64_hn2"
>}
>  )
> 
> -;; div optimizations using narrowings
> -;; we can do the division e.g. shorts by 255 faster by calculating it as -;; 
> (x +
> ((x + 257) >> 8)) >> 8 assuming the operation is done in -;; double the
> precision of x.
> -;;
> -;; If we imagine a short as being composed of two blocks of bytes then -;;
> adding 257 or 0b_0001__0001 to the number is equivalent to -;;
> adding 1 to each sub component:
> -;;
> -;;  short value of 16-bits
> -;; ┌──┬┐
>

[PATCH 2/2]AArch64 Update div-bitmask to implement new optab instead of target hook [PR108583]

2023-02-09 Thread Tamar Christina via Gcc-patches
Hi All,

This replaces the custom division hook with just an implementation through
add_highpart.  For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.

This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:

umull   v1.8h, v0.8b, v3.8b
umull2  v0.8h, v0.16b, v3.16b
add v5.8h, v1.8h, v2.8h
add v4.8h, v0.8h, v2.8h
usrav1.8h, v5.8h, 8
usrav0.8h, v4.8h, 8
uzp2v1.16b, v1.16b, v0.16b

To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:

.L4:
ldr q0, [x3]
umull   v1.8h, v0.8b, v5.8b
umull2  v0.8h, v0.16b, v5.16b
addhn   v3.8b, v1.8h, v4.8h
addhn   v2.8b, v0.8h, v4.8h
uaddw   v1.8h, v1.8h, v3.8b
uaddw   v0.8h, v0.8h, v2.8b
uzp2v1.16b, v1.16b, v0.16b
str q1, [x3], 16
cmp x3, x4
bne .L4

For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:

.L3:
ld1bz0.h, p0/z, [x0, x3]
mul z0.h, p1/m, z0.h, z2.h
add z1.h, z0.h, z3.h
usraz0.h, z1.h, #8
lsr z0.h, z0.h, #8
st1bz0.h, p0, [x0, x3]
inchx3
whilelo p0.h, w3, w2
b.any   .L3
.L1:
ret

and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:

.L3:
ld1bz0.h, p0/z, [x0, x3]
mul z0.h, p1/m, z0.h, z2.h
addhnb  z1.b, z0.h, z3.h
addhnb  z0.b, z0.h, z1.h
st1bz0.h, p0, [x0, x3]
inchx3
whilelo p0.h, w3, w2
b.any   .L3

There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend.  Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

Bootstrapped Regtested on aarch64-none-linux-gnu and  issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

PR target/108583
* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv3): Remove.
(add3_highpart, *bitmask_shift_plus): New.
* config/aarch64/aarch64-sve2.md (add3_highpart,
*bitmask_shift_plus): New.
(@aarch64_bitmask_udiv3): Remove.
* config/aarch64/aarch64.cc
(aarch64_vectorize_can_special_div_by_constant): Removed.
* config/aarch64/iterators.md (UNSPEC_SADD_HIGHPART,
UNSPEC_UADD_HIGHPART, ADD_HIGHPART): New.

--- inline copy of patch -- 
diff --git a/gcc/config/aarch64/aarch64-simd.md 
b/gcc/config/aarch64/aarch64-simd.md
index 
7f212bf37cd2c120dceb7efa733c9fa76226f029..26871a56d1fdb134f0ad9d828ce68a8df0272c53
 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,62 +4867,48 @@ (define_expand "aarch64_hn2"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b_0001__0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;  short value of 16-bits
-;; ┌──┬┐
-;; │  ││
-;; └──┴┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;   ││
-;;   ││
-;;  +1   +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv3"
-  [(match_operand:VQN 0 "register_operand")
-   (match_operand:VQN 1 "register_op