Pengxuan Zheng <[email protected]> writes:
> This patch optimizes certain vector permute expansion with the FMOV
> instruction
> when one of the input vectors is a vector of all zeros and the result of the
> vector permute is as if the upper lane of the non-zero input vector is set to
> zero and the lower lane remains unchanged.
>
> Note that the patch also propagates zero_op0_p and zero_op1_p during re-encode
> now. They will be used by aarch64_evpc_fmov to check if the input vectors are
> valid candidates.
>
> PR target/100165
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-protos.h (aarch64_lane0_mask_p): New.
> * config/aarch64/aarch64-simd.md
> (@aarch64_simd_vec_set_zero_fmov<mode>):
> New define_insn.
> * config/aarch64/aarch64.cc (aarch64_lane0_mask_p): New.
> (aarch64_evpc_reencode): Copy zero_op0_p and zero_op1_p.
> (aarch64_evpc_fmov): New.
> (aarch64_expand_vec_perm_const_1): Add call to aarch64_evpc_fmov.
> * config/aarch64/iterators.md (VALL_F16_NO_QI): New mode iterator.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/vec-set-zero.c: Update test accordingly.
> * gcc.target/aarch64/fmov-1.c: New test.
> * gcc.target/aarch64/fmov-2.c: New test.
> * gcc.target/aarch64/fmov-3.c: New test.
> * gcc.target/aarch64/fmov-be-1.c: New test.
> * gcc.target/aarch64/fmov-be-2.c: New test.
> * gcc.target/aarch64/fmov-be-3.c: New test.
Sorry to be awkward, but looking at this again, and going back to my
previous comment:
Part of me thinks that this should just be described as a plain old AND,
but I suppose that doesn't work well for FP modes. Still, handling ANDs
might be an interesting follow-up :)
I wonder whether we should model this as an AND after all. That is,
any permute the blends a vector with zero can be interpreted as an AND
of a mask. We could even provide a target-independent routine for
detecting that case.
At present:
v4hf
f_v4hf (v4hf x)
{
return __builtin_shuffle (x, (v4hf){ 0, 0, 0, 0 }, (v4hi){ 4, 1, 6, 3 });
}
generates:
f_v4hf:
uzp1 v0.2d, v0.2d, v0.2d
adrp x0, .LC0
ldr d31, [x0, #:lo12:.LC0]
tbl v0.8b, {v0.16b}, v31.8b
ret
.LC0:
.byte -1
.byte -1
.byte 2
.byte 3
.byte -1
.byte -1
.byte 6
.byte 7
whereas with SVE enabled it could just be:
f_v4hf:
and z0.d, z0.d, #0xffff0000ffff
ret
and even without SVE it would be:
f_v4hf:
movi v31.2s, 0xff, msl 8
and v0.8b, v0.8b, v31.8b
ret
Then, using fmov would be an optimisation of AND.
I think this would also simplify the evpc detection, since the requirement
for using AND is the same for big-endian and little-endian, namely that
index I of the result must either come from index I of the nonzero
vector or from any element of the zero vector. (What differs between
big-endian and little-endian is which masks correspond to FMOV.)
Sorry again for the run-around.
Richard