Hi, > -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Monday, September 30, 2024 6:33 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com>; Marcus Shawcroft > <marcus.shawcr...@arm.com>; ktkac...@gcc.gnu.org > Subject: Re: [PATCH 2/2]AArch64: support encoding integer immediates using > floating point moves > > Tamar Christina <tamar.christ...@arm.com> writes: > > Hi All, > > > > This patch extends our immediate SIMD generation cases to support generating > > integer immediates using floating point operation if the integer immediate > > maps > > to an exact FP value. > > > > As an example: > > > > uint32x4_t f1() { > > return vdupq_n_u32(0x3f800000); > > } > > > > currently generates: > > > > f1: > > adrp x0, .LC0 > > ldr q0, [x0, #:lo12:.LC0] > > ret > > > > i.e. a load, but with this change: > > > > f1: > > fmov v0.4s, 1.0e+0 > > ret > > > > Such immediates are common in e.g. our Math routines in glibc because they > > are > > created to extract or mark part of an FP immediate as masks. > > I agree this is a good thing to do. The current code is too beholden > to the original vector mode. This patch relaxes it so that it isn't > beholden to the original mode's class (integer vs. float), but it would > still be beholden to the original mode's element size.
I've implemented this approach and it works but I'm struggling with an inconsistency in how zeros are created. There are about 800 SVE ACLE tests like acge_f16.c that check that a zero is created using a mov of the same sized register as the usage. So I added an exception for zero to use the original input element mode. But then there are about 400 other SVE ACLE tests that actually check that zeros are created using byte moves, like dup_128_s16_z even though they're used as ints. So these two are in conflict. Do you care which way I resolve this? since it's zero it shouldn't matter how they're created but perhaps there's a reason why some test check for the specific instruction? Thanks, Tamar > > It looks like an alternative would be to remove: > > scalar_float_mode elt_float_mode; > if (n_elts == 1 > && is_a <scalar_float_mode> (elt_mode, &elt_float_mode)) > { > rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0); > if (aarch64_float_const_zero_rtx_p (elt) > || aarch64_float_const_representable_p (elt)) > { > if (info) > *info = simd_immediate_info (elt_float_mode, elt); > return true; > } > } > > and instead insert code: > > /* Get the repeating 8-byte value as an integer. No endian correction > is needed here because bytes is already in lsb-first order. */ > unsigned HOST_WIDE_INT val64 = 0; > for (unsigned int i = 0; i < 8; i++) > val64 |= ((unsigned HOST_WIDE_INT) bytes[i % nbytes] > << (i * BITS_PER_UNIT)); > > ---> here > > if (vec_flags & VEC_SVE_DATA) > return aarch64_sve_valid_immediate (val64, info); > else > return aarch64_advsimd_valid_immediate (val64, info, which); > > that tries to reduce val64 to the smallest repeating pattern, > then tries to interpret that pattern as a float. The reduction step > could reuse the first part of aarch64_sve_valid_immediate, which > calculates the narrowest repeating integer mode: > > scalar_int_mode mode = DImode; > unsigned int val32 = val64 & 0xffffffff; > if (val32 == (val64 >> 32)) > { > mode = SImode; > unsigned int val16 = val32 & 0xffff; > if (val16 == (val32 >> 16)) > { > mode = HImode; > unsigned int val8 = val16 & 0xff; > if (val8 == (val16 >> 8)) > mode = QImode; > } > } > > This would give us the candidate integer mode, to which we could > apply float_mode_for_size (...).exists, as in the patch. > > In this case we would have the value as an integer, rather than > as an rtx, so I think it would make sense to split out the part of > aarch64_float_const_representable_p that processes the REAL_VALUE_TYPE. > aarch64_simd_valid_immediate could then use the patch's: > > > + long int as_long_ints[2]; > > + as_long_ints[0] = buf & 0xFFFFFFFF; > > + as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF; > > [...] > > + real_from_target (&r, as_long_ints, fmode); > > with "buf" being "val64" in the code above, and "fmode" being the result > of float_mode_for_size (...).exists. aarch64_simd_valid_immediate > would then pass "r" and and "fmode" to the new, split-out variant of > aarch64_float_const_representable_p. (I haven't checked the endiannes > requirements for real_from_target.) > > The split-out variant would still perform the HFmode test in: > > if (GET_MODE (x) == VOIDmode > || (GET_MODE (x) == HFmode && !TARGET_FP_F16INST)) > return false; > > The VOIDmode test is redundant and can be dropped. AArch64 has always > been a CONST_WIDE_INT target. > > If we do that, we should probably also pass the integer mode calculated > by the code quoted above down to aarch64_sve_valid_immediate (where it > came from) and aarch64_advsimd_valid_immediate, since both of them would > find it useful. E.g.: > > /* Try using a replicated byte. */ > if (which == AARCH64_CHECK_MOV > && val16 == (val32 >> 16) > && val8 == (val16 >> 8)) > { > if (info) > *info = simd_immediate_info (QImode, val8); > return true; > } > > would become: > > /* Try using a replicated byte. */ > if (which == AARCH64_CHECK_MOV && mode == QImode) > { > if (info) > *info = simd_immediate_info (QImode, val8); > return true; > } > > I realise that's quite a bit different from the patch as posted, sorry, > and I've made it sound more complicated than it actually is. But I think > it should be both more general (because it ignores the element size as > well as the mode class) and a little simpler. > > The proposed split of aarch64_float_const_representable_p would be > a replacement for patch 1 in the series. The current rtx version > of aarch64_float_const_representable_p would not need to take a mode, > but the REAL_VALUE_TYPE interface would. > > Thanks, > Richard > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and <on-goin> issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * config/aarch64/aarch64-protos.h > (aarch64_float_const_representable_p): > > Add overload. > > * config/aarch64/aarch64.cc (aarch64_float_const_zero_rtx_p): Reject > > integer modes. > > (aarch64_simd_valid_immediate, aarch64_float_const_representable_p): > > Check if integer value maps to an exact FP constant. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/const_create_using_fmov.c: New test. > > > > --- > > > > diff --git a/gcc/config/aarch64/aarch64-protos.h > b/gcc/config/aarch64/aarch64-protos.h > > index > 7a84acc59569da0b50af2300615db561a5de460a..6c683ea2d93e1b733cfe49fac > 38381ea6451fd55 100644 > > --- a/gcc/config/aarch64/aarch64-protos.h > > +++ b/gcc/config/aarch64/aarch64-protos.h > > @@ -974,6 +974,7 @@ void aarch64_split_simd_move (rtx, rtx); > > > > /* Check for a legitimate floating point constant for FMOV. */ > > bool aarch64_float_const_representable_p (rtx, machine_mode); > > +bool aarch64_float_const_representable_p (rtx *, rtx, machine_mode); > > > > extern int aarch64_epilogue_uses (int); > > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index > 1842f6ecf6330f11a64545d0903240c89b104ffc..2d44608d93b8e7542ea8d5eb > 4c3f99c9f88e70ed 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -10991,7 +10991,8 @@ aarch64_float_const_zero_rtx_p (rtx x) > > /* 0.0 in Decimal Floating Point cannot be represented by #0 or > > zr as our callers expect, so no need to check the actual > > value if X is of Decimal Floating Point type. */ > > - if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT) > > + if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT > > + || !CONST_DOUBLE_P (x)) > > return false; > > > > if (REAL_VALUE_MINUS_ZERO (*CONST_DOUBLE_REAL_VALUE (x))) > > @@ -23026,17 +23027,30 @@ aarch64_simd_valid_immediate (rtx op, > simd_immediate_info *info, > > else > > return false; > > > > - scalar_float_mode elt_float_mode; > > - if (n_elts == 1 > > - && is_a <scalar_float_mode> (elt_mode, &elt_float_mode)) > > + if (n_elts == 1) > > { > > rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0); > > + rtx new_elt = NULL_RTX; > > if (aarch64_float_const_zero_rtx_p (elt) > > - || aarch64_float_const_representable_p (elt, elt_mode)) > > - { > > - if (info) > > - *info = simd_immediate_info (elt_float_mode, elt); > > - return true; > > + || aarch64_float_const_representable_p (&new_elt, elt, elt_mode)) > > + { > > + scalar_float_mode elt_float_mode; > > + auto bitsize = GET_MODE_UNIT_BITSIZE (elt_mode); > > + if (is_a <scalar_float_mode> (elt_mode)) > > + elt_float_mode = as_a <scalar_float_mode> (elt_mode); > > + else if (which == AARCH64_CHECK_MOV > > + && new_elt > > + && float_mode_for_size (bitsize).exists (&elt_float_mode)) > > + elt = new_elt; > > + else > > + elt = NULL_RTX; > > + > > + if (elt != NULL_RTX) > > + { > > + if (info) > > + *info = simd_immediate_info (elt_float_mode, elt); > > + return true; > > + } > > } > > } > > > > @@ -25121,8 +25135,22 @@ aarch64_c_mode_for_suffix (char suffix) > > > > /* Return true iff X with mode MODE can be represented by a > > quarter-precision > > floating point immediate operand X. Note, we cannot represent 0.0. */ > > + > > bool > > aarch64_float_const_representable_p (rtx x, machine_mode mode) > > +{ > > + return aarch64_float_const_representable_p (NULL, x, mode); > > +} > > + > > + > > +/* Return true iff X with mode MODE can be represented by a > > quarter-precision > > + floating point immediate operand X. Note, we cannot represent 0.0. > > + If the value is a CONST_INT that can be represented as an exact floating > > + point then OUT will contain the new floating point value to emit to > > generate > > + the integer constant. */ > > + > > +bool > > +aarch64_float_const_representable_p (rtx *out, rtx x, machine_mode mode) > > { > > /* This represents our current view of how many bits > > make up the mantissa. */ > > @@ -25134,14 +25162,45 @@ aarch64_float_const_representable_p (rtx x, > machine_mode mode) > > > > x = unwrap_const_vec_duplicate (x); > > mode = GET_MODE_INNER (mode); > > - if (!CONST_DOUBLE_P (x)) > > + if (!CONST_DOUBLE_P (x) > > + && !CONST_INT_P (x)) > > return false; > > > > if (mode == VOIDmode > > - || (mode == HFmode && !TARGET_FP_F16INST)) > > + || ((mode == HFmode || mode == HImode) && !TARGET_FP_F16INST)) > > return false; > > > > - r = *CONST_DOUBLE_REAL_VALUE (x); > > + /* If we have an integer bit pattern, decode it back into a real. > > + real_from_target requires the representation to be split into > > + 32-bit values and then put into two host wide ints. */ > > + if (CONST_INT_P (x)) > > + { > > + HOST_WIDE_INT buf = INTVAL (x); > > + long int as_long_ints[2]; > > + as_long_ints[0] = buf & 0xFFFFFFFF; > > + as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF; > > + machine_mode fmode; > > + switch (mode) > > + { > > + case HImode: > > + fmode = HFmode; > > + break; > > + case SImode: > > + fmode = SFmode; > > + break; > > + case DImode: > > + fmode = DFmode; > > + break; > > + default: > > + return false; > > + } > > + > > + real_from_target (&r, as_long_ints, fmode); > > + if (out) > > + *out = const_double_from_real_value (r, fmode); > > + } > > + else > > + r = *CONST_DOUBLE_REAL_VALUE (x); > > > > /* We cannot represent infinities, NaNs or +/-zero. We won't > > know if we have +zero until we analyse the mantissa, but we > > @@ -25170,6 +25229,7 @@ aarch64_float_const_representable_p (rtx x, > machine_mode mode) > > the value. */ > > if (w.ulow () != 0) > > return false; > > + > > /* We have rejected the lower HOST_WIDE_INT, so update our > > understanding of how many bits lie in the mantissa and > > look only at the high HOST_WIDE_INT. */ > > @@ -25205,9 +25265,9 @@ aarch64_float_const_representable_p (rtx x, > machine_mode mode) > > return (exponent >= 0 && exponent <= 7); > > } > > > > -/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR or > > BIC > > - immediate with a CONST_VECTOR of MODE and WIDTH. WHICH selects > whether to > > - output MOVI/MVNI, ORR or BIC immediate. */ > > +/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR, BIC > or > > + FMOV immediate with a CONST_VECTOR of MODE and WIDTH. WHICH > selects whether > > + to output MOVI/MVNI, ORR or BIC immediate. */ > > char* > > aarch64_output_simd_mov_immediate (rtx const_vector, unsigned width, > > enum simd_immediate_check which) > > diff --git a/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c > b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c > > new file mode 100644 > > index > 0000000000000000000000000000000000000000..e080afed8aa35786600279 > 79335bfc859ca6bc91 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c > > @@ -0,0 +1,87 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-march=armv9-a -Ofast" } */ > > +/* { dg-final { check-function-bodies "**" "" "" } } */ > > + > > +#include <arm_neon.h> > > + > > +/* > > +** g: > > +** fmov v0\.4s, 1\.0e\+0 > > +** ret > > +*/ > > +float32x4_t g(){ > > + return vdupq_n_f32(1); > > +} > > + > > +/* > > +** h: > > +** fmov v0\.4s, 1\.0e\+0 > > +** ret > > +*/ > > +uint32x4_t h() { > > + return vreinterpretq_u32_f32(g()); > > +} > > + > > +/* > > +** f1: > > +** fmov v0\.4s, 1\.0e\+0 > > +** ret > > +*/ > > +uint32x4_t f1() { > > + return vdupq_n_u32(0x3f800000); > > +} > > + > > +/* > > +** f2: > > +** fmov v0\.4s, 1\.5e\+0 > > +** ret > > +*/ > > +uint32x4_t f2() { > > + return vdupq_n_u32(0x3FC00000); > > +} > > + > > +/* > > +** f3: > > +** fmov v0\.4s, 1\.25e\+0 > > +** ret > > +*/ > > +uint32x4_t f3() { > > + return vdupq_n_u32(0x3FA00000); > > +} > > + > > +/* > > +** f4: > > +** fmov v0\.2d, 1\.0e\+0 > > +** ret > > +*/ > > +uint64x2_t f4() { > > + return vdupq_n_u64(0x3FF0000000000000); > > +} > > + > > +/* > > +** fn4: > > +** fmov v0\.2d, -1\.0e\+0 > > +** ret > > +*/ > > +uint64x2_t fn4() { > > + return vdupq_n_u64(0xBFF0000000000000); > > +} > > + > > +/* > > +** f5: > > +** fmov v0\.8h, 1\.5e\+0 > > +** ret > > +*/ > > +uint16x8_t f5() { > > + return vdupq_n_u16(0x3E00); > > +} > > + > > +/* > > +** f6: > > +** adrp x0, \.LC0 > > +** ldr q0, \[x0, #:lo12:\.LC0\] > > +** ret > > +*/ > > +uint32x4_t f6() { > > + return vdupq_n_u32(0x4f800000); > > +}