Tamar Christina <tamar.christ...@arm.com> writes:
> Hi All,
>
> This patch extends our immediate SIMD generation cases to support generating
> integer immediates using floating point operation if the integer immediate 
> maps
> to an exact FP value.
>
> As an example:
>
> uint32x4_t f1() {
>     return vdupq_n_u32(0x3f800000);
> }
>
> currently generates:
>
> f1:
>         adrp    x0, .LC0
>         ldr     q0, [x0, #:lo12:.LC0]
>         ret
>
> i.e. a load, but with this change:
>
> f1:
>         fmov    v0.4s, 1.0e+0
>         ret
>
> Such immediates are common in e.g. our Math routines in glibc because they are
> created to extract or mark part of an FP immediate as masks.

I agree this is a good thing to do.  The current code is too beholden
to the original vector mode.  This patch relaxes it so that it isn't
beholden to the original mode's class (integer vs. float), but it would
still be beholden to the original mode's element size.

It looks like an alternative would be to remove:

  scalar_float_mode elt_float_mode;
  if (n_elts == 1
      && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
    {
      rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
      if (aarch64_float_const_zero_rtx_p (elt)
          || aarch64_float_const_representable_p (elt))
        {
          if (info)
            *info = simd_immediate_info (elt_float_mode, elt);
          return true;
        }
    }

and instead insert code:

  /* Get the repeating 8-byte value as an integer.  No endian correction
     is needed here because bytes is already in lsb-first order.  */
  unsigned HOST_WIDE_INT val64 = 0;
  for (unsigned int i = 0; i < 8; i++)
    val64 |= ((unsigned HOST_WIDE_INT) bytes[i % nbytes]
              << (i * BITS_PER_UNIT));

---> here

  if (vec_flags & VEC_SVE_DATA)
    return aarch64_sve_valid_immediate (val64, info);
  else
    return aarch64_advsimd_valid_immediate (val64, info, which);

that tries to reduce val64 to the smallest repeating pattern,
then tries to interpret that pattern as a float.  The reduction step
could reuse the first part of aarch64_sve_valid_immediate, which
calculates the narrowest repeating integer mode:

  scalar_int_mode mode = DImode;
  unsigned int val32 = val64 & 0xffffffff;
  if (val32 == (val64 >> 32))
    {
      mode = SImode;
      unsigned int val16 = val32 & 0xffff;
      if (val16 == (val32 >> 16))
        {
          mode = HImode;
          unsigned int val8 = val16 & 0xff;
          if (val8 == (val16 >> 8))
            mode = QImode;
        }
    }

This would give us the candidate integer mode, to which we could
apply float_mode_for_size (...).exists, as in the patch.

In this case we would have the value as an integer, rather than
as an rtx, so I think it would make sense to split out the part of
aarch64_float_const_representable_p that processes the REAL_VALUE_TYPE.
aarch64_simd_valid_immediate could then use the patch's:

> +      long int as_long_ints[2];
> +      as_long_ints[0] = buf & 0xFFFFFFFF;
> +      as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF;
> [...]
> +      real_from_target (&r, as_long_ints, fmode);

with "buf" being "val64" in the code above, and "fmode" being the result
of float_mode_for_size (...).exists.  aarch64_simd_valid_immediate
would then pass "r" and and "fmode" to the new, split-out variant of
aarch64_float_const_representable_p.  (I haven't checked the endiannes
requirements for real_from_target.)

The split-out variant would still perform the HFmode test in:

  if (GET_MODE (x) == VOIDmode
      || (GET_MODE (x) == HFmode && !TARGET_FP_F16INST))
    return false;

The VOIDmode test is redundant and can be dropped.  AArch64 has always
been a CONST_WIDE_INT target.

If we do that, we should probably also pass the integer mode calculated
by the code quoted above down to aarch64_sve_valid_immediate (where it
came from) and aarch64_advsimd_valid_immediate, since both of them would
find it useful.  E.g.:

      /* Try using a replicated byte.  */
      if (which == AARCH64_CHECK_MOV
          && val16 == (val32 >> 16)
          && val8 == (val16 >> 8))
        {
          if (info)
            *info = simd_immediate_info (QImode, val8);
          return true;
        }

would become:

  /* Try using a replicated byte.  */
  if (which == AARCH64_CHECK_MOV && mode == QImode)
    {
      if (info)
        *info = simd_immediate_info (QImode, val8);
      return true;
    }

I realise that's quite a bit different from the patch as posted, sorry,
and I've made it sound more complicated than it actually is.  But I think
it should be both more general (because it ignores the element size as
well as the mode class) and a little simpler.

The proposed split of aarch64_float_const_representable_p would be
a replacement for patch 1 in the series.  The current rtx version
of aarch64_float_const_representable_p would not need to take a mode,
but the REAL_VALUE_TYPE interface would.

Thanks,
Richard

>
> Bootstrapped Regtested on aarch64-none-linux-gnu and <on-goin> issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>       * config/aarch64/aarch64-protos.h (aarch64_float_const_representable_p):
>       Add overload.
>       * config/aarch64/aarch64.cc (aarch64_float_const_zero_rtx_p): Reject
>       integer modes.
>       (aarch64_simd_valid_immediate, aarch64_float_const_representable_p):
>       Check if integer value maps to an exact FP constant.
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/const_create_using_fmov.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 
> 7a84acc59569da0b50af2300615db561a5de460a..6c683ea2d93e1b733cfe49fac38381ea6451fd55
>  100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -974,6 +974,7 @@ void aarch64_split_simd_move (rtx, rtx);
>  
>  /* Check for a legitimate floating point constant for FMOV.  */
>  bool aarch64_float_const_representable_p (rtx, machine_mode);
> +bool aarch64_float_const_representable_p (rtx *, rtx, machine_mode);
>  
>  extern int aarch64_epilogue_uses (int);
>  
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 1842f6ecf6330f11a64545d0903240c89b104ffc..2d44608d93b8e7542ea8d5eb4c3f99c9f88e70ed
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -10991,7 +10991,8 @@ aarch64_float_const_zero_rtx_p (rtx x)
>    /* 0.0 in Decimal Floating Point cannot be represented by #0 or
>       zr as our callers expect, so no need to check the actual
>       value if X is of Decimal Floating Point type.  */
> -  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT)
> +  if (GET_MODE_CLASS (GET_MODE (x)) == MODE_DECIMAL_FLOAT
> +      || !CONST_DOUBLE_P (x))
>      return false;
>  
>    if (REAL_VALUE_MINUS_ZERO (*CONST_DOUBLE_REAL_VALUE (x)))
> @@ -23026,17 +23027,30 @@ aarch64_simd_valid_immediate (rtx op, 
> simd_immediate_info *info,
>    else
>      return false;
>  
> -  scalar_float_mode elt_float_mode;
> -  if (n_elts == 1
> -      && is_a <scalar_float_mode> (elt_mode, &elt_float_mode))
> +  if (n_elts == 1)
>      {
>        rtx elt = CONST_VECTOR_ENCODED_ELT (op, 0);
> +      rtx new_elt = NULL_RTX;
>        if (aarch64_float_const_zero_rtx_p (elt)
> -       || aarch64_float_const_representable_p (elt, elt_mode))
> -     {
> -       if (info)
> -         *info = simd_immediate_info (elt_float_mode, elt);
> -       return true;
> +       || aarch64_float_const_representable_p (&new_elt, elt, elt_mode))
> +     {
> +       scalar_float_mode elt_float_mode;
> +       auto bitsize = GET_MODE_UNIT_BITSIZE (elt_mode);
> +       if (is_a <scalar_float_mode> (elt_mode))
> +         elt_float_mode = as_a <scalar_float_mode> (elt_mode);
> +       else if (which == AARCH64_CHECK_MOV
> +                && new_elt
> +                && float_mode_for_size (bitsize).exists (&elt_float_mode))
> +         elt = new_elt;
> +       else
> +         elt = NULL_RTX;
> +
> +       if (elt != NULL_RTX)
> +         {
> +           if (info)
> +             *info = simd_immediate_info (elt_float_mode, elt);
> +           return true;
> +         }
>       }
>      }
>  
> @@ -25121,8 +25135,22 @@ aarch64_c_mode_for_suffix (char suffix)
>  
>  /* Return true iff X with mode MODE can be represented by a quarter-precision
>     floating point immediate operand X.  Note, we cannot represent 0.0.  */
> +
>  bool
>  aarch64_float_const_representable_p (rtx x, machine_mode mode)
> +{
> +  return aarch64_float_const_representable_p (NULL, x, mode);
> +}
> +
> +
> +/* Return true iff X with mode MODE can be represented by a quarter-precision
> +   floating point immediate operand X.  Note, we cannot represent 0.0.
> +   If the value is a CONST_INT that can be represented as an exact floating
> +   point then OUT will contain the new floating point value to emit to 
> generate
> +   the integer constant.  */
> +
> +bool
> +aarch64_float_const_representable_p (rtx *out, rtx x, machine_mode mode)
>  {
>    /* This represents our current view of how many bits
>       make up the mantissa.  */
> @@ -25134,14 +25162,45 @@ aarch64_float_const_representable_p (rtx x, 
> machine_mode mode)
>  
>    x = unwrap_const_vec_duplicate (x);
>    mode = GET_MODE_INNER (mode);
> -  if (!CONST_DOUBLE_P (x))
> +  if (!CONST_DOUBLE_P (x)
> +      && !CONST_INT_P (x))
>      return false;
>  
>    if (mode == VOIDmode
> -      || (mode == HFmode && !TARGET_FP_F16INST))
> +      || ((mode == HFmode || mode == HImode) && !TARGET_FP_F16INST))
>      return false;
>  
> -  r = *CONST_DOUBLE_REAL_VALUE (x);
> +  /* If we have an integer bit pattern, decode it back into a real.
> +     real_from_target requires the representation to be split into
> +     32-bit values and then put into two host wide ints.  */
> +  if (CONST_INT_P (x))
> +    {
> +      HOST_WIDE_INT buf = INTVAL (x);
> +      long int as_long_ints[2];
> +      as_long_ints[0] = buf & 0xFFFFFFFF;
> +      as_long_ints[1] = (buf >> 32) & 0xFFFFFFFF;
> +      machine_mode fmode;
> +      switch (mode)
> +      {
> +      case HImode:
> +     fmode = HFmode;
> +     break;
> +      case SImode:
> +     fmode = SFmode;
> +     break;
> +      case DImode:
> +     fmode = DFmode;
> +     break;
> +      default:
> +     return false;
> +      }
> +
> +      real_from_target (&r, as_long_ints, fmode);
> +      if (out)
> +     *out = const_double_from_real_value (r, fmode);
> +    }
> +  else
> +    r = *CONST_DOUBLE_REAL_VALUE (x);
>  
>    /* We cannot represent infinities, NaNs or +/-zero.  We won't
>       know if we have +zero until we analyse the mantissa, but we
> @@ -25170,6 +25229,7 @@ aarch64_float_const_representable_p (rtx x, 
> machine_mode mode)
>       the value.  */
>    if (w.ulow () != 0)
>      return false;
> +
>    /* We have rejected the lower HOST_WIDE_INT, so update our
>       understanding of how many bits lie in the mantissa and
>       look only at the high HOST_WIDE_INT.  */
> @@ -25205,9 +25265,9 @@ aarch64_float_const_representable_p (rtx x, 
> machine_mode mode)
>    return (exponent >= 0 && exponent <= 7);
>  }
>  
> -/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR or BIC
> -   immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH selects whether to
> -   output MOVI/MVNI, ORR or BIC immediate.  */
> +/* Returns the string with the instruction for AdvSIMD MOVI, MVNI, ORR, BIC 
> or
> +   FMOV immediate with a CONST_VECTOR of MODE and WIDTH.  WHICH selects 
> whether
> +   to output MOVI/MVNI, ORR or BIC immediate.  */
>  char*
>  aarch64_output_simd_mov_immediate (rtx const_vector, unsigned width,
>                                  enum simd_immediate_check which)
> diff --git a/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c 
> b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..e080afed8aa3578660027979335bfc859ca6bc91
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/const_create_using_fmov.c
> @@ -0,0 +1,87 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-march=armv9-a -Ofast" } */
> +/* { dg-final { check-function-bodies "**" "" "" } } */
> +
> +#include <arm_neon.h>
> +
> +/*
> +** g:
> +**   fmov    v0\.4s, 1\.0e\+0
> +**   ret
> +*/
> +float32x4_t g(){
> +    return vdupq_n_f32(1);
> +}
> +
> +/*
> +** h:
> +**   fmov    v0\.4s, 1\.0e\+0
> +**   ret
> +*/
> +uint32x4_t h() {
> +    return vreinterpretq_u32_f32(g());
> +}
> +
> +/*
> +** f1:
> +**   fmov    v0\.4s, 1\.0e\+0
> +**   ret
> +*/
> +uint32x4_t f1() {
> +    return vdupq_n_u32(0x3f800000);
> +}
> +
> +/*
> +** f2:
> +**   fmov    v0\.4s, 1\.5e\+0
> +**   ret
> +*/
> +uint32x4_t f2() {
> +    return vdupq_n_u32(0x3FC00000);
> +}
> +
> +/*
> +** f3:
> +**   fmov    v0\.4s, 1\.25e\+0
> +**   ret
> +*/
> +uint32x4_t f3() {
> +    return vdupq_n_u32(0x3FA00000);
> +}
> +
> +/*
> +** f4:
> +**   fmov    v0\.2d, 1\.0e\+0
> +**   ret
> +*/
> +uint64x2_t f4() {
> +    return vdupq_n_u64(0x3FF0000000000000);
> +}
> +
> +/*
> +** fn4:
> +**   fmov    v0\.2d, -1\.0e\+0
> +**   ret
> +*/
> +uint64x2_t fn4() {
> +    return vdupq_n_u64(0xBFF0000000000000);
> +}
> +
> +/*
> +** f5:
> +**   fmov    v0\.8h, 1\.5e\+0
> +**   ret
> +*/
> +uint16x8_t f5() {
> +    return vdupq_n_u16(0x3E00);
> +}
> +
> +/*
> +** f6:
> +**   adrp    x0, \.LC0
> +**   ldr     q0, \[x0, #:lo12:\.LC0\]
> +**   ret
> +*/
> +uint32x4_t f6() {
> +    return vdupq_n_u32(0x4f800000);
> +}

Reply via email to