Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Tuesday, August 31, 2021 4:14 PM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
>> <richard.earns...@arm.com>; Marcus Shawcroft
>> <marcus.shawcr...@arm.com>; Kyrylo Tkachov <kyrylo.tkac...@arm.com>
>> Subject: Re: [PATCH 2/2]AArch64: Add better costing for vector constants
>> and operations
>> 
>> Tamar Christina <tamar.christ...@arm.com> writes:
>> > @@ -13936,8 +13937,65 @@ cost_plus:
>> >                         mode, MULT, 1, speed);
>> >            return true;
>> >          }
>> > +  break;
>> > +    case PARALLEL:
>> > +      /* Fall through */
>> 
>> Which code paths lead to getting a PARALLEL here?
>
> Hi,
>
> Thanks for the review!
>
> I added it for completeness because CSE treats a parallel and CONST_VECTOR as
> equivalent when they each entry in the parallel defines a constant.

Could you test whether it ever triggers in practice though?
The code would be much simpler without it.

>> > +    case CONST_VECTOR:
>> > +  {
>> > +    rtx gen_insn = aarch64_simd_make_constant (x, true);
>> > +    /* Not a valid const vector.  */
>> > +    if (!gen_insn)
>> > +      break;
>> >
>> > -      /* Fall through.  */
>> > +    switch (GET_CODE (gen_insn))
>> > +    {
>> > +    case CONST_VECTOR:
>> > +      /* Load using MOVI/MVNI.  */
>> > +      if (aarch64_simd_valid_immediate (x, NULL))
>> > +        *cost += extra_cost->vect.movi;
>> > +      else /* Load using constant pool.  */
>> > +        *cost += extra_cost->ldst.load;
>> > +      break;
>> > +    /* Load using a DUP.  */
>> > +    case VEC_DUPLICATE:
>> > +      *cost += extra_cost->vect.dup;
>> > +      break;
>> 
>> Does this trigger in practice?  The new check==true path (rightly) stops the
>> duplicated element from being forced into a register, but then I would have
>> expected:
>> 
>> rtx
>> gen_vec_duplicate (machine_mode mode, rtx x) {
>>   if (valid_for_const_vector_p (mode, x))
>>     return gen_const_vec_duplicate (mode, x);
>>   return gen_rtx_VEC_DUPLICATE (mode, x); }
>> 
>> to generate the original CONST_VECTOR again.
>
> Yes, but CSE is trying to see whether using a DUP is cheaper than another 
> instruction.
> Normal code won't hit this but CSE is just costing all the different ways one 
> can semantically
> construct a vector, which RTL actually comes out of it depends on how it's 
> folded as you say.

But what I mean is, you call:

          rtx gen_insn = aarch64_simd_make_constant (x, true);
          /* Not a valid const vector.  */
          if (!gen_insn)
            break;

where aarch64_simd_make_constant does:

  if (CONST_VECTOR_P (vals))
    const_vec = vals;
  else if (GET_CODE (vals) == PARALLEL)
    {
      /* A CONST_VECTOR must contain only CONST_INTs and
         CONST_DOUBLEs, but CONSTANT_P allows more (e.g. SYMBOL_REF).
         Only store valid constants in a CONST_VECTOR.  */
      int n_elts = XVECLEN (vals, 0);
      for (i = 0; i < n_elts; ++i)
        {
          rtx x = XVECEXP (vals, 0, i);
          if (CONST_INT_P (x) || CONST_DOUBLE_P (x))
            n_const++;
        }
      if (n_const == n_elts)
        const_vec = gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0));
    }
  else
    gcc_unreachable ();

  if (const_vec != NULL_RTX
      && aarch64_simd_valid_immediate (const_vec, NULL))
    /* Load using MOVI/MVNI.  */
    return const_vec;
  else if ((const_dup = aarch64_simd_dup_constant (vals, check)) != NULL_RTX)
    /* Loaded using DUP.  */
    return const_dup;

and aarch64_simd_dup_constant does:

  machine_mode mode = GET_MODE (vals);
  machine_mode inner_mode = GET_MODE_INNER (mode);
  rtx x;

  if (!const_vec_duplicate_p (vals, &x))
    return NULL_RTX;

  /* We can load this constant by using DUP and a constant in a
     single ARM register.  This will be cheaper than a vector
     load.  */
  if (!check)
    x = copy_to_mode_reg (inner_mode, x);
  return gen_vec_duplicate (mode, x);

For the “check” case, “x” will be a constant, and so gen_vec_duplicate
will call gen_const_vec_duplicate, which will return a CONST_VECTOR.
It didn't seem to be possible for gen_insn to be a VEC_DUPLICATE.

This would be much simpler if we could call aarch64_simd_valid_immediate
and aarch64_simd_dup_constant directly from the rtx cost code, hence the
question about whether the PARALLEL stuff was really needed in practice.

>> > +    default:
>> > +      *cost += extra_cost->ldst.load;
>> > +      break;
>> > +    }
>> > +    return true;
>> > +  }
>> > +    case VEC_CONCAT:
>> > +  /* depending on the operation, either DUP or INS.
>> > +     For now, keep default costing.  */
>> > +  break;
>> > +    case VEC_DUPLICATE:
>> > +  *cost += extra_cost->vect.dup;
>> > +  return true;
>> > +    case VEC_SELECT:
>> > +  {
>> > +    /* cost subreg of 0 as free, otherwise as DUP */
>> > +    rtx op1 = XEXP (x, 1);
>> > +    int nelts;
>> > +    if ((op1 == const0_rtx && !BYTES_BIG_ENDIAN)
>> > +        || (BYTES_BIG_ENDIAN
>> > +            && GET_MODE_NUNITS (mode).is_constant(&nelts)
>> > +            && INTVAL (op1) == nelts - 1))
>> > +      ;
>> > +    else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1))
>> > +      ;
>> > +    else if (vec_series_highpart_p (mode, GET_MODE (op1), op1))
>> > +    /* Selecting the high part is not technically free, but we lack
>> > +       enough information to decide that here.  For instance selecting
>> > +       the high-part of a vec_dup *is* free or to feed into any _high
>> > +       instruction.   Both of which we can't really tell.  That said
>> > +       have a better chance to optimize an dup vs multiple constants.  */
>> > +      ;
>> 
>> Not sure about this.  We already try to detect the latter case (_high
>> instructions) via aarch64_strip_extend_vec_half.  We might be missing some
>> cases, but that still feels like the right way to go IMO.
>
> That's a different problem from what I understand.  What this is trying to 
> say is that
> If you have a vector [x y a b] and you need vector [x y] that you can use the 
> top part
> of the original vector for this.
>
> This is an approximation, because something that can be created with a movi 
> is probably
> Cheaper to keep distinct if it's not going to be paired with a _high 
> operation (since you will have a dup then).
>
> The problem is that the front end has already spit the two Vectors into [x y 
> a b] and [x y].
> There's nothing else that tries to consolidate them back up if both survive.
>
> As a consequence of this, the testcase test0 is not handled optimally.  It 
> would instead create
> 2 vectors, both of movi 0x3, just one being 64-bits and one being 128-bits.
>
> So if the cost of selecting it is cheaper than the movi, cse will not 
> consolidate the vectors,
> and because movi's are so cheap, the only cost that worked was 0.  But 
> increasing the costs
> of movi's requires the costs of everything to be increased (including loads).
>
> I preferred to 0 out the cost, because the worst that can happen is an dup 
> instead of a movi,
> And at best a dup instead of a load from a pool (if the constant is 
> complicated).

Hmm, will need to look at this more tomorrow.

>> Selecting the high part of a vec_dup should get folded into another vec_dup.
>> 
>> The lowpart bits look OK, but which paths call this function without first
>> simplifying the select to a subreg?  The subreg is now the canonical form
>> (thanks to r12-2288).
>
> The simplification will happen during folding in cse or in combine.  This 
> costing happens before the folding,
> When CSE is trying to decide whether to undo the front end's lowering of 
> constants.
>
> To do so it models the constants and the semantic operation required to 
> extract them. E.g. to get
> 2 out of [0 2 4 5] it would need a VEC_SELECT of 1. And I don't treat the 
> first element/bottom part special
> Here.  Costing wise they would be the same.

But which code path creates the VEC_SELECT?  We don't need any context
to know that the VEC_SELECT is non-canonical.  It's obvious from the
operands of the VEC_SELECT in isolation.

I'd just rather tackle this at source than try to get the cost code
to handle non-canonical rtl.

Thanks,
Richard

Reply via email to