Tamar Christina <[email protected]> writes:
> @@ -13936,8 +13937,65 @@ cost_plus:
> mode, MULT, 1, speed);
> return true;
> }
> + break;
> + case PARALLEL:
> + /* Fall through */
Which code paths lead to getting a PARALLEL here?
> + case CONST_VECTOR:
> + {
> + rtx gen_insn = aarch64_simd_make_constant (x, true);
> + /* Not a valid const vector. */
> + if (!gen_insn)
> + break;
>
> - /* Fall through. */
> + switch (GET_CODE (gen_insn))
> + {
> + case CONST_VECTOR:
> + /* Load using MOVI/MVNI. */
> + if (aarch64_simd_valid_immediate (x, NULL))
> + *cost += extra_cost->vect.movi;
> + else /* Load using constant pool. */
> + *cost += extra_cost->ldst.load;
> + break;
> + /* Load using a DUP. */
> + case VEC_DUPLICATE:
> + *cost += extra_cost->vect.dup;
> + break;
Does this trigger in practice? The new check==true path (rightly) stops
the duplicated element from being forced into a register, but then
I would have expected:
rtx
gen_vec_duplicate (machine_mode mode, rtx x)
{
if (valid_for_const_vector_p (mode, x))
return gen_const_vec_duplicate (mode, x);
return gen_rtx_VEC_DUPLICATE (mode, x);
}
to generate the original CONST_VECTOR again.
> + default:
> + *cost += extra_cost->ldst.load;
> + break;
> + }
> + return true;
> + }
> + case VEC_CONCAT:
> + /* depending on the operation, either DUP or INS.
> + For now, keep default costing. */
> + break;
> + case VEC_DUPLICATE:
> + *cost += extra_cost->vect.dup;
> + return true;
> + case VEC_SELECT:
> + {
> + /* cost subreg of 0 as free, otherwise as DUP */
> + rtx op1 = XEXP (x, 1);
> + int nelts;
> + if ((op1 == const0_rtx && !BYTES_BIG_ENDIAN)
> + || (BYTES_BIG_ENDIAN
> + && GET_MODE_NUNITS (mode).is_constant(&nelts)
> + && INTVAL (op1) == nelts - 1))
> + ;
> + else if (vec_series_lowpart_p (mode, GET_MODE (op1), op1))
> + ;
> + else if (vec_series_highpart_p (mode, GET_MODE (op1), op1))
> + /* Selecting the high part is not technically free, but we lack
> + enough information to decide that here. For instance selecting
> + the high-part of a vec_dup *is* free or to feed into any _high
> + instruction. Both of which we can't really tell. That said
> + have a better chance to optimize an dup vs multiple constants. */
> + ;
Not sure about this. We already try to detect the latter case
(_high instructions) via aarch64_strip_extend_vec_half. We might
be missing some cases, but that still feels like the right way
to go IMO.
Selecting the high part of a vec_dup should get folded into
another vec_dup.
The lowpart bits look OK, but which paths call this function
without first simplifying the select to a subreg? The subreg
is now the canonical form (thanks to r12-2288).
Thanks,
Richard