On Wed, Dec 18, 2024 at 6:30 PM Jennifer Schmitz <[email protected]> wrote:
>
>
>
> > On 17 Dec 2024, at 18:57, Richard Biener <[email protected]> wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> >> Am 16.12.2024 um 09:10 schrieb Jennifer Schmitz <[email protected]>:
> >>
> >>
> >>
> >>> On 14 Dec 2024, at 09:32, Richard Biener <[email protected]> wrote:
> >>>
> >>> External email: Use caution opening links or attachments
> >>>
> >>>
> >>>>> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <[email protected]>:
> >>>>
> >>>>
> >>>>
> >>>>> On 13 Dec 2024, at 13:40, Richard Biener <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>> External email: Use caution opening links or attachments
> >>>>>
> >>>>>
> >>>>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <[email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <[email protected]> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford
> >>>>>>>> <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> External email: Use caution opening links or attachments
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Jennifer Schmitz <[email protected]> writes:
> >>>>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> External email: Use caution opening links or attachments
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote:
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford
> >>>>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> External email: Use caution opening links or attachments
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Jennifer Schmitz <[email protected]> writes:
> >>>>>>>>>>>>> [...]
> >>>>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of
> >>>>>>>>>>>>> the diff for strided_store_2.c), it seemed odd that
> >>>>>>>>>>>>> vec_to_scalar operations cost 0 now, instead of the previous
> >>>>>>>>>>>>> cost of 2:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note: === vectorizable_operation
> >>>>>>>>>>>>> ===
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_simple_cost:
> >>>>>>>>>>>>> inside_cost = 1, prologue_cost = 0 .
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note: ==> examining statement: *_6
> >>>>>>>>>>>>> = _7;
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand
> >>>>>>>>>>>>> _3 + 1.0e+0, type of def: internal
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note: Vectorizing an unaligned
> >>>>>>>>>>>>> access.
> >>>>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128
> >>>>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234
> >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_store_cost:
> >>>>>>>>>>>>> inside_cost = 12, prologue_cost = 0 .
> >>>>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body
> >>>>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
> >>>>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body
> >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue
> >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body
> >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body
> >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in
> >>>>>>>>>>>>> multiple places in aarch64.cc, the location that causes this
> >>>>>>>>>>>>> behavior is this one:
> >>>>>>>>>>>>> unsigned
> >>>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count,
> >>>>>>>>>>>>> vect_cost_for_stmt kind,
> >>>>>>>>>>>>> stmt_vec_info stmt_info, slp_tree,
> >>>>>>>>>>>>> tree vectype, int misalign,
> >>>>>>>>>>>>> vect_cost_model_location where)
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>> [...]
> >>>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO
> >>>>>>>>>>>>> instead
> >>>>>>>>>>>>> of just looking at KIND. */
> >>>>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>>>> + if (stmt_info)
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
> >>>>>>>>>>>>> vec_to_scalar for each element. However, we can store the first
> >>>>>>>>>>>>> element using an FP store without a separate extract step. */
> >>>>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info))
> >>>>>>>>>>>>> count -= 1;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
> >>>>>>>>>>>>> stmt_info,
> >>>>>>>>>>>>> stmt_cost);
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> if (vectype && m_vec_flags)
> >>>>>>>>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
> >>>>>>>>>>>>> stmt_info,
> >>>>>>>>>>>>> vectype,
> >>>>>>>>>>>>> where,
> >>>>>>>>>>>>> stmt_cost);
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>> [...]
> >>>>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count *
> >>>>>>>>>>>>> stmt_cost).ceil ());
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of
> >>>>>>>>>>>>> 2 for a vec_to_scalar operation in the vect body. Now "if
> >>>>>>>>>>>>> (stmt_info)" is entered and "if (vect_is_store_elt_extraction
> >>>>>>>>>>>>> (kind, stmt_info))" evaluates to true, which sets the count to
> >>>>>>>>>>>>> 0 and leads to a return value of 0.
> >>>>>>>>>>>>
> >>>>>>>>>>>> At the time the code was written, a scalarised store would be
> >>>>>>>>>>>> costed
> >>>>>>>>>>>> using one vec_to_scalar call into the backend, with the count
> >>>>>>>>>>>> parameter
> >>>>>>>>>>>> set to the number of elements being stored. The "count -= 1" was
> >>>>>>>>>>>> supposed to lop off the leading element extraction, since we can
> >>>>>>>>>>>> store
> >>>>>>>>>>>> lane 0 as a normal FP store.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The target-independent costing was later reworked so that it
> >>>>>>>>>>>> costs
> >>>>>>>>>>>> each operation individually:
> >>>>>>>>>>>>
> >>>>>>>>>>>> for (i = 0; i < nstores; i++)
> >>>>>>>>>>>> {
> >>>>>>>>>>>> if (costing_p)
> >>>>>>>>>>>> {
> >>>>>>>>>>>> /* Only need vector extracting when there are more
> >>>>>>>>>>>> than one stores. */
> >>>>>>>>>>>> if (nstores > 1)
> >>>>>>>>>>>> inside_cost
> >>>>>>>>>>>> += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> >>>>>>>>>>>> stmt_info, 0, vect_body);
> >>>>>>>>>>>> /* Take a single lane vector type store as scalar
> >>>>>>>>>>>> store to avoid ICE like 110776. */
> >>>>>>>>>>>> if (VECTOR_TYPE_P (ltype)
> >>>>>>>>>>>> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> >>>>>>>>>>>> n_adjacent_stores++;
> >>>>>>>>>>>> else
> >>>>>>>>>>>> inside_cost
> >>>>>>>>>>>> += record_stmt_cost (cost_vec, 1, scalar_store,
> >>>>>>>>>>>> stmt_info, 0, vect_body);
> >>>>>>>>>>>> continue;
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> Unfortunately, there's no easy way of telling whether a
> >>>>>>>>>>>> particular call
> >>>>>>>>>>>> is part of a group, and if so, which member of the group it is.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I suppose we could give up on the attempt to be (somewhat)
> >>>>>>>>>>>> accurate
> >>>>>>>>>>>> and just disable the optimisation. Or we could restrict it to
> >>>>>>>>>>>> count > 1,
> >>>>>>>>>>>> since it might still be useful for gathers and scatters.
> >>>>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to
> >>>>>>>>>>> count > 1 and it seems to resolve the issue of costing
> >>>>>>>>>>> vec_to_scalar operations with 0 (see patch below).
> >>>>>>>>>>> What are your thoughts on this?
> >>>>>>>>>>
> >>>>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost
> >>>>>>>>>> together
> >>>>>>>>>> with the n_adjacent_store handling?
> >>>>>>>>> When I continued working on this patch, we had already reached
> >>>>>>>>> stage 3 and I was hesitant to introduce changes to the middle-end
> >>>>>>>>> that were not previously covered by this patch. So I tried if the
> >>>>>>>>> issue could not be resolved by making a small change in the backend.
> >>>>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy
> >>>>>>>>> to look into it again.
> >>>>>>>>
> >>>>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which
> >>>>>>>> it
> >>>>>>>> sounds like he is), then I agree that would be better. Otherwise
> >>>>>>>> we'd
> >>>>>>>> be creating technical debt to clean up for GCC 16. And it is a
> >>>>>>>> regression
> >>>>>>>> of sorts, so is stage 3 material from that POV.
> >>>>>>>>
> >>>>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a
> >>>>>>>> "let's clean this up next stage 1" thing, since we needed to add
> >>>>>>>> tuning
> >>>>>>>> for a new CPU late during the cycle. But of course, there were other
> >>>>>>>> priorities when stage 1 actually came around, so it never actually
> >>>>>>>> happened. Thanks again for being the one to sort this out.)
> >>>>>>> Thanks for your feedback. Then I will try to make it work in
> >>>>>>> vectorizable_store.
> >>>>>>> Best,
> >>>>>>> Jennifer
> >>>>>> Below is the updated patch with a suggestion for the changes in
> >>>>>> vectorizable_store. It resolves the issue with the vec_to_scalar
> >>>>>> operations that were individually costed with 0.
> >>>>>> We already tested it on aarch64, no regression, but we are still doing
> >>>>>> performance testing.
> >>>>>> Can you give some feedback in the meantime on the patch itself?
> >>>>>> Thanks,
> >>>>>> Jennifer
> >>>>>>
> >>>>>>
> >>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable
> >>>>>> and
> >>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> >>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and
> >>>>>> its uses
> >>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
> >>>>>> described in
> >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> >>>>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
> >>>>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
> >>>>>> are not costed individually, but as a group.
> >>>>>>
> >>>>>> Two tests were adjusted due to changes in codegen. In both cases, the
> >>>>>> old code performed loop unrolling once, but the new code does not:
> >>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
> >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> >>>>>> -moverride=tune=none):
> >>>>>> f_int64_t_32:
> >>>>>> cbz w3, .L92
> >>>>>> mov x4, 0
> >>>>>> uxtw x3, w3
> >>>>>> + cntd x5
> >>>>>> + whilelo p7.d, xzr, x3
> >>>>>> + mov z29.s, w5
> >>>>>> mov z31.s, w2
> >>>>>> - whilelo p6.d, xzr, x3
> >>>>>> - mov x2, x3
> >>>>>> - index z30.s, #0, #1
> >>>>>> - uqdecd x2
> >>>>>> - ptrue p5.b, all
> >>>>>> - whilelo p7.d, xzr, x2
> >>>>>> + index z30.d, #0, #1
> >>>>>> + ptrue p6.b, all
> >>>>>> .p2align 3,,7
> >>>>>> .L94:
> >>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl]
> >>>>>> - ld1d z28.d, p6/z, [x0]
> >>>>>> - movprfx z29, z31
> >>>>>> - mul z29.s, p5/m, z29.s, z30.s
> >>>>>> - incw x4
> >>>>>> - uunpklo z0.d, z29.s
> >>>>>> - uunpkhi z29.d, z29.s
> >>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3]
> >>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3]
> >>>>>> - add z25.d, z28.d, z25.d
> >>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3]
> >>>>>> + movprfx z28, z31
> >>>>>> + mul z28.s, p6/m, z28.s, z30.s
> >>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3]
> >>>>>> add z26.d, z27.d, z26.d
> >>>>>> - st1d z26.d, p7, [x0, #1, mul vl]
> >>>>>> - whilelo p7.d, x4, x2
> >>>>>> - st1d z25.d, p6, [x0]
> >>>>>> - incw z30.s
> >>>>>> - incb x0, all, mul #2
> >>>>>> - whilelo p6.d, x4, x3
> >>>>>> + st1d z26.d, p7, [x0, x4, lsl 3]
> >>>>>> + add z30.s, z30.s, z29.s
> >>>>>> + incd x4
> >>>>>> + whilelo p7.d, x4, x3
> >>>>>> b.any .L94
> >>>>>> .L92:
> >>>>>> ret
> >>>>>>
> >>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
> >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> >>>>>> -moverride=tune=none):
> >>>>>> f_int64_t_32:
> >>>>>> cbz w3, .L84
> >>>>>> - addvl x5, x1, #1
> >>>>>> mov x4, 0
> >>>>>> uxtw x3, w3
> >>>>>> - mov z31.s, w2
> >>>>>> + cntd x5
> >>>>>> whilelo p7.d, xzr, x3
> >>>>>> - mov x2, x3
> >>>>>> - index z30.s, #0, #1
> >>>>>> - uqdecd x2
> >>>>>> - ptrue p5.b, all
> >>>>>> - whilelo p6.d, xzr, x2
> >>>>>> + mov z29.s, w5
> >>>>>> + mov z31.s, w2
> >>>>>> + index z30.d, #0, #1
> >>>>>> + ptrue p6.b, all
> >>>>>> .p2align 3,,7
> >>>>>> .L86:
> >>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3]
> >>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3]
> >>>>>> - movprfx z29, z30
> >>>>>> - mul z29.s, p5/m, z29.s, z31.s
> >>>>>> - add z28.d, z28.d, #1
> >>>>>> - uunpklo z26.d, z29.s
> >>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3]
> >>>>>> - incw x4
> >>>>>> - uunpkhi z29.d, z29.s
> >>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3]
> >>>>>> + movprfx z28, z30
> >>>>>> + mul z28.s, p6/m, z28.s, z31.s
> >>>>>> add z27.d, z27.d, #1
> >>>>>> - whilelo p6.d, x4, x2
> >>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3]
> >>>>>> - incw z30.s
> >>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3]
> >>>>>> + incd x4
> >>>>>> + add z30.s, z30.s, z29.s
> >>>>>> whilelo p7.d, x4, x3
> >>>>>> b.any .L86
> >>>>>> .L84:
> >>>>>> ret
> >>>>>>
> >>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> >>>>>> regression.
> >>>>>> OK for mainline?
> >>>>>>
> >>>>>> Signed-off-by: Jennifer Schmitz <[email protected]>
> >>>>>>
> >>>>>> gcc/
> >>>>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of
> >>>>>> n_adjacent_stores to also cover vec_to_scalar operations.
> >>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove
> >>>>>> use_new_vector_costs as tuning option.
> >>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
> >>>>>> Remove.
> >>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
> >>>>>> aarch64_use_new_vector_costs_p.
> >>>>>> (aarch64_vector_costs::finish_cost): Remove use of
> >>>>>> aarch64_use_new_vector_costs_p.
> >>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove
> >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
> >>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
> >>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
> >>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
> >>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
> >>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
> >>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
> >>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
> >>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
> >>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
> >>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
> >>>>>>
> >>>>>> gcc/testsuite/
> >>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
> >>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> >>>>>> ---
> >>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 --
> >>>>>> gcc/config/aarch64/aarch64.cc | 20 +++----------
> >>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 -
> >>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 -
> >>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 -
> >>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 -
> >>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 -
> >>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 -
> >>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 -
> >>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +-
> >>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +-
> >>>>>> gcc/tree-vect-stmts.cc | 29 ++++++++++---------
> >>>>>> 16 files changed, 22 insertions(+), 44 deletions(-)
> >>>>>>
> >>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>> index ffbff20e29c..1de633c739b 100644
> >>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
> >>>>>> CHEAP_SHIFT_EXTEND)
> >>>>>>
> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> >>>>>> CSE_SVE_VL_CONSTANTS)
> >>>>>>
> >>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs",
> >>>>>> USE_NEW_VECTOR_COSTS)
> >>>>>> -
> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput",
> >>>>>> MATCHED_VECTOR_THROUGHPUT)
> >>>>>>
> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma",
> >>>>>> AVOID_CROSS_LOOP_FMA)
> >>>>>> diff --git a/gcc/config/aarch64/aarch64.cc
> >>>>>> b/gcc/config/aarch64/aarch64.cc
> >>>>>> index 77a2a6bfa3a..71fba9cc63b 100644
> >>>>>> --- a/gcc/config/aarch64/aarch64.cc
> >>>>>> +++ b/gcc/config/aarch64/aarch64.cc
> >>>>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info
> >>>>>> *vinfo, bool costing_for_scalar)
> >>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
> >>>>>> }
> >>>>>>
> >>>>>> -/* Return true if the current CPU should use the new costs defined
> >>>>>> - in GCC 11. This should be removed for GCC 12 and above, with the
> >>>>>> - costs applying to all CPUs instead. */
> >>>>>> -static bool
> >>>>>> -aarch64_use_new_vector_costs_p ()
> >>>>>> -{
> >>>>>> - return (aarch64_tune_params.extra_tuning_flags
> >>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> >>>>>> -}
> >>>>>> -
> >>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */
> >>>>>> static const simd_vec_cost *
> >>>>>> aarch64_simd_vec_costs (tree vectype)
> >>>>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int
> >>>>>> count, vect_cost_for_stmt kind,
> >>>>>>
> >>>>>> /* Do one-time initialization based on the vinfo. */
> >>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> >>>>>> + if (!m_analyzed_vinfo)
> >>>>>> {
> >>>>>> if (loop_vinfo)
> >>>>>> analyze_loop_vinfo (loop_vinfo);
> >>>>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int
> >>>>>> count, vect_cost_for_stmt kind,
> >>>>>>
> >>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
> >>>>>> of just looking at KIND. */
> >>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>> + if (stmt_info)
> >>>>>> {
> >>>>>> /* If we scalarize a strided store, the vectorizer costs one
> >>>>>> vec_to_scalar for each element. However, we can store the first
> >>>>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int
> >>>>>> count, vect_cost_for_stmt kind,
> >>>>>> else
> >>>>>> m_num_last_promote_demote = 0;
> >>>>>>
> >>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>> + if (stmt_info)
> >>>>>> {
> >>>>>> /* Account for any extra "embedded" costs that apply additively
> >>>>>> to the base cost calculated above. */
> >>>>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const
> >>>>>> vector_costs *uncast_scalar_costs)
> >>>>>>
> >>>>>> auto *scalar_costs
> >>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> >>>>>> - if (loop_vinfo
> >>>>>> - && m_vec_flags
> >>>>>> - && aarch64_use_new_vector_costs_p ())
> >>>>>> + if (loop_vinfo && m_vec_flags)
> >>>>>> {
> >>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> >>>>>> m_costs[vect_body]);
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>> index b2ff716157a..0a8eff69307 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>> index 2d704ecd110..a564528f43d 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>> @@ -55,7 +55,6 @@ static const struct tune_params
> >>>>>> fujitsu_monaka_tunings =
> >>>>>> 0, /* max_case_values. */
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>>>> &generic_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>> index bdd309ab03d..f090d5cde50 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>> @@ -183,7 +183,6 @@ static const struct tune_params
> >>>>>> generic_armv8_a_tunings =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>>>> &generic_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>> index a05a9ab92a2..4c33c147444 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>> @@ -249,7 +249,6 @@ static const struct tune_params
> >>>>>> generic_armv9_a_tunings =
> >>>>>> 0, /* max_case_values. */
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>>>> &generic_armv9a_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>> index c407b89a22f..fe4f7c10f73 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>> @@ -156,7 +156,6 @@ static const struct tune_params
> >>>>>> neoverse512tvb_tunings =
> >>>>>> 0, /* max_case_values. */
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>>>> &generic_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>> index fd5f8f37370..0c74068da2c 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>> index 8b156c2fe4d..9d4e1be171a 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>>>> &generic_prefetch_tune,
> >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>> index 23c121d8652..85a78bb2bef 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>> index 40af5f47f4f..1dd452beb8d 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
> >>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>> index d65d74bfecf..d0ba5b1aef6 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings
> >>>>>> =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>> index 7b7fa0b4b08..a1572048503 100644
> >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
> >>>>>> neoversev3ae_tunings =
> >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>> (AARCH64_EXTRA_TUNE_BASE
> >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>> &generic_prefetch_tune,
> >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>> index 762805ff54b..c334b7a6875 100644
> >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>> @@ -15,4 +15,4 @@
> >>>>>> so we vectorize the offset calculation. This means that the
> >>>>>> 64-bit version needs two copies. */
> >>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z,
> >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
> >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
> >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>> index f0ea58e38e2..94cc63049bc 100644
> >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>> @@ -15,4 +15,4 @@
> >>>>>> so we vectorize the offset calculation. This means that the
> >>>>>> 64-bit version needs two copies. */
> >>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7],
> >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
> >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
> >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> >>>>>> index be1139a423c..6d7d28c4702 100644
> >>>>>> --- a/gcc/tree-vect-stmts.cc
> >>>>>> +++ b/gcc/tree-vect-stmts.cc
> >>>>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo,
> >>>>>> {
> >>>>>> if (costing_p)
> >>>>>> {
> >>>>>> - /* Only need vector extracting when there are
> >>>>>> more
> >>>>>> - than one stores. */
> >>>>>> - if (nstores > 1)
> >>>>>> - inside_cost
> >>>>>> - += record_stmt_cost (cost_vec, 1,
> >>>>>> vec_to_scalar,
> >>>>>> - stmt_info, slp_node,
> >>>>>> - 0, vect_body);
> >>>>>> /* Take a single lane vector type store as scalar
> >>>>>> store to avoid ICE like 110776. */
> >>>>>> - if (VECTOR_TYPE_P (ltype)
> >>>>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype),
> >>>>>> 1U))
> >>>>>> + bool single_lane_vec_p =
> >>>>>> + VECTOR_TYPE_P (ltype)
> >>>>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U);
> >>>>>> + /* Only need vector extracting when there are
> >>>>>> more
> >>>>>> + than one stores. */
> >>>>>> + if (nstores > 1 || single_lane_vec_p)
> >>>>>> n_adjacent_stores++;
> >>>>>> - else
> >>>>>> + if (!single_lane_vec_p)
> >>>>>
> >>>>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p
> >>>>> correlate. In fact I think that we always record a store, just for
> >>>>> single-element
> >>>>> vectors we record scalar stores. I suggest to here always to just
> >>>>> n_adjacent_stores++
> >>>>> and below ...
> >>>>>
> >>>>>> inside_cost
> >>>>>> += record_stmt_cost (cost_vec, 1, scalar_store,
> >>>>>> stmt_info, 0, vect_body);
> >>>>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo,
> >>>>>> if (costing_p)
> >>>>>> {
> >>>>>> if (n_adjacent_stores > 0)
> >>>>>> - vect_get_store_cost (vinfo, stmt_info, slp_node,
> >>>>>> n_adjacent_stores,
> >>>>>> - alignment_support_scheme,
> >>>>>> misalignment,
> >>>>>> - &inside_cost, cost_vec);
> >>>>>> + {
> >>>>>> + vect_get_store_cost (vinfo, stmt_info, slp_node,
> >>>>>> n_adjacent_stores,
> >>>>>> + alignment_support_scheme,
> >>>>>> misalignment,
> >>>>>> + &inside_cost, cost_vec);
> >>>>>
> >>>>> ... record n_adjacent_stores scalar_store when ltype is single-lane and
> >>>>> record
> >>>>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none).
> >>>>>
> >>>>> Richard.
> >>>> Thanks for the feedback, I’m glad it’s going in the right direction.
> >>>> Below is the updated patch, re-validated on aarch64.
> >>>> Thanks, Jennifer
> >>>>
> >>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable
> >>>> and
> >>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> >>>> default. To that end, the function aarch64_use_new_vector_costs_p and
> >>>> its uses
> >>>> were removed. To prevent costing vec_to_scalar operations with 0, as
> >>>> described in
> >>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> >>>> we adjusted vectorizable_store such that the variable n_adjacent_stores
> >>>> also covers vec_to_scalar operations. This way vec_to_scalar operations
> >>>> are not costed individually, but as a group.
> >>>>
> >>>> Two tests were adjusted due to changes in codegen. In both cases, the
> >>>> old code performed loop unrolling once, but the new code does not:
> >>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
> >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> >>>> -moverride=tune=none):
> >>>> f_int64_t_32:
> >>>> cbz w3, .L92
> >>>> mov x4, 0
> >>>> uxtw x3, w3
> >>>> + cntd x5
> >>>> + whilelo p7.d, xzr, x3
> >>>> + mov z29.s, w5
> >>>> mov z31.s, w2
> >>>> - whilelo p6.d, xzr, x3
> >>>> - mov x2, x3
> >>>> - index z30.s, #0, #1
> >>>> - uqdecd x2
> >>>> - ptrue p5.b, all
> >>>> - whilelo p7.d, xzr, x2
> >>>> + index z30.d, #0, #1
> >>>> + ptrue p6.b, all
> >>>> .p2align 3,,7
> >>>> .L94:
> >>>> - ld1d z27.d, p7/z, [x0, #1, mul vl]
> >>>> - ld1d z28.d, p6/z, [x0]
> >>>> - movprfx z29, z31
> >>>> - mul z29.s, p5/m, z29.s, z30.s
> >>>> - incw x4
> >>>> - uunpklo z0.d, z29.s
> >>>> - uunpkhi z29.d, z29.s
> >>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3]
> >>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3]
> >>>> - add z25.d, z28.d, z25.d
> >>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3]
> >>>> + movprfx z28, z31
> >>>> + mul z28.s, p6/m, z28.s, z30.s
> >>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3]
> >>>> add z26.d, z27.d, z26.d
> >>>> - st1d z26.d, p7, [x0, #1, mul vl]
> >>>> - whilelo p7.d, x4, x2
> >>>> - st1d z25.d, p6, [x0]
> >>>> - incw z30.s
> >>>> - incb x0, all, mul #2
> >>>> - whilelo p6.d, x4, x3
> >>>> + st1d z26.d, p7, [x0, x4, lsl 3]
> >>>> + add z30.s, z30.s, z29.s
> >>>> + incd x4
> >>>> + whilelo p7.d, x4, x3
> >>>> b.any .L94
> >>>> .L92:
> >>>> ret
> >>>>
> >>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
> >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> >>>> -moverride=tune=none):
> >>>> f_int64_t_32:
> >>>> cbz w3, .L84
> >>>> - addvl x5, x1, #1
> >>>> mov x4, 0
> >>>> uxtw x3, w3
> >>>> - mov z31.s, w2
> >>>> + cntd x5
> >>>> whilelo p7.d, xzr, x3
> >>>> - mov x2, x3
> >>>> - index z30.s, #0, #1
> >>>> - uqdecd x2
> >>>> - ptrue p5.b, all
> >>>> - whilelo p6.d, xzr, x2
> >>>> + mov z29.s, w5
> >>>> + mov z31.s, w2
> >>>> + index z30.d, #0, #1
> >>>> + ptrue p6.b, all
> >>>> .p2align 3,,7
> >>>> .L86:
> >>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3]
> >>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3]
> >>>> - movprfx z29, z30
> >>>> - mul z29.s, p5/m, z29.s, z31.s
> >>>> - add z28.d, z28.d, #1
> >>>> - uunpklo z26.d, z29.s
> >>>> - st1d z28.d, p7, [x0, z26.d, lsl 3]
> >>>> - incw x4
> >>>> - uunpkhi z29.d, z29.s
> >>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3]
> >>>> + movprfx z28, z30
> >>>> + mul z28.s, p6/m, z28.s, z31.s
> >>>> add z27.d, z27.d, #1
> >>>> - whilelo p6.d, x4, x2
> >>>> - st1d z27.d, p7, [x0, z29.d, lsl 3]
> >>>> - incw z30.s
> >>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3]
> >>>> + incd x4
> >>>> + add z30.s, z30.s, z29.s
> >>>> whilelo p7.d, x4, x3
> >>>> b.any .L86
> >>>> .L84:
> >>>> ret
> >>>>
> >>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> >>>> regression.
> >>>> OK for mainline?
> >>>>
> >>>> Signed-off-by: Jennifer Schmitz <[email protected]>
> >>>>
> >>>> gcc/
> >>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of
> >>>> n_adjacent_stores to also cover vec_to_scalar operations.
> >>>> * config/aarch64/aarch64-tuning-flags.def: Remove
> >>>> use_new_vector_costs as tuning option.
> >>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
> >>>> Remove.
> >>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
> >>>> aarch64_use_new_vector_costs_p.
> >>>> (aarch64_vector_costs::finish_cost): Remove use of
> >>>> aarch64_use_new_vector_costs_p.
> >>>> * config/aarch64/tuning_models/cortexx925.h: Remove
> >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
> >>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
> >>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
> >>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
> >>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
> >>>>
> >>>> gcc/testsuite/
> >>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
> >>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> >>>> ---
> >>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -
> >>>> gcc/config/aarch64/aarch64.cc | 20 ++--------
> >>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 -
> >>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 -
> >>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 -
> >>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 -
> >>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 -
> >>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 -
> >>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 -
> >>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 -
> >>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 -
> >>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 -
> >>>> .../aarch64/tuning_models/neoversev3ae.h | 1 -
> >>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +-
> >>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +-
> >>>> gcc/tree-vect-stmts.cc | 37 +++++++++++--------
> >>>> 16 files changed, 27 insertions(+), 47 deletions(-)
> >>>>
> >>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>> index ffbff20e29c..1de633c739b 100644
> >>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
> >>>> CHEAP_SHIFT_EXTEND)
> >>>>
> >>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> >>>> CSE_SVE_VL_CONSTANTS)
> >>>>
> >>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs",
> >>>> USE_NEW_VECTOR_COSTS)
> >>>> -
> >>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput",
> >>>> MATCHED_VECTOR_THROUGHPUT)
> >>>>
> >>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma",
> >>>> AVOID_CROSS_LOOP_FMA)
> >>>> diff --git a/gcc/config/aarch64/aarch64.cc
> >>>> b/gcc/config/aarch64/aarch64.cc
> >>>> index 77a2a6bfa3a..71fba9cc63b 100644
> >>>> --- a/gcc/config/aarch64/aarch64.cc
> >>>> +++ b/gcc/config/aarch64/aarch64.cc
> >>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info
> >>>> *vinfo, bool costing_for_scalar)
> >>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
> >>>> }
> >>>>
> >>>> -/* Return true if the current CPU should use the new costs defined
> >>>> - in GCC 11. This should be removed for GCC 12 and above, with the
> >>>> - costs applying to all CPUs instead. */
> >>>> -static bool
> >>>> -aarch64_use_new_vector_costs_p ()
> >>>> -{
> >>>> - return (aarch64_tune_params.extra_tuning_flags
> >>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> >>>> -}
> >>>> -
> >>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */
> >>>> static const simd_vec_cost *
> >>>> aarch64_simd_vec_costs (tree vectype)
> >>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
> >>>> vect_cost_for_stmt kind,
> >>>>
> >>>> /* Do one-time initialization based on the vinfo. */
> >>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> >>>> + if (!m_analyzed_vinfo)
> >>>> {
> >>>> if (loop_vinfo)
> >>>> analyze_loop_vinfo (loop_vinfo);
> >>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
> >>>> vect_cost_for_stmt kind,
> >>>>
> >>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
> >>>> of just looking at KIND. */
> >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>> + if (stmt_info)
> >>>> {
> >>>> /* If we scalarize a strided store, the vectorizer costs one
> >>>> vec_to_scalar for each element. However, we can store the first
> >>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
> >>>> vect_cost_for_stmt kind,
> >>>> else
> >>>> m_num_last_promote_demote = 0;
> >>>>
> >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>> + if (stmt_info)
> >>>> {
> >>>> /* Account for any extra "embedded" costs that apply additively
> >>>> to the base cost calculated above. */
> >>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const
> >>>> vector_costs *uncast_scalar_costs)
> >>>>
> >>>> auto *scalar_costs
> >>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> >>>> - if (loop_vinfo
> >>>> - && m_vec_flags
> >>>> - && aarch64_use_new_vector_costs_p ())
> >>>> + if (loop_vinfo && m_vec_flags)
> >>>> {
> >>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> >>>> m_costs[vect_body]);
> >>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>> index 5ebaf66e986..74772f3e15f 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>> index 2d704ecd110..a564528f43d 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings
> >>>> =
> >>>> 0, /* max_case_values. */
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>> &generic_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>> index bdd309ab03d..f090d5cde50 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>> @@ -183,7 +183,6 @@ static const struct tune_params
> >>>> generic_armv8_a_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>> &generic_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>> index 785e00946bc..7b5821183bc 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>> @@ -251,7 +251,6 @@ static const struct tune_params
> >>>> generic_armv9_a_tunings =
> >>>> 0, /* max_case_values. */
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>> index 007f987154c..f7457df59e5 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>> @@ -156,7 +156,6 @@ static const struct tune_params
> >>>> neoverse512tvb_tunings =
> >>>> 0, /* max_case_values. */
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>> index 32560d2f5f8..541b61c8179 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>> index 2010bc4645b..eff668132a8 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>> index c3751e32696..d11472b6e1e 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>> index 80dbe5c806c..ee77ffdd3bc 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
> >>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>> index efe09e16d1e..6ef143ef7d5 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>> index 66849f30889..96bdbf971f1 100644
> >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings
> >>>> =
> >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>> (AARCH64_EXTRA_TUNE_BASE
> >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>> &generic_armv9a_prefetch_tune,
> >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>> index 762805ff54b..c334b7a6875 100644
> >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>> @@ -15,4 +15,4 @@
> >>>> so we vectorize the offset calculation. This means that the
> >>>> 64-bit version needs two copies. */
> >>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z,
> >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
> >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
> >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>> index f0ea58e38e2..94cc63049bc 100644
> >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>> @@ -15,4 +15,4 @@
> >>>> so we vectorize the offset calculation. This means that the
> >>>> 64-bit version needs two copies. */
> >>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7],
> >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
> >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
> >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> >>>> index be1139a423c..ab57163c243 100644
> >>>> --- a/gcc/tree-vect-stmts.cc
> >>>> +++ b/gcc/tree-vect-stmts.cc
> >>>> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo,
> >>>> {
> >>>> if (costing_p)
> >>>> {
> >>>> - /* Only need vector extracting when there are more
> >>>> - than one stores. */
> >>>> - if (nstores > 1)
> >>>> - inside_cost
> >>>> - += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> >>>> - stmt_info, slp_node,
> >>>> - 0, vect_body);
> >>>> - /* Take a single lane vector type store as scalar
> >>>> - store to avoid ICE like 110776. */
> >>>> - if (VECTOR_TYPE_P (ltype)
> >>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> >>>> - n_adjacent_stores++;
> >>>> - else
> >>>> + n_adjacent_stores++;
> >>>> + if (!VECTOR_TYPE_P (ltype))
> >>>
> >>> This should be combined with the Single lane Vector case belle
> >>>
> >>>> inside_cost
> >>>> += record_stmt_cost (cost_vec, 1, scalar_store,
> >>>> stmt_info, 0, vect_body);
> >>>> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo,
> >>>> if (costing_p)
> >>>> {
> >>>> if (n_adjacent_stores > 0)
> >>>> - vect_get_store_cost (vinfo, stmt_info, slp_node,
> >>>> n_adjacent_stores,
> >>>> - alignment_support_scheme, misalignment,
> >>>> - &inside_cost, cost_vec);
> >>>> + {
> >>>> + /* Take a single lane vector type store as scalar
> >>>> + store to avoid ICE like 110776. */
> >>>> + if (VECTOR_TYPE_P (ltype)
> >>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> >>>> + inside_cost
> >>>> + += record_stmt_cost (cost_vec, n_adjacent_stores,
> >>>> + scalar_store, stmt_info, 0, vect_body);
> >>>> + /* Only need vector extracting when there are more
> >>>> + than one stores. */
> >>>> + if (nstores > 1)
> >>>> + inside_cost
> >>>> + += record_stmt_cost (cost_vec, n_adjacent_stores,
> >>>> + vec_to_scalar, stmt_info, slp_node,
> >>>> + 0, vect_body);
> >>>> + vect_get_store_cost (vinfo, stmt_info, slp_node,
> >>>
> >>> This should be Inlay done for Multi-lane vectors
> >> Thanks for the quick reply. As I am making the changes, I am wondering: Do
> >> we even need n_adjacent_stores anymore? It appears to always have the same
> >> value as nstores. Can we remove it and use nstores instead or does it
> >> still serve another purpose?
> >
> > It was a heuristic needed for powerpc(?), can you confirm we’re not
> > combining stores from VF unrolling for strided SLP stores?
> Hi Richard,
> the reasoning behind my suggestion to replace n_adjacent_stores by nstores in
> this code section is that with my patch they will logically always have the
> same value.
>
> Having said that, I looked into why n_adjacent_stores was introduced in the
> first place: The patch [1] that introduced n_adjacent_stores fixed a
> regression on aarch64 by costing vector loads/stores together. The variables
> n_adjacent_stores and n_adjacent_loads were added in two code sections each
> in vectorizable_store and vectorizable_load. The connection to PowerPC you
> recalled is also mentioned in the PR, but I believe it refers to the enum
> dr_alignment_support alignment_support_scheme that is used in
>
> vect_get_store_cost (vinfo, stmt_info, slp_node,
> _adjacent_stores, alignment_support_scheme,
> misalignment, &inside_cost, cost_vec);
>
> to which I made no changes other than refactoring the if-statement around it.
>
> So, taking the fact that n_adjacent_stores has been introduced in multiple
> locations into account I would actually leave n_adjacent_stores in the code
> section that I made changes to in order to keep vectorizable_store and
> vectorizable_load consistent.
>
> Regarding your question about not combining stores from loop unrolling for
> strided SLP stores: I'm not entirely sure what you mean, but could it be
> covered by the tests gcc.target/aarch64/ldp_stp_* that were also mentioned in
> [1]?
I'm refering to a case with variable stride
for (.. i += s)
{
a[4*i] = ..;
a[4*i + 1] = ...;
a[4*i + 2] = ...;
a[4*i + 3] = ...;
}
where we might choose to store to the V4SI destination using two
V2SI stores (adjacent), iff the VF ends up equal two we'd have two
sets of a[] stores, thus four V2SI stores but only two of them would be
"adjacent". Note I don't know whether "adjacent" really was supposed
to be adjacent or rather "related".
Anyway, the costing interface for loads and stores is likely to change
sustantially for GCC 16.
> I added the changes you proposed in the updated patch below, but kept
> n_adjacent_stores. The patch was re-validated on aarch64.
> Thanks,
> Jennifer
>
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111784#c3
>
>
> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> default. To that end, the function aarch64_use_new_vector_costs_p and its uses
> were removed. To prevent costing vec_to_scalar operations with 0, as
> described in
> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> we adjusted vectorizable_store such that the variable n_adjacent_stores
> also covers vec_to_scalar operations. This way vec_to_scalar operations
> are not costed individually, but as a group.
>
> Two tests were adjusted due to changes in codegen. In both cases, the
> old code performed loop unrolling once, but the new code does not:
> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> -moverride=tune=none):
> f_int64_t_32:
> cbz w3, .L92
> mov x4, 0
> uxtw x3, w3
> + cntd x5
> + whilelo p7.d, xzr, x3
> + mov z29.s, w5
> mov z31.s, w2
> - whilelo p6.d, xzr, x3
> - mov x2, x3
> - index z30.s, #0, #1
> - uqdecd x2
> - ptrue p5.b, all
> - whilelo p7.d, xzr, x2
> + index z30.d, #0, #1
> + ptrue p6.b, all
> .p2align 3,,7
> .L94:
> - ld1d z27.d, p7/z, [x0, #1, mul vl]
> - ld1d z28.d, p6/z, [x0]
> - movprfx z29, z31
> - mul z29.s, p5/m, z29.s, z30.s
> - incw x4
> - uunpklo z0.d, z29.s
> - uunpkhi z29.d, z29.s
> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3]
> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3]
> - add z25.d, z28.d, z25.d
> + ld1d z27.d, p7/z, [x0, x4, lsl 3]
> + movprfx z28, z31
> + mul z28.s, p6/m, z28.s, z30.s
> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3]
> add z26.d, z27.d, z26.d
> - st1d z26.d, p7, [x0, #1, mul vl]
> - whilelo p7.d, x4, x2
> - st1d z25.d, p6, [x0]
> - incw z30.s
> - incb x0, all, mul #2
> - whilelo p6.d, x4, x3
> + st1d z26.d, p7, [x0, x4, lsl 3]
> + add z30.s, z30.s, z29.s
> + incd x4
> + whilelo p7.d, x4, x3
> b.any .L94
> .L92:
> ret
>
> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> -moverride=tune=none):
> f_int64_t_32:
> cbz w3, .L84
> - addvl x5, x1, #1
> mov x4, 0
> uxtw x3, w3
> - mov z31.s, w2
> + cntd x5
> whilelo p7.d, xzr, x3
> - mov x2, x3
> - index z30.s, #0, #1
> - uqdecd x2
> - ptrue p5.b, all
> - whilelo p6.d, xzr, x2
> + mov z29.s, w5
> + mov z31.s, w2
> + index z30.d, #0, #1
> + ptrue p6.b, all
> .p2align 3,,7
> .L86:
> - ld1d z28.d, p7/z, [x1, x4, lsl 3]
> - ld1d z27.d, p6/z, [x5, x4, lsl 3]
> - movprfx z29, z30
> - mul z29.s, p5/m, z29.s, z31.s
> - add z28.d, z28.d, #1
> - uunpklo z26.d, z29.s
> - st1d z28.d, p7, [x0, z26.d, lsl 3]
> - incw x4
> - uunpkhi z29.d, z29.s
> + ld1d z27.d, p7/z, [x1, x4, lsl 3]
> + movprfx z28, z30
> + mul z28.s, p6/m, z28.s, z31.s
> add z27.d, z27.d, #1
> - whilelo p6.d, x4, x2
> - st1d z27.d, p7, [x0, z29.d, lsl 3]
> - incw z30.s
> + st1d z27.d, p7, [x0, z28.d, uxtw 3]
> + incd x4
> + add z30.s, z30.s, z29.s
> whilelo p7.d, x4, x3
> b.any .L86
> .L84:
> ret
>
> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> regression.
> OK for mainline?
LGTM.
Richard.
> Signed-off-by: Jennifer Schmitz <[email protected]>
>
> gcc/
> * tree-vect-stmts.cc (vectorizable_store): Extend the use of
> n_adjacent_stores to also cover vec_to_scalar operations.
> * config/aarch64/aarch64-tuning-flags.def: Remove
> use_new_vector_costs as tuning option.
> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
> Remove.
> (aarch64_vector_costs::add_stmt_cost): Remove use of
> aarch64_use_new_vector_costs_p.
> (aarch64_vector_costs::finish_cost): Remove use of
> aarch64_use_new_vector_costs_p.
> * config/aarch64/tuning_models/cortexx925.h: Remove
> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
> * config/aarch64/tuning_models/neoversen2.h: Likewise.
> * config/aarch64/tuning_models/neoversen3.h: Likewise.
> * config/aarch64/tuning_models/neoversev1.h: Likewise.
> * config/aarch64/tuning_models/neoversev2.h: Likewise.
> * config/aarch64/tuning_models/neoversev3.h: Likewise.
> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>
> gcc/testsuite/
> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> ---
> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -
> gcc/config/aarch64/aarch64.cc | 20 ++--------
> gcc/config/aarch64/tuning_models/cortexx925.h | 1 -
> .../aarch64/tuning_models/fujitsu_monaka.h | 1 -
> .../aarch64/tuning_models/generic_armv8_a.h | 1 -
> .../aarch64/tuning_models/generic_armv9_a.h | 1 -
> .../aarch64/tuning_models/neoverse512tvb.h | 1 -
> gcc/config/aarch64/tuning_models/neoversen2.h | 1 -
> gcc/config/aarch64/tuning_models/neoversen3.h | 1 -
> gcc/config/aarch64/tuning_models/neoversev1.h | 1 -
> gcc/config/aarch64/tuning_models/neoversev2.h | 1 -
> gcc/config/aarch64/tuning_models/neoversev3.h | 1 -
> .../aarch64/tuning_models/neoversev3ae.h | 1 -
> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +-
> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +-
> gcc/tree-vect-stmts.cc | 40 ++++++++++---------
> 16 files changed, 27 insertions(+), 50 deletions(-)
>
> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> b/gcc/config/aarch64/aarch64-tuning-flags.def
> index ffbff20e29c..1de633c739b 100644
> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
> CHEAP_SHIFT_EXTEND)
>
> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
>
> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
> -
> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput",
> MATCHED_VECTOR_THROUGHPUT)
>
> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 77a2a6bfa3a..71fba9cc63b 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo,
> bool costing_for_scalar)
> return new aarch64_vector_costs (vinfo, costing_for_scalar);
> }
>
> -/* Return true if the current CPU should use the new costs defined
> - in GCC 11. This should be removed for GCC 12 and above, with the
> - costs applying to all CPUs instead. */
> -static bool
> -aarch64_use_new_vector_costs_p ()
> -{
> - return (aarch64_tune_params.extra_tuning_flags
> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> -}
> -
> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */
> static const simd_vec_cost *
> aarch64_simd_vec_costs (tree vectype)
> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
>
> /* Do one-time initialization based on the vinfo. */
> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> + if (!m_analyzed_vinfo)
> {
> if (loop_vinfo)
> analyze_loop_vinfo (loop_vinfo);
> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
>
> /* Try to get a more accurate cost by looking at STMT_INFO instead
> of just looking at KIND. */
> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> + if (stmt_info)
> {
> /* If we scalarize a strided store, the vectorizer costs one
> vec_to_scalar for each element. However, we can store the first
> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
> else
> m_num_last_promote_demote = 0;
>
> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> + if (stmt_info)
> {
> /* Account for any extra "embedded" costs that apply additively
> to the base cost calculated above. */
> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs
> *uncast_scalar_costs)
>
> auto *scalar_costs
> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> - if (loop_vinfo
> - && m_vec_flags
> - && aarch64_use_new_vector_costs_p ())
> + if (loop_vinfo && m_vec_flags)
> {
> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> m_costs[vect_body]);
> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h
> b/gcc/config/aarch64/tuning_models/cortexx925.h
> index 5ebaf66e986..74772f3e15f 100644
> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> index 2d704ecd110..a564528f43d 100644
> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
> 0, /* max_case_values. */
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> index bdd309ab03d..f090d5cde50 100644
> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> @@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> index 785e00946bc..7b5821183bc 100644
> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> @@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings =
> 0, /* max_case_values. */
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> index 007f987154c..f7457df59e5 100644
> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
> 0, /* max_case_values. */
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h
> b/gcc/config/aarch64/tuning_models/neoversen2.h
> index 32560d2f5f8..541b61c8179 100644
> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h
> b/gcc/config/aarch64/tuning_models/neoversen3.h
> index 2010bc4645b..eff668132a8 100644
> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h
> b/gcc/config/aarch64/tuning_models/neoversev1.h
> index c3751e32696..d11472b6e1e 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h
> b/gcc/config/aarch64/tuning_models/neoversev2.h
> index 80dbe5c806c..ee77ffdd3bc 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h
> b/gcc/config/aarch64/tuning_models/neoversev3.h
> index efe09e16d1e..6ef143ef7d5 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> index 66849f30889..96bdbf971f1 100644
> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> (AARCH64_EXTRA_TUNE_BASE
> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> &generic_armv9a_prefetch_tune,
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> index 762805ff54b..c334b7a6875 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> @@ -15,4 +15,4 @@
> so we vectorize the offset calculation. This means that the
> 64-bit version needs two copies. */
> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z,
> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z,
> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> index f0ea58e38e2..94cc63049bc 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> @@ -15,4 +15,4 @@
> so we vectorize the offset calculation. This means that the
> 64-bit version needs two copies. */
> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+,
> z[0-9]+.s, uxtw 2\]\n} 3 } } */
> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+,
> z[0-9]+.d, lsl 3\]\n} 15 } } */
> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+,
> z[0-9]+.d, lsl 3\]\n} 9 } } */
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index be1139a423c..a14248193ca 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -8834,22 +8834,7 @@ vectorizable_store (vec_info *vinfo,
> {
> if (costing_p)
> {
> - /* Only need vector extracting when there are more
> - than one stores. */
> - if (nstores > 1)
> - inside_cost
> - += record_stmt_cost (cost_vec, 1, vec_to_scalar,
> - stmt_info, slp_node,
> - 0, vect_body);
> - /* Take a single lane vector type store as scalar
> - store to avoid ICE like 110776. */
> - if (VECTOR_TYPE_P (ltype)
> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> - n_adjacent_stores++;
> - else
> - inside_cost
> - += record_stmt_cost (cost_vec, 1, scalar_store,
> - stmt_info, 0, vect_body);
> + n_adjacent_stores++;
> continue;
> }
> tree newref, newoff;
> @@ -8905,9 +8890,26 @@ vectorizable_store (vec_info *vinfo,
> if (costing_p)
> {
> if (n_adjacent_stores > 0)
> - vect_get_store_cost (vinfo, stmt_info, slp_node,
> n_adjacent_stores,
> - alignment_support_scheme, misalignment,
> - &inside_cost, cost_vec);
> + {
> + /* Take a single lane vector type store as scalar
> + store to avoid ICE like 110776. */
> + if (VECTOR_TYPE_P (ltype)
> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
> + vect_get_store_cost (vinfo, stmt_info, slp_node,
> + n_adjacent_stores,
> alignment_support_scheme,
> + misalignment, &inside_cost, cost_vec);
> + else
> + inside_cost
> + += record_stmt_cost (cost_vec, n_adjacent_stores,
> + scalar_store, stmt_info, 0, vect_body);
> + /* Only need vector extracting when there are more
> + than one stores. */
> + if (nstores > 1)
> + inside_cost
> + += record_stmt_cost (cost_vec, n_adjacent_stores,
> + vec_to_scalar, stmt_info, slp_node,
> + 0, vect_body);
> + }
> if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location,
> "vect_model_store_cost: inside_cost = %d, "
> --
> 2.44.0
> >
> >> Thanks, Jennifer
> >>>
> >>>> + n_adjacent_stores, alignment_support_scheme,
> >>>> + misalignment, &inside_cost, cost_vec);
> >>>> + }
> >>>> if (dump_enabled_p ())
> >>>> dump_printf_loc (MSG_NOTE, vect_location,
> >>>> "vect_model_store_cost: inside_cost = %d, "
> >>>> --
> >>>> 2.34.1
> >>>>>
> >>>>>> + inside_cost
> >>>>>> + += record_stmt_cost (cost_vec, n_adjacent_stores,
> >>>>>> vec_to_scalar,
> >>>>>> + stmt_info, slp_node,
> >>>>>> + 0, vect_body);
> >>>>>> + }
> >>>>>> if (dump_enabled_p ())
> >>>>>> dump_printf_loc (MSG_NOTE, vect_location,
> >>>>>> "vect_model_store_cost: inside_cost = %d, "
> >>>>>> --
> >>>>>> 2.44.0
> >>>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>> Richard
> >>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Jennifer
> >>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Jennifer
> >>>>>>>>>>>
> >>>>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> tunable and
> >>>>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes
> >>>>>>>>>>> the
> >>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
> >>>>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p
> >>>>>>>>>>> and its uses
> >>>>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0,
> >>>>>>>>>>> as
> >>>>>>>>>>> described in
> >>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
> >>>>>>>>>>> we guarded the call to vect_is_store_elt_extraction in
> >>>>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
> >>>>>>>>>>>
> >>>>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases,
> >>>>>>>>>>> the
> >>>>>>>>>>> old code performed loop unrolling once, but the new code does not:
> >>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled
> >>>>>>>>>>> with
> >>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> >>>>>>>>>>> -moverride=tune=none):
> >>>>>>>>>>> f_int64_t_32:
> >>>>>>>>>>> cbz w3, .L92
> >>>>>>>>>>> mov x4, 0
> >>>>>>>>>>> uxtw x3, w3
> >>>>>>>>>>> + cntd x5
> >>>>>>>>>>> + whilelo p7.d, xzr, x3
> >>>>>>>>>>> + mov z29.s, w5
> >>>>>>>>>>> mov z31.s, w2
> >>>>>>>>>>> - whilelo p6.d, xzr, x3
> >>>>>>>>>>> - mov x2, x3
> >>>>>>>>>>> - index z30.s, #0, #1
> >>>>>>>>>>> - uqdecd x2
> >>>>>>>>>>> - ptrue p5.b, all
> >>>>>>>>>>> - whilelo p7.d, xzr, x2
> >>>>>>>>>>> + index z30.d, #0, #1
> >>>>>>>>>>> + ptrue p6.b, all
> >>>>>>>>>>> .p2align 3,,7
> >>>>>>>>>>> .L94:
> >>>>>>>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl]
> >>>>>>>>>>> - ld1d z28.d, p6/z, [x0]
> >>>>>>>>>>> - movprfx z29, z31
> >>>>>>>>>>> - mul z29.s, p5/m, z29.s, z30.s
> >>>>>>>>>>> - incw x4
> >>>>>>>>>>> - uunpklo z0.d, z29.s
> >>>>>>>>>>> - uunpkhi z29.d, z29.s
> >>>>>>>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3]
> >>>>>>>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3]
> >>>>>>>>>>> - add z25.d, z28.d, z25.d
> >>>>>>>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3]
> >>>>>>>>>>> + movprfx z28, z31
> >>>>>>>>>>> + mul z28.s, p6/m, z28.s, z30.s
> >>>>>>>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3]
> >>>>>>>>>>> add z26.d, z27.d, z26.d
> >>>>>>>>>>> - st1d z26.d, p7, [x0, #1, mul vl]
> >>>>>>>>>>> - whilelo p7.d, x4, x2
> >>>>>>>>>>> - st1d z25.d, p6, [x0]
> >>>>>>>>>>> - incw z30.s
> >>>>>>>>>>> - incb x0, all, mul #2
> >>>>>>>>>>> - whilelo p6.d, x4, x3
> >>>>>>>>>>> + st1d z26.d, p7, [x0, x4, lsl 3]
> >>>>>>>>>>> + add z30.s, z30.s, z29.s
> >>>>>>>>>>> + incd x4
> >>>>>>>>>>> + whilelo p7.d, x4, x3
> >>>>>>>>>>> b.any .L94
> >>>>>>>>>>> .L92:
> >>>>>>>>>>> ret
> >>>>>>>>>>>
> >>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled
> >>>>>>>>>>> with
> >>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
> >>>>>>>>>>> -moverride=tune=none):
> >>>>>>>>>>> f_int64_t_32:
> >>>>>>>>>>> cbz w3, .L84
> >>>>>>>>>>> - addvl x5, x1, #1
> >>>>>>>>>>> mov x4, 0
> >>>>>>>>>>> uxtw x3, w3
> >>>>>>>>>>> - mov z31.s, w2
> >>>>>>>>>>> + cntd x5
> >>>>>>>>>>> whilelo p7.d, xzr, x3
> >>>>>>>>>>> - mov x2, x3
> >>>>>>>>>>> - index z30.s, #0, #1
> >>>>>>>>>>> - uqdecd x2
> >>>>>>>>>>> - ptrue p5.b, all
> >>>>>>>>>>> - whilelo p6.d, xzr, x2
> >>>>>>>>>>> + mov z29.s, w5
> >>>>>>>>>>> + mov z31.s, w2
> >>>>>>>>>>> + index z30.d, #0, #1
> >>>>>>>>>>> + ptrue p6.b, all
> >>>>>>>>>>> .p2align 3,,7
> >>>>>>>>>>> .L86:
> >>>>>>>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3]
> >>>>>>>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3]
> >>>>>>>>>>> - movprfx z29, z30
> >>>>>>>>>>> - mul z29.s, p5/m, z29.s, z31.s
> >>>>>>>>>>> - add z28.d, z28.d, #1
> >>>>>>>>>>> - uunpklo z26.d, z29.s
> >>>>>>>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3]
> >>>>>>>>>>> - incw x4
> >>>>>>>>>>> - uunpkhi z29.d, z29.s
> >>>>>>>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3]
> >>>>>>>>>>> + movprfx z28, z30
> >>>>>>>>>>> + mul z28.s, p6/m, z28.s, z31.s
> >>>>>>>>>>> add z27.d, z27.d, #1
> >>>>>>>>>>> - whilelo p6.d, x4, x2
> >>>>>>>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3]
> >>>>>>>>>>> - incw z30.s
> >>>>>>>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3]
> >>>>>>>>>>> + incd x4
> >>>>>>>>>>> + add z30.s, z30.s, z29.s
> >>>>>>>>>>> whilelo p7.d, x4, x3
> >>>>>>>>>>> b.any .L86
> >>>>>>>>>>> .L84:
> >>>>>>>>>>> ret
> >>>>>>>>>>>
> >>>>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
> >>>>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace
> >>>>>>>>>>> machine and saw
> >>>>>>>>>>> no non-noise impact on performance. We would appreciate help with
> >>>>>>>>>>> wider
> >>>>>>>>>>> benchmarking on other platforms, if necessary.
> >>>>>>>>>>> OK for mainline?
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Jennifer Schmitz <[email protected]>
> >>>>>>>>>>>
> >>>>>>>>>>> gcc/
> >>>>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove
> >>>>>>>>>>> use_new_vector_costs as tuning option.
> >>>>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
> >>>>>>>>>>> Remove.
> >>>>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
> >>>>>>>>>>> aarch64_use_new_vector_costs_p and guard call to
> >>>>>>>>>>> vect_is_store_elt_extraction with count > 1.
> >>>>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of
> >>>>>>>>>>> aarch64_use_new_vector_costs_p.
> >>>>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove
> >>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
> >>>>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
> >>>>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
> >>>>>>>>>>>
> >>>>>>>>>>> gcc/testsuite/
> >>>>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected
> >>>>>>>>>>> outcome.
> >>>>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
> >>>>>>>>>>> ---
> >>>>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 --
> >>>>>>>>>>> gcc/config/aarch64/aarch64.cc | 22
> >>>>>>>>>>> +++++--------------
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 -
> >>>>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 -
> >>>>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 -
> >>>>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 -
> >>>>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 -
> >>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 -
> >>>>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 -
> >>>>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +-
> >>>>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +-
> >>>>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>>>>>>> index 5939602576b..ed345b13ed3 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
> >>>>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION
> >>>>>>>>>>> ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
> >>>>>>>>>>>
> >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
> >>>>>>>>>>> CSE_SVE_VL_CONSTANTS)
> >>>>>>>>>>>
> >>>>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs",
> >>>>>>>>>>> USE_NEW_VECTOR_COSTS)
> >>>>>>>>>>> -
> >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput",
> >>>>>>>>>>> MATCHED_VECTOR_THROUGHPUT)
> >>>>>>>>>>>
> >>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma",
> >>>>>>>>>>> AVOID_CROSS_LOOP_FMA)
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc
> >>>>>>>>>>> b/gcc/config/aarch64/aarch64.cc
> >>>>>>>>>>> index 43238aefef2..03806671c97 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc
> >>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc
> >>>>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info
> >>>>>>>>>>> *vinfo, bool costing_for_scalar)
> >>>>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> -/* Return true if the current CPU should use the new costs
> >>>>>>>>>>> defined
> >>>>>>>>>>> - in GCC 11. This should be removed for GCC 12 and above, with
> >>>>>>>>>>> the
> >>>>>>>>>>> - costs applying to all CPUs instead. */
> >>>>>>>>>>> -static bool
> >>>>>>>>>>> -aarch64_use_new_vector_costs_p ()
> >>>>>>>>>>> -{
> >>>>>>>>>>> - return (aarch64_tune_params.extra_tuning_flags
> >>>>>>>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
> >>>>>>>>>>> -}
> >>>>>>>>>>> -
> >>>>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.
> >>>>>>>>>>> */
> >>>>>>>>>>> static const simd_vec_cost *
> >>>>>>>>>>> aarch64_simd_vec_costs (tree vectype)
> >>>>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int
> >>>>>>>>>>> count, vect_cost_for_stmt kind,
> >>>>>>>>>>>
> >>>>>>>>>>> /* Do one-time initialization based on the vinfo. */
> >>>>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >>>>>>>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>> + if (!m_analyzed_vinfo)
> >>>>>>>>>>> {
> >>>>>>>>>>> if (loop_vinfo)
> >>>>>>>>>>> analyze_loop_vinfo (loop_vinfo);
> >>>>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost
> >>>>>>>>>>> (int count, vect_cost_for_stmt kind,
> >>>>>>>>>>>
> >>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
> >>>>>>>>>>> of just looking at KIND. */
> >>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>> + if (stmt_info)
> >>>>>>>>>>> {
> >>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
> >>>>>>>>>>> vec_to_scalar for each element. However, we can store the first
> >>>>>>>>>>> element using an FP store without a separate extract step. */
> >>>>>>>>>>> - if (vect_is_store_elt_extraction (kind, stmt_info))
> >>>>>>>>>>> + if (vect_is_store_elt_extraction (kind, stmt_info) &&
> >>>>>>>>>>> count > 1)
> >>>>>>>>>>> count -= 1;
> >>>>>>>>>>>
> >>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
> >>>>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int
> >>>>>>>>>>> count, vect_cost_for_stmt kind,
> >>>>>>>>>>> else
> >>>>>>>>>>> m_num_last_promote_demote = 0;
> >>>>>>>>>>>
> >>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>> + if (stmt_info)
> >>>>>>>>>>> {
> >>>>>>>>>>> /* Account for any extra "embedded" costs that apply additively
> >>>>>>>>>>> to the base cost calculated above. */
> >>>>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const
> >>>>>>>>>>> vector_costs *uncast_scalar_costs)
> >>>>>>>>>>>
> >>>>>>>>>>> auto *scalar_costs
> >>>>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
> >>>>>>>>>>> - if (loop_vinfo
> >>>>>>>>>>> - && m_vec_flags
> >>>>>>>>>>> - && aarch64_use_new_vector_costs_p ())
> >>>>>>>>>>> + if (loop_vinfo && m_vec_flags)
> >>>>>>>>>>> {
> >>>>>>>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
> >>>>>>>>>>> m_costs[vect_body]);
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>>>>>>> index eb9b89984b0..dafea96e924 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
> >>>>>>>>>>> cortexx925_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>>>>>>> index 6a098497759..ac001927959 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
> >>>>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params
> >>>>>>>>>>> fujitsu_monaka_tunings =
> >>>>>>>>>>> 0, /* max_case_values. */
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags.
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
> >>>>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params
> >>>>>>>>>>> generic_armv8_a_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags.
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>>>>>>> index 48353a59939..562ef89c67b 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
> >>>>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params
> >>>>>>>>>>> generic_armv9_a_tunings =
> >>>>>>>>>>> 0, /* max_case_values. */
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags.
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_armv9a_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>>>>>>> index c407b89a22f..fe4f7c10f73 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
> >>>>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params
> >>>>>>>>>>> neoverse512tvb_tunings =
> >>>>>>>>>>> 0, /* max_case_values. */
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags.
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>>>>>>> index 18199ac206c..56be77423cb 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
> >>>>>>>>>>> neoversen2_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
> >>>>>>>>>>> neoversen3_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags.
> >>>>>>>>>>> */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>>>>>>> index dd9120eee48..c7241cf23d7 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
> >>>>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params
> >>>>>>>>>>> neoversev1_tunings =
> >>>>>>>>>>> 0, /* max_case_values. */
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>>>>>>> index 1369de73991..96f55940649 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
> >>>>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params
> >>>>>>>>>>> neoversev2_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags.
> >>>>>>>>>>> */
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>>>>>>> index d8c82255378..f62ae67d355 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
> >>>>>>>>>>> neoversev3_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>>>>>>> index 7f050501ede..0233baf5e34 100644
> >>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
> >>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
> >>>>>>>>>>> neoversev3ae_tunings =
> >>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
> >>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> >>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> >>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
> >>>>>>>>>>> &generic_prefetch_tune,
> >>>>>>>>>>> diff --git
> >>>>>>>>>>> a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>>>>>>> index 762805ff54b..c334b7a6875 100644
> >>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
> >>>>>>>>>>> @@ -15,4 +15,4 @@
> >>>>>>>>>>> so we vectorize the offset calculation. This means that the
> >>>>>>>>>>> 64-bit version needs two copies. */
> >>>>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s,
> >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d,
> >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d,
> >>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>>>>>>>> diff --git
> >>>>>>>>>>> a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>>>>>>> index f0ea58e38e2..94cc63049bc 100644
> >>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
> >>>>>>>>>>> @@ -15,4 +15,4 @@
> >>>>>>>>>>> so we vectorize the offset calculation. This means that the
> >>>>>>>>>>> 64-bit version needs two copies. */
> >>>>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7],
> >>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
> >>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d,
> >>>>>>>>>>> p[0-7], \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
> >>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d,
> >>>>>>>>>>> p[0-7], \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Richard Biener <[email protected]>
> >>>>>>>>>> SUSE Software Solutions Germany GmbH,
> >>>>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
> >>>>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> >>>>>>>>>> Nuernberg)
>
>