> On 17 Dec 2024, at 18:57, Richard Biener <[email protected]> wrote: > > External email: Use caution opening links or attachments > > >> Am 16.12.2024 um 09:10 schrieb Jennifer Schmitz <[email protected]>: >> >> >> >>> On 14 Dec 2024, at 09:32, Richard Biener <[email protected]> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>>>> Am 13.12.2024 um 18:00 schrieb Jennifer Schmitz <[email protected]>: >>>> >>>> >>>> >>>>> On 13 Dec 2024, at 13:40, Richard Biener <[email protected]> >>>>> wrote: >>>>> >>>>> External email: Use caution opening links or attachments >>>>> >>>>> >>>>>> On Thu, Dec 12, 2024 at 5:27 PM Jennifer Schmitz <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>>> On 6 Dec 2024, at 08:41, Jennifer Schmitz <[email protected]> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 5 Dec 2024, at 20:07, Richard Sandiford <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> External email: Use caution opening links or attachments >>>>>>>> >>>>>>>> >>>>>>>> Jennifer Schmitz <[email protected]> writes: >>>>>>>>>> On 5 Dec 2024, at 11:44, Richard Biener <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> External email: Use caution opening links or attachments >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, 5 Dec 2024, Jennifer Schmitz wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On 17 Oct 2024, at 19:23, Richard Sandiford >>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> External email: Use caution opening links or attachments >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Jennifer Schmitz <[email protected]> writes: >>>>>>>>>>>>> [...] >>>>>>>>>>>>> Looking at the diff of the vect dumps (below is a section of the >>>>>>>>>>>>> diff for strided_store_2.c), it seemed odd that vec_to_scalar >>>>>>>>>>>>> operations cost 0 now, instead of the previous cost of 2: >>>>>>>>>>>>> >>>>>>>>>>>>> +strided_store_1.c:38:151: note: === vectorizable_operation === >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_simple_cost: >>>>>>>>>>>>> inside_cost = 1, prologue_cost = 0 . >>>>>>>>>>>>> +strided_store_1.c:38:151: note: ==> examining statement: *_6 = >>>>>>>>>>>>> _7; >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 >>>>>>>>>>>>> + 1.0e+0, type of def: internal >>>>>>>>>>>>> +strided_store_1.c:38:151: note: Vectorizing an unaligned >>>>>>>>>>>>> access. >>>>>>>>>>>>> +Applying pattern match.pd:236, generic-match-9.cc:4128 >>>>>>>>>>>>> +Applying pattern match.pd:5285, generic-match-10.cc:4234 >>>>>>>>>>>>> +strided_store_1.c:38:151: note: vect_model_store_cost: >>>>>>>>>>>>> inside_cost = 12, prologue_cost = 0 . >>>>>>>>>>>>> *_2 1 times unaligned_load (misalign -1) costs 1 in body >>>>>>>>>>>>> -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue >>>>>>>>>>>>> _3 + 1.0e+0 1 times vector_stmt costs 1 in body >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>>>> +<unknown> 1 times vector_load costs 1 in prologue >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>>>> -_7 1 times vec_to_scalar costs 2 in body >>>>>>>>>>>>> +_7 1 times vec_to_scalar costs 0 in body >>>>>>>>>>>>> _7 1 times scalar_store costs 1 in body >>>>>>>>>>>>> >>>>>>>>>>>>> Although the aarch64_use_new_vector_costs_p flag was used in >>>>>>>>>>>>> multiple places in aarch64.cc, the location that causes this >>>>>>>>>>>>> behavior is this one: >>>>>>>>>>>>> unsigned >>>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost (int count, >>>>>>>>>>>>> vect_cost_for_stmt kind, >>>>>>>>>>>>> stmt_vec_info stmt_info, slp_tree, >>>>>>>>>>>>> tree vectype, int misalign, >>>>>>>>>>>>> vect_cost_model_location where) >>>>>>>>>>>>> { >>>>>>>>>>>>> [...] >>>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>>>>>>>>>>> of just looking at KIND. */ >>>>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>>>>>>>>> + if (stmt_info) >>>>>>>>>>>>> { >>>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one >>>>>>>>>>>>> vec_to_scalar for each element. However, we can store the first >>>>>>>>>>>>> element using an FP store without a separate extract step. */ >>>>>>>>>>>>> if (vect_is_store_elt_extraction (kind, stmt_info)) >>>>>>>>>>>>> count -= 1; >>>>>>>>>>>>> >>>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, >>>>>>>>>>>>> stmt_info, >>>>>>>>>>>>> stmt_cost); >>>>>>>>>>>>> >>>>>>>>>>>>> if (vectype && m_vec_flags) >>>>>>>>>>>>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, >>>>>>>>>>>>> stmt_info, vectype, >>>>>>>>>>>>> where, stmt_cost); >>>>>>>>>>>>> } >>>>>>>>>>>>> [...] >>>>>>>>>>>>> return record_stmt_cost (stmt_info, where, (count * >>>>>>>>>>>>> stmt_cost).ceil ()); >>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> Previously, for mtune=generic, this function returned a cost of 2 >>>>>>>>>>>>> for a vec_to_scalar operation in the vect body. Now "if >>>>>>>>>>>>> (stmt_info)" is entered and "if (vect_is_store_elt_extraction >>>>>>>>>>>>> (kind, stmt_info))" evaluates to true, which sets the count to 0 >>>>>>>>>>>>> and leads to a return value of 0. >>>>>>>>>>>> >>>>>>>>>>>> At the time the code was written, a scalarised store would be >>>>>>>>>>>> costed >>>>>>>>>>>> using one vec_to_scalar call into the backend, with the count >>>>>>>>>>>> parameter >>>>>>>>>>>> set to the number of elements being stored. The "count -= 1" was >>>>>>>>>>>> supposed to lop off the leading element extraction, since we can >>>>>>>>>>>> store >>>>>>>>>>>> lane 0 as a normal FP store. >>>>>>>>>>>> >>>>>>>>>>>> The target-independent costing was later reworked so that it costs >>>>>>>>>>>> each operation individually: >>>>>>>>>>>> >>>>>>>>>>>> for (i = 0; i < nstores; i++) >>>>>>>>>>>> { >>>>>>>>>>>> if (costing_p) >>>>>>>>>>>> { >>>>>>>>>>>> /* Only need vector extracting when there are more >>>>>>>>>>>> than one stores. */ >>>>>>>>>>>> if (nstores > 1) >>>>>>>>>>>> inside_cost >>>>>>>>>>>> += record_stmt_cost (cost_vec, 1, vec_to_scalar, >>>>>>>>>>>> stmt_info, 0, vect_body); >>>>>>>>>>>> /* Take a single lane vector type store as scalar >>>>>>>>>>>> store to avoid ICE like 110776. */ >>>>>>>>>>>> if (VECTOR_TYPE_P (ltype) >>>>>>>>>>>> && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>>>>>>>>>> n_adjacent_stores++; >>>>>>>>>>>> else >>>>>>>>>>>> inside_cost >>>>>>>>>>>> += record_stmt_cost (cost_vec, 1, scalar_store, >>>>>>>>>>>> stmt_info, 0, vect_body); >>>>>>>>>>>> continue; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> Unfortunately, there's no easy way of telling whether a particular >>>>>>>>>>>> call >>>>>>>>>>>> is part of a group, and if so, which member of the group it is. >>>>>>>>>>>> >>>>>>>>>>>> I suppose we could give up on the attempt to be (somewhat) accurate >>>>>>>>>>>> and just disable the optimisation. Or we could restrict it to >>>>>>>>>>>> count > 1, >>>>>>>>>>>> since it might still be useful for gathers and scatters. >>>>>>>>>>> I tried restricting the calls to vect_is_store_elt_extraction to >>>>>>>>>>> count > 1 and it seems to resolve the issue of costing >>>>>>>>>>> vec_to_scalar operations with 0 (see patch below). >>>>>>>>>>> What are your thoughts on this? >>>>>>>>>> >>>>>>>>>> Why didn't you pursue instead moving the vec_to_scalar cost together >>>>>>>>>> with the n_adjacent_store handling? >>>>>>>>> When I continued working on this patch, we had already reached stage >>>>>>>>> 3 and I was hesitant to introduce changes to the middle-end that were >>>>>>>>> not previously covered by this patch. So I tried if the issue could >>>>>>>>> not be resolved by making a small change in the backend. >>>>>>>>> If you still advise to use the n_adjacent_store instead, I’m happy to >>>>>>>>> look into it again. >>>>>>>> >>>>>>>> If Richard's ok with adjusting vectorizable_store for GCC 15 (which it >>>>>>>> sounds like he is), then I agree that would be better. Otherwise we'd >>>>>>>> be creating technical debt to clean up for GCC 16. And it is a >>>>>>>> regression >>>>>>>> of sorts, so is stage 3 material from that POV. >>>>>>>> >>>>>>>> (Incidentally, AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS was itself a >>>>>>>> "let's clean this up next stage 1" thing, since we needed to add tuning >>>>>>>> for a new CPU late during the cycle. But of course, there were other >>>>>>>> priorities when stage 1 actually came around, so it never actually >>>>>>>> happened. Thanks again for being the one to sort this out.) >>>>>>> Thanks for your feedback. Then I will try to make it work in >>>>>>> vectorizable_store. >>>>>>> Best, >>>>>>> Jennifer >>>>>> Below is the updated patch with a suggestion for the changes in >>>>>> vectorizable_store. It resolves the issue with the vec_to_scalar >>>>>> operations that were individually costed with 0. >>>>>> We already tested it on aarch64, no regression, but we are still doing >>>>>> performance testing. >>>>>> Can you give some feedback in the meantime on the patch itself? >>>>>> Thanks, >>>>>> Jennifer >>>>>> >>>>>> >>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable >>>>>> and >>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the >>>>>> default. To that end, the function aarch64_use_new_vector_costs_p and >>>>>> its uses >>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as >>>>>> described in >>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, >>>>>> we adjusted vectorizable_store such that the variable n_adjacent_stores >>>>>> also covers vec_to_scalar operations. This way vec_to_scalar operations >>>>>> are not costed individually, but as a group. >>>>>> >>>>>> Two tests were adjusted due to changes in codegen. In both cases, the >>>>>> old code performed loop unrolling once, but the new code does not: >>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>>>> -moverride=tune=none): >>>>>> f_int64_t_32: >>>>>> cbz w3, .L92 >>>>>> mov x4, 0 >>>>>> uxtw x3, w3 >>>>>> + cntd x5 >>>>>> + whilelo p7.d, xzr, x3 >>>>>> + mov z29.s, w5 >>>>>> mov z31.s, w2 >>>>>> - whilelo p6.d, xzr, x3 >>>>>> - mov x2, x3 >>>>>> - index z30.s, #0, #1 >>>>>> - uqdecd x2 >>>>>> - ptrue p5.b, all >>>>>> - whilelo p7.d, xzr, x2 >>>>>> + index z30.d, #0, #1 >>>>>> + ptrue p6.b, all >>>>>> .p2align 3,,7 >>>>>> .L94: >>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] >>>>>> - ld1d z28.d, p6/z, [x0] >>>>>> - movprfx z29, z31 >>>>>> - mul z29.s, p5/m, z29.s, z30.s >>>>>> - incw x4 >>>>>> - uunpklo z0.d, z29.s >>>>>> - uunpkhi z29.d, z29.s >>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >>>>>> - add z25.d, z28.d, z25.d >>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >>>>>> + movprfx z28, z31 >>>>>> + mul z28.s, p6/m, z28.s, z30.s >>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >>>>>> add z26.d, z27.d, z26.d >>>>>> - st1d z26.d, p7, [x0, #1, mul vl] >>>>>> - whilelo p7.d, x4, x2 >>>>>> - st1d z25.d, p6, [x0] >>>>>> - incw z30.s >>>>>> - incb x0, all, mul #2 >>>>>> - whilelo p6.d, x4, x3 >>>>>> + st1d z26.d, p7, [x0, x4, lsl 3] >>>>>> + add z30.s, z30.s, z29.s >>>>>> + incd x4 >>>>>> + whilelo p7.d, x4, x3 >>>>>> b.any .L94 >>>>>> .L92: >>>>>> ret >>>>>> >>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with >>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>>>> -moverride=tune=none): >>>>>> f_int64_t_32: >>>>>> cbz w3, .L84 >>>>>> - addvl x5, x1, #1 >>>>>> mov x4, 0 >>>>>> uxtw x3, w3 >>>>>> - mov z31.s, w2 >>>>>> + cntd x5 >>>>>> whilelo p7.d, xzr, x3 >>>>>> - mov x2, x3 >>>>>> - index z30.s, #0, #1 >>>>>> - uqdecd x2 >>>>>> - ptrue p5.b, all >>>>>> - whilelo p6.d, xzr, x2 >>>>>> + mov z29.s, w5 >>>>>> + mov z31.s, w2 >>>>>> + index z30.d, #0, #1 >>>>>> + ptrue p6.b, all >>>>>> .p2align 3,,7 >>>>>> .L86: >>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] >>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] >>>>>> - movprfx z29, z30 >>>>>> - mul z29.s, p5/m, z29.s, z31.s >>>>>> - add z28.d, z28.d, #1 >>>>>> - uunpklo z26.d, z29.s >>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] >>>>>> - incw x4 >>>>>> - uunpkhi z29.d, z29.s >>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >>>>>> + movprfx z28, z30 >>>>>> + mul z28.s, p6/m, z28.s, z31.s >>>>>> add z27.d, z27.d, #1 >>>>>> - whilelo p6.d, x4, x2 >>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] >>>>>> - incw z30.s >>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >>>>>> + incd x4 >>>>>> + add z30.s, z30.s, z29.s >>>>>> whilelo p7.d, x4, x3 >>>>>> b.any .L86 >>>>>> .L84: >>>>>> ret >>>>>> >>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no >>>>>> regression. >>>>>> OK for mainline? >>>>>> >>>>>> Signed-off-by: Jennifer Schmitz <[email protected]> >>>>>> >>>>>> gcc/ >>>>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of >>>>>> n_adjacent_stores to also cover vec_to_scalar operations. >>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove >>>>>> use_new_vector_costs as tuning option. >>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): >>>>>> Remove. >>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of >>>>>> aarch64_use_new_vector_costs_p. >>>>>> (aarch64_vector_costs::finish_cost): Remove use of >>>>>> aarch64_use_new_vector_costs_p. >>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. >>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. >>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. >>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. >>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. >>>>>> >>>>>> gcc/testsuite/ >>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. >>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. >>>>>> --- >>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 -- >>>>>> gcc/config/aarch64/aarch64.cc | 20 +++---------- >>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - >>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - >>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - >>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - >>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - >>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - >>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - >>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- >>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- >>>>>> gcc/tree-vect-stmts.cc | 29 ++++++++++--------- >>>>>> 16 files changed, 22 insertions(+), 44 deletions(-) >>>>>> >>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>> index ffbff20e29c..1de633c739b 100644 >>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", >>>>>> CHEAP_SHIFT_EXTEND) >>>>>> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", >>>>>> CSE_SVE_VL_CONSTANTS) >>>>>> >>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", >>>>>> USE_NEW_VECTOR_COSTS) >>>>>> - >>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", >>>>>> MATCHED_VECTOR_THROUGHPUT) >>>>>> >>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", >>>>>> AVOID_CROSS_LOOP_FMA) >>>>>> diff --git a/gcc/config/aarch64/aarch64.cc >>>>>> b/gcc/config/aarch64/aarch64.cc >>>>>> index 77a2a6bfa3a..71fba9cc63b 100644 >>>>>> --- a/gcc/config/aarch64/aarch64.cc >>>>>> +++ b/gcc/config/aarch64/aarch64.cc >>>>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info >>>>>> *vinfo, bool costing_for_scalar) >>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); >>>>>> } >>>>>> >>>>>> -/* Return true if the current CPU should use the new costs defined >>>>>> - in GCC 11. This should be removed for GCC 12 and above, with the >>>>>> - costs applying to all CPUs instead. */ >>>>>> -static bool >>>>>> -aarch64_use_new_vector_costs_p () >>>>>> -{ >>>>>> - return (aarch64_tune_params.extra_tuning_flags >>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); >>>>>> -} >>>>>> - >>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ >>>>>> static const simd_vec_cost * >>>>>> aarch64_simd_vec_costs (tree vectype) >>>>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>>>> vect_cost_for_stmt kind, >>>>>> >>>>>> /* Do one-time initialization based on the vinfo. */ >>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); >>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) >>>>>> + if (!m_analyzed_vinfo) >>>>>> { >>>>>> if (loop_vinfo) >>>>>> analyze_loop_vinfo (loop_vinfo); >>>>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>>>> vect_cost_for_stmt kind, >>>>>> >>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>>>> of just looking at KIND. */ >>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>> + if (stmt_info) >>>>>> { >>>>>> /* If we scalarize a strided store, the vectorizer costs one >>>>>> vec_to_scalar for each element. However, we can store the first >>>>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>>>> vect_cost_for_stmt kind, >>>>>> else >>>>>> m_num_last_promote_demote = 0; >>>>>> >>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>>>> + if (stmt_info) >>>>>> { >>>>>> /* Account for any extra "embedded" costs that apply additively >>>>>> to the base cost calculated above. */ >>>>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const >>>>>> vector_costs *uncast_scalar_costs) >>>>>> >>>>>> auto *scalar_costs >>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); >>>>>> - if (loop_vinfo >>>>>> - && m_vec_flags >>>>>> - && aarch64_use_new_vector_costs_p ()) >>>>>> + if (loop_vinfo && m_vec_flags) >>>>>> { >>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>>>> m_costs[vect_body]); >>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>> index b2ff716157a..0a8eff69307 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params cortexx925_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>> index 2d704ecd110..a564528f43d 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings >>>>>> = >>>>>> 0, /* max_case_values. */ >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>> index bdd309ab03d..f090d5cde50 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>>>> @@ -183,7 +183,6 @@ static const struct tune_params >>>>>> generic_armv8_a_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>> index a05a9ab92a2..4c33c147444 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>>>> @@ -249,7 +249,6 @@ static const struct tune_params >>>>>> generic_armv9_a_tunings = >>>>>> 0, /* max_case_values. */ >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_armv9a_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>> index c407b89a22f..fe4f7c10f73 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>>>> @@ -156,7 +156,6 @@ static const struct tune_params >>>>>> neoverse512tvb_tunings = >>>>>> 0, /* max_case_values. */ >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>> index fd5f8f37370..0c74068da2c 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>> index 8b156c2fe4d..9d4e1be171a 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>> index 23c121d8652..85a78bb2bef 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h >>>>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>> index 40af5f47f4f..1dd452beb8d 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >>>>>> @@ -232,7 +232,6 @@ static const struct tune_params neoversev2_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>> index d65d74bfecf..d0ba5b1aef6 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>> index 7b7fa0b4b08..a1572048503 100644 >>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings >>>>>> = >>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>>>> (AARCH64_EXTRA_TUNE_BASE >>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>>>> &generic_prefetch_tune, >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>> index 762805ff54b..c334b7a6875 100644 >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>>>> @@ -15,4 +15,4 @@ >>>>>> so we vectorize the offset calculation. This means that the >>>>>> 64-bit version needs two copies. */ >>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>> index f0ea58e38e2..94cc63049bc 100644 >>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>>>> @@ -15,4 +15,4 @@ >>>>>> so we vectorize the offset calculation. This means that the >>>>>> 64-bit version needs two copies. */ >>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], >>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >>>>>> index be1139a423c..6d7d28c4702 100644 >>>>>> --- a/gcc/tree-vect-stmts.cc >>>>>> +++ b/gcc/tree-vect-stmts.cc >>>>>> @@ -8834,19 +8834,16 @@ vectorizable_store (vec_info *vinfo, >>>>>> { >>>>>> if (costing_p) >>>>>> { >>>>>> - /* Only need vector extracting when there are more >>>>>> - than one stores. */ >>>>>> - if (nstores > 1) >>>>>> - inside_cost >>>>>> - += record_stmt_cost (cost_vec, 1, >>>>>> vec_to_scalar, >>>>>> - stmt_info, slp_node, >>>>>> - 0, vect_body); >>>>>> /* Take a single lane vector type store as scalar >>>>>> store to avoid ICE like 110776. */ >>>>>> - if (VECTOR_TYPE_P (ltype) >>>>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>>>> + bool single_lane_vec_p = >>>>>> + VECTOR_TYPE_P (ltype) >>>>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U); >>>>>> + /* Only need vector extracting when there are more >>>>>> + than one stores. */ >>>>>> + if (nstores > 1 || single_lane_vec_p) >>>>>> n_adjacent_stores++; >>>>>> - else >>>>>> + if (!single_lane_vec_p) >>>>> >>>>> I think it's somewhat non-obvious that nstores > 1 and single_lane_vec_p >>>>> correlate. In fact I think that we always record a store, just for >>>>> single-element >>>>> vectors we record scalar stores. I suggest to here always to just >>>>> n_adjacent_stores++ >>>>> and below ... >>>>> >>>>>> inside_cost >>>>>> += record_stmt_cost (cost_vec, 1, scalar_store, >>>>>> stmt_info, 0, vect_body); >>>>>> @@ -8905,9 +8902,15 @@ vectorizable_store (vec_info *vinfo, >>>>>> if (costing_p) >>>>>> { >>>>>> if (n_adjacent_stores > 0) >>>>>> - vect_get_store_cost (vinfo, stmt_info, slp_node, >>>>>> n_adjacent_stores, >>>>>> - alignment_support_scheme, misalignment, >>>>>> - &inside_cost, cost_vec); >>>>>> + { >>>>>> + vect_get_store_cost (vinfo, stmt_info, slp_node, >>>>>> n_adjacent_stores, >>>>>> + alignment_support_scheme, >>>>>> misalignment, >>>>>> + &inside_cost, cost_vec); >>>>> >>>>> ... record n_adjacent_stores scalar_store when ltype is single-lane and >>>>> record >>>>> n_adjacent_stores vect_to_scalar if nstores > 1 (and else none). >>>>> >>>>> Richard. >>>> Thanks for the feedback, I’m glad it’s going in the right direction. Below >>>> is the updated patch, re-validated on aarch64. >>>> Thanks, Jennifer >>>> >>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and >>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the >>>> default. To that end, the function aarch64_use_new_vector_costs_p and its >>>> uses >>>> were removed. To prevent costing vec_to_scalar operations with 0, as >>>> described in >>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html, >>>> we adjusted vectorizable_store such that the variable n_adjacent_stores >>>> also covers vec_to_scalar operations. This way vec_to_scalar operations >>>> are not costed individually, but as a group. >>>> >>>> Two tests were adjusted due to changes in codegen. In both cases, the >>>> old code performed loop unrolling once, but the new code does not: >>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>> -moverride=tune=none): >>>> f_int64_t_32: >>>> cbz w3, .L92 >>>> mov x4, 0 >>>> uxtw x3, w3 >>>> + cntd x5 >>>> + whilelo p7.d, xzr, x3 >>>> + mov z29.s, w5 >>>> mov z31.s, w2 >>>> - whilelo p6.d, xzr, x3 >>>> - mov x2, x3 >>>> - index z30.s, #0, #1 >>>> - uqdecd x2 >>>> - ptrue p5.b, all >>>> - whilelo p7.d, xzr, x2 >>>> + index z30.d, #0, #1 >>>> + ptrue p6.b, all >>>> .p2align 3,,7 >>>> .L94: >>>> - ld1d z27.d, p7/z, [x0, #1, mul vl] >>>> - ld1d z28.d, p6/z, [x0] >>>> - movprfx z29, z31 >>>> - mul z29.s, p5/m, z29.s, z30.s >>>> - incw x4 >>>> - uunpklo z0.d, z29.s >>>> - uunpkhi z29.d, z29.s >>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >>>> - add z25.d, z28.d, z25.d >>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >>>> + movprfx z28, z31 >>>> + mul z28.s, p6/m, z28.s, z30.s >>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >>>> add z26.d, z27.d, z26.d >>>> - st1d z26.d, p7, [x0, #1, mul vl] >>>> - whilelo p7.d, x4, x2 >>>> - st1d z25.d, p6, [x0] >>>> - incw z30.s >>>> - incb x0, all, mul #2 >>>> - whilelo p6.d, x4, x3 >>>> + st1d z26.d, p7, [x0, x4, lsl 3] >>>> + add z30.s, z30.s, z29.s >>>> + incd x4 >>>> + whilelo p7.d, x4, x3 >>>> b.any .L94 >>>> .L92: >>>> ret >>>> >>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with >>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic >>>> -moverride=tune=none): >>>> f_int64_t_32: >>>> cbz w3, .L84 >>>> - addvl x5, x1, #1 >>>> mov x4, 0 >>>> uxtw x3, w3 >>>> - mov z31.s, w2 >>>> + cntd x5 >>>> whilelo p7.d, xzr, x3 >>>> - mov x2, x3 >>>> - index z30.s, #0, #1 >>>> - uqdecd x2 >>>> - ptrue p5.b, all >>>> - whilelo p6.d, xzr, x2 >>>> + mov z29.s, w5 >>>> + mov z31.s, w2 >>>> + index z30.d, #0, #1 >>>> + ptrue p6.b, all >>>> .p2align 3,,7 >>>> .L86: >>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3] >>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3] >>>> - movprfx z29, z30 >>>> - mul z29.s, p5/m, z29.s, z31.s >>>> - add z28.d, z28.d, #1 >>>> - uunpklo z26.d, z29.s >>>> - st1d z28.d, p7, [x0, z26.d, lsl 3] >>>> - incw x4 >>>> - uunpkhi z29.d, z29.s >>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >>>> + movprfx z28, z30 >>>> + mul z28.s, p6/m, z28.s, z31.s >>>> add z27.d, z27.d, #1 >>>> - whilelo p6.d, x4, x2 >>>> - st1d z27.d, p7, [x0, z29.d, lsl 3] >>>> - incw z30.s >>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >>>> + incd x4 >>>> + add z30.s, z30.s, z29.s >>>> whilelo p7.d, x4, x3 >>>> b.any .L86 >>>> .L84: >>>> ret >>>> >>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no >>>> regression. >>>> OK for mainline? >>>> >>>> Signed-off-by: Jennifer Schmitz <[email protected]> >>>> >>>> gcc/ >>>> * tree-vect-stmts.cc (vectorizable_store): Extend the use of >>>> n_adjacent_stores to also cover vec_to_scalar operations. >>>> * config/aarch64/aarch64-tuning-flags.def: Remove >>>> use_new_vector_costs as tuning option. >>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p): >>>> Remove. >>>> (aarch64_vector_costs::add_stmt_cost): Remove use of >>>> aarch64_use_new_vector_costs_p. >>>> (aarch64_vector_costs::finish_cost): Remove use of >>>> aarch64_use_new_vector_costs_p. >>>> * config/aarch64/tuning_models/cortexx925.h: Remove >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS. >>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise. >>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise. >>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise. >>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise. >>>> * config/aarch64/tuning_models/neoversen2.h: Likewise. >>>> * config/aarch64/tuning_models/neoversen3.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev1.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev2.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev3.h: Likewise. >>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise. >>>> >>>> gcc/testsuite/ >>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome. >>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise. >>>> --- >>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 - >>>> gcc/config/aarch64/aarch64.cc | 20 ++-------- >>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 - >>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 - >>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 - >>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 - >>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 - >>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 - >>>> .../aarch64/tuning_models/neoversev3ae.h | 1 - >>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +- >>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +- >>>> gcc/tree-vect-stmts.cc | 37 +++++++++++-------- >>>> 16 files changed, 27 insertions(+), 47 deletions(-) >>>> >>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> index ffbff20e29c..1de633c739b 100644 >>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", >>>> CHEAP_SHIFT_EXTEND) >>>> >>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS) >>>> >>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS) >>>> - >>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput", >>>> MATCHED_VECTOR_THROUGHPUT) >>>> >>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA) >>>> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc >>>> index 77a2a6bfa3a..71fba9cc63b 100644 >>>> --- a/gcc/config/aarch64/aarch64.cc >>>> +++ b/gcc/config/aarch64/aarch64.cc >>>> @@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, >>>> bool costing_for_scalar) >>>> return new aarch64_vector_costs (vinfo, costing_for_scalar); >>>> } >>>> >>>> -/* Return true if the current CPU should use the new costs defined >>>> - in GCC 11. This should be removed for GCC 12 and above, with the >>>> - costs applying to all CPUs instead. */ >>>> -static bool >>>> -aarch64_use_new_vector_costs_p () >>>> -{ >>>> - return (aarch64_tune_params.extra_tuning_flags >>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS); >>>> -} >>>> - >>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE. */ >>>> static const simd_vec_cost * >>>> aarch64_simd_vec_costs (tree vectype) >>>> @@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> >>>> /* Do one-time initialization based on the vinfo. */ >>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); >>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ()) >>>> + if (!m_analyzed_vinfo) >>>> { >>>> if (loop_vinfo) >>>> analyze_loop_vinfo (loop_vinfo); >>>> @@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> >>>> /* Try to get a more accurate cost by looking at STMT_INFO instead >>>> of just looking at KIND. */ >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>> + if (stmt_info) >>>> { >>>> /* If we scalarize a strided store, the vectorizer costs one >>>> vec_to_scalar for each element. However, we can store the first >>>> @@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count, >>>> vect_cost_for_stmt kind, >>>> else >>>> m_num_last_promote_demote = 0; >>>> >>>> - if (stmt_info && aarch64_use_new_vector_costs_p ()) >>>> + if (stmt_info) >>>> { >>>> /* Account for any extra "embedded" costs that apply additively >>>> to the base cost calculated above. */ >>>> @@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const >>>> vector_costs *uncast_scalar_costs) >>>> >>>> auto *scalar_costs >>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs); >>>> - if (loop_vinfo >>>> - && m_vec_flags >>>> - && aarch64_use_new_vector_costs_p ()) >>>> + if (loop_vinfo && m_vec_flags) >>>> { >>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>>> m_costs[vect_body]); >>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >>>> b/gcc/config/aarch64/tuning_models/cortexx925.h >>>> index 5ebaf66e986..74772f3e15f 100644 >>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >>>> @@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> index 2d704ecd110..a564528f43d 100644 >>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h >>>> @@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> index bdd309ab03d..f090d5cde50 100644 >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h >>>> @@ -183,7 +183,6 @@ static const struct tune_params >>>> generic_armv8_a_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> index 785e00946bc..7b5821183bc 100644 >>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h >>>> @@ -251,7 +251,6 @@ static const struct tune_params >>>> generic_armv9_a_tunings = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> index 007f987154c..f7457df59e5 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h >>>> @@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings >>>> = >>>> 0, /* max_case_values. */ >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h >>>> b/gcc/config/aarch64/tuning_models/neoversen2.h >>>> index 32560d2f5f8..541b61c8179 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h >>>> b/gcc/config/aarch64/tuning_models/neoversen3.h >>>> index 2010bc4645b..eff668132a8 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h >>>> b/gcc/config/aarch64/tuning_models/neoversev1.h >>>> index c3751e32696..d11472b6e1e 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h >>>> @@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h >>>> b/gcc/config/aarch64/tuning_models/neoversev2.h >>>> index 80dbe5c806c..ee77ffdd3bc 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW >>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */ >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h >>>> b/gcc/config/aarch64/tuning_models/neoversev3.h >>>> index efe09e16d1e..6ef143ef7d5 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> index 66849f30889..96bdbf971f1 100644 >>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h >>>> @@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings = >>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ >>>> (AARCH64_EXTRA_TUNE_BASE >>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS >>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT >>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */ >>>> &generic_armv9a_prefetch_tune, >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> index 762805ff54b..c334b7a6875 100644 >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c >>>> @@ -15,4 +15,4 @@ >>>> so we vectorize the offset calculation. This means that the >>>> 64-bit version needs two copies. */ >>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> index f0ea58e38e2..94cc63049bc 100644 >>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c >>>> @@ -15,4 +15,4 @@ >>>> so we vectorize the offset calculation. This means that the >>>> 64-bit version needs two copies. */ >>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], >>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */ >>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */ >>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], >>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */ >>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc >>>> index be1139a423c..ab57163c243 100644 >>>> --- a/gcc/tree-vect-stmts.cc >>>> +++ b/gcc/tree-vect-stmts.cc >>>> @@ -8834,19 +8834,8 @@ vectorizable_store (vec_info *vinfo, >>>> { >>>> if (costing_p) >>>> { >>>> - /* Only need vector extracting when there are more >>>> - than one stores. */ >>>> - if (nstores > 1) >>>> - inside_cost >>>> - += record_stmt_cost (cost_vec, 1, vec_to_scalar, >>>> - stmt_info, slp_node, >>>> - 0, vect_body); >>>> - /* Take a single lane vector type store as scalar >>>> - store to avoid ICE like 110776. */ >>>> - if (VECTOR_TYPE_P (ltype) >>>> - && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>> - n_adjacent_stores++; >>>> - else >>>> + n_adjacent_stores++; >>>> + if (!VECTOR_TYPE_P (ltype)) >>> >>> This should be combined with the Single lane Vector case belle >>> >>>> inside_cost >>>> += record_stmt_cost (cost_vec, 1, scalar_store, >>>> stmt_info, 0, vect_body); >>>> @@ -8905,9 +8894,25 @@ vectorizable_store (vec_info *vinfo, >>>> if (costing_p) >>>> { >>>> if (n_adjacent_stores > 0) >>>> - vect_get_store_cost (vinfo, stmt_info, slp_node, >>>> n_adjacent_stores, >>>> - alignment_support_scheme, misalignment, >>>> - &inside_cost, cost_vec); >>>> + { >>>> + /* Take a single lane vector type store as scalar >>>> + store to avoid ICE like 110776. */ >>>> + if (VECTOR_TYPE_P (ltype) >>>> + && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U)) >>>> + inside_cost >>>> + += record_stmt_cost (cost_vec, n_adjacent_stores, >>>> + scalar_store, stmt_info, 0, vect_body); >>>> + /* Only need vector extracting when there are more >>>> + than one stores. */ >>>> + if (nstores > 1) >>>> + inside_cost >>>> + += record_stmt_cost (cost_vec, n_adjacent_stores, >>>> + vec_to_scalar, stmt_info, slp_node, >>>> + 0, vect_body); >>>> + vect_get_store_cost (vinfo, stmt_info, slp_node, >>> >>> This should be Inlay done for Multi-lane vectors >> Thanks for the quick reply. As I am making the changes, I am wondering: Do >> we even need n_adjacent_stores anymore? It appears to always have the same >> value as nstores. Can we remove it and use nstores instead or does it still >> serve another purpose? > > It was a heuristic needed for powerpc(?), can you confirm we’re not combining > stores from VF unrolling for strided SLP stores? Hi Richard, the reasoning behind my suggestion to replace n_adjacent_stores by nstores in this code section is that with my patch they will logically always have the same value.
Having said that, I looked into why n_adjacent_stores was introduced in the
first place: The patch [1] that introduced n_adjacent_stores fixed a regression
on aarch64 by costing vector loads/stores together. The variables
n_adjacent_stores and n_adjacent_loads were added in two code sections each in
vectorizable_store and vectorizable_load. The connection to PowerPC you
recalled is also mentioned in the PR, but I believe it refers to the enum
dr_alignment_support alignment_support_scheme that is used in
vect_get_store_cost (vinfo, stmt_info, slp_node,
_adjacent_stores, alignment_support_scheme,
misalignment, &inside_cost, cost_vec);
to which I made no changes other than refactoring the if-statement around it.
So, taking the fact that n_adjacent_stores has been introduced in multiple
locations into account I would actually leave n_adjacent_stores in the code
section that I made changes to in order to keep vectorizable_store and
vectorizable_load consistent.
Regarding your question about not combining stores from loop unrolling for
strided SLP stores: I'm not entirely sure what you mean, but could it be
covered by the tests gcc.target/aarch64/ldp_stp_* that were also mentioned in
[1]?
I added the changes you proposed in the updated patch below, but kept
n_adjacent_stores. The patch was re-validated on aarch64.
Thanks,
Jennifer
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111784#c3
This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS tunable and
use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
default. To that end, the function aarch64_use_new_vector_costs_p and its uses
were removed. To prevent costing vec_to_scalar operations with 0, as
described in
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
we adjusted vectorizable_store such that the variable n_adjacent_stores
also covers vec_to_scalar operations. This way vec_to_scalar operations
are not costed individually, but as a group.
Two tests were adjusted due to changes in codegen. In both cases, the
old code performed loop unrolling once, but the new code does not:
Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
cbz w3, .L92
mov x4, 0
uxtw x3, w3
+ cntd x5
+ whilelo p7.d, xzr, x3
+ mov z29.s, w5
mov z31.s, w2
- whilelo p6.d, xzr, x3
- mov x2, x3
- index z30.s, #0, #1
- uqdecd x2
- ptrue p5.b, all
- whilelo p7.d, xzr, x2
+ index z30.d, #0, #1
+ ptrue p6.b, all
.p2align 3,,7
.L94:
- ld1d z27.d, p7/z, [x0, #1, mul vl]
- ld1d z28.d, p6/z, [x0]
- movprfx z29, z31
- mul z29.s, p5/m, z29.s, z30.s
- incw x4
- uunpklo z0.d, z29.s
- uunpkhi z29.d, z29.s
- ld1d z25.d, p6/z, [x1, z0.d, lsl 3]
- ld1d z26.d, p7/z, [x1, z29.d, lsl 3]
- add z25.d, z28.d, z25.d
+ ld1d z27.d, p7/z, [x0, x4, lsl 3]
+ movprfx z28, z31
+ mul z28.s, p6/m, z28.s, z30.s
+ ld1d z26.d, p7/z, [x1, z28.d, uxtw 3]
add z26.d, z27.d, z26.d
- st1d z26.d, p7, [x0, #1, mul vl]
- whilelo p7.d, x4, x2
- st1d z25.d, p6, [x0]
- incw z30.s
- incb x0, all, mul #2
- whilelo p6.d, x4, x3
+ st1d z26.d, p7, [x0, x4, lsl 3]
+ add z30.s, z30.s, z29.s
+ incd x4
+ whilelo p7.d, x4, x3
b.any .L94
.L92:
ret
Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none):
f_int64_t_32:
cbz w3, .L84
- addvl x5, x1, #1
mov x4, 0
uxtw x3, w3
- mov z31.s, w2
+ cntd x5
whilelo p7.d, xzr, x3
- mov x2, x3
- index z30.s, #0, #1
- uqdecd x2
- ptrue p5.b, all
- whilelo p6.d, xzr, x2
+ mov z29.s, w5
+ mov z31.s, w2
+ index z30.d, #0, #1
+ ptrue p6.b, all
.p2align 3,,7
.L86:
- ld1d z28.d, p7/z, [x1, x4, lsl 3]
- ld1d z27.d, p6/z, [x5, x4, lsl 3]
- movprfx z29, z30
- mul z29.s, p5/m, z29.s, z31.s
- add z28.d, z28.d, #1
- uunpklo z26.d, z29.s
- st1d z28.d, p7, [x0, z26.d, lsl 3]
- incw x4
- uunpkhi z29.d, z29.s
+ ld1d z27.d, p7/z, [x1, x4, lsl 3]
+ movprfx z28, z30
+ mul z28.s, p6/m, z28.s, z31.s
add z27.d, z27.d, #1
- whilelo p6.d, x4, x2
- st1d z27.d, p7, [x0, z29.d, lsl 3]
- incw z30.s
+ st1d z27.d, p7, [x0, z28.d, uxtw 3]
+ incd x4
+ add z30.s, z30.s, z29.s
whilelo p7.d, x4, x3
b.any .L86
.L84:
ret
The patch was bootstrapped and tested on aarch64-linux-gnu, no
regression.
OK for mainline?
Signed-off-by: Jennifer Schmitz <[email protected]>
gcc/
* tree-vect-stmts.cc (vectorizable_store): Extend the use of
n_adjacent_stores to also cover vec_to_scalar operations.
* config/aarch64/aarch64-tuning-flags.def: Remove
use_new_vector_costs as tuning option.
* config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
Remove.
(aarch64_vector_costs::add_stmt_cost): Remove use of
aarch64_use_new_vector_costs_p.
(aarch64_vector_costs::finish_cost): Remove use of
aarch64_use_new_vector_costs_p.
* config/aarch64/tuning_models/cortexx925.h: Remove
AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
* config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
* config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
* config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
* config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
* config/aarch64/tuning_models/neoversen2.h: Likewise.
* config/aarch64/tuning_models/neoversen3.h: Likewise.
* config/aarch64/tuning_models/neoversev1.h: Likewise.
* config/aarch64/tuning_models/neoversev2.h: Likewise.
* config/aarch64/tuning_models/neoversev3.h: Likewise.
* config/aarch64/tuning_models/neoversev3ae.h: Likewise.
gcc/testsuite/
* gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
* gcc.target/aarch64/sve/strided_store_2.c: Likewise.
---
gcc/config/aarch64/aarch64-tuning-flags.def | 2 -
gcc/config/aarch64/aarch64.cc | 20 ++--------
gcc/config/aarch64/tuning_models/cortexx925.h | 1 -
.../aarch64/tuning_models/fujitsu_monaka.h | 1 -
.../aarch64/tuning_models/generic_armv8_a.h | 1 -
.../aarch64/tuning_models/generic_armv9_a.h | 1 -
.../aarch64/tuning_models/neoverse512tvb.h | 1 -
gcc/config/aarch64/tuning_models/neoversen2.h | 1 -
gcc/config/aarch64/tuning_models/neoversen3.h | 1 -
gcc/config/aarch64/tuning_models/neoversev1.h | 1 -
gcc/config/aarch64/tuning_models/neoversev2.h | 1 -
gcc/config/aarch64/tuning_models/neoversev3.h | 1 -
.../aarch64/tuning_models/neoversev3ae.h | 1 -
.../gcc.target/aarch64/sve/strided_load_2.c | 2 +-
.../gcc.target/aarch64/sve/strided_store_2.c | 2 +-
gcc/tree-vect-stmts.cc | 40 ++++++++++---------
16 files changed, 27 insertions(+), 50 deletions(-)
diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
b/gcc/config/aarch64/aarch64-tuning-flags.def
index ffbff20e29c..1de633c739b 100644
--- a/gcc/config/aarch64/aarch64-tuning-flags.def
+++ b/gcc/config/aarch64/aarch64-tuning-flags.def
@@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend",
CHEAP_SHIFT_EXTEND)
AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants", CSE_SVE_VL_CONSTANTS)
-AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs", USE_NEW_VECTOR_COSTS)
-
AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput",
MATCHED_VECTOR_THROUGHPUT)
AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", AVOID_CROSS_LOOP_FMA)
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 77a2a6bfa3a..71fba9cc63b 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16627,16 +16627,6 @@ aarch64_vectorize_create_costs (vec_info *vinfo, bool
costing_for_scalar)
return new aarch64_vector_costs (vinfo, costing_for_scalar);
}
-/* Return true if the current CPU should use the new costs defined
- in GCC 11. This should be removed for GCC 12 and above, with the
- costs applying to all CPUs instead. */
-static bool
-aarch64_use_new_vector_costs_p ()
-{
- return (aarch64_tune_params.extra_tuning_flags
- & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
-}
-
/* Return the appropriate SIMD costs for vectors of type VECTYPE. */
static const simd_vec_cost *
aarch64_simd_vec_costs (tree vectype)
@@ -17555,7 +17545,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
vect_cost_for_stmt kind,
/* Do one-time initialization based on the vinfo. */
loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
- if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
+ if (!m_analyzed_vinfo)
{
if (loop_vinfo)
analyze_loop_vinfo (loop_vinfo);
@@ -17573,7 +17563,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
vect_cost_for_stmt kind,
/* Try to get a more accurate cost by looking at STMT_INFO instead
of just looking at KIND. */
- if (stmt_info && aarch64_use_new_vector_costs_p ())
+ if (stmt_info)
{
/* If we scalarize a strided store, the vectorizer costs one
vec_to_scalar for each element. However, we can store the first
@@ -17638,7 +17628,7 @@ aarch64_vector_costs::add_stmt_cost (int count,
vect_cost_for_stmt kind,
else
m_num_last_promote_demote = 0;
- if (stmt_info && aarch64_use_new_vector_costs_p ())
+ if (stmt_info)
{
/* Account for any extra "embedded" costs that apply additively
to the base cost calculated above. */
@@ -17999,9 +17989,7 @@ aarch64_vector_costs::finish_cost (const vector_costs
*uncast_scalar_costs)
auto *scalar_costs
= static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
- if (loop_vinfo
- && m_vec_flags
- && aarch64_use_new_vector_costs_p ())
+ if (loop_vinfo && m_vec_flags)
{
m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
m_costs[vect_body]);
diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h
b/gcc/config/aarch64/tuning_models/cortexx925.h
index 5ebaf66e986..74772f3e15f 100644
--- a/gcc/config/aarch64/tuning_models/cortexx925.h
+++ b/gcc/config/aarch64/tuning_models/cortexx925.h
@@ -221,7 +221,6 @@ static const struct tune_params cortexx925_tunings =
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
| AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
&generic_armv9a_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
index 2d704ecd110..a564528f43d 100644
--- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
+++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
@@ -55,7 +55,6 @@ static const struct tune_params fujitsu_monaka_tunings =
0, /* max_case_values. */
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
&generic_prefetch_tune,
AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
index bdd309ab03d..f090d5cde50 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
@@ -183,7 +183,6 @@ static const struct tune_params generic_armv8_a_tunings =
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
&generic_prefetch_tune,
AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
index 785e00946bc..7b5821183bc 100644
--- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
+++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
@@ -251,7 +251,6 @@ static const struct tune_params generic_armv9_a_tunings =
0, /* max_case_values. */
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
&generic_armv9a_prefetch_tune,
AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
index 007f987154c..f7457df59e5 100644
--- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
+++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
@@ -156,7 +156,6 @@ static const struct tune_params neoverse512tvb_tunings =
0, /* max_case_values. */
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
&generic_armv9a_prefetch_tune,
AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h
b/gcc/config/aarch64/tuning_models/neoversen2.h
index 32560d2f5f8..541b61c8179 100644
--- a/gcc/config/aarch64/tuning_models/neoversen2.h
+++ b/gcc/config/aarch64/tuning_models/neoversen2.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversen2_tunings =
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
| AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
&generic_armv9a_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h
b/gcc/config/aarch64/tuning_models/neoversen3.h
index 2010bc4645b..eff668132a8 100644
--- a/gcc/config/aarch64/tuning_models/neoversen3.h
+++ b/gcc/config/aarch64/tuning_models/neoversen3.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversen3_tunings =
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
&generic_armv9a_prefetch_tune,
AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h
b/gcc/config/aarch64/tuning_models/neoversev1.h
index c3751e32696..d11472b6e1e 100644
--- a/gcc/config/aarch64/tuning_models/neoversev1.h
+++ b/gcc/config/aarch64/tuning_models/neoversev1.h
@@ -228,7 +228,6 @@ static const struct tune_params neoversev1_tunings =
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
| AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
&generic_armv9a_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h
b/gcc/config/aarch64/tuning_models/neoversev2.h
index 80dbe5c806c..ee77ffdd3bc 100644
--- a/gcc/config/aarch64/tuning_models/neoversev2.h
+++ b/gcc/config/aarch64/tuning_models/neoversev2.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversev2_tunings =
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
| AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
| AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */
diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h
b/gcc/config/aarch64/tuning_models/neoversev3.h
index efe09e16d1e..6ef143ef7d5 100644
--- a/gcc/config/aarch64/tuning_models/neoversev3.h
+++ b/gcc/config/aarch64/tuning_models/neoversev3.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversev3_tunings =
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
| AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
&generic_armv9a_prefetch_tune,
diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h
b/gcc/config/aarch64/tuning_models/neoversev3ae.h
index 66849f30889..96bdbf971f1 100644
--- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
+++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
@@ -219,7 +219,6 @@ static const struct tune_params neoversev3ae_tunings =
tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
(AARCH64_EXTRA_TUNE_BASE
| AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
- | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
| AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
| AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
&generic_armv9a_prefetch_tune,
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
index 762805ff54b..c334b7a6875 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
@@ -15,4 +15,4 @@
so we vectorize the offset calculation. This means that the
64-bit version needs two copies. */
/* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z, \[x[0-9]+,
z[0-9]+.s, uxtw 2\]\n} 3 } } */
-/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[x[0-9]+,
z[0-9]+.d, lsl 3\]\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d, p[0-7]/z, \[x[0-9]+,
z[0-9]+.d, lsl 3\]\n} 9 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
index f0ea58e38e2..94cc63049bc 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
@@ -15,4 +15,4 @@
so we vectorize the offset calculation. This means that the
64-bit version needs two copies. */
/* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7], \[x[0-9]+,
z[0-9]+.s, uxtw 2\]\n} 3 } } */
-/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+,
z[0-9]+.d, lsl 3\]\n} 15 } } */
+/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7], \[x[0-9]+,
z[0-9]+.d, lsl 3\]\n} 9 } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index be1139a423c..a14248193ca 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -8834,22 +8834,7 @@ vectorizable_store (vec_info *vinfo,
{
if (costing_p)
{
- /* Only need vector extracting when there are more
- than one stores. */
- if (nstores > 1)
- inside_cost
- += record_stmt_cost (cost_vec, 1, vec_to_scalar,
- stmt_info, slp_node,
- 0, vect_body);
- /* Take a single lane vector type store as scalar
- store to avoid ICE like 110776. */
- if (VECTOR_TYPE_P (ltype)
- && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
- n_adjacent_stores++;
- else
- inside_cost
- += record_stmt_cost (cost_vec, 1, scalar_store,
- stmt_info, 0, vect_body);
+ n_adjacent_stores++;
continue;
}
tree newref, newoff;
@@ -8905,9 +8890,26 @@ vectorizable_store (vec_info *vinfo,
if (costing_p)
{
if (n_adjacent_stores > 0)
- vect_get_store_cost (vinfo, stmt_info, slp_node, n_adjacent_stores,
- alignment_support_scheme, misalignment,
- &inside_cost, cost_vec);
+ {
+ /* Take a single lane vector type store as scalar
+ store to avoid ICE like 110776. */
+ if (VECTOR_TYPE_P (ltype)
+ && known_ne (TYPE_VECTOR_SUBPARTS (ltype), 1U))
+ vect_get_store_cost (vinfo, stmt_info, slp_node,
+ n_adjacent_stores,
alignment_support_scheme,
+ misalignment, &inside_cost, cost_vec);
+ else
+ inside_cost
+ += record_stmt_cost (cost_vec, n_adjacent_stores,
+ scalar_store, stmt_info, 0, vect_body);
+ /* Only need vector extracting when there are more
+ than one stores. */
+ if (nstores > 1)
+ inside_cost
+ += record_stmt_cost (cost_vec, n_adjacent_stores,
+ vec_to_scalar, stmt_info, slp_node,
+ 0, vect_body);
+ }
if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
"vect_model_store_cost: inside_cost = %d, "
--
2.44.0
>
>> Thanks, Jennifer
>>>
>>>> + n_adjacent_stores, alignment_support_scheme,
>>>> + misalignment, &inside_cost, cost_vec);
>>>> + }
>>>> if (dump_enabled_p ())
>>>> dump_printf_loc (MSG_NOTE, vect_location,
>>>> "vect_model_store_cost: inside_cost = %d, "
>>>> --
>>>> 2.34.1
>>>>>
>>>>>> + inside_cost
>>>>>> + += record_stmt_cost (cost_vec, n_adjacent_stores,
>>>>>> vec_to_scalar,
>>>>>> + stmt_info, slp_node,
>>>>>> + 0, vect_body);
>>>>>> + }
>>>>>> if (dump_enabled_p ())
>>>>>> dump_printf_loc (MSG_NOTE, vect_location,
>>>>>> "vect_model_store_cost: inside_cost = %d, "
>>>>>> --
>>>>>> 2.44.0
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>> Richard
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jennifer
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Jennifer
>>>>>>>>>>>
>>>>>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> tunable and
>>>>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>>>>>>>>>>> default. To that end, the function aarch64_use_new_vector_costs_p
>>>>>>>>>>> and its uses
>>>>>>>>>>> were removed. To prevent costing vec_to_scalar operations with 0, as
>>>>>>>>>>> described in
>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665481.html,
>>>>>>>>>>> we guarded the call to vect_is_store_elt_extraction in
>>>>>>>>>>> aarch64_vector_costs::add_stmt_cost by count > 1.
>>>>>>>>>>>
>>>>>>>>>>> Two tests were adjusted due to changes in codegen. In both cases,
>>>>>>>>>>> the
>>>>>>>>>>> old code performed loop unrolling once, but the new code does not:
>>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_load_2.c (compiled with
>>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
>>>>>>>>>>> -moverride=tune=none):
>>>>>>>>>>> f_int64_t_32:
>>>>>>>>>>> cbz w3, .L92
>>>>>>>>>>> mov x4, 0
>>>>>>>>>>> uxtw x3, w3
>>>>>>>>>>> + cntd x5
>>>>>>>>>>> + whilelo p7.d, xzr, x3
>>>>>>>>>>> + mov z29.s, w5
>>>>>>>>>>> mov z31.s, w2
>>>>>>>>>>> - whilelo p6.d, xzr, x3
>>>>>>>>>>> - mov x2, x3
>>>>>>>>>>> - index z30.s, #0, #1
>>>>>>>>>>> - uqdecd x2
>>>>>>>>>>> - ptrue p5.b, all
>>>>>>>>>>> - whilelo p7.d, xzr, x2
>>>>>>>>>>> + index z30.d, #0, #1
>>>>>>>>>>> + ptrue p6.b, all
>>>>>>>>>>> .p2align 3,,7
>>>>>>>>>>> .L94:
>>>>>>>>>>> - ld1d z27.d, p7/z, [x0, #1, mul vl]
>>>>>>>>>>> - ld1d z28.d, p6/z, [x0]
>>>>>>>>>>> - movprfx z29, z31
>>>>>>>>>>> - mul z29.s, p5/m, z29.s, z30.s
>>>>>>>>>>> - incw x4
>>>>>>>>>>> - uunpklo z0.d, z29.s
>>>>>>>>>>> - uunpkhi z29.d, z29.s
>>>>>>>>>>> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3]
>>>>>>>>>>> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3]
>>>>>>>>>>> - add z25.d, z28.d, z25.d
>>>>>>>>>>> + ld1d z27.d, p7/z, [x0, x4, lsl 3]
>>>>>>>>>>> + movprfx z28, z31
>>>>>>>>>>> + mul z28.s, p6/m, z28.s, z30.s
>>>>>>>>>>> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3]
>>>>>>>>>>> add z26.d, z27.d, z26.d
>>>>>>>>>>> - st1d z26.d, p7, [x0, #1, mul vl]
>>>>>>>>>>> - whilelo p7.d, x4, x2
>>>>>>>>>>> - st1d z25.d, p6, [x0]
>>>>>>>>>>> - incw z30.s
>>>>>>>>>>> - incb x0, all, mul #2
>>>>>>>>>>> - whilelo p6.d, x4, x3
>>>>>>>>>>> + st1d z26.d, p7, [x0, x4, lsl 3]
>>>>>>>>>>> + add z30.s, z30.s, z29.s
>>>>>>>>>>> + incd x4
>>>>>>>>>>> + whilelo p7.d, x4, x3
>>>>>>>>>>> b.any .L94
>>>>>>>>>>> .L92:
>>>>>>>>>>> ret
>>>>>>>>>>>
>>>>>>>>>>> Example from gcc.target/aarch64/sve/strided_store_2.c (compiled with
>>>>>>>>>>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic
>>>>>>>>>>> -moverride=tune=none):
>>>>>>>>>>> f_int64_t_32:
>>>>>>>>>>> cbz w3, .L84
>>>>>>>>>>> - addvl x5, x1, #1
>>>>>>>>>>> mov x4, 0
>>>>>>>>>>> uxtw x3, w3
>>>>>>>>>>> - mov z31.s, w2
>>>>>>>>>>> + cntd x5
>>>>>>>>>>> whilelo p7.d, xzr, x3
>>>>>>>>>>> - mov x2, x3
>>>>>>>>>>> - index z30.s, #0, #1
>>>>>>>>>>> - uqdecd x2
>>>>>>>>>>> - ptrue p5.b, all
>>>>>>>>>>> - whilelo p6.d, xzr, x2
>>>>>>>>>>> + mov z29.s, w5
>>>>>>>>>>> + mov z31.s, w2
>>>>>>>>>>> + index z30.d, #0, #1
>>>>>>>>>>> + ptrue p6.b, all
>>>>>>>>>>> .p2align 3,,7
>>>>>>>>>>> .L86:
>>>>>>>>>>> - ld1d z28.d, p7/z, [x1, x4, lsl 3]
>>>>>>>>>>> - ld1d z27.d, p6/z, [x5, x4, lsl 3]
>>>>>>>>>>> - movprfx z29, z30
>>>>>>>>>>> - mul z29.s, p5/m, z29.s, z31.s
>>>>>>>>>>> - add z28.d, z28.d, #1
>>>>>>>>>>> - uunpklo z26.d, z29.s
>>>>>>>>>>> - st1d z28.d, p7, [x0, z26.d, lsl 3]
>>>>>>>>>>> - incw x4
>>>>>>>>>>> - uunpkhi z29.d, z29.s
>>>>>>>>>>> + ld1d z27.d, p7/z, [x1, x4, lsl 3]
>>>>>>>>>>> + movprfx z28, z30
>>>>>>>>>>> + mul z28.s, p6/m, z28.s, z31.s
>>>>>>>>>>> add z27.d, z27.d, #1
>>>>>>>>>>> - whilelo p6.d, x4, x2
>>>>>>>>>>> - st1d z27.d, p7, [x0, z29.d, lsl 3]
>>>>>>>>>>> - incw z30.s
>>>>>>>>>>> + st1d z27.d, p7, [x0, z28.d, uxtw 3]
>>>>>>>>>>> + incd x4
>>>>>>>>>>> + add z30.s, z30.s, z29.s
>>>>>>>>>>> whilelo p7.d, x4, x3
>>>>>>>>>>> b.any .L86
>>>>>>>>>>> .L84:
>>>>>>>>>>> ret
>>>>>>>>>>>
>>>>>>>>>>> The patch was bootstrapped and tested on aarch64-linux-gnu, no
>>>>>>>>>>> regression. We also ran SPEC2017 with -mcpu=generic on a Grace
>>>>>>>>>>> machine and saw
>>>>>>>>>>> no non-noise impact on performance. We would appreciate help with
>>>>>>>>>>> wider
>>>>>>>>>>> benchmarking on other platforms, if necessary.
>>>>>>>>>>> OK for mainline?
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Jennifer Schmitz <[email protected]>
>>>>>>>>>>>
>>>>>>>>>>> gcc/
>>>>>>>>>>> * config/aarch64/aarch64-tuning-flags.def: Remove
>>>>>>>>>>> use_new_vector_costs as tuning option.
>>>>>>>>>>> * config/aarch64/aarch64.cc (aarch64_use_new_vector_costs_p):
>>>>>>>>>>> Remove.
>>>>>>>>>>> (aarch64_vector_costs::add_stmt_cost): Remove use of
>>>>>>>>>>> aarch64_use_new_vector_costs_p and guard call to
>>>>>>>>>>> vect_is_store_elt_extraction with count > 1.
>>>>>>>>>>> (aarch64_vector_costs::finish_cost): Remove use of
>>>>>>>>>>> aarch64_use_new_vector_costs_p.
>>>>>>>>>>> * config/aarch64/tuning_models/cortexx925.h: Remove
>>>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS.
>>>>>>>>>>> * config/aarch64/tuning_models/fujitsu_monaka.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/generic_armv8_a.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/generic_armv9_a.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoverse512tvb.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversen2.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversen3.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversev1.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversev2.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversev3.h: Likewise.
>>>>>>>>>>> * config/aarch64/tuning_models/neoversev3ae.h: Likewise.
>>>>>>>>>>>
>>>>>>>>>>> gcc/testsuite/
>>>>>>>>>>> * gcc.target/aarch64/sve/strided_load_2.c: Adjust expected outcome.
>>>>>>>>>>> * gcc.target/aarch64/sve/strided_store_2.c: Likewise.
>>>>>>>>>>> ---
>>>>>>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 --
>>>>>>>>>>> gcc/config/aarch64/aarch64.cc | 22
>>>>>>>>>>> +++++--------------
>>>>>>>>>>> gcc/config/aarch64/tuning_models/cortexx925.h | 1 -
>>>>>>>>>>> .../aarch64/tuning_models/fujitsu_monaka.h | 1 -
>>>>>>>>>>> .../aarch64/tuning_models/generic_armv8_a.h | 1 -
>>>>>>>>>>> .../aarch64/tuning_models/generic_armv9_a.h | 1 -
>>>>>>>>>>> .../aarch64/tuning_models/neoverse512tvb.h | 1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen2.h | 1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversen3.h | 1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev1.h | 1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev2.h | 1 -
>>>>>>>>>>> gcc/config/aarch64/tuning_models/neoversev3.h | 1 -
>>>>>>>>>>> .../aarch64/tuning_models/neoversev3ae.h | 1 -
>>>>>>>>>>> .../gcc.target/aarch64/sve/strided_load_2.c | 2 +-
>>>>>>>>>>> .../gcc.target/aarch64/sve/strided_store_2.c | 2 +-
>>>>>>>>>>> 15 files changed, 7 insertions(+), 32 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>>> index 5939602576b..ed345b13ed3 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>>>>>>> @@ -38,8 +38,6 @@ AARCH64_EXTRA_TUNING_OPTION
>>>>>>>>>>> ("cheap_shift_extend", CHEAP_SHIFT_EXTEND)
>>>>>>>>>>>
>>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("cse_sve_vl_constants",
>>>>>>>>>>> CSE_SVE_VL_CONSTANTS)
>>>>>>>>>>>
>>>>>>>>>>> -AARCH64_EXTRA_TUNING_OPTION ("use_new_vector_costs",
>>>>>>>>>>> USE_NEW_VECTOR_COSTS)
>>>>>>>>>>> -
>>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("matched_vector_throughput",
>>>>>>>>>>> MATCHED_VECTOR_THROUGHPUT)
>>>>>>>>>>>
>>>>>>>>>>> AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma",
>>>>>>>>>>> AVOID_CROSS_LOOP_FMA)
>>>>>>>>>>> diff --git a/gcc/config/aarch64/aarch64.cc
>>>>>>>>>>> b/gcc/config/aarch64/aarch64.cc
>>>>>>>>>>> index 43238aefef2..03806671c97 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/aarch64.cc
>>>>>>>>>>> +++ b/gcc/config/aarch64/aarch64.cc
>>>>>>>>>>> @@ -16566,16 +16566,6 @@ aarch64_vectorize_create_costs (vec_info
>>>>>>>>>>> *vinfo, bool costing_for_scalar)
>>>>>>>>>>> return new aarch64_vector_costs (vinfo, costing_for_scalar);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> -/* Return true if the current CPU should use the new costs defined
>>>>>>>>>>> - in GCC 11. This should be removed for GCC 12 and above, with
>>>>>>>>>>> the
>>>>>>>>>>> - costs applying to all CPUs instead. */
>>>>>>>>>>> -static bool
>>>>>>>>>>> -aarch64_use_new_vector_costs_p ()
>>>>>>>>>>> -{
>>>>>>>>>>> - return (aarch64_tune_params.extra_tuning_flags
>>>>>>>>>>> - & AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS);
>>>>>>>>>>> -}
>>>>>>>>>>> -
>>>>>>>>>>> /* Return the appropriate SIMD costs for vectors of type VECTYPE.
>>>>>>>>>>> */
>>>>>>>>>>> static const simd_vec_cost *
>>>>>>>>>>> aarch64_simd_vec_costs (tree vectype)
>>>>>>>>>>> @@ -17494,7 +17484,7 @@ aarch64_vector_costs::add_stmt_cost (int
>>>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>>>>
>>>>>>>>>>> /* Do one-time initialization based on the vinfo. */
>>>>>>>>>>> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
>>>>>>>>>>> - if (!m_analyzed_vinfo && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>> + if (!m_analyzed_vinfo)
>>>>>>>>>>> {
>>>>>>>>>>> if (loop_vinfo)
>>>>>>>>>>> analyze_loop_vinfo (loop_vinfo);
>>>>>>>>>>> @@ -17512,12 +17502,12 @@ aarch64_vector_costs::add_stmt_cost (int
>>>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>>>>
>>>>>>>>>>> /* Try to get a more accurate cost by looking at STMT_INFO instead
>>>>>>>>>>> of just looking at KIND. */
>>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>> + if (stmt_info)
>>>>>>>>>>> {
>>>>>>>>>>> /* If we scalarize a strided store, the vectorizer costs one
>>>>>>>>>>> vec_to_scalar for each element. However, we can store the first
>>>>>>>>>>> element using an FP store without a separate extract step. */
>>>>>>>>>>> - if (vect_is_store_elt_extraction (kind, stmt_info))
>>>>>>>>>>> + if (vect_is_store_elt_extraction (kind, stmt_info) && count
>>>>>>>>>>> > 1)
>>>>>>>>>>> count -= 1;
>>>>>>>>>>>
>>>>>>>>>>> stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
>>>>>>>>>>> @@ -17577,7 +17567,7 @@ aarch64_vector_costs::add_stmt_cost (int
>>>>>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>>>> else
>>>>>>>>>>> m_num_last_promote_demote = 0;
>>>>>>>>>>>
>>>>>>>>>>> - if (stmt_info && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>> + if (stmt_info)
>>>>>>>>>>> {
>>>>>>>>>>> /* Account for any extra "embedded" costs that apply additively
>>>>>>>>>>> to the base cost calculated above. */
>>>>>>>>>>> @@ -17938,9 +17928,7 @@ aarch64_vector_costs::finish_cost (const
>>>>>>>>>>> vector_costs *uncast_scalar_costs)
>>>>>>>>>>>
>>>>>>>>>>> auto *scalar_costs
>>>>>>>>>>> = static_cast<const aarch64_vector_costs *> (uncast_scalar_costs);
>>>>>>>>>>> - if (loop_vinfo
>>>>>>>>>>> - && m_vec_flags
>>>>>>>>>>> - && aarch64_use_new_vector_costs_p ())
>>>>>>>>>>> + if (loop_vinfo && m_vec_flags)
>>>>>>>>>>> {
>>>>>>>>>>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs,
>>>>>>>>>>> m_costs[vect_body]);
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>>> index eb9b89984b0..dafea96e924 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
>>>>>>>>>>> cortexx925_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>>> index 6a098497759..ac001927959 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/fujitsu_monaka.h
>>>>>>>>>>> @@ -55,7 +55,6 @@ static const struct tune_params
>>>>>>>>>>> fujitsu_monaka_tunings =
>>>>>>>>>>> 0, /* max_case_values. */
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>>> index 9b1cbfc5bd2..7b534831340 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h
>>>>>>>>>>> @@ -183,7 +183,6 @@ static const struct tune_params
>>>>>>>>>>> generic_armv8_a_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>>> index 48353a59939..562ef89c67b 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h
>>>>>>>>>>> @@ -249,7 +249,6 @@ static const struct tune_params
>>>>>>>>>>> generic_armv9_a_tunings =
>>>>>>>>>>> 0, /* max_case_values. */
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>>>>> &generic_armv9a_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>>> index c407b89a22f..fe4f7c10f73 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h
>>>>>>>>>>> @@ -156,7 +156,6 @@ static const struct tune_params
>>>>>>>>>>> neoverse512tvb_tunings =
>>>>>>>>>>> 0, /* max_case_values. */
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>>> index 18199ac206c..56be77423cb 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen2.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
>>>>>>>>>>> neoversen2_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>>> index 4da85cfac0d..254ad5e27f8 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversen3.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
>>>>>>>>>>> neoversen3_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT), /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> AARCH64_LDP_STP_POLICY_ALWAYS, /* ldp_policy_model. */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>>> index dd9120eee48..c7241cf23d7 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev1.h
>>>>>>>>>>> @@ -227,7 +227,6 @@ static const struct tune_params
>>>>>>>>>>> neoversev1_tunings =
>>>>>>>>>>> 0, /* max_case_values. */
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>>> index 1369de73991..96f55940649 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev2.h
>>>>>>>>>>> @@ -232,7 +232,6 @@ static const struct tune_params
>>>>>>>>>>> neoversev2_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA), /* tune_flags. */
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>>> index d8c82255378..f62ae67d355 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
>>>>>>>>>>> neoversev3_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>>> b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>>> index 7f050501ede..0233baf5e34 100644
>>>>>>>>>>> --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>>> +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h
>>>>>>>>>>> @@ -219,7 +219,6 @@ static const struct tune_params
>>>>>>>>>>> neoversev3ae_tunings =
>>>>>>>>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */
>>>>>>>>>>> (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
>>>>>>>>>>> - | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
>>>>>>>>>>> | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW), /* tune_flags. */
>>>>>>>>>>> &generic_prefetch_tune,
>>>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>>> index 762805ff54b..c334b7a6875 100644
>>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_load_2.c
>>>>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>>>>> so we vectorize the offset calculation. This means that the
>>>>>>>>>>> 64-bit version needs two copies. */
>>>>>>>>>>> /* { dg-final { scan-assembler-times {\tld1w\tz[0-9]+\.s, p[0-7]/z,
>>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d,
>>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tld1d\tz[0-9]+\.d,
>>>>>>>>>>> p[0-7]/z, \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>>>>> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>>> b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>>> index f0ea58e38e2..94cc63049bc 100644
>>>>>>>>>>> --- a/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>>> +++ b/gcc/testsuite/gcc.target/aarch64/sve/strided_store_2.c
>>>>>>>>>>> @@ -15,4 +15,4 @@
>>>>>>>>>>> so we vectorize the offset calculation. This means that the
>>>>>>>>>>> 64-bit version needs two copies. */
>>>>>>>>>>> /* { dg-final { scan-assembler-times {\tst1w\tz[0-9]+\.s, p[0-7],
>>>>>>>>>>> \[x[0-9]+, z[0-9]+.s, uxtw 2\]\n} 3 } } */
>>>>>>>>>>> -/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
>>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 15 } } */
>>>>>>>>>>> +/* { dg-final { scan-assembler-times {\tst1d\tz[0-9]+\.d, p[0-7],
>>>>>>>>>>> \[x[0-9]+, z[0-9]+.d, lsl 3\]\n} 9 } } */
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Richard Biener <[email protected]>
>>>>>>>>>> SUSE Software Solutions Germany GmbH,
>>>>>>>>>> Frankenstrasse 146, 90461 Nuernberg, Germany;
>>>>>>>>>> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
>>>>>>>>>> Nuernberg)
smime.p7s
Description: S/MIME cryptographic signature
