Hi Tamar,

> On 21 Jul 2025, at 11:12, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
> Hi Kyrill,
> 
>> -----Original Message-----
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Sent: Friday, July 18, 2025 10:40 AM
>> To: GCC Patches <gcc-patches@gcc.gnu.org>
>> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
>> <richard.sandif...@arm.com>; Andrew Pinski <pins...@gmail.com>; Alex Coplan
>> <alex.cop...@arm.com>
>> Subject: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR
>> instructions
>> 
>> Hi all,
>> 
>> For inserting zero into a vector lane we usually use an instruction like:
>> ins     v0.h[2], wzr
>> 
>> This, however, has not-so-great performance on some CPUs.
>> On Grace, for example it has a latency of 5 and throughput 1.
>> The alternative sequence:
>> movi    v31.8b, #0
>> ins     v0.h[2], v31.h[0]
>> is prefereble bcause the MOVI-0 is often a zero-latency operation that is
>> eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 
>> and
>> throughput of 4.
>> We can avoid the merging of the two instructions into the
>> aarch64_simd_vec_set_zero<mode>
>> insn through rtx costs.  We just need to handle the right VEC_MERGE form in
>> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to 
>> describe
>> this operation.
>> According to a similar LLVM PR: https://github.com/llvm/llvm-
>> project/pull/146538
>> and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit
>> from this,
>> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable 
>> latency
>> 3 in Cortex-A510).
>> 
> 
> I'm not really sure about the Cortex-A510, the entries are a bit suspect.  I 
> think they
> haven't added an entry for INS from GPR, which is an alias for MOV element, 
> GPR.
> And the throughput for the INS they did add (element to element) seems 
> suspect.
> 
> If I assume the same latency as an FMOV for INS GPR Wzr, then yeah the MOVI is
> slower (at least on paper) but I'd expect in actual code the throughput to 
> matter more.
> 
> So I think we should just do this for every core and by default.  I have a 
> couple of
> arguments for this:
> 
> 1. We already do this for every constant but 0, since we prefer the throughput
>     advantage here. See https://godbolt.org/z/9PKhEax8G where if you change 
> your
>     testcase to anything but 0 you get the movi.  Even though strictly 
> speaking the movi
>     is more expensive than the mov.
> 
> 2. I've ran this sequence change over various cores, and they all showed 
> either an improvement
>    or very little difference. i.e. Cortex-A510 shows 
> 
> cortex-a510   │ little │         -0.1%        
> cortex-a520   │ little │        -0.09%
> cortex-a710   │  mid   │        -0.03%
> 
> etc, this is comparing using movi over the INS, wzr.

Thanks for the broader benchmarking!


> 
> However the advantage grows significantly when there is more than one of the 
> same constant to insert.
> e.g.
> 
> __attribute__((noipa))
> uint16x8_t foo2(uint16x8_t a) {
>  a[2] = 0;
>  a[5] = 0;
>  return a;
> }
> 
> Now the movi pays for itself and the INS wzr is serial over most cores.
> 
> Benchmarking this gives you as expected for inorder cores:
> 
> cortex-a510   │ little │         -0.1%
> cortex-a520   │ little │        -0.05%
> but then out of order cores
> 
> cortex-a710   │  mid   │      ✔  -49.49%
> neoverse v2   │  big   │      ✔  -49.57%
> 
> etc.
> 
> Showing that I think we should do this by default.

Yeah that’s fine with me.


> 
> As a side note, the current patch only seems to work outside loops. e.g.
> 
> __attribute__((noipa))
> uint16x8_t foos(uint16x8_t a, int n) {
>  for (int i = 0; i < n; i++)
>    a[2] = 0;
>  return a;
> }
> 
> Does not seem to make the movi and I think that's because after rejecting
> the movi in the vec_merge and floating it in the loop preheader, combine now
> tries to force it down and ultimately succeeds.
> 
> Given the above, what do you think about instead just disabling the
> aarch64_simd_vec_set_zero pattern when not compiling for size?
> 

Yes, that would make for a simpler patch and wouldn’t rely on the rtx cost 
users doing the right thing.
I’ll respin accordingly.
Thanks,
Kyrill

> That should do what we want in all cases.
> 
> Thanks,
> Tamar
> 
>> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the 
>> split
>> into two instructions, values lower than that retain the INS-WZR form.
>> cortexa76_extra_costs, from which Grace and many other Neoverse cores derive
>> from,
>> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles.  3 is the 
>> number
>> of cycles above the normal cheapest SIMD instruction on such cores (which 
>> take 2
>> cycles
>> for the cheapest one).
>> 
>> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to
>> preserve the current codegen, though I'd be happy to increase it for generic 
>> tuning.
>> 
>> For -Os we don't add any extra cost so the shorter INS-WZR form is still
>> generated always.
>> 
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Ok for trunk?
>> Thanks,
>> Kyrill
>> 
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>> 
>> gcc/
>> 
>> * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp
>> field.  Add comments to other vector cost fields.
>> * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE
>> case.
>> * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs,
>> thunderx_extra_costs, thunderx2t99_extra_costs,
>> thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs,
>> ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs):
>> Specify ins_gp cost.
>> * config/arm/aarch-cost-tables.h (generic_extra_costs,
>> cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs,
>> exynosm1_extra_costs, xgene1_extra_costs): Likewise.
>> 
>> gcc/testsuite/
>> 
>> * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4,
>> test_setq_lane4): Relax allowed assembly.


Reply via email to