> On 21 Jul 2025, at 11:43, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
> 
> Hi Tamar,
> 
>> On 21 Jul 2025, at 11:12, Tamar Christina <tamar.christ...@arm.com> wrote:
>> 
>> Hi Kyrill,
>> 
>>> -----Original Message-----
>>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>>> Sent: Friday, July 18, 2025 10:40 AM
>>> To: GCC Patches <gcc-patches@gcc.gnu.org>
>>> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
>>> <richard.sandif...@arm.com>; Andrew Pinski <pins...@gmail.com>; Alex Coplan
>>> <alex.cop...@arm.com>
>>> Subject: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR
>>> instructions
>>> 
>>> Hi all,
>>> 
>>> For inserting zero into a vector lane we usually use an instruction like:
>>> ins     v0.h[2], wzr
>>> 
>>> This, however, has not-so-great performance on some CPUs.
>>> On Grace, for example it has a latency of 5 and throughput 1.
>>> The alternative sequence:
>>> movi    v31.8b, #0
>>> ins     v0.h[2], v31.h[0]
>>> is prefereble bcause the MOVI-0 is often a zero-latency operation that is
>>> eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 
>>> and
>>> throughput of 4.
>>> We can avoid the merging of the two instructions into the
>>> aarch64_simd_vec_set_zero<mode>
>>> insn through rtx costs.  We just need to handle the right VEC_MERGE form in
>>> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to 
>>> describe
>>> this operation.
>>> According to a similar LLVM PR: https://github.com/llvm/llvm-
>>> project/pull/146538
>>> and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit
>>> from this,
>>> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable 
>>> latency
>>> 3 in Cortex-A510).
>>> 
>> 
>> I'm not really sure about the Cortex-A510, the entries are a bit suspect.  I 
>> think they
>> haven't added an entry for INS from GPR, which is an alias for MOV element, 
>> GPR.
>> And the throughput for the INS they did add (element to element) seems 
>> suspect.
>> 
>> If I assume the same latency as an FMOV for INS GPR Wzr, then yeah the MOVI 
>> is
>> slower (at least on paper) but I'd expect in actual code the throughput to 
>> matter more.
>> 
>> So I think we should just do this for every core and by default.  I have a 
>> couple of
>> arguments for this:
>> 
>> 1. We already do this for every constant but 0, since we prefer the 
>> throughput
>>    advantage here. See https://godbolt.org/z/9PKhEax8G where if you change 
>> your
>>    testcase to anything but 0 you get the movi.  Even though strictly 
>> speaking the movi
>>    is more expensive than the mov.
>> 
>> 2. I've ran this sequence change over various cores, and they all showed 
>> either an improvement
>>   or very little difference. i.e. Cortex-A510 shows 
>> 
>> cortex-a510   │ little │         -0.1%        
>> cortex-a520   │ little │        -0.09%
>> cortex-a710   │  mid   │        -0.03%
>> 
>> etc, this is comparing using movi over the INS, wzr.
> 
> Thanks for the broader benchmarking!
> 
> 
>> 
>> However the advantage grows significantly when there is more than one of the 
>> same constant to insert.
>> e.g.
>> 
>> __attribute__((noipa))
>> uint16x8_t foo2(uint16x8_t a) {
>> a[2] = 0;
>> a[5] = 0;
>> return a;
>> }
>> 
>> Now the movi pays for itself and the INS wzr is serial over most cores.
>> 
>> Benchmarking this gives you as expected for inorder cores:
>> 
>> cortex-a510   │ little │         -0.1%
>> cortex-a520   │ little │        -0.05%
>> but then out of order cores
>> 
>> cortex-a710   │  mid   │      ✔  -49.49%
>> neoverse v2   │  big   │      ✔  -49.57%
>> 
>> etc.
>> 
>> Showing that I think we should do this by default.
> 
> Yeah that’s fine with me.
> 
> 
>> 
>> As a side note, the current patch only seems to work outside loops. e.g.
>> 
>> __attribute__((noipa))
>> uint16x8_t foos(uint16x8_t a, int n) {
>> for (int i = 0; i < n; i++)
>>   a[2] = 0;
>> return a;
>> }
>> 
>> Does not seem to make the movi and I think that's because after rejecting
>> the movi in the vec_merge and floating it in the loop preheader, combine now
>> tries to force it down and ultimately succeeds.
>> 
>> Given the above, what do you think about instead just disabling the
>> aarch64_simd_vec_set_zero pattern when not compiling for size?
>> 
> 
> Yes, that would make for a simpler patch and wouldn’t rely on the rtx cost 
> users doing the right thing.
> I’ll respin accordingly.

This patch passes bootstrap and test on aarch64-none-linux-gnu.
Thanks,
Kyrill


> Thanks,
> Kyrill
> 
>> That should do what we want in all cases.
>> 
>> Thanks,
>> Tamar
>> 
>>> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the 
>>> split
>>> into two instructions, values lower than that retain the INS-WZR form.
>>> cortexa76_extra_costs, from which Grace and many other Neoverse cores derive
>>> from,
>>> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles.  3 is 
>>> the number
>>> of cycles above the normal cheapest SIMD instruction on such cores (which 
>>> take 2
>>> cycles
>>> for the cheapest one).
>>> 
>>> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to
>>> preserve the current codegen, though I'd be happy to increase it for 
>>> generic tuning.
>>> 
>>> For -Os we don't add any extra cost so the shorter INS-WZR form is still
>>> generated always.
>>> 
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>> Ok for trunk?
>>> Thanks,
>>> Kyrill
>>> 
>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>> 
>>> gcc/
>>> 
>>> * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp
>>> field.  Add comments to other vector cost fields.
>>> * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE
>>> case.
>>> * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs,
>>> thunderx_extra_costs, thunderx2t99_extra_costs,
>>> thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs,
>>> ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs):
>>> Specify ins_gp cost.
>>> * config/arm/aarch-cost-tables.h (generic_extra_costs,
>>> cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs,
>>> exynosm1_extra_costs, xgene1_extra_costs): Likewise.
>>> 
>>> gcc/testsuite/
>>> 
>>> * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4,
>>> test_setq_lane4): Relax allowed assembly.


Attachment: 0001-aarch64-Avoid-INS-W-X-ZR-instructions-when-optimisin.patch
Description: 0001-aarch64-Avoid-INS-W-X-ZR-instructions-when-optimisin.patch

Reply via email to