Hi Jennifer,

> On 18 Jul 2025, at 17:08, Jennifer Schmitz <jschm...@nvidia.com> wrote:
> 
> 
> 
>> On 18 Jul 2025, at 11:39, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Hi all,
>> 
>> For inserting zero into a vector lane we usually use an instruction like:
>>       ins     v0.h[2], wzr
>> 
>> This, however, has not-so-great performance on some CPUs.
>> On Grace, for example it has a latency of 5 and throughput 1.
>> The alternative sequence:
>>       movi    v31.8b, #0
>>       ins     v0.h[2], v31.h[0]
>> is prefereble bcause the MOVI-0 is often a zero-latency operation that is
>> eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 
>> and
>> throughput of 4.
>> We can avoid the merging of the two instructions into the 
>> aarch64_simd_vec_set_zero<mode>
>> insn through rtx costs.  We just need to handle the right VEC_MERGE form in
>> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to 
>> describe
>> this operation.
>> According to a similar LLVM PR: 
>> https://github.com/llvm/llvm-project/pull/146538
>> and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit 
>> from this,
>> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable 
>> latency
>> 3 in Cortex-A510).
>> 
>> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the 
>> split
>> into two instructions, values lower than that retain the INS-WZR form.
>> cortexa76_extra_costs, from which Grace and many other Neoverse cores derive 
>> from,
>> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles.  3 is the 
>> number
>> of cycles above the normal cheapest SIMD instruction on such cores (which 
>> take 2 cycles
>> for the cheapest one).
>> 
>> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to
>> preserve the current codegen, though I'd be happy to increase it for generic 
>> tuning.
>> 
>> For -Os we don't add any extra cost so the shorter INS-WZR form is still
>> generated always.
>> 
>> Bootstrapped and tested on aarch64-none-linux-gnu.
>> Ok for trunk?
>> Thanks,
>> Kyrill
> Hi Kyrill,
> I had a look at your patch series (including 1/2) and they look good to me.
> Two thoughts:
> - Would it make sense from a staging point of view to move the comments in 
> the vector cost fields in aarch64-common-protos.h to the first patch?

I don’t think it matters to be honest, it wouldn’t make the patches independent 
in terms of merge conflicts and the end result would be the same.

> - In addition to the relaxation of the test case, should we add a 
> core-specific test that requires the production of the new instruction 
> sequence?

Yes, that is a good idea.
Added one in the attached version.
Thanks,
Kyrill

> Thanks,
> Jennifer
>> 
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>> 
>> gcc/
>> 
>>       * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp
>>       field.  Add comments to other vector cost fields.
>>       * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE case.
>>       * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs,
>>       thunderx_extra_costs, thunderx2t99_extra_costs,
>>       thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs,
>>       ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs):
>>       Specify ins_gp cost.
>>       * config/arm/aarch-cost-tables.h (generic_extra_costs,
>>       cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs,
>>       exynosm1_extra_costs, xgene1_extra_costs): Likewise.
>> 
>> gcc/testsuite/
>> 
>>       * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4,
>>       test_setq_lane4): Relax allowed assembly.
>> 
>> <0002-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch>


Attachment: 0001-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch
Description: 0001-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch

Reply via email to