[PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR instructions

Kyrylo Tkachov Fri, 18 Jul 2025 02:40:06 -0700

Hi all,

For inserting zero into a vector lane we usually use an instruction like:
        ins     v0.h[2], wzr


This, however, has not-so-great performance on some CPUs.
On Grace, for example it has a latency of 5 and throughput 1.
The alternative sequence:
        movi    v31.8b, #0
        ins     v0.h[2], v31.h[0]
is prefereble bcause the MOVI-0 is often a zero-latency operation that is
eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 and
throughput of 4.
We can avoid the merging of the two instructions into the 
aarch64_simd_vec_set_zero<mode>
insn through rtx costs.  We just need to handle the right VEC_MERGE form in
aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to 
describe
this operation.
According to a similar LLVM PR: https://github.com/llvm/llvm-project/pull/146538
and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit 
from this,
whereas little cores like Cortex-A510 won't (INS-WZR has a respectable latency
3 in Cortex-A510).

Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the split
into two instructions, values lower than that retain the INS-WZR form.
cortexa76_extra_costs, from which Grace and many other Neoverse cores derive 
from,
sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles.  3 is the 
number
of cycles above the normal cheapest SIMD instruction on such cores (which take 
2 cycles
for the cheapest one).

cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to
preserve the current codegen, though I'd be happy to increase it for generic 
tuning.

For -Os we don't add any extra cost so the shorter INS-WZR form is still
generated always.

Bootstrapped and tested on aarch64-none-linux-gnu.
Ok for trunk?
Thanks,
Kyrill

Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>

gcc/

        * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp
        field.  Add comments to other vector cost fields.
        * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE case.
        * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs,
        thunderx_extra_costs, thunderx2t99_extra_costs,
        thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs,
        ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs):
        Specify ins_gp cost.
        * config/arm/aarch-cost-tables.h (generic_extra_costs,
        cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs,
        exynosm1_extra_costs, xgene1_extra_costs): Likewise.

gcc/testsuite/

        * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4,
        test_setq_lane4): Relax allowed assembly.

0002-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch
Description: 0002-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch

[PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR instructions

Reply via email to