Re: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR instructions

Remi Machet Fri, 18 Jul 2025 06:18:36 -0700

On 7/18/25 05:39, Kyrylo Tkachov wrote:
> External email: Use caution opening links or attachments
>
>
> Hi all,
>
> For inserting zero into a vector lane we usually use an instruction like:
>          ins     v0.h[2], wzr
>
> This, however, has not-so-great performance on some CPUs.
> On Grace, for example it has a latency of 5 and throughput 1.
> The alternative sequence:
>          movi    v31.8b, #0
>          ins     v0.h[2], v31.h[0]
> is prefereble bcause the MOVI-0 is often a zero-latency operation that is
> eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 and
> throughput of 4.
> We can avoid the merging of the two instructions into the 
> aarch64_simd_vec_set_zero<mode>
> insn through rtx costs.  We just need to handle the right VEC_MERGE form in
> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to 
> describe
> this operation.
> According to a similar LLVM PR: 
> https://github.com/llvm/llvm-project/pull/146538
> and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit 
> from this,
> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable latency
> 3 in Cortex-A510).
>
> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the 
> split
> into two instructions, values lower than that retain the INS-WZR form.
> cortexa76_extra_costs, from which Grace and many other Neoverse cores derive 
> from,
> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles.  3 is the 
> number
> of cycles above the normal cheapest SIMD instruction on such cores (which 
> take 2 cycles
> for the cheapest one).
>
> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to
> preserve the current codegen, though I'd be happy to increase it for generic 
> tuning.
>
> For -Os we don't add any extra cost so the shorter INS-WZR form is still
> generated always.
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
> Ok for trunk?
> Thanks,
> Kyrill


Minor nit: one line in gcc/config/aarch64/aarch64.cc is past 80 characters.

Looks good to me otherwise (but I cannot approve).

Remi

>
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>
> gcc/
>
>          * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp
>          field.  Add comments to other vector cost fields.
>          * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE 
> case.
>          * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs,
>          thunderx_extra_costs, thunderx2t99_extra_costs,
>          thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs,
>          ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs):
>          Specify ins_gp cost.
>          * config/arm/aarch-cost-tables.h (generic_extra_costs,
>          cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs,
>          exynosm1_extra_costs, xgene1_extra_costs): Likewise.
>
> gcc/testsuite/
>
>          * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4,
>          test_setq_lane4): Relax allowed assembly.
>

Re: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR instructions

Reply via email to