On 7/18/25 05:39, Kyrylo Tkachov wrote: > External email: Use caution opening links or attachments > > > Hi all, > > For inserting zero into a vector lane we usually use an instruction like: > ins v0.h[2], wzr > > This, however, has not-so-great performance on some CPUs. > On Grace, for example it has a latency of 5 and throughput 1. > The alternative sequence: > movi v31.8b, #0 > ins v0.h[2], v31.h[0] > is prefereble bcause the MOVI-0 is often a zero-latency operation that is > eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 and > throughput of 4. > We can avoid the merging of the two instructions into the > aarch64_simd_vec_set_zero<mode> > insn through rtx costs. We just need to handle the right VEC_MERGE form in > aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to > describe > this operation. > According to a similar LLVM PR: > https://github.com/llvm/llvm-project/pull/146538 > and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit > from this, > whereas little cores like Cortex-A510 won't (INS-WZR has a respectable latency > 3 in Cortex-A510). > > Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the > split > into two instructions, values lower than that retain the INS-WZR form. > cortexa76_extra_costs, from which Grace and many other Neoverse cores derive > from, > sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles. 3 is the > number > of cycles above the normal cheapest SIMD instruction on such cores (which > take 2 cycles > for the cheapest one). > > cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to > preserve the current codegen, though I'd be happy to increase it for generic > tuning. > > For -Os we don't add any extra cost so the shorter INS-WZR form is still > generated always. > > Bootstrapped and tested on aarch64-none-linux-gnu. > Ok for trunk? > Thanks, > Kyrill
Minor nit: one line in gcc/config/aarch64/aarch64.cc is past 80 characters. Looks good to me otherwise (but I cannot approve). Remi > > Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > gcc/ > > * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp > field. Add comments to other vector cost fields. > * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE > case. > * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs, > thunderx_extra_costs, thunderx2t99_extra_costs, > thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs, > ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs): > Specify ins_gp cost. > * config/arm/aarch-cost-tables.h (generic_extra_costs, > cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs, > exynosm1_extra_costs, xgene1_extra_costs): Likewise. > > gcc/testsuite/ > > * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4, > test_setq_lane4): Relax allowed assembly. >