Hi Jennifer, > On 18 Jul 2025, at 17:08, Jennifer Schmitz <jschm...@nvidia.com> wrote: > > > >> On 18 Jul 2025, at 11:39, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: >> >> External email: Use caution opening links or attachments >> >> >> Hi all, >> >> For inserting zero into a vector lane we usually use an instruction like: >> ins v0.h[2], wzr >> >> This, however, has not-so-great performance on some CPUs. >> On Grace, for example it has a latency of 5 and throughput 1. >> The alternative sequence: >> movi v31.8b, #0 >> ins v0.h[2], v31.h[0] >> is prefereble bcause the MOVI-0 is often a zero-latency operation that is >> eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 >> and >> throughput of 4. >> We can avoid the merging of the two instructions into the >> aarch64_simd_vec_set_zero<mode> >> insn through rtx costs. We just need to handle the right VEC_MERGE form in >> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to >> describe >> this operation. >> According to a similar LLVM PR: >> https://github.com/llvm/llvm-project/pull/146538 >> and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit >> from this, >> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable >> latency >> 3 in Cortex-A510). >> >> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the >> split >> into two instructions, values lower than that retain the INS-WZR form. >> cortexa76_extra_costs, from which Grace and many other Neoverse cores derive >> from, >> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles. 3 is the >> number >> of cycles above the normal cheapest SIMD instruction on such cores (which >> take 2 cycles >> for the cheapest one). >> >> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to >> preserve the current codegen, though I'd be happy to increase it for generic >> tuning. >> >> For -Os we don't add any extra cost so the shorter INS-WZR form is still >> generated always. >> >> Bootstrapped and tested on aarch64-none-linux-gnu. >> Ok for trunk? >> Thanks, >> Kyrill > Hi Kyrill, > I had a look at your patch series (including 1/2) and they look good to me. > Two thoughts: > - Would it make sense from a staging point of view to move the comments in > the vector cost fields in aarch64-common-protos.h to the first patch?
I don’t think it matters to be honest, it wouldn’t make the patches independent in terms of merge conflicts and the end result would be the same. > - In addition to the relaxation of the test case, should we add a > core-specific test that requires the production of the new instruction > sequence? Yes, that is a good idea. Added one in the attached version. Thanks, Kyrill > Thanks, > Jennifer >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ >> >> * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp >> field. Add comments to other vector cost fields. >> * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE case. >> * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs, >> thunderx_extra_costs, thunderx2t99_extra_costs, >> thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs, >> ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs): >> Specify ins_gp cost. >> * config/arm/aarch-cost-tables.h (generic_extra_costs, >> cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs, >> exynosm1_extra_costs, xgene1_extra_costs): Likewise. >> >> gcc/testsuite/ >> >> * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4, >> test_setq_lane4): Relax allowed assembly. >> >> <0002-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch>
0001-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch
Description: 0001-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch