Hi all, For inserting zero into a vector lane we usually use an instruction like: ins v0.h[2], wzr
This, however, has not-so-great performance on some CPUs. On Grace, for example it has a latency of 5 and throughput 1. The alternative sequence: movi v31.8b, #0 ins v0.h[2], v31.h[0] is prefereble bcause the MOVI-0 is often a zero-latency operation that is eliminated by the CPU frontend and the lane-to-lane INS has a latency of 2 and throughput of 4. We can avoid the merging of the two instructions into the aarch64_simd_vec_set_zero<mode> insn through rtx costs. We just need to handle the right VEC_MERGE form in aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced to describe this operation. According to a similar LLVM PR: https://github.com/llvm/llvm-project/pull/146538 and looking at some Arm SWOGs I expect the Neoverse-derived cores to benefit from this, whereas little cores like Cortex-A510 won't (INS-WZR has a respectable latency 3 in Cortex-A510). Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes the split into two instructions, values lower than that retain the INS-WZR form. cortexa76_extra_costs, from which Grace and many other Neoverse cores derive from, sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles. 3 is the number of cycles above the normal cheapest SIMD instruction on such cores (which take 2 cycles for the cheapest one). cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS (1) to preserve the current codegen, though I'd be happy to increase it for generic tuning. For -Os we don't add any extra cost so the shorter INS-WZR form is still generated always. Bootstrapped and tested on aarch64-none-linux-gnu. Ok for trunk? Thanks, Kyrill Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> gcc/ * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp field. Add comments to other vector cost fields. * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE case. * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs, thunderx_extra_costs, thunderx2t99_extra_costs, thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs, ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs): Specify ins_gp cost. * config/arm/aarch-cost-tables.h (generic_extra_costs, cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs, exynosm1_extra_costs, xgene1_extra_costs): Likewise. gcc/testsuite/ * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4, test_setq_lane4): Relax allowed assembly.
0002-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch
Description: 0002-aarch64-Allow-CPU-tuning-to-avoid-INS-W-X-ZR-instruc.patch