On Mon, Jul 21, 2025 at 4:13 AM Tamar Christina <tamar.christ...@arm.com> wrote: > > > -----Original Message----- > > From: Kyrylo Tkachov <ktkac...@nvidia.com> > > Sent: Monday, July 21, 2025 11:36 AM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford > > <richard.sandif...@arm.com>; Andrew Pinski <pins...@gmail.com>; Alex Coplan > > <alex.cop...@arm.com>; Remi Machet <rmac...@nvidia.com>; Jennifer Schmitz > > <jschm...@nvidia.com> > > Subject: Re: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR > > instructions > > > > > > > > > On 21 Jul 2025, at 11:43, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: > > > > > > Hi Tamar, > > > > > >> On 21 Jul 2025, at 11:12, Tamar Christina <tamar.christ...@arm.com> > > >> wrote: > > >> > > >> Hi Kyrill, > > >> > > >>> -----Original Message----- > > >>> From: Kyrylo Tkachov <ktkac...@nvidia.com> > > >>> Sent: Friday, July 18, 2025 10:40 AM > > >>> To: GCC Patches <gcc-patches@gcc.gnu.org> > > >>> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford > > >>> <richard.sandif...@arm.com>; Andrew Pinski <pins...@gmail.com>; Alex > > Coplan > > >>> <alex.cop...@arm.com> > > >>> Subject: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR > > >>> instructions > > >>> > > >>> Hi all, > > >>> > > >>> For inserting zero into a vector lane we usually use an instruction > > >>> like: > > >>> ins v0.h[2], wzr > > >>> > > >>> This, however, has not-so-great performance on some CPUs. > > >>> On Grace, for example it has a latency of 5 and throughput 1. > > >>> The alternative sequence: > > >>> movi v31.8b, #0 > > >>> ins v0.h[2], v31.h[0] > > >>> is prefereble bcause the MOVI-0 is often a zero-latency operation that > > >>> is > > >>> eliminated by the CPU frontend and the lane-to-lane INS has a latency > > >>> of 2 and > > >>> throughput of 4. > > >>> We can avoid the merging of the two instructions into the > > >>> aarch64_simd_vec_set_zero<mode> > > >>> insn through rtx costs. We just need to handle the right VEC_MERGE > > >>> form in > > >>> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced > > >>> to > > describe > > >>> this operation. > > >>> According to a similar LLVM PR: https://github.com/llvm/llvm- > > >>> project/pull/146538 > > >>> and looking at some Arm SWOGs I expect the Neoverse-derived cores to > > benefit > > >>> from this, > > >>> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable > > >>> latency > > >>> 3 in Cortex-A510). > > >>> > > >> > > >> I'm not really sure about the Cortex-A510, the entries are a bit > > >> suspect. I think > > they > > >> haven't added an entry for INS from GPR, which is an alias for MOV > > >> element, > > GPR. > > >> And the throughput for the INS they did add (element to element) seems > > suspect. > > >> > > >> If I assume the same latency as an FMOV for INS GPR Wzr, then yeah the > > >> MOVI > > is > > >> slower (at least on paper) but I'd expect in actual code the throughput > > >> to matter > > more. > > >> > > >> So I think we should just do this for every core and by default. I have > > >> a couple > > of > > >> arguments for this: > > >> > > >> 1. We already do this for every constant but 0, since we prefer the > > >> throughput > > >> advantage here. See https://godbolt.org/z/9PKhEax8G where if you > > >> change > > your > > >> testcase to anything but 0 you get the movi. Even though strictly > > >> speaking the > > movi > > >> is more expensive than the mov. > > >> > > >> 2. I've ran this sequence change over various cores, and they all showed > > >> either > > an improvement > > >> or very little difference. i.e. Cortex-A510 shows > > >> > > >> cortex-a510 │ little │ -0.1% > > >> cortex-a520 │ little │ -0.09% > > >> cortex-a710 │ mid │ -0.03% > > >> > > >> etc, this is comparing using movi over the INS, wzr. > > > > > > Thanks for the broader benchmarking! > > > > > > > > >> > > >> However the advantage grows significantly when there is more than one of > > >> the > > same constant to insert. > > >> e.g. > > >> > > >> __attribute__((noipa)) > > >> uint16x8_t foo2(uint16x8_t a) { > > >> a[2] = 0; > > >> a[5] = 0; > > >> return a; > > >> } > > >> > > >> Now the movi pays for itself and the INS wzr is serial over most cores. > > >> > > >> Benchmarking this gives you as expected for inorder cores: > > >> > > >> cortex-a510 │ little │ -0.1% > > >> cortex-a520 │ little │ -0.05% > > >> but then out of order cores > > >> > > >> cortex-a710 │ mid │ ✔ -49.49% > > >> neoverse v2 │ big │ ✔ -49.57% > > >> > > >> etc. > > >> > > >> Showing that I think we should do this by default. > > > > > > Yeah that’s fine with me. > > > > > > > > >> > > >> As a side note, the current patch only seems to work outside loops. e.g. > > >> > > >> __attribute__((noipa)) > > >> uint16x8_t foos(uint16x8_t a, int n) { > > >> for (int i = 0; i < n; i++) > > >> a[2] = 0; > > >> return a; > > >> } > > >> > > >> Does not seem to make the movi and I think that's because after rejecting > > >> the movi in the vec_merge and floating it in the loop preheader, combine > > >> now > > >> tries to force it down and ultimately succeeds. > > >> > > >> Given the above, what do you think about instead just disabling the > > >> aarch64_simd_vec_set_zero pattern when not compiling for size? > > >> > > > > > > Yes, that would make for a simpler patch and wouldn’t rely on the rtx > > > cost users > > doing the right thing. > > > I’ll respin accordingly. > > > > This patch passes bootstrap and test on aarch64-none-linux-gnu. > > Thanks! Ok with me unless someone disagrees.
This is also OK with me. > > Cheers, > Tamar > > > Thanks, > > Kyrill > > > > > > > Thanks, > > > Kyrill > > > > > >> That should do what we want in all cases. > > >> > > >> Thanks, > > >> Tamar > > >> > > >>> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes > > >>> the split > > >>> into two instructions, values lower than that retain the INS-WZR form. > > >>> cortexa76_extra_costs, from which Grace and many other Neoverse cores > > derive > > >>> from, > > >>> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles. 3 > > >>> is the > > number > > >>> of cycles above the normal cheapest SIMD instruction on such cores > > >>> (which > > take 2 > > >>> cycles > > >>> for the cheapest one). > > >>> > > >>> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS > > >>> (1) to > > >>> preserve the current codegen, though I'd be happy to increase it for > > >>> generic > > tuning. > > >>> > > >>> For -Os we don't add any extra cost so the shorter INS-WZR form is still > > >>> generated always. > > >>> > > >>> Bootstrapped and tested on aarch64-none-linux-gnu. > > >>> Ok for trunk? > > >>> Thanks, > > >>> Kyrill > > >>> > > >>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > >>> > > >>> gcc/ > > >>> > > >>> * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp > > >>> field. Add comments to other vector cost fields. > > >>> * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE > > >>> case. > > >>> * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs, > > >>> thunderx_extra_costs, thunderx2t99_extra_costs, > > >>> thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs, > > >>> ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs): > > >>> Specify ins_gp cost. > > >>> * config/arm/aarch-cost-tables.h (generic_extra_costs, > > >>> cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs, > > >>> exynosm1_extra_costs, xgene1_extra_costs): Likewise. > > >>> > > >>> gcc/testsuite/ > > >>> > > >>> * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4, > > >>> test_setq_lane4): Relax allowed assembly. > > >