On Mon, Jul 21, 2025 at 4:13 AM Tamar Christina <tamar.christ...@arm.com> wrote:
>
> > -----Original Message-----
> > From: Kyrylo Tkachov <ktkac...@nvidia.com>
> > Sent: Monday, July 21, 2025 11:36 AM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford
> > <richard.sandif...@arm.com>; Andrew Pinski <pins...@gmail.com>; Alex Coplan
> > <alex.cop...@arm.com>; Remi Machet <rmac...@nvidia.com>; Jennifer Schmitz
> > <jschm...@nvidia.com>
> > Subject: Re: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR
> > instructions
> >
> >
> >
> > > On 21 Jul 2025, at 11:43, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
> > >
> > > Hi Tamar,
> > >
> > >> On 21 Jul 2025, at 11:12, Tamar Christina <tamar.christ...@arm.com> 
> > >> wrote:
> > >>
> > >> Hi Kyrill,
> > >>
> > >>> -----Original Message-----
> > >>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> > >>> Sent: Friday, July 18, 2025 10:40 AM
> > >>> To: GCC Patches <gcc-patches@gcc.gnu.org>
> > >>> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
> > >>> <richard.sandif...@arm.com>; Andrew Pinski <pins...@gmail.com>; Alex
> > Coplan
> > >>> <alex.cop...@arm.com>
> > >>> Subject: [PATCH 2/2] aarch64: Allow CPU tuning to avoid INS-(W|X)ZR
> > >>> instructions
> > >>>
> > >>> Hi all,
> > >>>
> > >>> For inserting zero into a vector lane we usually use an instruction 
> > >>> like:
> > >>> ins     v0.h[2], wzr
> > >>>
> > >>> This, however, has not-so-great performance on some CPUs.
> > >>> On Grace, for example it has a latency of 5 and throughput 1.
> > >>> The alternative sequence:
> > >>> movi    v31.8b, #0
> > >>> ins     v0.h[2], v31.h[0]
> > >>> is prefereble bcause the MOVI-0 is often a zero-latency operation that 
> > >>> is
> > >>> eliminated by the CPU frontend and the lane-to-lane INS has a latency 
> > >>> of 2 and
> > >>> throughput of 4.
> > >>> We can avoid the merging of the two instructions into the
> > >>> aarch64_simd_vec_set_zero<mode>
> > >>> insn through rtx costs.  We just need to handle the right VEC_MERGE 
> > >>> form in
> > >>> aarch64_rtx_costs. The new CPU-specific cost field ins_gp is introduced 
> > >>> to
> > describe
> > >>> this operation.
> > >>> According to a similar LLVM PR: https://github.com/llvm/llvm-
> > >>> project/pull/146538
> > >>> and looking at some Arm SWOGs I expect the Neoverse-derived cores to
> > benefit
> > >>> from this,
> > >>> whereas little cores like Cortex-A510 won't (INS-WZR has a respectable 
> > >>> latency
> > >>> 3 in Cortex-A510).
> > >>>
> > >>
> > >> I'm not really sure about the Cortex-A510, the entries are a bit 
> > >> suspect.  I think
> > they
> > >> haven't added an entry for INS from GPR, which is an alias for MOV 
> > >> element,
> > GPR.
> > >> And the throughput for the INS they did add (element to element) seems
> > suspect.
> > >>
> > >> If I assume the same latency as an FMOV for INS GPR Wzr, then yeah the 
> > >> MOVI
> > is
> > >> slower (at least on paper) but I'd expect in actual code the throughput 
> > >> to matter
> > more.
> > >>
> > >> So I think we should just do this for every core and by default.  I have 
> > >> a couple
> > of
> > >> arguments for this:
> > >>
> > >> 1. We already do this for every constant but 0, since we prefer the 
> > >> throughput
> > >>    advantage here. See https://godbolt.org/z/9PKhEax8G where if you 
> > >> change
> > your
> > >>    testcase to anything but 0 you get the movi.  Even though strictly 
> > >> speaking the
> > movi
> > >>    is more expensive than the mov.
> > >>
> > >> 2. I've ran this sequence change over various cores, and they all showed 
> > >> either
> > an improvement
> > >>   or very little difference. i.e. Cortex-A510 shows
> > >>
> > >> cortex-a510   │ little │         -0.1%
> > >> cortex-a520   │ little │        -0.09%
> > >> cortex-a710   │  mid   │        -0.03%
> > >>
> > >> etc, this is comparing using movi over the INS, wzr.
> > >
> > > Thanks for the broader benchmarking!
> > >
> > >
> > >>
> > >> However the advantage grows significantly when there is more than one of 
> > >> the
> > same constant to insert.
> > >> e.g.
> > >>
> > >> __attribute__((noipa))
> > >> uint16x8_t foo2(uint16x8_t a) {
> > >> a[2] = 0;
> > >> a[5] = 0;
> > >> return a;
> > >> }
> > >>
> > >> Now the movi pays for itself and the INS wzr is serial over most cores.
> > >>
> > >> Benchmarking this gives you as expected for inorder cores:
> > >>
> > >> cortex-a510   │ little │         -0.1%
> > >> cortex-a520   │ little │        -0.05%
> > >> but then out of order cores
> > >>
> > >> cortex-a710   │  mid   │      ✔  -49.49%
> > >> neoverse v2   │  big   │      ✔  -49.57%
> > >>
> > >> etc.
> > >>
> > >> Showing that I think we should do this by default.
> > >
> > > Yeah that’s fine with me.
> > >
> > >
> > >>
> > >> As a side note, the current patch only seems to work outside loops. e.g.
> > >>
> > >> __attribute__((noipa))
> > >> uint16x8_t foos(uint16x8_t a, int n) {
> > >> for (int i = 0; i < n; i++)
> > >>   a[2] = 0;
> > >> return a;
> > >> }
> > >>
> > >> Does not seem to make the movi and I think that's because after rejecting
> > >> the movi in the vec_merge and floating it in the loop preheader, combine 
> > >> now
> > >> tries to force it down and ultimately succeeds.
> > >>
> > >> Given the above, what do you think about instead just disabling the
> > >> aarch64_simd_vec_set_zero pattern when not compiling for size?
> > >>
> > >
> > > Yes, that would make for a simpler patch and wouldn’t rely on the rtx 
> > > cost users
> > doing the right thing.
> > > I’ll respin accordingly.
> >
> > This patch passes bootstrap and test on aarch64-none-linux-gnu.
>
> Thanks! Ok with me unless someone disagrees.

This is also OK with me.

>
> Cheers,
> Tamar
>
> > Thanks,
> > Kyrill
> >
> >
> > > Thanks,
> > > Kyrill
> > >
> > >> That should do what we want in all cases.
> > >>
> > >> Thanks,
> > >> Tamar
> > >>
> > >>> Practically, a value of COSTS_N_INSNS (2) and higher for ins_gp causes 
> > >>> the split
> > >>> into two instructions, values lower than that retain the INS-WZR form.
> > >>> cortexa76_extra_costs, from which Grace and many other Neoverse cores
> > derive
> > >>> from,
> > >>> sets ins_gp to COSTS_N_INSNS (3) to reflect a latency of 5 cycles.  3 
> > >>> is the
> > number
> > >>> of cycles above the normal cheapest SIMD instruction on such cores 
> > >>> (which
> > take 2
> > >>> cycles
> > >>> for the cheapest one).
> > >>>
> > >>> cortexa53_extra_costs and all other costs set ins_gp to COSTS_N_INSNS 
> > >>> (1) to
> > >>> preserve the current codegen, though I'd be happy to increase it for 
> > >>> generic
> > tuning.
> > >>>
> > >>> For -Os we don't add any extra cost so the shorter INS-WZR form is still
> > >>> generated always.
> > >>>
> > >>> Bootstrapped and tested on aarch64-none-linux-gnu.
> > >>> Ok for trunk?
> > >>> Thanks,
> > >>> Kyrill
> > >>>
> > >>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
> > >>>
> > >>> gcc/
> > >>>
> > >>> * config/arm/aarch-common-protos.h (vector_cost_table): Add ins_gp
> > >>> field.  Add comments to other vector cost fields.
> > >>> * config/aarch64/aarch64.cc (aarch64_rtx_costs): Handle VEC_MERGE
> > >>> case.
> > >>> * config/aarch64/aarch64-cost-tables.h (qdf24xx_extra_costs,
> > >>> thunderx_extra_costs, thunderx2t99_extra_costs,
> > >>> thunderx3t110_extra_costs, tsv110_extra_costs, a64fx_extra_costs,
> > >>> ampere1_extra_costs, ampere1a_extra_costs, ampere1b_extra_costs):
> > >>> Specify ins_gp cost.
> > >>> * config/arm/aarch-cost-tables.h (generic_extra_costs,
> > >>> cortexa53_extra_costs, cortexa57_extra_costs, cortexa76_extra_costs,
> > >>> exynosm1_extra_costs, xgene1_extra_costs): Likewise.
> > >>>
> > >>> gcc/testsuite/
> > >>>
> > >>> * gcc.target/aarch64/simd/mf8_data_1.c (test_set_lane4,
> > >>> test_setq_lane4): Relax allowed assembly.
> >
>

Reply via email to