Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandif...@arm.com>
>> Sent: Thursday, February 1, 2024 4:42 PM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: Andrew Pinski <pins...@gmail.com>; gcc-patches@gcc.gnu.org; nd
>> <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>; Marcus
>> Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov
>> <kyrylo.tkac...@arm.com>
>> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
>> 
>> Tamar Christina <tamar.christ...@arm.com> writes:
>> >> -----Original Message-----
>> >> From: Richard Sandiford <richard.sandif...@arm.com>
>> >> Sent: Thursday, February 1, 2024 2:24 PM
>> >> To: Andrew Pinski <pins...@gmail.com>
>> >> Cc: Tamar Christina <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org; nd
>> >> <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>; Marcus
>> >> Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov
>> >> <kyrylo.tkac...@arm.com>
>> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output
>> >>
>> >> Andrew Pinski <pins...@gmail.com> writes:
>> >> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <tamar.christ...@arm.com>
>> >> wrote:
>> >> >>
>> >> >> Hi All,
>> >> >>
>> >> >> In the vget_set_lane_1.c test the following entries now generate a zip1
>> instead
>> >> of an INS
>> >> >>
>> >> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0)
>> >> >> BUILD_TEST (int32x2_t,   int32x2_t,   , , s32, 1, 0)
>> >> >> BUILD_TEST (uint32x2_t,  uint32x2_t,  , , u32, 1, 0)
>> >> >>
>> >> >> This is because the non-Q variant for indices 0 and 1 are just 
>> >> >> shuffling values.
>> >> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such 
>> >> >> just
>> >> update the
>> >> >> test file.
>> >> > Hmm, is this true on all cores? I suspect there is a core out there
>> >> > where INS is implemented with a much lower latency than ZIP.
>> >> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles
>> >> > while ZIP is 6 cycles (3/7 for q versions).
>> >> > Now I don't have any invested interest in that core any more but I
>> >> > just wanted to point out that is not exactly true for all cores.
>> >>
>> >> Thanks for the pointer.  In that case, perhaps we should prefer
>> >> aarch64_evpc_ins over aarch64_evpc_zip in
>> aarch64_expand_vec_perm_const_1?
>> >> That's enough to fix this failure, but it'll probably require other
>> >> tests to be adjusted...
>> >
>> > I think given that Thundex-X is a 10 year old micro-architecture that is 
>> > several
>> cases where
>> > often used instructions have very high latencies that generic codegen 
>> > should not
>> be blocked
>> > from progressing because of it.
>> >
>> > we use zips in many things and if thunderx codegen is really of that much
>> importance then I
>> > think the old codegen should be gated behind -mcpu=thunderx rather than
>> preventing generic
>> > changes.
>> 
>> But you said there was no perf difference between INS and ZIP, so it
>> sounds like for all known cases, using INS rather than ZIP is either
>> neutral or better.
>> 
>> There's also the possible secondary benefit that the INS patterns use
>> standard RTL operations whereas the ZIP patterns use unspecs.
>> 
>> Keeping ZIP seems OK there's a specific reason to prefer it over INS for
>> more modern cores though.
>
> Ok, that's a fair point.  Doing some due diligence, Neoverse-E1 and
> Cortex-A65 SWoGs seem to imply that there ZIPs have better throughput
> than INSs. However the entries are inconsistent and I can't measure the
> difference so I believe this to be a documentation bug.
>
> That said, switching the operands seems to show one issue in that preferring
> INS degenerates code in cases where we are inserting the top bits of the first
> parameter into the bottom of the second parameter and returning,
>
> Zip being a Three operand instruction allows us to put the result into the 
> final
> destination register with one operation whereas INS requires an fmov:
>
> foo_uzp1_s32:
>         ins     v0.s[1], v1.s[0]
>         fmov    d0, d0
>         ret
> foo_uzp2_s32:
>         ins     v1.s[0], v0.s[1]
>         fmov    d0, d1
>         ret

Ah, yeah, I should have thought about that.

In that case, the original patch is OK, thanks.

Richard

Reply via email to