Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Thursday, February 1, 2024 4:42 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: Andrew Pinski <pins...@gmail.com>; gcc-patches@gcc.gnu.org; nd >> <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>; Marcus >> Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov >> <kyrylo.tkac...@arm.com> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> >> -----Original Message----- >> >> From: Richard Sandiford <richard.sandif...@arm.com> >> >> Sent: Thursday, February 1, 2024 2:24 PM >> >> To: Andrew Pinski <pins...@gmail.com> >> >> Cc: Tamar Christina <tamar.christ...@arm.com>; gcc-patches@gcc.gnu.org; nd >> >> <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>; Marcus >> >> Shawcroft <marcus.shawcr...@arm.com>; Kyrylo Tkachov >> >> <kyrylo.tkac...@arm.com> >> >> Subject: Re: [PATCH]AArch64: update vget_set_lane_1.c test output >> >> >> >> Andrew Pinski <pins...@gmail.com> writes: >> >> > On Thu, Feb 1, 2024 at 1:26 AM Tamar Christina <tamar.christ...@arm.com> >> >> wrote: >> >> >> >> >> >> Hi All, >> >> >> >> >> >> In the vget_set_lane_1.c test the following entries now generate a zip1 >> instead >> >> of an INS >> >> >> >> >> >> BUILD_TEST (float32x2_t, float32x2_t, , , f32, 1, 0) >> >> >> BUILD_TEST (int32x2_t, int32x2_t, , , s32, 1, 0) >> >> >> BUILD_TEST (uint32x2_t, uint32x2_t, , , u32, 1, 0) >> >> >> >> >> >> This is because the non-Q variant for indices 0 and 1 are just >> >> >> shuffling values. >> >> >> There is no perf difference between INS SIMD to SIMD and ZIP, as such >> >> >> just >> >> update the >> >> >> test file. >> >> > Hmm, is this true on all cores? I suspect there is a core out there >> >> > where INS is implemented with a much lower latency than ZIP. >> >> > If we look at config/aarch64/thunderx.md, we can see INS is 2 cycles >> >> > while ZIP is 6 cycles (3/7 for q versions). >> >> > Now I don't have any invested interest in that core any more but I >> >> > just wanted to point out that is not exactly true for all cores. >> >> >> >> Thanks for the pointer. In that case, perhaps we should prefer >> >> aarch64_evpc_ins over aarch64_evpc_zip in >> aarch64_expand_vec_perm_const_1? >> >> That's enough to fix this failure, but it'll probably require other >> >> tests to be adjusted... >> > >> > I think given that Thundex-X is a 10 year old micro-architecture that is >> > several >> cases where >> > often used instructions have very high latencies that generic codegen >> > should not >> be blocked >> > from progressing because of it. >> > >> > we use zips in many things and if thunderx codegen is really of that much >> importance then I >> > think the old codegen should be gated behind -mcpu=thunderx rather than >> preventing generic >> > changes. >> >> But you said there was no perf difference between INS and ZIP, so it >> sounds like for all known cases, using INS rather than ZIP is either >> neutral or better. >> >> There's also the possible secondary benefit that the INS patterns use >> standard RTL operations whereas the ZIP patterns use unspecs. >> >> Keeping ZIP seems OK there's a specific reason to prefer it over INS for >> more modern cores though. > > Ok, that's a fair point. Doing some due diligence, Neoverse-E1 and > Cortex-A65 SWoGs seem to imply that there ZIPs have better throughput > than INSs. However the entries are inconsistent and I can't measure the > difference so I believe this to be a documentation bug. > > That said, switching the operands seems to show one issue in that preferring > INS degenerates code in cases where we are inserting the top bits of the first > parameter into the bottom of the second parameter and returning, > > Zip being a Three operand instruction allows us to put the result into the > final > destination register with one operation whereas INS requires an fmov: > > foo_uzp1_s32: > ins v0.s[1], v1.s[0] > fmov d0, d0 > ret > foo_uzp2_s32: > ins v1.s[0], v0.s[1] > fmov d0, d1 > ret
Ah, yeah, I should have thought about that. In that case, the original patch is OK, thanks. Richard