[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Feng Xue changed: What|Removed |Added Resolution|FIXED |--- Status|RESOLVED|REOPENED --- Comment #7 from Feng Xue --- Partial permutation (especially extract-low/high) would incur inefficient splicing in some situation even the vector mode with lowest cost is used. int test(unsigned array[4][4]); int foo(unsigned short *a, unsigned long n) { unsigned array[4][4]; for (unsigned i = 0; i < 4; i++, a += n) { array[i][0] = (a[0] << 3) - (a[4] << 6); array[i][1] = (a[1] << 3) - (a[5] << 6); array[i][2] = (a[2] << 3) - (a[6] << 6); array[i][3] = (a[3] << 3) - (a[7] << 6); } return test(array); } // After vect-compare-cost fix mov x2, x0 stp x29, x30, [sp, -80]! add x3, x2, x1, lsl 1 lsl x1, x1, 1 mov x29, sp add x4, x3, x1 add x0, sp, 16 ldr q5, [x2] ldr q31, [x4, x1] ldr q0, [x3, x1] ldr q1, [x2, x1] moviv28.4s, 0// zip1v29.2d, v0.2d, v31.2d// zip1v2.2d, v5.2d, v1.2d // zip2v31.2d, v0.2d, v31.2d// zip2v1.2d, v5.2d, v1.2d // zip1v30.8h, v29.8h, v28.8h // zip1v4.8h, v2.8h, v28.8h // superfluous zip1v27.8h, v31.8h, v28.8h // zip1v3.8h, v1.8h, v28.8h // zip2v29.8h, v29.8h, v28.8h // zip2v31.8h, v31.8h, v28.8h // zip2v2.8h, v2.8h, v28.8h // zip2v1.8h, v1.8h, v28.8h // shl v30.4s, v30.4s, 3 shl v29.4s, v29.4s, 3 shl v4.4s, v4.4s, 3 shl v2.4s, v2.4s, 3 shl v27.4s, v27.4s, 6 shl v31.4s, v31.4s, 6 shl v3.4s, v3.4s, 6 shl v1.4s, v1.4s, 6 sub v27.4s, v30.4s, v27.4s sub v31.4s, v29.4s, v31.4s sub v3.4s, v4.4s, v3.4s sub v1.4s, v2.4s, v1.4s stp q27, q31, [sp, 48] stp q3, q1, [sp, 16] bl test ldp x29, x30, [sp], 80 ret // Expect it to be optimized as: lsl x3, x1, 1 mov x2, x0 stp x29, x30, [sp, -80]! add x1, x2, x1, lsl 1 add x4, x1, x3 mov x29, sp add x0, sp, 16 ldr q30, [x2, x3] ldr q0, [x2] ushll v31.4s, v30.4h, 3 ushll2 v30.4s, v30.8h, 6 ushll v29.4s, v0.4h, 3 ushll2 v0.4s, v0.8h, 6 sub v30.4s, v31.4s, v30.4s sub v0.4s, v29.4s, v0.4s str q0, [sp, 16] ldr q0, [x1, x3] str q30, [sp, 32] ldr q30, [x4, x3] ushll v29.4s, v0.4h, 3 ushll2 v0.4s, v0.8h, 6 ushll v31.4s, v30.4h, 3 ushll2 v30.4s, v30.8h, 6 sub v0.4s, v29.4s, v0.4s sub v30.4s, v31.4s, v30.4s stp q0, q30, [sp, 48] bl test ldp x29, x30, [sp], 80 ret Based on cost arising from splicing, we still need a way to select the most profitable vector mode per slp node, over the global vector mode specified in vinfo.
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Richard Sandiford changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #6 from Richard Sandiford --- Fixed. Thanks for the report.
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 --- Comment #5 from GCC Commits --- The trunk branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:7328faf89e9b4953baaff10e18262c70fbd3e578 commit r14-6961-g7328faf89e9b4953baaff10e18262c70fbd3e578 Author: Richard Sandiford Date: Fri Jan 5 16:25:16 2024 + aarch64: Extend VECT_COMPARE_COSTS to !SVE [PR113104] When SVE is enabled, we try vectorising with multiple different SVE and Advanced SIMD approaches and use the cost model to pick the best one. Until now, we've not done that for Advanced SIMD, since "the first mode that works should always be the best". The testcase is a counterexample. Each iteration of the scalar loop vectorises naturally with 64-bit input vectors and 128-bit output vectors. We do try that for SVE, and choose it as the best approach. But the first approach we try is instead to use: - a vectorisation factor of 2 - 1 128-bit vector for the inputs - 2 128-bit vectors for the outputs But since the stride is variable, the cost of marshalling the input vector from two iterations outweighs the benefit of doing two iterations at once. This patch therefore generalises aarch64-sve-compare-costs to aarch64-vect-compare-costs and applies it to non-SVE compilations. gcc/ PR target/113104 * doc/invoke.texi (aarch64-sve-compare-costs): Replace with... (aarch64-vect-compare-costs): ...this. * config/aarch64/aarch64.opt (-param=aarch64-sve-compare-costs=): Replace with... (-param=aarch64-vect-compare-costs=): ...this new param. * config/aarch64/aarch64.cc (aarch64_override_options_internal): Don't disable it when vectorizing for Advanced SIMD only. (aarch64_autovectorize_vector_modes): Apply VECT_COMPARE_COSTS whenever aarch64_vect_compare_costs is true. gcc/testsuite/ PR target/113104 * gcc.target/aarch64/pr113104.c: New test. * gcc.target/aarch64/sve/cond_arith_1.c: Update for new parameter names. * gcc.target/aarch64/sve/cond_arith_1_run.c: Likewise. * gcc.target/aarch64/sve/cond_arith_3.c: Likewise. * gcc.target/aarch64/sve/cond_arith_3_run.c: Likewise. * gcc.target/aarch64/sve/gather_load_6.c: Likewise. * gcc.target/aarch64/sve/gather_load_7.c: Likewise. * gcc.target/aarch64/sve/load_const_offset_2.c: Likewise. * gcc.target/aarch64/sve/load_const_offset_3.c: Likewise. * gcc.target/aarch64/sve/mask_gather_load_6.c: Likewise. * gcc.target/aarch64/sve/mask_gather_load_7.c: Likewise. * gcc.target/aarch64/sve/mask_load_slp_1.c: Likewise. * gcc.target/aarch64/sve/mask_struct_load_1.c: Likewise. * gcc.target/aarch64/sve/mask_struct_load_2.c: Likewise. * gcc.target/aarch64/sve/mask_struct_load_3.c: Likewise. * gcc.target/aarch64/sve/mask_struct_load_4.c: Likewise. * gcc.target/aarch64/sve/mask_struct_store_1.c: Likewise. * gcc.target/aarch64/sve/mask_struct_store_1_run.c: Likewise. * gcc.target/aarch64/sve/mask_struct_store_2.c: Likewise. * gcc.target/aarch64/sve/mask_struct_store_2_run.c: Likewise. * gcc.target/aarch64/sve/pack_1.c: Likewise. * gcc.target/aarch64/sve/reduc_4.c: Likewise. * gcc.target/aarch64/sve/scatter_store_6.c: Likewise. * gcc.target/aarch64/sve/scatter_store_7.c: Likewise. * gcc.target/aarch64/sve/strided_load_3.c: Likewise. * gcc.target/aarch64/sve/strided_store_3.c: Likewise. * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Likewise. * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise. * gcc.target/aarch64/sve/unpack_signed_1.c: Likewise. * gcc.target/aarch64/sve/unpack_unsigned_1.c: Likewise. * gcc.target/aarch64/sve/unpack_unsigned_1_run.c: Likewise. * gcc.target/aarch64/sve/vcond_11.c: Likewise. * gcc.target/aarch64/sve/vcond_11_run.c: Likewise.
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Richard Sandiford changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2023-12-30 Ever confirmed|0 |1 CC||rsandifo at gcc dot gnu.org Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org --- Comment #4 from Richard Sandiford --- FWIW, we do get the desired code with -march=armv8-a+sve (even though the test doesn't use SVE). This is because of: /* Consider enabling VECT_COMPARE_COSTS for SVE, both so that we can compare SVE against Advanced SIMD and so that we can compare multiple SVE vectorization approaches against each other. There's not really any point doing this for Advanced SIMD only, since the first mode that works should always be the best. */ if (TARGET_SVE && aarch64_sve_compare_costs) flags |= VECT_COMPARE_COSTS; The testcase in this PR is a counterexample to the claim in the final sentence. I think the comment might predate significant support for mixed-sized Advanced SIMD vectorisation. If we enable SVE (or uncomment the "if" line), the costs are 13 units per vector iteration for 128-bit vectors and 4 units per vector iteration for 64-bit vectors (so 8 units per 128 bits on a parity basis). The 64-bit version is therefore seen as significantly cheaper and is chosen ahead of the 128-bit version. I think this PR is enough proof that we should enable VECT_COMPARE_COSTS even without SVE. Assigning to myself for that.
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 --- Comment #3 from rguenther at suse dot de --- On Thu, 21 Dec 2023, fxue at os dot amperecomputing.com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 > > --- Comment #2 from Feng Xue --- > (In reply to Richard Biener from comment #1) > > See my proposal on the mailing list to lift the restriction of sticking to a > > single vector size, I think this is another example showing this. If you > > use BB level vectorization by disabling loop vectorization but not SLP > > vectorization the code should improve? > > Yes, the loop is fully unrolled, and BB SLP would. I suspect even when the loop isn't unrolled (just increase iteration count) the code would improve > I could not find the proposal, would you share me a link? Thanks https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640476.html
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 --- Comment #2 from Feng Xue --- (In reply to Richard Biener from comment #1) > See my proposal on the mailing list to lift the restriction of sticking to a > single vector size, I think this is another example showing this. If you > use BB level vectorization by disabling loop vectorization but not SLP > vectorization the code should improve? Yes, the loop is fully unrolled, and BB SLP would. I could not find the proposal, would you share me a link? Thanks
[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #1 from Richard Biener --- See my proposal on the mailing list to lift the restriction of sticking to a single vector size, I think this is another example showing this. If you use BB level vectorization by disabling loop vectorization but not SLP vectorization the code should improve?