[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2024-01-09 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Feng Xue  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #7 from Feng Xue  ---
Partial permutation (especially extract-low/high) would incur inefficient
splicing in some situation even the vector mode with lowest cost is used.

  int test(unsigned array[4][4]);

  int foo(unsigned short *a, unsigned long n)
  {
unsigned array[4][4];

for (unsigned i = 0; i < 4; i++, a += n)
  {
array[i][0] = (a[0] << 3) - (a[4] << 6);
array[i][1] = (a[1] << 3) - (a[5] << 6);
array[i][2] = (a[2] << 3) - (a[6] << 6);
array[i][3] = (a[3] << 3) - (a[7] << 6);
  }

return test(array);
  }

// After vect-compare-cost fix
mov x2, x0
stp x29, x30, [sp, -80]!
add x3, x2, x1, lsl 1
lsl x1, x1, 1
mov x29, sp
add x4, x3, x1
add x0, sp, 16
ldr q5, [x2]
ldr q31, [x4, x1]
ldr q0, [x3, x1]
ldr q1, [x2, x1]
moviv28.4s, 0//
zip1v29.2d, v0.2d, v31.2d//
zip1v2.2d, v5.2d, v1.2d  //
zip2v31.2d, v0.2d, v31.2d//
zip2v1.2d, v5.2d, v1.2d  // 
zip1v30.8h, v29.8h, v28.8h   //
zip1v4.8h, v2.8h, v28.8h //  superfluous
zip1v27.8h, v31.8h, v28.8h   //
zip1v3.8h, v1.8h, v28.8h //
zip2v29.8h, v29.8h, v28.8h   //
zip2v31.8h, v31.8h, v28.8h   //
zip2v2.8h, v2.8h, v28.8h //
zip2v1.8h, v1.8h, v28.8h //
shl v30.4s, v30.4s, 3
shl v29.4s, v29.4s, 3
shl v4.4s, v4.4s, 3
shl v2.4s, v2.4s, 3
shl v27.4s, v27.4s, 6
shl v31.4s, v31.4s, 6
shl v3.4s, v3.4s, 6
shl v1.4s, v1.4s, 6
sub v27.4s, v30.4s, v27.4s
sub v31.4s, v29.4s, v31.4s
sub v3.4s, v4.4s, v3.4s
sub v1.4s, v2.4s, v1.4s
stp q27, q31, [sp, 48]
stp q3, q1, [sp, 16]
bl  test
ldp x29, x30, [sp], 80
ret


// Expect it to be optimized as:
lsl x3, x1, 1
mov x2, x0
stp x29, x30, [sp, -80]!
add x1, x2, x1, lsl 1
add x4, x1, x3
mov x29, sp
add x0, sp, 16
ldr q30, [x2, x3]
ldr q0, [x2]
ushll   v31.4s, v30.4h, 3
ushll2  v30.4s, v30.8h, 6
ushll   v29.4s, v0.4h, 3
ushll2  v0.4s, v0.8h, 6
sub v30.4s, v31.4s, v30.4s
sub v0.4s, v29.4s, v0.4s
str q0, [sp, 16]
ldr q0, [x1, x3]
str q30, [sp, 32]
ldr q30, [x4, x3]
ushll   v29.4s, v0.4h, 3
ushll2  v0.4s, v0.8h, 6
ushll   v31.4s, v30.4h, 3
ushll2  v30.4s, v30.8h, 6
sub v0.4s, v29.4s, v0.4s
sub v30.4s, v31.4s, v30.4s
stp q0, q30, [sp, 48]
bl  test
ldp x29, x30, [sp], 80
ret

Based on cost arising from splicing, we still need a way to select the most
profitable vector mode per slp node, over the global vector mode specified in
vinfo.

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2024-01-05 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Richard Sandiford  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Richard Sandiford  ---
Fixed.  Thanks for the report.

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2024-01-05 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

--- Comment #5 from GCC Commits  ---
The trunk branch has been updated by Richard Sandiford :

https://gcc.gnu.org/g:7328faf89e9b4953baaff10e18262c70fbd3e578

commit r14-6961-g7328faf89e9b4953baaff10e18262c70fbd3e578
Author: Richard Sandiford 
Date:   Fri Jan 5 16:25:16 2024 +

aarch64: Extend VECT_COMPARE_COSTS to !SVE [PR113104]

When SVE is enabled, we try vectorising with multiple different SVE and
Advanced SIMD approaches and use the cost model to pick the best one.
Until now, we've not done that for Advanced SIMD, since "the first mode
that works should always be the best".

The testcase is a counterexample.  Each iteration of the scalar loop
vectorises naturally with 64-bit input vectors and 128-bit output
vectors.  We do try that for SVE, and choose it as the best approach.
But the first approach we try is instead to use:

- a vectorisation factor of 2
- 1 128-bit vector for the inputs
- 2 128-bit vectors for the outputs

But since the stride is variable, the cost of marshalling the input
vector from two iterations outweighs the benefit of doing two iterations
at once.

This patch therefore generalises aarch64-sve-compare-costs to
aarch64-vect-compare-costs and applies it to non-SVE compilations.

gcc/
PR target/113104
* doc/invoke.texi (aarch64-sve-compare-costs): Replace with...
(aarch64-vect-compare-costs): ...this.
* config/aarch64/aarch64.opt (-param=aarch64-sve-compare-costs=):
Replace with...
(-param=aarch64-vect-compare-costs=): ...this new param.
* config/aarch64/aarch64.cc (aarch64_override_options_internal):
Don't disable it when vectorizing for Advanced SIMD only.
(aarch64_autovectorize_vector_modes): Apply VECT_COMPARE_COSTS
whenever aarch64_vect_compare_costs is true.

gcc/testsuite/
PR target/113104
* gcc.target/aarch64/pr113104.c: New test.
* gcc.target/aarch64/sve/cond_arith_1.c: Update for new parameter
names.
* gcc.target/aarch64/sve/cond_arith_1_run.c: Likewise.
* gcc.target/aarch64/sve/cond_arith_3.c: Likewise.
* gcc.target/aarch64/sve/cond_arith_3_run.c: Likewise.
* gcc.target/aarch64/sve/gather_load_6.c: Likewise.
* gcc.target/aarch64/sve/gather_load_7.c: Likewise.
* gcc.target/aarch64/sve/load_const_offset_2.c: Likewise.
* gcc.target/aarch64/sve/load_const_offset_3.c: Likewise.
* gcc.target/aarch64/sve/mask_gather_load_6.c: Likewise.
* gcc.target/aarch64/sve/mask_gather_load_7.c: Likewise.
* gcc.target/aarch64/sve/mask_load_slp_1.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_load_1.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_load_2.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_load_3.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_load_4.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_store_1.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_store_1_run.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_store_2.c: Likewise.
* gcc.target/aarch64/sve/mask_struct_store_2_run.c: Likewise.
* gcc.target/aarch64/sve/pack_1.c: Likewise.
* gcc.target/aarch64/sve/reduc_4.c: Likewise.
* gcc.target/aarch64/sve/scatter_store_6.c: Likewise.
* gcc.target/aarch64/sve/scatter_store_7.c: Likewise.
* gcc.target/aarch64/sve/strided_load_3.c: Likewise.
* gcc.target/aarch64/sve/strided_store_3.c: Likewise.
* gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Likewise.
* gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise.
* gcc.target/aarch64/sve/unpack_signed_1.c: Likewise.
* gcc.target/aarch64/sve/unpack_unsigned_1.c: Likewise.
* gcc.target/aarch64/sve/unpack_unsigned_1_run.c: Likewise.
* gcc.target/aarch64/sve/vcond_11.c: Likewise.
* gcc.target/aarch64/sve/vcond_11_run.c: Likewise.

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2023-12-30 Thread rsandifo at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Richard Sandiford  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2023-12-30
 Ever confirmed|0   |1
 CC||rsandifo at gcc dot gnu.org
   Assignee|unassigned at gcc dot gnu.org  |rsandifo at gcc dot 
gnu.org

--- Comment #4 from Richard Sandiford  ---
FWIW, we do get the desired code with -march=armv8-a+sve (even though the test
doesn't use SVE).  This is because of:

  /* Consider enabling VECT_COMPARE_COSTS for SVE, both so that we
 can compare SVE against Advanced SIMD and so that we can compare
 multiple SVE vectorization approaches against each other.  There's
 not really any point doing this for Advanced SIMD only, since the
 first mode that works should always be the best.  */
  if (TARGET_SVE && aarch64_sve_compare_costs)
flags |= VECT_COMPARE_COSTS;

The testcase in this PR is a counterexample to the claim in the final sentence.
 I think the comment might predate significant support for mixed-sized Advanced
SIMD vectorisation.

If we enable SVE (or uncomment the "if" line), the costs are 13 units per
vector iteration for 128-bit vectors and 4 units per vector iteration for
64-bit vectors (so 8 units per 128 bits on a parity basis).  The 64-bit version
is therefore seen as significantly cheaper and is chosen ahead of the 128-bit
version.

I think this PR is enough proof that we should enable VECT_COMPARE_COSTS even
without SVE.  Assigning to myself for that.

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2023-12-21 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

--- Comment #3 from rguenther at suse dot de  ---
On Thu, 21 Dec 2023, fxue at os dot amperecomputing.com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104
> 
> --- Comment #2 from Feng Xue  ---
> (In reply to Richard Biener from comment #1)
> > See my proposal on the mailing list to lift the restriction of sticking to a
> > single vector size, I think this is another example showing this.  If you
> > use BB level vectorization by disabling loop vectorization but not SLP
> > vectorization the code should improve?
> 
> Yes, the loop is fully unrolled, and BB SLP would.

I suspect even when the loop isn't unrolled (just increase iteration
count) the code would improve

> I could not find the proposal, would you share me a link? Thanks

https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640476.html

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2023-12-21 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

--- Comment #2 from Feng Xue  ---
(In reply to Richard Biener from comment #1)
> See my proposal on the mailing list to lift the restriction of sticking to a
> single vector size, I think this is another example showing this.  If you
> use BB level vectorization by disabling loop vectorization but not SLP
> vectorization the code should improve?

Yes, the loop is fully unrolled, and BB SLP would.

I could not find the proposal, would you share me a link? Thanks

[Bug tree-optimization/113104] Suboptimal loop-based slp node splicing across iterations

2023-12-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113104

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener  ---
See my proposal on the mailing list to lift the restriction of sticking to a
single vector size, I think this is another example showing this.  If you
use BB level vectorization by disabling loop vectorization but not SLP
vectorization the code should improve?