[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand

2024-01-11 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #7 from Feng Xue  ---
(In reply to Richard Biener from comment #6)
> (In reply to Andrew Pinski from comment #5)
> > One more thing:
> > ```
> >  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
> >  vect_shift_1 = vect_value >> { 4, 5, 6, 7 };
> >  vect_shift_2 = vect_value >> { 8, 9, 10, 11 };
> >  vect_shift_3 = vect_value >> { 12, 13, 14, 15 };
> > ```
> > vs
> > ```
> >  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
> >  vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 };
> >  vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 };
> >  vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };
> > ```
> > 
> > the first has fully independent operations while in the second case, there
> > is one dependent and the are independent operations.
> > 
> > On cores which has many vector units the first one might be faster than the
> > second one.  So this needs a cost model too.
> 
> Note the vectorizer has the shift values dependent as well (across
> iterations),
> we just constant propagate after unrolling here.
> 
> Note this is basically asking for "strength-reduction" of expensive
> constants which could be more generally useful and not only for this
> specific shift case.  Consider the same example but with an add instead
> of a shift for example, the same exact set of constants will appear.

It is. But only find that vector shift has special treatment to constant
operands based on its numerical pattern. No sure any other operator would be.

BTW, here is a scalar-version strength-reduction for shift, like:

  int a = value >> n;
  int b = value >> (n + 6);

  ==>

  int a = value >> n;
  int b = a >> 6;  // (n + 6) is not needed

But this is not covered by current scalar strength-reduction pass.

[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand

2024-01-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #6 from Richard Biener  ---
(In reply to Andrew Pinski from comment #5)
> One more thing:
> ```
>  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
>  vect_shift_1 = vect_value >> { 4, 5, 6, 7 };
>  vect_shift_2 = vect_value >> { 8, 9, 10, 11 };
>  vect_shift_3 = vect_value >> { 12, 13, 14, 15 };
> ```
> vs
> ```
>  vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
>  vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 };
>  vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 };
>  vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };
> ```
> 
> the first has fully independent operations while in the second case, there
> is one dependent and the are independent operations.
> 
> On cores which has many vector units the first one might be faster than the
> second one.  So this needs a cost model too.

Note the vectorizer has the shift values dependent as well (across iterations),
we just constant propagate after unrolling here.

Note this is basically asking for "strength-reduction" of expensive
constants which could be more generally useful and not only for this
specific shift case.  Consider the same example but with an add instead
of a shift for example, the same exact set of constants will appear.

[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand

2024-01-11 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #5 from Andrew Pinski  ---
One more thing:
```
 vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
 vect_shift_1 = vect_value >> { 4, 5, 6, 7 };
 vect_shift_2 = vect_value >> { 8, 9, 10, 11 };
 vect_shift_3 = vect_value >> { 12, 13, 14, 15 };
```
vs
```
 vect_shift_0 = vect_value >> { 0, 1, 2, 3 };
 vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 };
 vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 };
 vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 };
```

the first has fully independent operations while in the second case, there is
one dependent and the are independent operations.

On cores which has many vector units the first one might be faster than the
second one.  So this needs a cost model too.

[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand

2024-01-10 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #4 from Andrew Pinski  ---
(In reply to Feng Xue from comment #3)
> (In reply to Andrew Pinski from comment #1)
> > Note on aarch64 with SVE, you should be able to generate those constants
> > without a load, using the index instruction.
> Ok. Thanks for the note. This still requires an extra instruction, while the
> constant delta could be nested in shift instruction as IMM operand.
> 
> > Basically this requires an "un-shift" pass and most likely should be done at
> > the RTL level though that might be too late.
> > Maybe isel?
> I'm thinking of adding the processing in pass_lower_vector_ssa, which also
> contains other peephole vector ssa optimizations, not just lowering.

It should be in isel like other vector instruction selection that goes on.

pass_lower_vector_ssa is only for lowing generic vectors.

[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand

2024-01-10 Thread fxue at os dot amperecomputing.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #3 from Feng Xue  ---

(In reply to Andrew Pinski from comment #1)
> Note on aarch64 with SVE, you should be able to generate those constants
> without a load, using the index instruction.
Ok. Thanks for the note. This still requires an extra instruction, while the
constant delta could be nested in shift instruction as IMM operand.

> Basically this requires an "un-shift" pass and most likely should be done at
> the RTL level though that might be too late.
> Maybe isel?
I'm thinking of adding the processing in pass_lower_vector_ssa, which also
contains other peephole vector ssa optimizations, not just lowering.

[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand

2024-01-10 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

--- Comment #2 from Andrew Pinski  ---
Basically this requires an "un-shift" pass and most likely should be done at
the RTL level though that might be too late.
Maybe isel?

[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand

2024-01-10 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization
  Component|tree-optimization   |target
   Severity|normal  |enhancement
   Last reconfirmed||2024-01-11
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #1 from Andrew Pinski  ---
Note on aarch64 with SVE, you should be able to generate those constants
without a load, using the index instruction.  Though aarch64 does not generate
them currently.  So this is definitely target specific and all.

Plus this is simplification happens after the vectorizer which produces:
```
   [local count: 252544065]:
  # i_13 = PHI 
  # ivtmp_3 = PHI 
  # vect_vec_iv_.4_11 = PHI <_16(5), { 0, 1, 2, 3 }(2)>
  # vectp_array.6_19 = PHI 
  # ivtmp_22 = PHI 
  _16 = vect_vec_iv_.4_11 + { 4, 4, 4, 4 };
  vect__1.5_18 = vect_cst__17 >> vect_vec_iv_.4_11;
  _1 = value_8(D) >> i_13;
  MEM  [(int *)vectp_array.6_19] = vect__1.5_18;
  i_10 = i_13 + 1;
  ivtmp_2 = ivtmp_3 - 1;
  vectp_array.6_20 = vectp_array.6_19 + 16;
  ivtmp_23 = ivtmp_22 + 1;
  if (ivtmp_23 < 4)
goto ; [75.00%]
  else
goto ; [25.00%]
```