[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 --- Comment #7 from Feng Xue --- (In reply to Richard Biener from comment #6) > (In reply to Andrew Pinski from comment #5) > > One more thing: > > ``` > > vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; > > vect_shift_1 = vect_value >> { 4, 5, 6, 7 }; > > vect_shift_2 = vect_value >> { 8, 9, 10, 11 }; > > vect_shift_3 = vect_value >> { 12, 13, 14, 15 }; > > ``` > > vs > > ``` > > vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; > > vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 }; > > vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 }; > > vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 }; > > ``` > > > > the first has fully independent operations while in the second case, there > > is one dependent and the are independent operations. > > > > On cores which has many vector units the first one might be faster than the > > second one. So this needs a cost model too. > > Note the vectorizer has the shift values dependent as well (across > iterations), > we just constant propagate after unrolling here. > > Note this is basically asking for "strength-reduction" of expensive > constants which could be more generally useful and not only for this > specific shift case. Consider the same example but with an add instead > of a shift for example, the same exact set of constants will appear. It is. But only find that vector shift has special treatment to constant operands based on its numerical pattern. No sure any other operator would be. BTW, here is a scalar-version strength-reduction for shift, like: int a = value >> n; int b = value >> (n + 6); ==> int a = value >> n; int b = a >> 6; // (n + 6) is not needed But this is not covered by current scalar strength-reduction pass.
[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 --- Comment #6 from Richard Biener --- (In reply to Andrew Pinski from comment #5) > One more thing: > ``` > vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; > vect_shift_1 = vect_value >> { 4, 5, 6, 7 }; > vect_shift_2 = vect_value >> { 8, 9, 10, 11 }; > vect_shift_3 = vect_value >> { 12, 13, 14, 15 }; > ``` > vs > ``` > vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; > vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 }; > vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 }; > vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 }; > ``` > > the first has fully independent operations while in the second case, there > is one dependent and the are independent operations. > > On cores which has many vector units the first one might be faster than the > second one. So this needs a cost model too. Note the vectorizer has the shift values dependent as well (across iterations), we just constant propagate after unrolling here. Note this is basically asking for "strength-reduction" of expensive constants which could be more generally useful and not only for this specific shift case. Consider the same example but with an add instead of a shift for example, the same exact set of constants will appear.
[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 --- Comment #5 from Andrew Pinski --- One more thing: ``` vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; vect_shift_1 = vect_value >> { 4, 5, 6, 7 }; vect_shift_2 = vect_value >> { 8, 9, 10, 11 }; vect_shift_3 = vect_value >> { 12, 13, 14, 15 }; ``` vs ``` vect_shift_0 = vect_value >> { 0, 1, 2, 3 }; vect_shift_1 = vect_shift_0 >> { 4, 4, 4, 4 }; vect_shift_2 = vect_shift_0 >> { 8, 8, 8, 8 }; vect_shift_3 = vect_shift_0 >> { 12, 12, 12, 12 }; ``` the first has fully independent operations while in the second case, there is one dependent and the are independent operations. On cores which has many vector units the first one might be faster than the second one. So this needs a cost model too.
[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 --- Comment #4 from Andrew Pinski --- (In reply to Feng Xue from comment #3) > (In reply to Andrew Pinski from comment #1) > > Note on aarch64 with SVE, you should be able to generate those constants > > without a load, using the index instruction. > Ok. Thanks for the note. This still requires an extra instruction, while the > constant delta could be nested in shift instruction as IMM operand. > > > Basically this requires an "un-shift" pass and most likely should be done at > > the RTL level though that might be too late. > > Maybe isel? > I'm thinking of adding the processing in pass_lower_vector_ssa, which also > contains other peephole vector ssa optimizations, not just lowering. It should be in isel like other vector instruction selection that goes on. pass_lower_vector_ssa is only for lowing generic vectors.
[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 --- Comment #3 from Feng Xue --- (In reply to Andrew Pinski from comment #1) > Note on aarch64 with SVE, you should be able to generate those constants > without a load, using the index instruction. Ok. Thanks for the note. This still requires an extra instruction, while the constant delta could be nested in shift instruction as IMM operand. > Basically this requires an "un-shift" pass and most likely should be done at > the RTL level though that might be too late. > Maybe isel? I'm thinking of adding the processing in pass_lower_vector_ssa, which also contains other peephole vector ssa optimizations, not just lowering.
[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 --- Comment #2 from Andrew Pinski --- Basically this requires an "un-shift" pass and most likely should be done at the RTL level though that might be too late. Maybe isel?
[Bug target/113326] Optimize vector shift with constant delta on shifting-count operand
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113326 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization Component|tree-optimization |target Severity|normal |enhancement Last reconfirmed||2024-01-11 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #1 from Andrew Pinski --- Note on aarch64 with SVE, you should be able to generate those constants without a load, using the index instruction. Though aarch64 does not generate them currently. So this is definitely target specific and all. Plus this is simplification happens after the vectorizer which produces: ``` [local count: 252544065]: # i_13 = PHI # ivtmp_3 = PHI # vect_vec_iv_.4_11 = PHI <_16(5), { 0, 1, 2, 3 }(2)> # vectp_array.6_19 = PHI # ivtmp_22 = PHI _16 = vect_vec_iv_.4_11 + { 4, 4, 4, 4 }; vect__1.5_18 = vect_cst__17 >> vect_vec_iv_.4_11; _1 = value_8(D) >> i_13; MEM [(int *)vectp_array.6_19] = vect__1.5_18; i_10 = i_13 + 1; ivtmp_2 = ivtmp_3 - 1; vectp_array.6_20 = vectp_array.6_19 + 16; ivtmp_23 = ivtmp_22 + 1; if (ivtmp_23 < 4) goto ; [75.00%] else goto ; [25.00%] ```