[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #7 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #4)
> (In reply to Hongtao Liu from comment #3)
> > Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.
> 
> Oh, ix86_vect_estimate_reg_pressure is only for loop, BB vectorizer only use
> ix86_builtin_vectorization_cost, but not add_stmt_cost/finish_cost.

Oh, CTOR comes from source code, not from vectorizer.
Then why those loads from offset is not moved just before consumer(loads from
array), then the live range of those values can be shorten.(loads from array
are moved just before CTOR insns).

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #6 from Richard Biener  ---
That's ix86_expand_vector_init_interleave which for QI inner_mode extends
to SImode, likely because it tries to work with just SSE2?

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #5 from Richard Biener  ---
We do not BB vectorize gathers I think (ISTR some "loop" uses in the
infrastructure, not too difficult to fix I guess).

In the end the problem is RTL expansion of the CTOR and then lack of
combine?

Look at how we RTL expand

typedef char __v32qi __attribute__((vector_size(32)));

__v32qi
_mm256_set_epi8  (char __q31, char __q30, char __q29, char __q28,
  char __q27, char __q26, char __q25, char __q24,
  char __q23, char __q22, char __q21, char __q20,
  char __q19, char __q18, char __q17, char __q16,
  char __q15, char __q14, char __q13, char __q12,
  char __q11, char __q10, char __q09, char __q08,
  char __q07, char __q06, char __q05, char __q04,
  char __q03, char __q02, char __q01, char __q00)
{
  return __extension__ (__v32qi){
__q00, __q01, __q02, __q03, __q04, __q05, __q06, __q07,
__q08, __q09, __q10, __q11, __q12, __q13, __q14, __q15,
__q16, __q17, __q18, __q19, __q20, __q21, __q22, __q23,
__q24, __q25, __q26, __q27, __q28, __q29, __q30, __q31
  };
}

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #4 from Hongtao Liu  ---
(In reply to Hongtao Liu from comment #3)
> Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.

Oh, ix86_vect_estimate_reg_pressure is only for loop, BB vectorizer only use
ix86_builtin_vectorization_cost, but not add_stmt_cost/finish_cost.

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-17 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

Hongtao Liu  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org

--- Comment #3 from Hongtao Liu  ---
Looks like ix86_vect_estimate_reg_pressure doesn't work here, taking a look.

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-15 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

--- Comment #2 from Andrew Pinski  ---
Note you can reproduce the same issue with SSE2 (and not just AVX):
```

#define vect16 __attribute__((vector_size(16)))

vect16 char gather(char *array, unsigned short *offset) {

  return (vect16 char){array[offset[0]], array[offset[1]], array[offset[2]],
array[offset[3]], array[offset[4]], array[offset[5]], array[offset[6]],
array[offset[7]],
  array[offset[8]],array[offset[9]],array[offset[10]],array[offset[11]],
array[offset[12]], array[offset[13]], array[offset[14]]};
}
```

[Bug target/82731] _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2024-04-15 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82731

Andrew Pinski  changed:

   What|Removed |Added

   Last reconfirmed||2024-04-15
   Severity|normal  |enhancement
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #1 from Andrew Pinski  ---
Confirmed. This comes down to having a scheduler that reduces live ranges much
more agressively.

Adding -fschedule-insns helps slightly but not enough in this case.