https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125189
Bug ID: 125189
Summary: [missed optimization] VPERM instruction not used to
vectorize gather (lookup) in AVX512
Product: gcc
Version: 17.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: vincenzo.innocente at cern dot ch
Target Milestone: ---
according to Example 1 in
http://www.ecs.umass.edu/arith-2018/pdf/arith25_18.pdf
VPERM instructions can be used on AVX512 to vectorize lookup in small tables
I made a simple example ( see https://godbolt.org/z/q8K563nns )
extern float y[16];
void kernel1(float const * a, float * x) {
for (int i=0;i<2048; ++i) {
int j = int(a[i])%16; // just an example to compute an index in range 0-15
x[i]=y[j]*a[i];
}
}
that I think should vectorize as kernel2 (well, modulo the idiosyncrasy of
VPERM indexing)
I cannot exclude that is a cost-evaluation issue.
#include <x86intrin.h>
void kernel2(float const * a, float * x) {
for (int i=0;i<2048; i+=16) {
int j[16];
for (int k=0; k<16; ++k) j[k] = int(a[i+k])%16;
auto yy = _mm512_permutevar_ps(_mm512_load_ps(y), _mm512_load_epi32(j));
_mm512_store_ps(x+i,_mm512_mul_ps(yy,_mm512_load_ps(a+i)));
}
}