https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125189

            Bug ID: 125189
           Summary: [missed optimization] VPERM instruction not used to
                    vectorize gather (lookup) in AVX512
           Product: gcc
           Version: 17.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

according to Example 1 in
http://www.ecs.umass.edu/arith-2018/pdf/arith25_18.pdf
VPERM instructions can be used on AVX512 to vectorize lookup in small tables

I made a simple example ( see https://godbolt.org/z/q8K563nns )

extern float y[16];

void kernel1(float const * a, float * x) {
  for (int i=0;i<2048; ++i) {
    int j = int(a[i])%16; // just an example to compute an index in range 0-15
    x[i]=y[j]*a[i];
  }
}

that I think should vectorize as kernel2 (well, modulo the idiosyncrasy of
VPERM indexing)
I cannot exclude that is a cost-evaluation issue.

#include <x86intrin.h>
void kernel2(float const * a, float * x) {
   for (int i=0;i<2048; i+=16) {
    int j[16];
    for (int k=0; k<16; ++k) j[k] = int(a[i+k])%16;
    auto yy = _mm512_permutevar_ps(_mm512_load_ps(y), _mm512_load_epi32(j));
    _mm512_store_ps(x+i,_mm512_mul_ps(yy,_mm512_load_ps(a+i)));
   }
}

Reply via email to