https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88531

            Bug ID: 88531
           Summary: Index data types when targeting AVX-512 vectorization
                    with gather/scatter
           Product: gcc
           Version: 8.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: florian.schornbaum at siemens dot com
  Target Milestone: ---

Hi,

I realized that GCC fails to vectorize simple loops if there are indirect loads
(or stores) and the index used for the indirect access doesn't match a very
small subset of possible integer data types. I'm targeting AVX-512. This is the
MWE (only an indirect load, but a direct store):

==============================
#include <cstdint>

using loop_t = uint32_t;
using idx_t = uint32_t;

void loop(double * const __restrict__ dst,
          double const * const __restrict__ src,
          idx_t const * const __restrict__ idx,
          loop_t const begin,
          loop_t const end)
{
    for (loop_t i = begin; i < end; ++i)
    {
        dst[i] = 42.0 * src[idx[i]];
    }
}
==============================
See: https://godbolt.org/z/Ps-sOv

This only vectorizes if idx_t is int32_t, int64_t, or uint64_t.

My suspicion is this goes back to the gather/scatter instructions of AVX-512
that come in two flavors: with 32 and 64 bit signed integers for the indices.
Unsigned 64 bit probably works (on a 64 bit architecture) because it looks like
it's just treated as a signed 64 bit value, which probably is due to (from the
documentation):
"... The scaled index may require more bits to represent than the address bits
used by the processor (e.g., in 32-bit mode, if the scale is greater than one).
In this case, the most significant bits beyond the number of address bits are
ignored. ..."

Unfortunately, for int16_t, uint16_t, and uint32_t, this does not vectorize.
Although the 32 bit version of gather/scatter could be used -- with proper zero
padding -- for int16_t and uint16_t. Likewise, the 64 bit version could be used
with indices of type uint32_t.

Although the code example only uses idx[i] for loading, it appears to be the
exact same issue when using idx[i] for storing (meaning: when scatter would be
required).

Are there any plans to get this working?
Or did I maybe miss something and this should already work?

Many thanks in advance

Florian

Reply via email to