https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88531
Bug ID: 88531 Summary: Index data types when targeting AVX-512 vectorization with gather/scatter Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: florian.schornbaum at siemens dot com Target Milestone: --- Hi, I realized that GCC fails to vectorize simple loops if there are indirect loads (or stores) and the index used for the indirect access doesn't match a very small subset of possible integer data types. I'm targeting AVX-512. This is the MWE (only an indirect load, but a direct store): ============================== #include <cstdint> using loop_t = uint32_t; using idx_t = uint32_t; void loop(double * const __restrict__ dst, double const * const __restrict__ src, idx_t const * const __restrict__ idx, loop_t const begin, loop_t const end) { for (loop_t i = begin; i < end; ++i) { dst[i] = 42.0 * src[idx[i]]; } } ============================== See: https://godbolt.org/z/Ps-sOv This only vectorizes if idx_t is int32_t, int64_t, or uint64_t. My suspicion is this goes back to the gather/scatter instructions of AVX-512 that come in two flavors: with 32 and 64 bit signed integers for the indices. Unsigned 64 bit probably works (on a 64 bit architecture) because it looks like it's just treated as a signed 64 bit value, which probably is due to (from the documentation): "... The scaled index may require more bits to represent than the address bits used by the processor (e.g., in 32-bit mode, if the scale is greater than one). In this case, the most significant bits beyond the number of address bits are ignored. ..." Unfortunately, for int16_t, uint16_t, and uint32_t, this does not vectorize. Although the 32 bit version of gather/scatter could be used -- with proper zero padding -- for int16_t and uint16_t. Likewise, the 64 bit version could be used with indices of type uint32_t. Although the code example only uses idx[i] for loading, it appears to be the exact same issue when using idx[i] for storing (meaning: when scatter would be required). Are there any plans to get this working? Or did I maybe miss something and this should already work? Many thanks in advance Florian