https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79014
Bug ID: 79014 Summary: Absent vectorization with memory loads Product: gcc Version: 6.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jkb at sanger dot ac.uk Target Milestone: --- I have some code which vectorises using icc to some success, but fails to vectorise in gcc (and clang too). The noddy example is as follows: #define NX 32 typedef unsigned int uint32_t; typedef unsigned short uint16_t; typedef unsigned char uint8_t; int bar(uint32_t *s3, uint32_t *in, int in_size, uint8_t *out, int out_size) { const uint32_t mask = (1u << 12)-1; int i; uint32_t R[NX] = {0}; for (i = 0; i < out_size; i+= NX) { int z; #pragma omp simd aligned(in, out: 32) for (z = 0; z < NX; z++) { R[z] *= in[i+z]; out[i+z] = s3[R[z] & mask]; } } return 0; } This is called from a second file containing a main function to test it: #include <stdlib.h> typedef unsigned int uint32_t; typedef unsigned short uint16_t; typedef unsigned char uint8_t; #define BS (1013*1047) uint32_t s3[1<<12]; uint32_t in[BS]; uint8_t out[BS]; extern int bar(uint32_t *s3, uint32_t *in, int in_size, uint8_t *out, int out_size); int main(void) { int i; for (i = 0; i < (1<<12); i++) s3[i] = (rand() << 16) ^ rand(); for (i = 0; i < BS; i++) in[i] = (rand() << 16) ^ rand(); for (i = 0; i < 10000; i++) bar(s3, in, BS, out, BS); return 0; } (The omp pragma there is purely for icc, it seems to make little difference to gcc.) Some benchmarks: gcc-6.1.0: 33743262303 cycles, 97919071795 instructions icc-15.0.0: 21561762894 cycles, 58128430729 instructions clang-3.8.0: 36884730520 cycles, 92597500495 instructions Compilation is via "gcc -g -O3 -fopenmp -march=native a.c b.c" and the host is a Xeon E5-2660 (@2.2Ghz) with avx but not avx2 support (so realistically this can be vectorised using SSE4 only as it's integer maths). This also means it isn't using the gather instructions, but loading from memory to sse4 registers. icc -S confirms this, with a lot of xmm register usage. icc with -qopt-report=5 dumps out: LOOP BEGIN at a.c(14,14) remark #15389: vectorization support: reference R has unaligned access [ a.c(15,13) ] remark #15389: vectorization support: reference R has unaligned access [ a.c(15,13) ] remark #15389: vectorization support: reference in has unaligned access [ a.c(15,13) ] remark #15389: vectorization support: reference out has unaligned access [ a.c(16,13) ] remark #15389: vectorization support: reference R has unaligned access [ a.c(16,13) ] remark #15381: vectorization support: unaligned access used inside loop body [ a.c(16,13) ] remark #15399: vectorization support: unroll factor set to 2 remark #15301: OpenMP SIMD LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 2 remark #15449: unmasked aligned unit stride stores: 2 remark #15458: masked indexed (or gather) loads: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 17 remark #15477: vector loop cost: 4.500 remark #15478: estimated potential speedup: 3.770 remark #15479: lightweight vector operations: 12 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=2 LOOP END I've no idea why it thinks those are unaligned, but it vectorised it anyway. Gcc -fopt-info-vec-missed reports: a.c:11:5: note: failed: evolution of base is not affine. a.c:11:5: note: bad data references. a.c:15:18: note: not vectorized: not suitable for gather load _28 = *_27; a.c:15:18: note: bad data references. a.c:11:5: note: not vectorized: no vectype for stmt: R = {}; scalar_type: uint32_t[32] a.c:11:5: note: not vectorized: not enough data-refs in basic block. a.c:19:12: note: not vectorized: not enough data-refs in basic block. a.c:13:9: note: not vectorized: not enough data-refs in basic block. a.c:16:22: note: not consecutive access _28 = *_27; a.c:16:22: note: not vectorized: no grouped stores in basic block. a.c:11:5: note: not vectorized: not enough data-refs in basic block. Adding __attribute__((aligned(32))) to the R[] definition didn't help. I've explored up to gcc 7 via the online tool at https://godbolt.org/ and can confirm the lack of vectorisation still, although that seems to show clang gains vectorisation too.