https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79014

            Bug ID: 79014
           Summary: Absent vectorization with memory loads
           Product: gcc
           Version: 6.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jkb at sanger dot ac.uk
  Target Milestone: ---

I have some code which vectorises using icc to some success, but fails to
vectorise in gcc (and clang too).  The noddy example is as follows:

#define NX 32

typedef unsigned int uint32_t;
typedef unsigned short uint16_t;
typedef unsigned char uint8_t;

int bar(uint32_t *s3, uint32_t *in, int in_size, uint8_t *out, int out_size) {
    const uint32_t mask = (1u << 12)-1;
    int i;
    uint32_t R[NX] = {0};
    for (i = 0; i < out_size; i+= NX) {
        int z;
#pragma omp simd aligned(in, out: 32)
        for (z = 0; z < NX; z++) {
            R[z] *= in[i+z];
            out[i+z] = s3[R[z] & mask];
        }
    }
    return 0;
}

This is called from a second file containing a main function to test it:

#include <stdlib.h>

typedef unsigned int uint32_t;
typedef unsigned short uint16_t;
typedef unsigned char uint8_t;

#define BS (1013*1047)

uint32_t s3[1<<12];
uint32_t in[BS];
uint8_t out[BS];

extern int bar(uint32_t *s3, uint32_t *in, int in_size, uint8_t *out, int
out_size);

int main(void) {
    int i;

    for (i = 0; i < (1<<12); i++)
        s3[i] = (rand() << 16) ^ rand();

    for (i = 0; i < BS; i++)
        in[i] = (rand() << 16) ^ rand();

    for (i = 0; i < 10000; i++)
        bar(s3, in, BS, out, BS);

    return 0;
}

(The omp pragma there is purely for icc, it seems to make little difference to
gcc.)  Some benchmarks:

gcc-6.1.0:   33743262303 cycles, 97919071795 instructions
icc-15.0.0:  21561762894 cycles, 58128430729 instructions
clang-3.8.0: 36884730520 cycles, 92597500495 instructions

Compilation is via "gcc -g -O3 -fopenmp -march=native a.c b.c" and the host is
a Xeon E5-2660 (@2.2Ghz) with avx but not avx2 support (so realistically this
can be vectorised using SSE4 only as it's integer maths).  This also means it
isn't using the gather instructions, but loading from memory to sse4 registers.

icc -S confirms this, with a lot of xmm register usage.

icc with -qopt-report=5 dumps out:

   LOOP BEGIN at a.c(14,14)
      remark #15389: vectorization support: reference R has unaligned access  
[ a.c(15,13) ]
      remark #15389: vectorization support: reference R has unaligned access  
[ a.c(15,13) ]
      remark #15389: vectorization support: reference in has unaligned access  
[ a.c(15,13) ]
      remark #15389: vectorization support: reference out has unaligned access 
 [ a.c(16,13) ]
      remark #15389: vectorization support: reference R has unaligned access  
[ a.c(16,13) ]
      remark #15381: vectorization support: unaligned access used inside loop
body   [ a.c(16,13) ]
      remark #15399: vectorization support: unroll factor set to 2
      remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 2 
      remark #15449: unmasked aligned unit stride stores: 2 
      remark #15458: masked indexed (or gather) loads: 1 
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 17 
      remark #15477: vector loop cost: 4.500 
      remark #15478: estimated potential speedup: 3.770 
      remark #15479: lightweight vector operations: 12 
      remark #15488: --- end vector loop cost summary ---
      remark #25015: Estimate of max trip count of loop=2
   LOOP END

I've no idea why it thinks those are unaligned, but it vectorised it anyway.

Gcc -fopt-info-vec-missed reports:

a.c:11:5: note: failed: evolution of base is not affine.
a.c:11:5: note: bad data references.
a.c:15:18: note: not vectorized: not suitable for gather load _28 = *_27;

a.c:15:18: note: bad data references.
a.c:11:5: note: not vectorized: no vectype for stmt: R = {};
 scalar_type: uint32_t[32]
a.c:11:5: note: not vectorized: not enough data-refs in basic block.
a.c:19:12: note: not vectorized: not enough data-refs in basic block.
a.c:13:9: note: not vectorized: not enough data-refs in basic block.
a.c:16:22: note: not consecutive access _28 = *_27;
a.c:16:22: note: not vectorized: no grouped stores in basic block.
a.c:11:5: note: not vectorized: not enough data-refs in basic block.

Adding __attribute__((aligned(32))) to the R[] definition didn't help.

I've explored up to gcc 7 via the online tool at https://godbolt.org/ and can
confirm the lack of vectorisation still, although that seems to show clang
gains vectorisation too.

Reply via email to