https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85050
Bug ID: 85050 Summary: Vectorized function - suboptimal gather Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: marcin.krotkiewski at gmail dot com Target Milestone: --- I compile the following function with gcc 7.2 and 8.0.1, with -march=broadwell -O3 -ftree-vectorize -ffast-math -fopenmp #pragma omp declare simd notinbranch double testfun(double arg) { static const double A[11] = {5.53,5.49,5.46,5.43,5.40,5.25,5.00,4.69,4.48,4.16,3.85}; int iidx = int(arg); return A[iidx]; } Also here https://godbolt.org/g/wo7pdv For some reason GCC's 4-wide vectorized function contains asm that works on two 2-wide vectors instead of a single 4-wide vector: _ZGVdN4v__Z7testfund: [...] vmovapd %ymm0, -32(%rsp) vmovapd .LC0(%rip), %xmm2 vinsertf128 $0x1, -16(%rsp), %ymm0, %ymm0 vmovapd %xmm2, %xmm4 vcvttpd2dqy %ymm0, %xmm0 vgatherdpd %xmm4, (%rax,%xmm0,8), %xmm3 vpshufd $238, %xmm0, %xmm0 vgatherdpd %xmm2, (%rax,%xmm0,8), %xmm1 vmovaps %xmm3, -64(%rsp) vmovaps %xmm1, -48(%rsp) vmovapd -64(%rsp), %ymm0 [...] Code generated using ICC looks like expected: _ZGVYN4v__Z7testfund: vcvttpd2dq xmm1, ymm0 #11.18 vpcmpeqd ymm2, ymm2, ymm2 #12.10 vxorpd ymm0, ymm0, ymm0 #12.10 vgatherdpd ymm0, QWORD PTR [A.5.0.1+xmm1*8], ymm2 #12.10 I don't see anything wrong with my compiler options. Is this behaviour in GCC expected, and a result of a different vectorization cost model?