[Bug tree-optimization/85050] New: Vectorized function - suboptimal gather

marcin.krotkiewski at gmail dot com Fri, 23 Mar 2018 05:44:50 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85050


            Bug ID: 85050
           Summary: Vectorized function - suboptimal gather
           Product: gcc
           Version: 7.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: marcin.krotkiewski at gmail dot com
  Target Milestone: ---

I compile the following function with gcc 7.2 and 8.0.1, with -march=broadwell
-O3 -ftree-vectorize -ffast-math -fopenmp

#pragma omp declare simd notinbranch                                            
double testfun(double arg)                                                      
{                                                                               
  static const double A[11] =
{5.53,5.49,5.46,5.43,5.40,5.25,5.00,4.69,4.48,4.16,3.85};                       
  int iidx = int(arg);                                                          
  return A[iidx];                                                               
}                                                                               

Also here https://godbolt.org/g/wo7pdv

For some reason GCC's 4-wide vectorized function contains asm that works on two
2-wide vectors instead of a single 4-wide vector:

_ZGVdN4v__Z7testfund:
[...]
        vmovapd %ymm0, -32(%rsp)
        vmovapd .LC0(%rip), %xmm2
        vinsertf128     $0x1, -16(%rsp), %ymm0, %ymm0
        vmovapd %xmm2, %xmm4
        vcvttpd2dqy     %ymm0, %xmm0
        vgatherdpd      %xmm4, (%rax,%xmm0,8), %xmm3
        vpshufd $238, %xmm0, %xmm0
        vgatherdpd      %xmm2, (%rax,%xmm0,8), %xmm1
        vmovaps %xmm3, -64(%rsp)
        vmovaps %xmm1, -48(%rsp)
        vmovapd -64(%rsp), %ymm0
[...]

Code generated using ICC looks like expected:

_ZGVYN4v__Z7testfund:
  vcvttpd2dq xmm1, ymm0 #11.18
  vpcmpeqd ymm2, ymm2, ymm2 #12.10
  vxorpd ymm0, ymm0, ymm0 #12.10
  vgatherdpd ymm0, QWORD PTR [A.5.0.1+xmm1*8], ymm2 #12.10

I don't see anything wrong with my compiler options. Is this behaviour in GCC
expected, and a result of a different vectorization cost model?

[Bug tree-optimization/85050] New: Vectorized function - suboptimal gather

Reply via email to