https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105062

            Bug ID: 105062
           Summary: Suboptimal vectorization for reduction with several
                    elements
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: glisse at gcc dot gnu.org
  Target Milestone: ---

The testcase is essentially the same as in PR105053, but here this is about
performance, not correctness.

#include <vector>
#include <tuple>
#include <chrono>
#include <algorithm>
#include <random>
#include <iostream>

int main(){
  const long n = 100000000;
  std::vector<std::tuple<int,int,double>> vec;
  vec.reserve(n);
  std::random_device rd;
  std::default_random_engine re(rd());
  std::uniform_int_distribution<int> rand_int;
  std::uniform_real_distribution<double> rand_dbl;
  for(int i=0;i<n;++i)
    vec.emplace_back(rand_int(re), rand_int(re), rand_dbl(re));
  auto start = std::chrono::system_clock::now();
#ifdef SLOW
  {
    int sup = 0;
    for(int i=0;i<n;++i)
sup=std::max(sup,std::max(std::get<0>(vec[i]),std::get<1>(vec[i])));
    volatile int noopt0 = sup;
  }
#else
  {
    int sup = 0;
    for(int i=0;i<n;++i)
sup=std::max(std::max(sup,std::get<0>(vec[i])),std::get<1>(vec[i]));
    volatile int noopt1 = sup;
  }
#endif
  auto finish = std::chrono::system_clock::now();
  std::cout << std::chrono::duration_cast<std::chrono::microseconds>(finish -
start).count() << '\n';
}


I compile with -O3 -march=skylake (originally noticed with -march=native on a
i7-10875H CPU).

The second loop runs in about 60ms, while the first (compiling with -DSLOW)
runs in 80ms. The generated asm also looks very different. For the fast code,
the core loop is

.L64:
        vmovdqu (%rax), %ymm3
        addq    $64, %rax
        vpunpckldq      -32(%rax), %ymm3, %ymm0
        vpermd  %ymm0, %ymm2, %ymm0
        vpmaxsd %ymm0, %ymm1, %ymm1
        cmpq    %rdx, %rax
        jne     .L64

which looks nice and compact (well, I think we could do without the vpermd, but
it is already great). Now for the slow code, we have

.L64:
        vmovdqu (%rax), %ymm0
        vmovdqu 32(%rax), %ymm10
        vmovdqu 64(%rax), %ymm2
        vmovdqu 96(%rax), %ymm9
        vpermd  %ymm10, %ymm6, %ymm8
        vpermd  %ymm0, %ymm7, %ymm1
        vpblendd        $240, %ymm8, %ymm1, %ymm1
        vpermd  %ymm9, %ymm6, %ymm11
        vpermd  %ymm2, %ymm7, %ymm8
        vpermd  %ymm0, %ymm4, %ymm0
        vpermd  %ymm10, %ymm3, %ymm10
        vpermd  %ymm2, %ymm4, %ymm2
        vpermd  %ymm9, %ymm3, %ymm9
        vpblendd        $240, %ymm11, %ymm8, %ymm8
        vpblendd        $240, %ymm10, %ymm0, %ymm0
        vpblendd        $240, %ymm9, %ymm2, %ymm2
        vpermd  %ymm1, %ymm4, %ymm1
        vpermd  %ymm8, %ymm3, %ymm8
        vpermd  %ymm0, %ymm4, %ymm0
        vpermd  %ymm2, %ymm3, %ymm2
        vpblendd        $240, %ymm8, %ymm1, %ymm1
        vpblendd        $240, %ymm2, %ymm0, %ymm0
        vpmaxsd %ymm0, %ymm1, %ymm1
        subq    $-128, %rax
        vpmaxsd %ymm1, %ymm5, %ymm5
        cmpq    %rdx, %rax
        jne     .L64

It is unrolled once more than the fast code and contains an excessive amount of
shuffling. If I understand correctly, it vectorizes a reduction with MAX_EXPR
on "sup" but does not consider the operation max(get<0>,get<1>) as being part
of this reduction, so it generates code that would make sense if I used 2
different operations like

  sup=std::max(sup,std::get<0>(vec[i])+std::get<1>(vec[i]))

instead of both being the same MAX_EXPR. Maybe, when we discover a reduction,
we could check if the elements are themselves computed with the same operation
as the reduction and in that case try to make it a "bigger" reduction?

Reply via email to