https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105062
Bug ID: 105062 Summary: Suboptimal vectorization for reduction with several elements Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: glisse at gcc dot gnu.org Target Milestone: --- The testcase is essentially the same as in PR105053, but here this is about performance, not correctness. #include <vector> #include <tuple> #include <chrono> #include <algorithm> #include <random> #include <iostream> int main(){ const long n = 100000000; std::vector<std::tuple<int,int,double>> vec; vec.reserve(n); std::random_device rd; std::default_random_engine re(rd()); std::uniform_int_distribution<int> rand_int; std::uniform_real_distribution<double> rand_dbl; for(int i=0;i<n;++i) vec.emplace_back(rand_int(re), rand_int(re), rand_dbl(re)); auto start = std::chrono::system_clock::now(); #ifdef SLOW { int sup = 0; for(int i=0;i<n;++i) sup=std::max(sup,std::max(std::get<0>(vec[i]),std::get<1>(vec[i]))); volatile int noopt0 = sup; } #else { int sup = 0; for(int i=0;i<n;++i) sup=std::max(std::max(sup,std::get<0>(vec[i])),std::get<1>(vec[i])); volatile int noopt1 = sup; } #endif auto finish = std::chrono::system_clock::now(); std::cout << std::chrono::duration_cast<std::chrono::microseconds>(finish - start).count() << '\n'; } I compile with -O3 -march=skylake (originally noticed with -march=native on a i7-10875H CPU). The second loop runs in about 60ms, while the first (compiling with -DSLOW) runs in 80ms. The generated asm also looks very different. For the fast code, the core loop is .L64: vmovdqu (%rax), %ymm3 addq $64, %rax vpunpckldq -32(%rax), %ymm3, %ymm0 vpermd %ymm0, %ymm2, %ymm0 vpmaxsd %ymm0, %ymm1, %ymm1 cmpq %rdx, %rax jne .L64 which looks nice and compact (well, I think we could do without the vpermd, but it is already great). Now for the slow code, we have .L64: vmovdqu (%rax), %ymm0 vmovdqu 32(%rax), %ymm10 vmovdqu 64(%rax), %ymm2 vmovdqu 96(%rax), %ymm9 vpermd %ymm10, %ymm6, %ymm8 vpermd %ymm0, %ymm7, %ymm1 vpblendd $240, %ymm8, %ymm1, %ymm1 vpermd %ymm9, %ymm6, %ymm11 vpermd %ymm2, %ymm7, %ymm8 vpermd %ymm0, %ymm4, %ymm0 vpermd %ymm10, %ymm3, %ymm10 vpermd %ymm2, %ymm4, %ymm2 vpermd %ymm9, %ymm3, %ymm9 vpblendd $240, %ymm11, %ymm8, %ymm8 vpblendd $240, %ymm10, %ymm0, %ymm0 vpblendd $240, %ymm9, %ymm2, %ymm2 vpermd %ymm1, %ymm4, %ymm1 vpermd %ymm8, %ymm3, %ymm8 vpermd %ymm0, %ymm4, %ymm0 vpermd %ymm2, %ymm3, %ymm2 vpblendd $240, %ymm8, %ymm1, %ymm1 vpblendd $240, %ymm2, %ymm0, %ymm0 vpmaxsd %ymm0, %ymm1, %ymm1 subq $-128, %rax vpmaxsd %ymm1, %ymm5, %ymm5 cmpq %rdx, %rax jne .L64 It is unrolled once more than the fast code and contains an excessive amount of shuffling. If I understand correctly, it vectorizes a reduction with MAX_EXPR on "sup" but does not consider the operation max(get<0>,get<1>) as being part of this reduction, so it generates code that would make sense if I used 2 different operations like sup=std::max(sup,std::get<0>(vec[i])+std::get<1>(vec[i])) instead of both being the same MAX_EXPR. Maybe, when we discover a reduction, we could check if the elements are themselves computed with the same operation as the reduction and in that case try to make it a "bigger" reduction?