https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055
Bug ID: 93055 Summary: accumulation loops in stepanov_vector benchmark use more instruction level parpallelism Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hubicka at gcc dot gnu.org Target Milestone: --- stepanov_vector benchmark form https://gitlab.com/chriscox/CppPerformanceBenchmarks gets poor codegen on TestOneType<double> Built with -march=bdver1 -O3 (but the regression happens on core too) Clang compiles accumulation loops for testOneType<int> as follows: │ vpxor %xmm0,%xmm0,%xmm0 │ vpxor %xmm1,%xmm1,%xmm1 │ vpxor %xmm2,%xmm2,%xmm2 0.05 │ vpxor %xmm3,%xmm3,%xmm3= │ data16 nopw %cs:0x0(%rax,%rax,1) 6.95 │ 300:┌─→vpaddd 0x5f0(%rsp,%rcx,4),%xmm0,%xmm0 0.05 │ │ vpaddd 0x600(%rsp,%rcx,4),%xmm1,%xmm1 7.13 │ │ vpaddd 0x610(%rsp,%rcx,4),%xmm2,%xmm2 0.16 │ │ vpaddd 0x620(%rsp,%rcx,4),%xmm3,%xmm3 │ │ add $0x10,%rcx │ │ cmp $0x7dc,%rcx 7.04 │ └──jne 300 0.07 │ vpaddd %xmm0,%xmm1,%xmm0 1.61 │ vpaddd %xmm0,%xmm2,%xmm0 │ vpaddd %xmm0,%xmm3,%xmm0 │ vpshuf $0x4e,%xmm0,%xmm1 0.07 │ vpaddd %xmm1,%xmm0,%xmm0 0.02 │ vpshuf $0xe5,%xmm0,%xmm1 while GCC10 does: │ 1c0: vxorps %xmm0,%xmm0,%xmm0 │ mov %rbx,%rax │ nop 2.25 │ 1d0:┌─→vpaddd (%rax),%xmm0,%xmm0 0.01 │ │ lea 0x2100(%rsp),%rdi 0.95 │ │ add $0x10,%rax 1.04 │ │ cmp %rax,%rdi 2.24 │ └──jne 1d0 Which runs slower: test description absolute operations ratio with number time per second test0 0 "int32_t accumulate pointer verify2" 1.06 sec 12440.17 M 1.00 1 "int32_t accumulate vector iterator" 1.06 sec 12458.15 M 1.00 2 "int32_t accumulate pointer reverse reverse" 1.06 sec 12440.34 M 1.00 3 "int32_t accumulate vector reverse_iterator reverse" 1.05 sec 12602.74 M 0.99 4 "int32_t accumulate vector iterator reverse reverse" 1.04 sec 12749.27 M 0.98 5 "int32_t accumulate array Riterator reverse reverse" 1.06 sec 12486.26 M 1.00 Total absolute time for int32_t Vector Accumulate: 6.32 sec int32_t Vector Accumulate Penalty: 0.99 compared to: test description absolute operations ratio with number time per second test0 0 "int32_t accumulate pointer verify2" 2.29 sec 5773.60 M 1.00 1 "int32_t accumulate vector iterator" 2.27 sec 5806.96 M 0.99 2 "int32_t accumulate pointer reverse reverse" 2.26 sec 5830.72 M 0.99 3 "int32_t accumulate vector reverse_iterator reverse" 2.27 sec 5827.45 M 0.99 4 "int32_t accumulate vector iterator reverse reverse" 2.27 sec 5821.29 M 0.99 5 "int32_t accumulate array Riterator reverse reverse" 2.27 sec 5826.58 M 0.99 Total absolute time for int32_t Vector Accumulate: 13.62 sec int32_t Vector Accumulate Penalty: 0.99