[Bug tree-optimization/93055] New: accumulation loops in stepanov_vector benchmark use more instruction level parpallelism

hubicka at gcc dot gnu.org Mon, 23 Dec 2019 11:29:13 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93055


            Bug ID: 93055
           Summary: accumulation loops in stepanov_vector benchmark use
                    more instruction level parpallelism
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hubicka at gcc dot gnu.org
  Target Milestone: ---

stepanov_vector benchmark form
https://gitlab.com/chriscox/CppPerformanceBenchmarks gets poor codegen on
TestOneType<double>

Built with -march=bdver1 -O3 (but the regression happens on core too)

Clang compiles accumulation loops for testOneType<int> as follows:

       │        vpxor  %xmm0,%xmm0,%xmm0
       │        vpxor  %xmm1,%xmm1,%xmm1 
       │        vpxor  %xmm2,%xmm2,%xmm2
  0.05 │        vpxor  %xmm3,%xmm3,%xmm3=
       │        data16 nopw %cs:0x0(%rax,%rax,1)
  6.95 │ 300:┌─→vpaddd 0x5f0(%rsp,%rcx,4),%xmm0,%xmm0 
  0.05 │     │  vpaddd 0x600(%rsp,%rcx,4),%xmm1,%xmm1
  7.13 │     │  vpaddd 0x610(%rsp,%rcx,4),%xmm2,%xmm2
  0.16 │     │  vpaddd 0x620(%rsp,%rcx,4),%xmm3,%xmm3
       │     │  add    $0x10,%rcx
       │     │  cmp    $0x7dc,%rcx
  7.04 │     └──jne    300
  0.07 │        vpaddd %xmm0,%xmm1,%xmm0
  1.61 │        vpaddd %xmm0,%xmm2,%xmm0
       │        vpaddd %xmm0,%xmm3,%xmm0
       │        vpshuf $0x4e,%xmm0,%xmm1
  0.07 │        vpaddd %xmm1,%xmm0,%xmm0 
  0.02 │        vpshuf $0xe5,%xmm0,%xmm1

while GCC10 does:

       │ 1c0:   vxorps %xmm0,%xmm0,%xmm0 
       │        mov    %rbx,%rax
       │        nop
  2.25 │ 1d0:┌─→vpaddd (%rax),%xmm0,%xmm0 
  0.01 │     │  lea    0x2100(%rsp),%rdi
  0.95 │     │  add    $0x10,%rax
  1.04 │     │  cmp    %rax,%rdi
  2.24 │     └──jne    1d0  

Which runs slower:

test                                        description   absolute   operations
  ratio with
number                                                    time       per second
  test0

 0                 "int32_t accumulate pointer verify2"   1.06 sec   12440.17 M
    1.00
 1                 "int32_t accumulate vector iterator"   1.06 sec   12458.15 M
    1.00
 2         "int32_t accumulate pointer reverse reverse"   1.06 sec   12440.34 M
    1.00
 3 "int32_t accumulate vector reverse_iterator reverse"   1.05 sec   12602.74 M
    0.99
 4 "int32_t accumulate vector iterator reverse reverse"   1.04 sec   12749.27 M
    0.98
 5 "int32_t accumulate array Riterator reverse reverse"   1.06 sec   12486.26 M
    1.00

Total absolute time for int32_t Vector Accumulate: 6.32 sec                     

int32_t Vector Accumulate Penalty: 0.99                                         

compared to:
test                                        description   absolute   operations
  ratio with
number                                                    time       per second
  test0

 0                 "int32_t accumulate pointer verify2"   2.29 sec   5773.60 M 
   1.00
 1                 "int32_t accumulate vector iterator"   2.27 sec   5806.96 M 
   0.99
 2         "int32_t accumulate pointer reverse reverse"   2.26 sec   5830.72 M 
   0.99
 3 "int32_t accumulate vector reverse_iterator reverse"   2.27 sec   5827.45 M 
   0.99
 4 "int32_t accumulate vector iterator reverse reverse"   2.27 sec   5821.29 M 
   0.99
 5 "int32_t accumulate array Riterator reverse reverse"   2.27 sec   5826.58 M 
   0.99

Total absolute time for int32_t Vector Accumulate: 13.62 sec                    

int32_t Vector Accumulate Penalty: 0.99

[Bug tree-optimization/93055] New: accumulation loops in stepanov_vector benchmark use more instruction level parpallelism

Reply via email to