Re: LTO slows down calculix by more than 10% on aarch64

Prathamesh Kulkarni via Gcc Mon, 21 Sep 2020 02:50:46 -0700

On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <[email protected]> wrote:
>
> > I obtained perf stat results for following benchmark runs:
> >
> > -O2:
> >
> >     7856832.692380      task-clock (msec)         #    1.000 CPUs utilized
> >               3758               context-switches          #    0.000 K/sec
> >                 40                 cpu-migrations             #    0.000 
> > K/sec
> >              40847              page-faults                   #    0.005 
> > K/sec
> >      7856782413676      cycles                           #    1.000 GHz
> >      6034510093417      instructions                   #    0.77  insn per 
> > cycle
> >       363937274287       branches                       #   46.321 M/sec
> >        48557110132       branch-misses                #   13.34% of all 
> > branches
>
> (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> enough for this kind of code)
>
> > -O2 with orthonl inlined:
> >
> >     8319643.114380      task-clock (msec)       #    1.000 CPUs utilized
> >               4285               context-switches         #    0.001 K/sec
> >                 28                 cpu-migrations            #    0.000 
> > K/sec
> >              40843              page-faults                  #    0.005 
> > K/sec
> >      8319591038295      cycles                          #    1.000 GHz
> >      6276338800377      instructions                  #    0.75  insn per 
> > cycle
> >       467400726106       branches                      #   56.180 M/sec
> >        45986364011        branch-misses              #    9.84% of all 
> > branches
>
> So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably 
> implying
> that extra instructions are appearing in this loop nest, but not in the 
> innermost
> loop. As a reminder for others, the innermost loop has only 3 iterations.
>
> > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> >
> >    8207331.088040      task-clock (msec)   #    1.000 CPUs utilized
> >               2266               context-switches    #    0.000 K/sec
> >                 32                 cpu-migrations       #    0.000 K/sec
> >              40846              page-faults             #    0.005 K/sec
> >      8207292032467      cycles                     #   1.000 GHz
> >      6035724436440      instructions             #    0.74  insn per cycle
> >       364415440156       branches                 #   44.401 M/sec
> >        53138327276        branch-misses         #   14.58% of all branches
>
> This seems to match baseline in terms of instruction count, but without PRE
> the loop nest may be carrying some dependencies over memory. I would simply
> check the assembly for the entire 6-level loop nest in question, I hope it's
> not very complicated (though Fortran array addressing...).
>
> > -O2 with orthonl inlined and hoisting disabled:
> >
> >    7797265.206850      task-clock (msec)         #    1.000 CPUs utilized
> >               3139              context-switches          #    0.000 K/sec
> >                 20                cpu-migrations             #    0.000 
> > K/sec
> >              40846              page-faults                  #    0.005 
> > K/sec
> >      7797221351467      cycles                          #    1.000 GHz
> >      6187348757324      instructions                  #    0.79  insn per 
> > cycle
> >       461840800061       branches                      #   59.231 M/sec
> >        26920311761        branch-misses             #    5.83% of all 
> > branches
>
> There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle 
> count.
> I don't think the former fully covers the latter (there's also a 90e9 
> reduction
> in insn count).
>
> Given that the inner loop iterates only 3 times, my main suggestion is to
> consider how the profile for the entire loop nest looks like (it's 6 loops 
> deep,
> each iterating only 3 times).
>
> > Perf profiles for
> > -O2 -fno-code-hoisting and inlined orthonl:
> > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> >
> >           3196866 |1f04:    ldur   d1, [x1, #-248]
> > 216348301800│            add    w0, w0, #0x1
> >             985098 |            add    x2, x2, #0x18
> > 216215999206│            add    x1, x1, #0x48
> > 215630376504│            fmul   d1, d5, d1
> > 863829148015│            fmul   d1, d1, d6
> > 864228353526│            fmul   d0, d1, d0
> > 864568163014│            fmadd  d2, d0, d16, d2
> >                         │             cmp    w0, #0x4
> > 216125427594│          ↓ b.eq   1f34
> >         15010377│             ldur   d0, [x2, #-8]
> > 143753737468│          ↑ b      1f04
> >
> > -O2 with inlined orthonl:
> > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> >
> > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > 144055883055│            add    w0, w0, #0x1
> >   72262104254│            add    x2, x2, #0x18
> > 143991169721│            add    x1, x1, #0x48
> > 288648917780│            fmul   d15, d17, d15
> > 864665644756│            fmul   d15, d15, d18
> > 863868426387│            fmul   d14, d15, d14
> > 865228159813│            fmadd  d16, d14, d31, d16
> >             245967│            cmp    w0, #0x4
> > 215396760545│         ↓ b.eq   1f28
> >       704732365│            ldur   d14, [x2, #-8]
> > 143775979620│         ↑ b      1ef8
>
> This indicates that the loop only covers about 46-48% of overall time.
>
> High count on the initial ldur instruction could be explained if the loop
> is not entered by "fallthru" from the preceding block, or if its backedge
> is mispredicted. Sampling mispredictions should be possible with perf record,
> and you may be able to check if loop entry is fallthrough by inspecting
> assembly.
>
> It may also be possible to check if code alignment matters, by compiling with
> -falign-loops=32.
Hi,
Thanks a lot for the detailed feedback, and I am sorry for late response.


The hoisting region is:
if(mattyp.eq.1) then
  4 loops
elseif(mattyp.eq.2) then
  {
     orthonl inlined into basic block;
     loads w[0] .. w[8]
  }
else
   6 loops  // load anisox

followed by basic block:

 senergy=
     &                    (s11*w(1,1)+s12*(w(1,2)+w(2,1))
     &                    +s13*(w(1,3)+w(3,1))+s22*w(2,2)
     &                    +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
                     s(ii1,jj1)=s(ii1,jj1)+senergy
                     s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
                     s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy

Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
right in block 181, which is:
if (mattyp.eq.2) goto <bb 182> else goto <bb 193>

which is then further hoisted to block 173:
if (mattyp.eq.1) goto <bb 392> else goto <bb 181>

>From block 181, we have two paths towards senergy block (bb 194):
bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
AND
bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
which has a path length of around 18 blocks.
(bb 194 post-dominates bb 181 and bb 173).

Disabling only load hoisting within blocks 173 and 181
(simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
avoid hoisting of 'w' array and brings back most of performance. Which
verifies that it is hoisting of the
'w' array (w[0] ... w[8]), which is causing the slowdown ?

I obtained perf profiles for full hoisting, and disabled hoisting of
'w' array for the 6 loops, and the most drastic difference was
for ldur instruction:

With full hoisting:
359871503840│ 1ef8:   ldur   d15, [x1, #-248]

Without full hoisting:
3441224 │1edc:   ldur   d1, [x1, #-248]

(The loop entry seems to be fall thru in both cases. I have attached
profiles for both cases).

IIUC, the instruction seems to be loading the first element from anisox array,
which makes me wonder if the issue was with data-cache miss for slower version.
I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
and it reported two cache misses on the ldur instruction in full hoisting case,
while it reported zero for the disabled load hoisting case.
So I wonder if the slowdown happens because hoisting of 'w' array
possibly results
in eviction of anisox thus causing a cache miss inside the inner loop
and making load slower ?

Hoisting also seems to improve the number of overall cache misses tho.
For disabled hoisting  of 'w' array case, there were a total of 463
cache misses, while with full hoisting there were 357 cache misses
(with period = 1 million).
Does that happen because hoisting probably reduces cache misses along
the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?

Thanks,
Prathamesh
>
> Alexander

  884982389 │1e40:   ldr    x0, [sp, #448]                                      
                                    ◆            │        fmov   d19, d6        
                                                                         ▒  
871517886 │        ldr    x1, [sp, #808]                                        
                                  ▒            │        add    x16, sp, #0x720  
                                                                       ▒  
904652642 │        ldr    x13, [sp, #784]                                       
                                  ▒            │        sub    x15, x26, #0x1   
                                                                       ▒  
892180199 │        mov    x24, x27                                              
                                  ▒            │        add    x28, x27, #0xf8  
                                                                       ▒  
881362543 │        add    x22, x1, x0, lsl #3                                   
                                  ▒            │        mov    x12, #0x9        
               // #9                                                   ▒
  906876972 │        mov    x23, #0x1                       // #1               
                                    ▒
 5342906864 │1e6c:   fmov   d17, d1                                             
                                    ▒
 2622786801 │        mov    x14, #0x1778                    // #6008
            │        mov    x20, x28                                            
                                    ▒ 2680397945 │        add    x19, sp, x14   
                                                                         ▒
            │        mov    x18, x24                                            
                                    ▒
 2629152729 │        mov    x21, x30                                            
                                    ▒
            │        ldr    d16, [x22]                                          
                                    ▒
 4571598336 │        mov    x17, #0x1e                      // #30              
                                    ▒
15904018941 │1e8c:   mov    x11, x19                                            
                                    ▒
 8106237022 │        mov    x10, x20                                            
                                    ▒
            │        mov    x14, x21                                            
                                    ▒
 7958740225 │        mov    x9, x18                                             
                                    ▒
            │        mov    x8, #0x1b                       // #27              
                                    ▒
41353477432 │1ea0:   ldr    d14, [x9]                                           
                                    ▒
 1220553185 │        fmov   d18, d22                                            
                                    ◆
22852558475 │        fmov   d20, d19                                            
                                    ▒ 1199867833 │        mov    x3, x11        
                                                                         ▒
22706386191 │        mov    x7, x16                                             
                                    ▒
 1177543527 │        mov    x6, x10                                             
                                    ▒
22767111709 │        fmul   d14, d17, d14                                       
                                    ▒
 1195454897 │        mov    x5, #0x1                        // #1               
                                    ▒
94868835951 │        fmadd  d16, d14, d31, d16                                  
                                    ▒
48021203056 │1ec4:   ldur   d15, [x6, #-248]                                    
                                    ▒
30707657072 │        sub    x4, x3, #0x140                                      
                                    ▒
41301831015 │        fmov   d14, d19                                            
                                    ▒
32467499777 │        mov    x2, x13                                             
                                    ▒
39498561992 │        mov    x1, x3                                              
                                    ▒
32503985332 │        mov    w0, #0x1                        // #1               
                                    ▒39636367978 │        fmul   d15, d17, d15  
                                                                         
▒56642417403 │        ldr    d21, [x4, x12, lsl #3]                             
                                     ▒215900325343│         fmul   d21, d17, 
d21                                                                          
▒49939836468 │        fmul   d15, d15, d20                                      
                                     ▒238451679574│         fmul   d20, d21, 
d18                                                                          
▒49692127013 │        fmadd  d15, d15, d31, d16                                 
                                     ▒287649913912│         fmadd  d16, d20, 
d31, d15                                                                     
▒359871503840│ 1ef8:   ldur   d15, [x1, #-248]                                  
                                     ▒144055883055│         add    w0, w0, #0x1 
                                                                          
▒72262104254 │        add    x2, x2, #0x18                                      
                                     ▒143991169721│         add    x1, x1, 
#0x48                                                                          
▒288648917780│         fmul   d15, d17, d15                                     
                                     ▒864665644756│         fmul   d15, d15, 
d18                                                                          
▒863868426387│         fmul   d14, d15, d14                                     
                                     ◆
865228159813│         fmadd  d16, d14, d31, d16                                 
                                    ▒
     245967 │        cmp    w0, #0x4                                            
                                    ▒
215396760545│       ↓ b.eq   1f28                                               
                                    ▒
  704732365 │        ldur   d14, [x2, #-8]                                      
                                    ▒
143775979620│       ↑ b      1ef8                                               
                                    ▒
 2623253706 │1f28:   add    x5, x5, #0x1                                        
                                    ▒71700007726 │        add    x6, x6, #0x48  
                                                                         ▒  
291326727 │        add    x3, x3, #0x8                                          
                                  ▒41539387956 │        cmp    x5, #0x4         
                                                                       ▒  
291327452 │      ↓ b.eq   1f4c                                                  
                                  ▒
152721910227│         ldr    d18, [x7, x15, lsl #3]                             
                                    ▒
 8561615599 │        add    x7, x7, #0x18                                       
                                    ▒
96142935717 │        ldur   d20, [x7, #-24]                                     
                                    ▒
 8495464096 │      ↑ b      1ec4                                                
                                    ▒
201164546300│ 1f4c:   add    x8, x8, #0x1b                                      
                                    ▒
22086088222 │        add    x9, x9, #0xd8                                       
                                    ▒
 1882100212 │        add    x14, x14, #0x18                                     
                                    ▒
22119311849 │        add    x10, x10, #0xd8                                     
                                    ▒
 1892034271 │        add    x11, x11, #0xd8                                     
                                    ▒
13413581701 │        cmp    x8, #0x6c                                           
                                    ▒
 1191551884 │      ↓ b.eq   1f70                                                
                                    ▒26310755425 │        ldur   d17, [x14, 
#-8]                                                                         ▒ 
1210506566 │      ↑ b      1ea0                                                 
                                   ▒71960439728 │1f70:   add    x17, x17, #0x3  
                                                                        ▒       
     │        add    x18, x18, #0x18                                            
                             ◆
 8069920125 │        add    x20, x20, #0x18                                     
                                    ▒
            │        add    x19, x19, #0x18                                     
                                    ▒
 4645045210 │        cmp    x17, #0x27                                          
                                    ▒
            │      ↓ b.eq   1f90                                                
                                    ▒
10962695888 │        ldr    d17, [x21], #8                                      
                                    ▒
            │      ↑ b      1e8c                                                
                                    ▒
23927242012 │1f90:   add    x23, x23, #0x1                                      
                                    ▒            │        str    d16, [x22]     
                                                                         ▒ 
2672842806 │        add    x16, x16, #0x8                                       
                                   ▒            │        add    x12, x12, #0x9  
                                                                        ▒
2653094829  │        sub    x15, x15, #0x1                                      
                                    ▒            │        add    x24, x24, 
#0x48                                                                         ▒ 
2692030697 │        add    x22, x22, #0x1e0                                     
                                   ▒            │        cmp    x23, #0x4       
                                                                        ▒ 
1721216607 │      ↓ b.eq   1fbc                                                 
                                   ▒  448331273 │        ldr    d19, [x13], #8  
                                                                        ▒ 
1778236919 │      ↑ b      1e6c                                                 
                                   ▒ 7971009272 │1fbc:   ldr    x0, [sp, #448]  
                                                                        ▒  
911313572 │        add    x26, x26, #0x1                                        
                                  ▒            │        add    x27, x27, #0x8   
                                                                       ▒  
902215785 │        add    x0, x0, #0x1                                          
                                  ▒            │        str    x0, [sp, #448]   
                                                                       ▒  
478032817 │        cmp    x26, #0x4                                             
                                  ▒            │      ↓ b.eq   1fe8             
                                                                       ▒ 
1475545769 │        add    x0, sp, #0x708                                       
                                   ◆
            │        add    x0, x0, x26, lsl #3                                 
                                    ▒ 1806982272 │        ldur   d22, [x0, #-8] 
                                                                         ▒      
      │      ↑ b      1e40

  589937229 │1e30:   mov    x15, #0x1760                    // #5984            
                                    ◆  904297989 │        add    x0, sp, x15    
                                                                         ▒  
870649879 │        add    x22, x0, x22                                          
                                  ▒
            │        fmov   d7, d24                                             
                                    ▒
  891274869 │        ldr    x0, [sp, #448]                                      
                                    ▒
            │        add    x14, sp, #0x710                                     
                                    ▒
  909978719 │        ldr    x12, [sp, #728]                                     
                                    ▒
            │        sub    x13, x27, #0x1                                      
                                    ▒
  882715766 │        add    x18, x28, #0x8                                      
                                    ▒
            │        sub    x19, x0, x28                                        
                                    ▒
  885884552 │        mov    x9, #0x9                        // #9               
                                    ▒
            │        mov    x20, #0x1                       // #1               
                                    ▒
 6279074827 │1e60:   mov    x17, x22
            │        mov    x16, x30                                            
                                    ▒ 2666213304 │        ldr    d2, [x19]      
                                                                         ▒
            │        mov    x15, #0x3                       // #3               
                                    ▒
18990367400 │1e70:   mov    x8, x17                                             
                                    ▒
            │        mov    x11, x16                                            
                                    ▒
 8057495884 │        mov    x10, #0x1b                      // #27              
                                    ▒
14947123246 │1e7c:   sub    x0, x8, #0x140                                      
                                    ▒
22985623052 │        ldur   d5, [x11, #-8]                                      
                                    ▒
 1060364445 │        fmov   d6, d8                                              
                                    ▒
23956420799 │        fmov   d3, d7                                              
                                    ▒
            │        add    x3, x18, x8                                         
                                    ◆
24065319873 │        mov    x7, x14                                             
                                    ▒            │        ldr    d0, [x0, x9, 
lsl #3]                                                                    
▒24187025828 │        mov    x6, x8                                             
                                     ▒            │        mov    x5, #0x1      
                  // #1                                                   ▒
48132474841 │        fmul   d0, d5, d0                                          
                                    ▒
96001335773 │        fmadd  d2, d0, d16, d2                                     
                                    ▒
61067761742 │1ea8:   ldur   d4, [x6, #-248]                                     
                                    ▒
14089308947 │        sub    x4, x3, #0x140                                      
                                    ▒
58091146403 │        fmov   d0, d7                                              
                                    ▒
14028168886 │        mov    x2, x12                                             
                                    ▒
57897209384 │        mov    x1, x3                                              
                                    ▒
13994185270 │        mov    w0, #0x1                        // #1
67891460180 │        fmul   d4, d5, d4                                          
                                    ▒28006688701 │        ldr    d1, [x4, x9, 
lsl #3]                                                                    
▒215655048826│         fmul   d1, d5, d1                                        
                                     ▒57701202743 │        fmul   d3, d4, d3    
                                                                          
▒230116393416│         fmul   d1, d1, d6                                        
                                     ▒57977229144 │        fmadd  d2, d3, d16, 
d2                                                                         
▒301775181164│         fmadd  d2, d1, d16, d2                                   
                                     ▒    3441224 │1edc:   ldur   d1, [x1, 
#-248]                                                                         
▒216111094536│         add    w0, w0, #0x1                                      
                                     ▒    1473566 │        add    x2, x2, #0x18 
                                                                          
▒215873683406│         add    x1, x1, #0x48                                     
                                     ▒216166335905│         fmul   d1, d5, d1   
                                                                          
▒864007322335│         fmul   d1, d1, d6                                        
                                     ▒863815029515│         fmul   d0, d1, d0   
                                                                          
▒864900327399│         fmadd  d2, d0, d16, d2                                   
                                     ◆
            │        cmp    w0, #0x4                                            
                                    ▒
216329679631│       ↓ b.eq   1f0c                                               
                                    ▒
   22872044 │        ldur   d0, [x2, #-8]                                       
                                    ▒
143941131893│       ↑ b      1edc                                               
                                    ▒
  277804663 │1f0c:   add    x5, x5, #0x1                                        
                                    ▒72179847520 │        add    x6, x6, #0x48  
                                                                         ▒      
      │        add    x3, x3, #0x8                                              
                              ▒65738463940 │        cmp    x5, #0x4             
                                                                   ▒            
│      ↓ b.eq   1f30                   
123097375558│         ldr    d6, [x7, x13, lsl #3]                              
                                    ▒
            │        add    x7, x7, #0x18                                       
                                    ▒
96061189670 │        ldur   d3, [x7, #-24]                                      
                                    ▒
            │      ↑ b      1ea8                                                
                                    ▒
42647845407 │1f30:   add    x10, x10, #0x1b                                     
                                    ▒
            │        add    x11, x11, #0x18                                     
                                    ▒
24141022972 │        add    x8, x8, #0xd8                                       
                                    ▒
            │        cmp    x10, #0x6c                                          
                                    ▒
14573046432 │      ↑ b.ne   1e7c                                                
                                    ▒
72139544087 │        add    x15, x15, #0x3                                      
                                    ▒
 8028370830 │        add    x16, x16, #0x8                                      
                                    ▒
            │        add    x17, x17, #0x18                                     
                                    ▒ 4860057143 │        cmp    x15, #0xc      
                                                                         ▒      
      │      ↑ b.ne   1e70                                                      
                              ▒23912996709 │        add    x20, x20, #0x1       
                                                                   ◆
            │        str    d2, [x19]                                           
                                    ▒
 2670529487 │        add    x14, x14, #0x8                                      
                                    ▒            │        add    x9, x9, #0x9   
                                                                         ▒ 
2659625346 │        sub    x13, x13, #0x1                                       
                                   ▒            │        add    x19, x19, 
#0x1e0                                                                        ▒ 
1606030574 │        cmp    x20, #0x4                                            
                                   ▒            │      ↓ b.eq   1f80            
                                                                        ▒ 
3096553445 │        ldr    d7, [x12], #8                                        
                                   ▒            │      ↑ b      1e60            
                                                                        ▒ 
7964390214 │1f80:   add    x27, x27, #0x1                                       
                                   ▒            │        sub    x28, x28, #0x8
  529029469 │        cmp    x27, #0x4                                           
                                    ◆
            │      ↓ b.eq   2028                                                
                                    ▒ 1176126379 │        lsl    x22, x27, #3   
                                                                         ▒      
      │        add    x0, sp, #0x6f8                                            
                              ▒  593893747 │        add    x0, x0, x22          
                                                                   ▒ 1798781807 
│        ldur   d8, [x0, #-8]                                                   
                        ▒  580872685 │      ↑ b      1e30

Re: LTO slows down calculix by more than 10% on aarch64

Reply via email to