[Bug target/62178] [AArch64] Performance regression on matrix matrix multiply due to r211211

amker.cheng at gmail dot com Tue, 19 Aug 2014 03:52:31 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62178


bin.cheng <amker.cheng at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker.cheng at gmail dot com

--- Comment #3 from bin.cheng <amker.cheng at gmail dot com> ---
I think it's a flaw in iv candidates choosing algorithm revealed by my patch.
Though r211211 does change cost of addressing modes, it doesn't change the cost
of optimal candidate set.  The problem with iv candidates choosing algorithm is
it's a heuristic one and would fail to find the optimal set for this specific
case.

In details, the only cost differences between r211210/r211211 is like below 
***************
*** 1,8 ****
  Use 1:
    cand    cost    compl.    depends on
!   1    16    1    1
    9    1    0    
!   10    4    1    1
    11    1    0    
!   12    5    0    
!   14    8    1    1
--- 1,8 ----
  Use 1:
    cand    cost    compl.    depends on
!   1    13    1    1
    9    1    0    
!   10    1    1    1
    11    1    0    
!   12    1    1    
!   14    5    1    1

The final candidates set choosed by r211210 is like below.

Initial set of candidates:
  cost: 19 (complexity 2)
  cand_cost: 10
  cand_use_cost: 5 (complexity 2)
  candidates: 11, 14
   use:0 --> iv_cand:14, cost=(4,2)
   use:1 --> iv_cand:11, cost=(1,0)
   use:2 --> iv_cand:11, cost=(0,0)
  invariants 1

Improved to:
  cost: 15 (complexity 0)
  cand_cost: 10
  cand_use_cost: 2 (complexity 0)
  candidates: 11, 13
   use:0 --> iv_cand:13, cost=(1,0)
   use:1 --> iv_cand:11, cost=(1,0)
   use:2 --> iv_cand:11, cost=(0,0)
  invariants 1

The final candidates set choosed by r211211 is like below.

Initial set of candidates:
  cost: 17 (complexity 3)
  cand_cost: 5
  cand_use_cost: 9 (complexity 3)
  candidates: 14
   use:0 --> iv_cand:14, cost=(4,2)
   use:1 --> iv_cand:14, cost=(5,1)
   use:2 --> iv_cand:14, cost=(0,0)

It is clear, r211211 doesn't change the optimal candidates set (which is
13/11).  It is the algorithm that can't find out the optimal set because it's
heuristic and would fail on this one.

With manual changes of candidates set, the diff of assembly code is like below.
*** 6,46 ****
      .type    Intmm, %function
  Intmm:
      movi    v3.2s, 0
      adrp    x6, a+128
!     adrp    x8, r+128
!     adrp    x10, r+3848
!     adrp    x9, b+128
!     adrp    x7, b+248
      add    x6, x6, :lo12:a+128
!     add    x8, x8, :lo12:r+128
!     add    x10, x10, :lo12:r+3848
!     add    x9, x9, :lo12:b+128
!     add    x7, x7, :lo12:b+248
  .L2:
!     mov    x5, x8
!     mov    x4, x8
!     mov    x3, x9
  .L4:
!     str    d3, [x4]
!     add    x2, x3, 3720
      movi    v0.2s, 0
      mov    x1, x6
!     mov    x0, x3
  .L3:
!     ldr    d1, [x0]
!     add    x0, x0, 124
      ld1r    {v2.2s}, [x1], 4
!     cmp    x0, x2
      mla    v0.2s, v2.2s, v1.2s
      bne    .L3
!     str    d0, [x5], 8
      add    x3, x3, 8
!     cmp    x3, x7
!     add    x4, x4, 8
      bne    .L4
!     add    x8, x8, 124
      add    x6, x6, 124
!     cmp    x8, x10
      bne    .L2
      ret
      .size    Intmm, .-Intmm
--- 6,42 ----
      .type    Intmm, %function
  Intmm:
      movi    v3.2s, 0
+     adrp    x4, r+128
      adrp    x6, a+128
!     adrp    x7, r+3848
!     adrp    x5, b
!     add    x4, x4, :lo12:r+128
      add    x6, x6, :lo12:a+128
!     add    x7, x7, :lo12:r+3848
!     add    x5, x5, :lo12:b
  .L2:
!     mov    x3, 0
  .L4:
!     str    d3, [x4, x3]
!     add    x0, x3, 128
      movi    v0.2s, 0
+     add    x2, x3, 3848
+     add    x2, x2, x5
      mov    x1, x6
!     add    x0, x5, x0
  .L3:
!     ldr    d1, [x0], 124
      ld1r    {v2.2s}, [x1], 4
!     cmp    x2, x0
      mla    v0.2s, v2.2s, v1.2s
      bne    .L3
!     str    d0, [x4, x3]
      add    x3, x3, 8
!     cmp    x3, 120
      bne    .L4
!     add    x4, x4, 124
      add    x6, x6, 124
!     cmp    x4, x7
      bne    .L2
      ret
      .size    Intmm, .-Intmm

You can see the inner most loop is back to optimized.  The additinal
instruction in the second loop is caused by addressing mode change, but I think
it can be fixed by enabling auto-increment addressing mode for IVOPT on
aarch64.  YES, we hasn't done that yet.


I will see how the IVOPT candidates choosing algorithm should be improved.

[Bug target/62178] [AArch64] Performance regression on matrix matrix multiply due to r211211

Reply via email to