On Fri, 4 Sep 2020 at 17:08, Alexander Monakov <[email protected]> wrote: > > > I obtained perf stat results for following benchmark runs: > > > > -O2: > > > > 7856832.692380 task-clock (msec) # 1.000 CPUs utilized > > 3758 context-switches # 0.000 K/sec > > 40 cpu-migrations # 0.000 > > K/sec > > 40847 page-faults # 0.005 > > K/sec > > 7856782413676 cycles # 1.000 GHz > > 6034510093417 instructions # 0.77 insn per > > cycle > > 363937274287 branches # 46.321 M/sec > > 48557110132 branch-misses # 13.34% of all > > branches > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be > enough for this kind of code) > > > -O2 with orthonl inlined: > > > > 8319643.114380 task-clock (msec) # 1.000 CPUs utilized > > 4285 context-switches # 0.001 K/sec > > 28 cpu-migrations # 0.000 > > K/sec > > 40843 page-faults # 0.005 > > K/sec > > 8319591038295 cycles # 1.000 GHz > > 6276338800377 instructions # 0.75 insn per > > cycle > > 467400726106 branches # 56.180 M/sec > > 45986364011 branch-misses # 9.84% of all > > branches > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably > implying > that extra instructions are appearing in this loop nest, but not in the > innermost > loop. As a reminder for others, the innermost loop has only 3 iterations. > > > -O2 with orthonl inlined and PRE disabled (this removes the extra branches): > > > > 8207331.088040 task-clock (msec) # 1.000 CPUs utilized > > 2266 context-switches # 0.000 K/sec > > 32 cpu-migrations # 0.000 K/sec > > 40846 page-faults # 0.005 K/sec > > 8207292032467 cycles # 1.000 GHz > > 6035724436440 instructions # 0.74 insn per cycle > > 364415440156 branches # 44.401 M/sec > > 53138327276 branch-misses # 14.58% of all branches > > This seems to match baseline in terms of instruction count, but without PRE > the loop nest may be carrying some dependencies over memory. I would simply > check the assembly for the entire 6-level loop nest in question, I hope it's > not very complicated (though Fortran array addressing...). > > > -O2 with orthonl inlined and hoisting disabled: > > > > 7797265.206850 task-clock (msec) # 1.000 CPUs utilized > > 3139 context-switches # 0.000 K/sec > > 20 cpu-migrations # 0.000 > > K/sec > > 40846 page-faults # 0.005 > > K/sec > > 7797221351467 cycles # 1.000 GHz > > 6187348757324 instructions # 0.79 insn per > > cycle > > 461840800061 branches # 59.231 M/sec > > 26920311761 branch-misses # 5.83% of all > > branches > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle > count. > I don't think the former fully covers the latter (there's also a 90e9 > reduction > in insn count). > > Given that the inner loop iterates only 3 times, my main suggestion is to > consider how the profile for the entire loop nest looks like (it's 6 loops > deep, > each iterating only 3 times). > > > Perf profiles for > > -O2 -fno-code-hoisting and inlined orthonl: > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > 3196866 |1f04: ldur d1, [x1, #-248] > > 216348301800│ add w0, w0, #0x1 > > 985098 | add x2, x2, #0x18 > > 216215999206│ add x1, x1, #0x48 > > 215630376504│ fmul d1, d5, d1 > > 863829148015│ fmul d1, d1, d6 > > 864228353526│ fmul d0, d1, d0 > > 864568163014│ fmadd d2, d0, d16, d2 > > │ cmp w0, #0x4 > > 216125427594│ ↓ b.eq 1f34 > > 15010377│ ldur d0, [x2, #-8] > > 143753737468│ ↑ b 1f04 > > > > -O2 with inlined orthonl: > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data > > > > 359871503840│ 1ef8: ldur d15, [x1, #-248] > > 144055883055│ add w0, w0, #0x1 > > 72262104254│ add x2, x2, #0x18 > > 143991169721│ add x1, x1, #0x48 > > 288648917780│ fmul d15, d17, d15 > > 864665644756│ fmul d15, d15, d18 > > 863868426387│ fmul d14, d15, d14 > > 865228159813│ fmadd d16, d14, d31, d16 > > 245967│ cmp w0, #0x4 > > 215396760545│ ↓ b.eq 1f28 > > 704732365│ ldur d14, [x2, #-8] > > 143775979620│ ↑ b 1ef8 > > This indicates that the loop only covers about 46-48% of overall time. > > High count on the initial ldur instruction could be explained if the loop > is not entered by "fallthru" from the preceding block, or if its backedge > is mispredicted. Sampling mispredictions should be possible with perf record, > and you may be able to check if loop entry is fallthrough by inspecting > assembly. > > It may also be possible to check if code alignment matters, by compiling with > -falign-loops=32. Hi, Thanks a lot for the detailed feedback, and I am sorry for late response.
The hoisting region is:
if(mattyp.eq.1) then
4 loops
elseif(mattyp.eq.2) then
{
orthonl inlined into basic block;
loads w[0] .. w[8]
}
else
6 loops // load anisox
followed by basic block:
senergy=
& (s11*w(1,1)+s12*(w(1,2)+w(2,1))
& +s13*(w(1,3)+w(3,1))+s22*w(2,2)
& +s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
s(ii1,jj1)=s(ii1,jj1)+senergy
s(ii1+1,jj1+1)=s(ii1+1,jj1+1)+senergy
s(ii1+2,jj1+2)=s(ii1+2,jj1+2)+senergy
Hoisting hoists loads w[0] .. w[8] from orthonl and senergy block,
right in block 181, which is:
if (mattyp.eq.2) goto <bb 182> else goto <bb 193>
which is then further hoisted to block 173:
if (mattyp.eq.1) goto <bb 392> else goto <bb 181>
>From block 181, we have two paths towards senergy block (bb 194):
bb 181 -> bb 182 (orthonl block) -> bb 194 (senergy block)
AND
bb 181 -> bb 392 <6 loops pre-header> ... -> bb 194
which has a path length of around 18 blocks.
(bb 194 post-dominates bb 181 and bb 173).
Disabling only load hoisting within blocks 173 and 181
(simply avoid inserting pre_expr if pre_expr->kind == REFERENCE),
avoid hoisting of 'w' array and brings back most of performance. Which
verifies that it is hoisting of the
'w' array (w[0] ... w[8]), which is causing the slowdown ?
I obtained perf profiles for full hoisting, and disabled hoisting of
'w' array for the 6 loops, and the most drastic difference was
for ldur instruction:
With full hoisting:
359871503840│ 1ef8: ldur d15, [x1, #-248]
Without full hoisting:
3441224 │1edc: ldur d1, [x1, #-248]
(The loop entry seems to be fall thru in both cases. I have attached
profiles for both cases).
IIUC, the instruction seems to be loading the first element from anisox array,
which makes me wonder if the issue was with data-cache miss for slower version.
I ran perf script on perf data for L1-dcache-load-misses with period = 1million,
and it reported two cache misses on the ldur instruction in full hoisting case,
while it reported zero for the disabled load hoisting case.
So I wonder if the slowdown happens because hoisting of 'w' array
possibly results
in eviction of anisox thus causing a cache miss inside the inner loop
and making load slower ?
Hoisting also seems to improve the number of overall cache misses tho.
For disabled hoisting of 'w' array case, there were a total of 463
cache misses, while with full hoisting there were 357 cache misses
(with period = 1 million).
Does that happen because hoisting probably reduces cache misses along
the orthonl path (bb 173 - > bb 181 -> bb 182 -> bb 194) ?
Thanks,
Prathamesh
>
> Alexander
884982389 │1e40: ldr x0, [sp, #448]
◆ │ fmov d19, d6
▒
871517886 │ ldr x1, [sp, #808]
▒ │ add x16, sp, #0x720
▒
904652642 │ ldr x13, [sp, #784]
▒ │ sub x15, x26, #0x1
▒
892180199 │ mov x24, x27
▒ │ add x28, x27, #0xf8
▒
881362543 │ add x22, x1, x0, lsl #3
▒ │ mov x12, #0x9
// #9 ▒
906876972 │ mov x23, #0x1 // #1
▒
5342906864 │1e6c: fmov d17, d1
▒
2622786801 │ mov x14, #0x1778 // #6008
│ mov x20, x28
▒ 2680397945 │ add x19, sp, x14
▒
│ mov x18, x24
▒
2629152729 │ mov x21, x30
▒
│ ldr d16, [x22]
▒
4571598336 │ mov x17, #0x1e // #30
▒
15904018941 │1e8c: mov x11, x19
▒
8106237022 │ mov x10, x20
▒
│ mov x14, x21
▒
7958740225 │ mov x9, x18
▒
│ mov x8, #0x1b // #27
▒
41353477432 │1ea0: ldr d14, [x9]
▒
1220553185 │ fmov d18, d22
◆
22852558475 │ fmov d20, d19
▒ 1199867833 │ mov x3, x11
▒
22706386191 │ mov x7, x16
▒
1177543527 │ mov x6, x10
▒
22767111709 │ fmul d14, d17, d14
▒
1195454897 │ mov x5, #0x1 // #1
▒
94868835951 │ fmadd d16, d14, d31, d16
▒
48021203056 │1ec4: ldur d15, [x6, #-248]
▒
30707657072 │ sub x4, x3, #0x140
▒
41301831015 │ fmov d14, d19
▒
32467499777 │ mov x2, x13
▒
39498561992 │ mov x1, x3
▒
32503985332 │ mov w0, #0x1 // #1
▒39636367978 │ fmul d15, d17, d15
▒56642417403 │ ldr d21, [x4, x12, lsl #3]
▒215900325343│ fmul d21, d17,
d21
▒49939836468 │ fmul d15, d15, d20
▒238451679574│ fmul d20, d21,
d18
▒49692127013 │ fmadd d15, d15, d31, d16
▒287649913912│ fmadd d16, d20,
d31, d15
▒359871503840│ 1ef8: ldur d15, [x1, #-248]
▒144055883055│ add w0, w0, #0x1
▒72262104254 │ add x2, x2, #0x18
▒143991169721│ add x1, x1,
#0x48
▒288648917780│ fmul d15, d17, d15
▒864665644756│ fmul d15, d15,
d18
▒863868426387│ fmul d14, d15, d14
◆
865228159813│ fmadd d16, d14, d31, d16
▒
245967 │ cmp w0, #0x4
▒
215396760545│ ↓ b.eq 1f28
▒
704732365 │ ldur d14, [x2, #-8]
▒
143775979620│ ↑ b 1ef8
▒
2623253706 │1f28: add x5, x5, #0x1
▒71700007726 │ add x6, x6, #0x48
▒
291326727 │ add x3, x3, #0x8
▒41539387956 │ cmp x5, #0x4
▒
291327452 │ ↓ b.eq 1f4c
▒
152721910227│ ldr d18, [x7, x15, lsl #3]
▒
8561615599 │ add x7, x7, #0x18
▒
96142935717 │ ldur d20, [x7, #-24]
▒
8495464096 │ ↑ b 1ec4
▒
201164546300│ 1f4c: add x8, x8, #0x1b
▒
22086088222 │ add x9, x9, #0xd8
▒
1882100212 │ add x14, x14, #0x18
▒
22119311849 │ add x10, x10, #0xd8
▒
1892034271 │ add x11, x11, #0xd8
▒
13413581701 │ cmp x8, #0x6c
▒
1191551884 │ ↓ b.eq 1f70
▒26310755425 │ ldur d17, [x14,
#-8] ▒
1210506566 │ ↑ b 1ea0
▒71960439728 │1f70: add x17, x17, #0x3
▒
│ add x18, x18, #0x18
◆
8069920125 │ add x20, x20, #0x18
▒
│ add x19, x19, #0x18
▒
4645045210 │ cmp x17, #0x27
▒
│ ↓ b.eq 1f90
▒
10962695888 │ ldr d17, [x21], #8
▒
│ ↑ b 1e8c
▒
23927242012 │1f90: add x23, x23, #0x1
▒ │ str d16, [x22]
▒
2672842806 │ add x16, x16, #0x8
▒ │ add x12, x12, #0x9
▒
2653094829 │ sub x15, x15, #0x1
▒ │ add x24, x24,
#0x48 ▒
2692030697 │ add x22, x22, #0x1e0
▒ │ cmp x23, #0x4
▒
1721216607 │ ↓ b.eq 1fbc
▒ 448331273 │ ldr d19, [x13], #8
▒
1778236919 │ ↑ b 1e6c
▒ 7971009272 │1fbc: ldr x0, [sp, #448]
▒
911313572 │ add x26, x26, #0x1
▒ │ add x27, x27, #0x8
▒
902215785 │ add x0, x0, #0x1
▒ │ str x0, [sp, #448]
▒
478032817 │ cmp x26, #0x4
▒ │ ↓ b.eq 1fe8
▒
1475545769 │ add x0, sp, #0x708
◆
│ add x0, x0, x26, lsl #3
▒ 1806982272 │ ldur d22, [x0, #-8]
▒
│ ↑ b 1e40
589937229 │1e30: mov x15, #0x1760 // #5984
◆ 904297989 │ add x0, sp, x15
▒
870649879 │ add x22, x0, x22
▒
│ fmov d7, d24
▒
891274869 │ ldr x0, [sp, #448]
▒
│ add x14, sp, #0x710
▒
909978719 │ ldr x12, [sp, #728]
▒
│ sub x13, x27, #0x1
▒
882715766 │ add x18, x28, #0x8
▒
│ sub x19, x0, x28
▒
885884552 │ mov x9, #0x9 // #9
▒
│ mov x20, #0x1 // #1
▒
6279074827 │1e60: mov x17, x22
│ mov x16, x30
▒ 2666213304 │ ldr d2, [x19]
▒
│ mov x15, #0x3 // #3
▒
18990367400 │1e70: mov x8, x17
▒
│ mov x11, x16
▒
8057495884 │ mov x10, #0x1b // #27
▒
14947123246 │1e7c: sub x0, x8, #0x140
▒
22985623052 │ ldur d5, [x11, #-8]
▒
1060364445 │ fmov d6, d8
▒
23956420799 │ fmov d3, d7
▒
│ add x3, x18, x8
◆
24065319873 │ mov x7, x14
▒ │ ldr d0, [x0, x9,
lsl #3]
▒24187025828 │ mov x6, x8
▒ │ mov x5, #0x1
// #1 ▒
48132474841 │ fmul d0, d5, d0
▒
96001335773 │ fmadd d2, d0, d16, d2
▒
61067761742 │1ea8: ldur d4, [x6, #-248]
▒
14089308947 │ sub x4, x3, #0x140
▒
58091146403 │ fmov d0, d7
▒
14028168886 │ mov x2, x12
▒
57897209384 │ mov x1, x3
▒
13994185270 │ mov w0, #0x1 // #1
67891460180 │ fmul d4, d5, d4
▒28006688701 │ ldr d1, [x4, x9,
lsl #3]
▒215655048826│ fmul d1, d5, d1
▒57701202743 │ fmul d3, d4, d3
▒230116393416│ fmul d1, d1, d6
▒57977229144 │ fmadd d2, d3, d16,
d2
▒301775181164│ fmadd d2, d1, d16, d2
▒ 3441224 │1edc: ldur d1, [x1,
#-248]
▒216111094536│ add w0, w0, #0x1
▒ 1473566 │ add x2, x2, #0x18
▒215873683406│ add x1, x1, #0x48
▒216166335905│ fmul d1, d5, d1
▒864007322335│ fmul d1, d1, d6
▒863815029515│ fmul d0, d1, d0
▒864900327399│ fmadd d2, d0, d16, d2
◆
│ cmp w0, #0x4
▒
216329679631│ ↓ b.eq 1f0c
▒
22872044 │ ldur d0, [x2, #-8]
▒
143941131893│ ↑ b 1edc
▒
277804663 │1f0c: add x5, x5, #0x1
▒72179847520 │ add x6, x6, #0x48
▒
│ add x3, x3, #0x8
▒65738463940 │ cmp x5, #0x4
▒
│ ↓ b.eq 1f30
123097375558│ ldr d6, [x7, x13, lsl #3]
▒
│ add x7, x7, #0x18
▒
96061189670 │ ldur d3, [x7, #-24]
▒
│ ↑ b 1ea8
▒
42647845407 │1f30: add x10, x10, #0x1b
▒
│ add x11, x11, #0x18
▒
24141022972 │ add x8, x8, #0xd8
▒
│ cmp x10, #0x6c
▒
14573046432 │ ↑ b.ne 1e7c
▒
72139544087 │ add x15, x15, #0x3
▒
8028370830 │ add x16, x16, #0x8
▒
│ add x17, x17, #0x18
▒ 4860057143 │ cmp x15, #0xc
▒
│ ↑ b.ne 1e70
▒23912996709 │ add x20, x20, #0x1
◆
│ str d2, [x19]
▒
2670529487 │ add x14, x14, #0x8
▒ │ add x9, x9, #0x9
▒
2659625346 │ sub x13, x13, #0x1
▒ │ add x19, x19,
#0x1e0 ▒
1606030574 │ cmp x20, #0x4
▒ │ ↓ b.eq 1f80
▒
3096553445 │ ldr d7, [x12], #8
▒ │ ↑ b 1e60
▒
7964390214 │1f80: add x27, x27, #0x1
▒ │ sub x28, x28, #0x8
529029469 │ cmp x27, #0x4
◆
│ ↓ b.eq 2028
▒ 1176126379 │ lsl x22, x27, #3
▒
│ add x0, sp, #0x6f8
▒ 593893747 │ add x0, x0, x22
▒ 1798781807
│ ldur d8, [x0, #-8]
▒ 580872685 │ ↑ b 1e30
