On 08/12/2016 05:20 PM, Denys Vlasenko wrote:
Yes, I know all that. Fetching is one thing. Loop cache is for instance
another (more important) thing. Not aligning the loop head increases
chance of the whole loop being split over more cache lines than necessary.
Jump predictors also don't necessarily decode/remember the whole
instruction address. And so on.
Aligning to 8 bytes within a cacheline does not speed things up. It
simply wastes bytes without speeding up anything.
It's not that easy, which is why I have asked if you have _measured_ the
correctness of your theory of it not mattering? All the alignment
adjustments in GCC were included after measurements. In particular the
align-by-8-always (for loop heads) was included after some large
regressions on cpu2000, in 2007 (core2 duo at that time).
So, I'm never much thrilled about listing reasons for why performance
can't possibly be affected, especially when we know that it once _was_
affected, when there's an easy way to show that it's not affected.
z.S:
#compile with: gcc -nostartfiles -nostdlib
_start: .globl _start
.p2align 8
mov $4000*1000*1000, %eax # 5-byte insn
nop # 6
nop # 7
nop # 8
loop: dec %eax
lea (%ebx), %ebx
jnz loop
push $0
ret # SEGV
This program loops 4 billion times, then exits (by crashing).
...
Looks like loop alignment to 8 bytes does not matter (in this particular
example).
I looked into it more. I read Agner's Fog
http://www.agner.org/optimize/microarchitecture.pdf
Since Nehalem, Intel CPUs have loopback buffer,
differently implemented in different CPUs.
I use the following code with 4-billion iteration loop
with various numbers of padding NOPs:
0000000000400100 <_start>:
400100: b8 00 28 6b ee mov $0xee6b2800,%eax
400105: 90 nop
400106: 90 nop
0000000000400107 <loop>:
400107: ff c8 dec %eax
400109: 8d 88 d2 04 00 00 lea 0x4d2(%rax),%ecx
40010f: 75 f6 jne 400107 <loop>
400111: b8 e7 00 00 00 mov $0xe7,%eax
400116: 0f 05 syscall
On Skylake, the loop slows down if its body crosses 16 bytes
(as shown above - last JNE insn doesn't fit).
With loop starting at 0000000000400106 and fitting into an aligned 16-byte
block:
Performance counter stats for './z6' (10 runs):
1209.051244 task-clock (msec) # 0.999 CPUs utilized
( +- 0.99% )
5 context-switches # 0.004 K/sec
( +- 11.11% )
2 page-faults # 0.002 K/sec
( +- 4.76% )
4,101,694,215 cycles # 3.392 GHz
( +- 0.51% )
12,027,931,896 instructions # 2.93 insn per cycle
( +- 0.00% )
4,005,295,446 branches # 3312.759 M/sec
( +- 0.00% )
15,828 branch-misses # 0.00% of all branches
( +- 4.49% )
1.209910890 seconds time elapsed
( +- 0.99% )
With loop starting at 0000000000400107:
Performance counter stats for './z7' (10 runs):
1408.362422 task-clock (msec) # 0.999 CPUs utilized
( +- 1.23% )
5 context-switches # 0.004 K/sec
( +- 15.59% )
2 page-faults # 0.001 K/sec
( +- 4.76% )
4,749,031,319 cycles # 3.372 GHz
( +- 0.34% )
12,032,488,082 instructions # 2.53 insn per cycle
( +- 0.00% )
4,006,159,536 branches # 2844.552 M/sec
( +- 0.00% )
6,946 branch-misses # 0.00% of all branches
( +- 3.88% )
1.409459099 seconds time elapsed
( +- 1.23% )
With loop starting at 0000000000400108:
Performance counter stats for './z8' (10 runs):
1407.127953 task-clock (msec) # 0.999 CPUs utilized
( +- 1.09% )
6 context-switches # 0.004 K/sec
( +- 15.70% )
2 page-faults # 0.002 K/sec
( +- 6.64% )
4,747,410,967 cycles # 3.374 GHz
( +- 0.39% )
12,032,462,223 instructions # 2.53 insn per cycle
( +- 0.00% )
4,006,154,637 branches # 2847.044 M/sec
( +- 0.00% )
7,324 branch-misses # 0.00% of all branches
( +- 3.40% )
1.408205377 seconds time elapsed
( +- 1.08% )
The difference is significant and reproducible.
Thus. For this CPU, alignment of loops to 8 bytes is wrong: it helps if it
happens
to align a loop to 16 bytes, but it may in fact hurt performance if it happens
to align
a loop to 16+8 bytes and this pushes loop's body end over the next 16-byte
boundary,
as it happens in the above example.
I suspect something similar was seen sometime ago on a different, earlier CPU,
and on _that_ CPU decoder/loop buffer idiosyncrasies are such that it likes
8 byte alignment.
It's not true that such alignment is always a win.