On 08/12/2016 05:20 PM, Denys Vlasenko wrote:
Yes, I know all that.  Fetching is one thing.  Loop cache is for instance
another (more important) thing.  Not aligning the loop head increases
chance of the whole loop being split over more cache lines than necessary.
Jump predictors also don't necessarily decode/remember the whole
instruction address.  And so on.

Aligning to 8 bytes within a cacheline does not speed things up. It
simply wastes bytes without speeding up anything.

It's not that easy, which is why I have asked if you have _measured_ the
correctness of your theory of it not mattering?  All the alignment
adjustments in GCC were included after measurements.  In particular the
align-by-8-always (for loop heads) was included after some large
regressions on cpu2000, in 2007 (core2 duo at that time).

So, I'm never much thrilled about listing reasons for why performance
can't possibly be affected, especially when we know that it once _was_
affected, when there's an easy way to show that it's not affected.

z.S:

#compile with: gcc -nostartfiles -nostdlib
_start:         .globl _start
                .p2align 8
                mov     $4000*1000*1000, %eax # 5-byte insn
                nop     # 6
                nop     # 7
                nop     # 8
loop:           dec     %eax
                lea     (%ebx), %ebx
                jnz     loop
                push    $0
                ret     # SEGV

This program loops 4 billion times, then exits (by crashing).
...
Looks like loop alignment to 8 bytes does not matter (in this particular 
example).


I looked into it more. I read Agner's Fog
http://www.agner.org/optimize/microarchitecture.pdf

Since Nehalem, Intel CPUs have loopback buffer,
differently implemented in different CPUs.

I use the following code with 4-billion iteration loop
with various numbers of padding NOPs:

0000000000400100 <_start>:
  400100:       b8 00 28 6b ee          mov    $0xee6b2800,%eax
  400105:       90                      nop
  400106:       90                      nop
0000000000400107 <loop>:
  400107:       ff c8                   dec    %eax
  400109:       8d 88 d2 04 00 00       lea    0x4d2(%rax),%ecx
  40010f:       75 f6                   jne    400107 <loop>

  400111:       b8 e7 00 00 00          mov    $0xe7,%eax
  400116:       0f 05                   syscall

On Skylake, the loop slows down if its body crosses 16 bytes
(as shown above - last JNE insn doesn't fit).

With loop starting at 0000000000400106 and fitting into an aligned 16-byte 
block:

 Performance counter stats for './z6' (10 runs):
       1209.051244      task-clock (msec)         #    0.999 CPUs utilized      
      ( +-  0.99% )
                 5      context-switches          #    0.004 K/sec              
      ( +- 11.11% )
                 2      page-faults               #    0.002 K/sec              
      ( +-  4.76% )
     4,101,694,215      cycles                    #    3.392 GHz                
      ( +-  0.51% )
    12,027,931,896      instructions              #    2.93  insn per cycle     
      ( +-  0.00% )
     4,005,295,446      branches                  # 3312.759 M/sec              
      ( +-  0.00% )
            15,828      branch-misses             #    0.00% of all branches    
      ( +-  4.49% )
       1.209910890 seconds time elapsed                                         
 ( +-  0.99% )

With loop starting at 0000000000400107:

 Performance counter stats for './z7' (10 runs):
       1408.362422      task-clock (msec)         #    0.999 CPUs utilized      
      ( +-  1.23% )
                 5      context-switches          #    0.004 K/sec              
      ( +- 15.59% )
                 2      page-faults               #    0.001 K/sec              
      ( +-  4.76% )
     4,749,031,319      cycles                    #    3.372 GHz                
      ( +-  0.34% )
    12,032,488,082      instructions              #    2.53  insn per cycle     
      ( +-  0.00% )
     4,006,159,536      branches                  # 2844.552 M/sec              
      ( +-  0.00% )
             6,946      branch-misses             #    0.00% of all branches    
      ( +-  3.88% )
       1.409459099 seconds time elapsed                                         
 ( +-  1.23% )

With loop starting at 0000000000400108:

 Performance counter stats for './z8' (10 runs):
       1407.127953      task-clock (msec)         #    0.999 CPUs utilized      
      ( +-  1.09% )
                 6      context-switches          #    0.004 K/sec              
      ( +- 15.70% )
                 2      page-faults               #    0.002 K/sec              
      ( +-  6.64% )
     4,747,410,967      cycles                    #    3.374 GHz                
      ( +-  0.39% )
    12,032,462,223      instructions              #    2.53  insn per cycle     
      ( +-  0.00% )
     4,006,154,637      branches                  # 2847.044 M/sec              
      ( +-  0.00% )
             7,324      branch-misses             #    0.00% of all branches    
      ( +-  3.40% )
       1.408205377 seconds time elapsed                                         
 ( +-  1.08% )

The difference is significant and reproducible.

Thus. For this CPU, alignment of loops to 8 bytes is wrong: it helps if it 
happens
to align a loop to 16 bytes, but it may in fact hurt performance if it happens 
to align
a loop to 16+8 bytes and this pushes loop's body end over the next 16-byte 
boundary,
as it happens in the above example.

I suspect something similar was seen sometime ago on a different, earlier CPU,
and on _that_ CPU decoder/loop buffer idiosyncrasies are such that it likes
8 byte alignment.

It's not true that such alignment is always a win.

Reply via email to