https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568

            Bug ID: 90568
           Summary: stack protector should use cmp or sub, not xor, to
                    allow macro-fusion on x86
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

cmp/jne is always at least as efficient as xor/jne, and more efficient on CPUs
that support macro-fusion of compare and branch.  Most support cmp/jne fusion
(including all mainstream Intel and AMD, not low-power), but none support
xor/jne fusion.

void foo() {
    volatile int buf[4];
    buf[1] = 2;
}

gcc trunk on Godbolt, but same code-gen all the way back to gcc4.9

foo:
        subq    $40, %rsp
        movq    %fs:40, %rax
        movq    %rax, 24(%rsp)
        xorl    %eax, %eax
        movl    $2, 4(%rsp)
        movq    24(%rsp), %rax
        xorq    %fs:40, %rax              ## This insn should be CMP
        jne     .L5
        addq    $40, %rsp
        ret
.L5:
        call    __stack_chk_fail

As far as I can tell, the actual XOR result value in RAX is not an input to
__stack_chk_fail because gcc sometimes uses a different register.

Therefore we don't need it, and can use any other way to check for equality.

If we need to avoid "leaking" the canary value in a register, we can use SUB,
otherwise CMP is even better and can macro-fuse on more CPUs.

Only Sandybridge-family can fuse SUB/JCC.  (And yes, it can fuse even with a
memory-source and a segment override prefix.  SUB %fs:40(%rsp), %rax / JNE  is
a single uop on Skylake; I checked this with perf counters in an asm loop.)

AMD can fuse any TEST or CMP/JCC, but only those instructions (so SUB is as bad
as XOR for AMD).  See Agner Fog's microarch PDF.

----

Linux test program (NASM) that runs  sub (mem), %reg with an FS prefix to prove
that it does macro-fuse and stays micro-fused as a single uop:


default rel
%use smartalign
alignmode p6, 64

global _start
_start:

cookie equ 12345
    mov  eax, 158       ; __NR_arch_prctl
    mov  edi, 0x1002    ; ARCH_SET_FS
    lea  rsi, [buf]
    syscall
   ;  wrfsbase   rsi    ; not enabled by the kernel
    mov  qword [fs: 0x28], cookie

    mov     ebp, 1000000000

align 64
.loop:
    mov   eax, cookie
    sub   rax, [fs: 0x28]
    jne   _start
    and   ecx, edx

    dec ebp
    jnz .loop
.end:

    xor edi,edi
    mov eax,231   ; __NR_exit_group
    syscall       ; sys_exit_group(0)


section .bss
align 4096
buf:    resb 4096



nasm -felf64  branch-fuse-mem.asm &&
ld -o branch-fuse-mem  branch-fuse-mem.o
to make a static executable

taskset -c 3 perf stat
-etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u
-r2 ./branch-fuse-mem

On my i7-6700k

 Performance counter stats for './branch-fuse-mem' (2 runs):

            240.78 msec task-clock:u              #    0.999 CPUs utilized     
      ( +-  0.23% )
                 2      context-switches          #    0.010 K/sec             
      ( +- 20.00% )
                 0      cpu-migrations            #    0.000 K/sec              
                 3      page-faults               #    0.012 K/sec              
     1,000,764,258      cycles:u                  #    4.156 GHz               
      ( +-  0.00% )
     2,000,000,076      branches:u                # 8306.384 M/sec             
      ( +-  0.00% )
     6,000,000,088      instructions:u            #    6.00  insn per cycle    
      ( +-  0.00% )
     4,000,109,615      uops_issued.any:u         # 16613.222 M/sec            
      ( +-  0.00% )
     5,000,098,334      uops_executed.thread:u    # 20766.367 M/sec            
      ( +-  0.00% )

          0.240935 +- 0.000546 seconds time elapsed  ( +-  0.23% )

Note 1.0 billion cycles (1 per iteration), and 4B fused-domain uops_issued.any,
i.e. 4 uops per loop iteration.

(5 uops *executed* is because one of those front-end uops has a load
micro-fused).

Changing SUB to CMP has no effect.

With SUB changed to XOR, the loop takes 1.25 cycles per iteration, and the
front-end issues 5 uops per iteration.  Other counters are the same.

Skylake's pipeline is 4-wide, like all Intel since Core2, so an extra uop for
the front-end creates a bottleneck.

------

On Intel pre Haswell, the decoders will only make at most 1 fusion per decode
group, so you may need to make the loop larger to still get fusion.  Or use
this as the loop-branch, e.g. with a  1  in memory

   sub  rax, [fs: 0x28]
   jnz  .loop

or with a 0 in memory, sub or cmp or xor will all set flags according to the
register being non-zero.  But sub or xor will introduce an extra cycle of
latency on the critical path for the loop counter.

Reply via email to