https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568
Bug ID: 90568 Summary: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86 Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* cmp/jne is always at least as efficient as xor/jne, and more efficient on CPUs that support macro-fusion of compare and branch. Most support cmp/jne fusion (including all mainstream Intel and AMD, not low-power), but none support xor/jne fusion. void foo() { volatile int buf[4]; buf[1] = 2; } gcc trunk on Godbolt, but same code-gen all the way back to gcc4.9 foo: subq $40, %rsp movq %fs:40, %rax movq %rax, 24(%rsp) xorl %eax, %eax movl $2, 4(%rsp) movq 24(%rsp), %rax xorq %fs:40, %rax ## This insn should be CMP jne .L5 addq $40, %rsp ret .L5: call __stack_chk_fail As far as I can tell, the actual XOR result value in RAX is not an input to __stack_chk_fail because gcc sometimes uses a different register. Therefore we don't need it, and can use any other way to check for equality. If we need to avoid "leaking" the canary value in a register, we can use SUB, otherwise CMP is even better and can macro-fuse on more CPUs. Only Sandybridge-family can fuse SUB/JCC. (And yes, it can fuse even with a memory-source and a segment override prefix. SUB %fs:40(%rsp), %rax / JNE is a single uop on Skylake; I checked this with perf counters in an asm loop.) AMD can fuse any TEST or CMP/JCC, but only those instructions (so SUB is as bad as XOR for AMD). See Agner Fog's microarch PDF. ---- Linux test program (NASM) that runs sub (mem), %reg with an FS prefix to prove that it does macro-fuse and stays micro-fused as a single uop: default rel %use smartalign alignmode p6, 64 global _start _start: cookie equ 12345 mov eax, 158 ; __NR_arch_prctl mov edi, 0x1002 ; ARCH_SET_FS lea rsi, [buf] syscall ; wrfsbase rsi ; not enabled by the kernel mov qword [fs: 0x28], cookie mov ebp, 1000000000 align 64 .loop: mov eax, cookie sub rax, [fs: 0x28] jne _start and ecx, edx dec ebp jnz .loop .end: xor edi,edi mov eax,231 ; __NR_exit_group syscall ; sys_exit_group(0) section .bss align 4096 buf: resb 4096 nasm -felf64 branch-fuse-mem.asm && ld -o branch-fuse-mem branch-fuse-mem.o to make a static executable taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u -r2 ./branch-fuse-mem On my i7-6700k Performance counter stats for './branch-fuse-mem' (2 runs): 240.78 msec task-clock:u # 0.999 CPUs utilized ( +- 0.23% ) 2 context-switches # 0.010 K/sec ( +- 20.00% ) 0 cpu-migrations # 0.000 K/sec 3 page-faults # 0.012 K/sec 1,000,764,258 cycles:u # 4.156 GHz ( +- 0.00% ) 2,000,000,076 branches:u # 8306.384 M/sec ( +- 0.00% ) 6,000,000,088 instructions:u # 6.00 insn per cycle ( +- 0.00% ) 4,000,109,615 uops_issued.any:u # 16613.222 M/sec ( +- 0.00% ) 5,000,098,334 uops_executed.thread:u # 20766.367 M/sec ( +- 0.00% ) 0.240935 +- 0.000546 seconds time elapsed ( +- 0.23% ) Note 1.0 billion cycles (1 per iteration), and 4B fused-domain uops_issued.any, i.e. 4 uops per loop iteration. (5 uops *executed* is because one of those front-end uops has a load micro-fused). Changing SUB to CMP has no effect. With SUB changed to XOR, the loop takes 1.25 cycles per iteration, and the front-end issues 5 uops per iteration. Other counters are the same. Skylake's pipeline is 4-wide, like all Intel since Core2, so an extra uop for the front-end creates a bottleneck. ------ On Intel pre Haswell, the decoders will only make at most 1 fusion per decode group, so you may need to make the loop larger to still get fusion. Or use this as the loop-branch, e.g. with a 1 in memory sub rax, [fs: 0x28] jnz .loop or with a 0 in memory, sub or cmp or xor will all set flags according to the register being non-zero. But sub or xor will introduce an extra cycle of latency on the critical path for the loop counter.