https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82142
Bug ID: 82142 Summary: struct zeroing should use wide stores instead of avoiding overwriting padding Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- (not sure if tree-optimization is the right "component", please check.) Zeroing a struct (by assignment from an all-zero struct) is inefficient for structs too small to inline rep stos or memset, if they have padding. Store coalescing happens (yay, huge improvement over gcc6!), but only if it won't write any padding. I assume assignment from non-zero struct constants is like this, too. Note also that gcc zeros two registers for two 16-bit stores, instead of reusing a zeroed reg. Also note that Haswell doesn't have LCP stalls for a `mov` with 16-bit immediate, only for ALU, so there's no need to avoid it. But if you *are* going to zero a register, you should use it as the source for all the integer mov instructions to save code-size. (And avoid filling up space in the uop cache with immediates). Sandybridge-family no longer has register-read-port stalls (P6 / Core2 / Nehalem), and even there a recently-written register is fine for several cycles. https://godbolt.org/g/AT7yfs typedef struct { int a,b,c; char j,k, k1; // 1B of padding int l,m,n[8]; char c1,c2; // 2B of padding }foo; int sf = sizeof(foo); sf: .long 60 # bytes long void assignzero(foo *p) { foo tmp = {}; *p = tmp; } (GCC-Explorer-Build) 8.0.0 20170907 -xc -std=c11 -O3 -march=haswell assignzero: pushq %rbp xorl %eax, %eax # zero a reg to avoid mov $imm16, (mem) vpxor %xmm0, %xmm0, %xmm0 xorl %edx, %edx # and another one, instead of reusing eax?? movq $0, (%rdi) movl $0, 8(%rdi) movq %rsp, %rbp # make a stack frame because of the ymm? At least it doesn't do something with %r10 like gcc7.2 movw %ax, 12(%rdi) movb $0, 14(%rdi) # avoiding a store to 1B of padding vmovdqu %ymm0, 16(%rdi) # bunch of int members movq $0, 48(%rdi) movw %dx, 56(%rdi) vzeroupper popq %rbp ret So 1B of padding cost us 4 narrow stores instead of one 16B store at the front of the struct. This also uses AVX for a single 256b store; very likely not worth it. // what we would like to see, more or less void charzero(char *p) { __builtin_memset(p, 0, sizeof(foo)); } charzero: vpxor %xmm0, %xmm0, %xmm0 movq $0, 48(%rdi) vmovups %xmm0, (%rdi) vmovups %xmm0, 16(%rdi) vmovups %xmm0, 32(%rdi) movl $0, 56(%rdi) ret This chooses not to use 256b AVX, so no vzeroupper, and no slow-mode while warming up the upper-half of execution units / data bus. (Agner's description sounds like it applies to stores: http://agner.org/optimize/blog/read.php?i=415). Also no triggering a lower turbo threshold. All of this is probably good for a tiny function. We could save some code-size by replacing the integer mov $0 stores with vmovq %xmm0, 48(%rdi). That's probably not good for Bulldozer-family (shared vector unit between two integer cores), so maybe only enable that with a -march=znver1 or sandybridge-family CPU. (xor-zeroing takes zero cycles on Intel SnB-family, so xmm0 is ready as a source for the store in the same cycle that vpxor issues. I guess it's possible that an integer store can execute 1 cycle sooner than an AVX 8-byte store, so if there weren't already any stores waiting to execute, and store-port throughput was about to become a bottleneck in later code, it makes sense. Maybe only do it for the last store, using a 4-byte vmovd %xmm0, 56(%rdi)) Or better (on Haswell and especially Skylake): use overlapping vector stores like clang has been doing since 3.7, and like glibc memcpy does for small copies: # clang (trunk) -O3 -march=haswell assignzero: # same code for clang's memset vxorps %xmm0, %xmm0, %xmm0 vmovups %ymm0, 28(%rdi) # last 32 bytes vmovups %ymm0, (%rdi) # first 32 bytes, overlapping by 4 vzeroupper retq Having to use vzeroupper here is questionable. 4 xmm stores (with the last one overlapping) might be a better choice here, but I haven't tried to measure the effect of scattered uses of AVX. When tuning for Sandybridge, clang only uses 128b stores here.