https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82142

            Bug ID: 82142
           Summary: struct zeroing should use wide stores instead of
                    avoiding overwriting padding
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

(not sure if tree-optimization is the right "component", please check.)

Zeroing a struct (by assignment from an all-zero struct) is inefficient for
structs too small to inline rep stos or memset, if they have padding.  Store
coalescing happens (yay, huge improvement over gcc6!), but only if it won't
write any padding.  I assume assignment from non-zero struct constants is like
this, too.

Note also that gcc zeros two registers for two 16-bit stores, instead of
reusing a zeroed reg.

Also note that Haswell doesn't have LCP stalls for a `mov` with 16-bit
immediate, only for ALU, so there's no need to avoid it.  But if you *are*
going to zero a register, you should use it as the source for all the integer
mov instructions to save code-size.  (And avoid filling up space in the uop
cache with immediates).  Sandybridge-family no longer has register-read-port
stalls (P6 / Core2 / Nehalem), and even there a recently-written register is
fine for several cycles.

https://godbolt.org/g/AT7yfs

typedef struct {
    int a,b,c;
    char j,k, k1;
        // 1B of padding
    int l,m,n[8];
    char c1,c2;
        // 2B of padding
}foo;
int sf = sizeof(foo);

sf:
        .long   60   # bytes long

void assignzero(foo *p) {
    foo tmp = {};
    *p = tmp;
}

(GCC-Explorer-Build) 8.0.0 20170907  -xc -std=c11 -O3 -march=haswell


assignzero:
        pushq   %rbp
        xorl    %eax, %eax            # zero a reg to avoid mov $imm16, (mem)
        vpxor   %xmm0, %xmm0, %xmm0
        xorl    %edx, %edx            # and another one, instead of reusing
eax??
        movq    $0, (%rdi)
        movl    $0, 8(%rdi)
        movq    %rsp, %rbp      # make a stack frame because of the ymm?  At
least it doesn't do something with %r10 like gcc7.2
        movw    %ax, 12(%rdi)
        movb    $0, 14(%rdi)
                           # avoiding a store to 1B of padding
        vmovdqu %ymm0, 16(%rdi)       # bunch of int members
        movq    $0, 48(%rdi)
        movw    %dx, 56(%rdi)
        vzeroupper
        popq    %rbp
        ret

So 1B of padding cost us 4 narrow stores instead of one 16B store at the front
of the struct.  This also uses AVX for a single 256b store; very likely not
worth it.


// what we would like to see, more or less
void charzero(char *p) {
    __builtin_memset(p, 0, sizeof(foo));
}
charzero:
        vpxor   %xmm0, %xmm0, %xmm0
        movq    $0, 48(%rdi)
        vmovups %xmm0, (%rdi)
        vmovups %xmm0, 16(%rdi)
        vmovups %xmm0, 32(%rdi)
        movl    $0, 56(%rdi)
        ret

This chooses not to use 256b AVX, so no vzeroupper, and no slow-mode while
warming up the upper-half of execution units / data bus. (Agner's description
sounds like it applies to stores:
http://agner.org/optimize/blog/read.php?i=415).  Also no triggering a lower
turbo threshold.  All of this is probably good for a tiny function.

We could save some code-size by replacing the integer mov $0 stores with vmovq
%xmm0, 48(%rdi).  That's probably not good for Bulldozer-family (shared vector
unit between two integer cores), so maybe only enable that with a -march=znver1
or sandybridge-family CPU.  (xor-zeroing takes zero cycles on Intel SnB-family,
so xmm0 is ready as a source for the store in the same cycle that vpxor issues.
 I guess it's possible that an integer store can execute 1 cycle sooner than an
AVX 8-byte store, so if there weren't already any stores waiting to execute,
and store-port throughput was about to become a bottleneck in later code, it
makes sense.  Maybe only do it for the last store, using a 4-byte vmovd %xmm0,
56(%rdi))

Or better (on Haswell and especially Skylake): use overlapping vector stores
like clang has been doing since 3.7, and like glibc memcpy does for small
copies:

   # clang (trunk) -O3 -march=haswell
assignzero:                             # same code for clang's memset
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %ymm0, 28(%rdi)         # last 32 bytes
        vmovups %ymm0, (%rdi)           # first 32 bytes, overlapping by 4
        vzeroupper
        retq

Having to use vzeroupper here is questionable.

4 xmm stores (with the last one overlapping) might be a better choice here, but
I haven't tried to measure the effect of scattered uses of AVX.  When tuning
for Sandybridge, clang only uses 128b stores here.

Reply via email to