https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891
Bug ID: 97891 Summary: [x86] Consider using registers on large initializations Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: andysem at mail dot ru Target Milestone: --- Consider the following example code: struct A { long a; short b; int c; char d; long x; bool y; int z; char* p; A() : a(0), b(0), c(0), d(0), x(0), y(false), z(0), p(0) {} }; void test(A* p, unsigned int count) { for (unsigned int i = 0; i < count; ++i) { p[i] = A(); } } When compiled with "-O3 -march=nehalem" the generated code is: test(A*, unsigned int): testl %esi, %esi je .L1 leal -1(%rsi), %eax leaq (%rax,%rax,2), %rax salq $4, %rax leaq 48(%rdi,%rax), %rax .L3: xorl %edx, %edx movq $0, (%rdi) addq $48, %rdi movw %dx, -40(%rdi) movl $0, -36(%rdi) movb $0, -32(%rdi) movq $0, -24(%rdi) movb $0, -16(%rdi) movl $0, -12(%rdi) movq $0, -8(%rdi) cmpq %rax, %rdi jne .L3 .L1: ret https://gcc.godbolt.org/z/TrfWYr Here, the main loop body between .L3 and .L1 is 60 bytes large, with a significant amount of space wasted on the $0 constants encoded in mov instructions. It would be more efficient to use a single zero register in all member initializations, especially given that %edx is already used like that. A loop rewritten like this: for (unsigned int i = 0; i < count; ++i) { __asm__ ( "movq %q1, (%0)\n\t" "movw %w1, 8(%0)\n\t" "movl %1, 12(%0)\n\t" "movb %b1, 16(%0)\n\t" "movq %q1, 24(%0)\n\t" "movb %b1, 32(%0)\n\t" "movl %1, 36(%0)\n\t" "movq %q1, 40(%0)\n\t" : : "r" (p + i), "q" (0) ); } compiles to: test(A*, unsigned int): testl %esi, %esi je .L1 leal -1(%rsi), %eax leaq (%rax,%rax,2), %rax salq $4, %rax leaq 48(%rdi,%rax), %rdx xorl %eax, %eax .L3: movq %rax, (%rdi) movw %ax, 8(%rdi) movl %eax, 12(%rdi) movb %al, 16(%rdi) movq %rax, 24(%rdi) movb %al, 32(%rdi) movl %eax, 36(%rdi) movq %rax, 40(%rdi) addq $48, %rdi cmpq %rdx, %rdi jne .L3 .L1: ret Here, the loop between .L3 and .L1 only takes 34 bytes, which is nearly half the original size. Constant (for example, zero) initialization is a frequently used pattern to initialize structures, so the sequences like the above are quite wide spread. Converting cases like this to the use of registers could save some code size and reduce cache pressure.