https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97891

            Bug ID: 97891
           Summary: [x86] Consider using registers on large
                    initializations
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

Consider the following example code:

struct A
{
    long a;
    short b;
    int c;
    char d;
    long x;
    bool y;
    int z;
    char* p;

    A() :
        a(0), b(0), c(0), d(0), x(0), y(false), z(0), p(0)
    {}
};

void test(A* p, unsigned int count)
{
    for (unsigned int i = 0; i < count; ++i)
    {
        p[i] = A();
    }
}

When compiled with "-O3 -march=nehalem" the generated code is:

test(A*, unsigned int):
        testl   %esi, %esi
        je      .L1
        leal    -1(%rsi), %eax
        leaq    (%rax,%rax,2), %rax
        salq    $4, %rax
        leaq    48(%rdi,%rax), %rax
.L3:
        xorl    %edx, %edx
        movq    $0, (%rdi)
        addq    $48, %rdi
        movw    %dx, -40(%rdi)
        movl    $0, -36(%rdi)
        movb    $0, -32(%rdi)
        movq    $0, -24(%rdi)
        movb    $0, -16(%rdi)
        movl    $0, -12(%rdi)
        movq    $0, -8(%rdi)
        cmpq    %rax, %rdi
        jne     .L3
.L1:
        ret

https://gcc.godbolt.org/z/TrfWYr

Here, the main loop body between .L3 and .L1 is 60 bytes large, with a
significant amount of space wasted on the $0 constants encoded in mov
instructions. It would be more efficient to use a single zero register in all
member initializations, especially given that %edx is already used like that.

A loop rewritten like this:

    for (unsigned int i = 0; i < count; ++i)
    {
        __asm__
        (
            "movq    %q1, (%0)\n\t"
            "movw    %w1, 8(%0)\n\t"
            "movl    %1, 12(%0)\n\t"
            "movb    %b1, 16(%0)\n\t"
            "movq    %q1, 24(%0)\n\t"
            "movb    %b1, 32(%0)\n\t"
            "movl    %1, 36(%0)\n\t"
            "movq    %q1, 40(%0)\n\t"
            : : "r" (p + i), "q" (0)
        );
    }

compiles to:

test(A*, unsigned int):
        testl   %esi, %esi
        je      .L1
        leal    -1(%rsi), %eax
        leaq    (%rax,%rax,2), %rax
        salq    $4, %rax
        leaq    48(%rdi,%rax), %rdx
        xorl    %eax, %eax
.L3:
        movq    %rax, (%rdi)
        movw    %ax, 8(%rdi)
        movl    %eax, 12(%rdi)
        movb    %al, 16(%rdi)
        movq    %rax, 24(%rdi)
        movb    %al, 32(%rdi)
        movl    %eax, 36(%rdi)
        movq    %rax, 40(%rdi)

        addq    $48, %rdi
        cmpq    %rdx, %rdi
        jne     .L3
.L1:
        ret

Here, the loop between .L3 and .L1 only takes 34 bytes, which is nearly half
the original size.

Constant (for example, zero) initialization is a frequently used pattern to
initialize structures, so the sequences like the above are quite wide spread.
Converting cases like this to the use of registers could save some code size
and reduce cache pressure.

Reply via email to