https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81456

            Bug ID: 81456
           Summary: x86-64 optimizer makes wrong decision when optimizing
                    for size
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cody at codygray dot com
  Target Milestone: ---
            Target: x86_64

Consider the following code (doesn't matter if you compile it as C or C++):

#include <stdlib.h>
int Bounce(int a, int b)
{
    int mod = abs(a % b);
    return (a/b % 2) ? b-mod : mod;
}

When optimizing for speed (whether -O1, -O2, or -O3), this is compiled to the
following x86-64 machine code:

Bounce(int, int):
        movl    %edi, %eax

        cltd
        idivl   %esi
        movl    %edx, %ecx
        sarl    $31, %ecx
        xorl    %ecx, %edx
        subl    %ecx, %edx

        subl    %edx, %esi
        testb   $1, %al
        cmovne  %esi, %edx
        movl    %edx, %eax
        ret

That output is observed as far back as (at least) GCC 4.8, and all the way up
to the current 8.0 preview I have (8.0.0 20170716).

However, when optimizing for size (-Os), the same function produces this
output:

Bounce:
        movl    %edi, %eax

        cltd
        idivl   %esi
        movl    %edx, %ecx
        sarl    $31, %ecx
        xorl    %ecx, %edx
        subl    %ecx, %edx

        testb   $1, %al
        je      .L1
        subl    %edx, %esi
        movl    %esi, %edx
.L1:
        movl    %edx, %eax
        ret

This defies expectations because it is actually *larger* (more bytes) than the
version optimized for speed. The JE instruction in this version is 2 bytes, as
is each MOVL instruction, making that section 6 bytes total. However, in the
version optimized for speed, the CMOVNE instruction is 3 bytes, plus a 2-byte
MOVL, for 5 bytes total. (The SUBL instruction there is required either way.)

Now, one byte obviously isn't a big deal in terms of total size, except that
the CMOV version is more performant, so even if these two versions were exactly
the same size, it should be used in preference to the branching version! (The
optimizer has no reason to suspect that the quotient in the division will be
predictably odd or even, so a non-branching conditional move is most
appropriate to get the best worst-case performance.)

I notice that this is a regression post-GCC 6.3. In other words, GCC 6.3
generates the same code for -Os and -O1/-O2/-O3. I don't have GCC 7.0
available, so GCC 7.1 is the first version I have available that reproduces the
described behavior. It continues to be there, as I said, in GCC 8.

This also is *not* observed when targeting 32-bit x86. You get conditional
moves when the target architecture supports them (P6 and later). So this
affects only x86-64, where conditional moves are *always* available.

Reply via email to