https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81456
Bug ID: 81456 Summary: x86-64 optimizer makes wrong decision when optimizing for size Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: cody at codygray dot com Target Milestone: --- Target: x86_64 Consider the following code (doesn't matter if you compile it as C or C++): #include <stdlib.h> int Bounce(int a, int b) { int mod = abs(a % b); return (a/b % 2) ? b-mod : mod; } When optimizing for speed (whether -O1, -O2, or -O3), this is compiled to the following x86-64 machine code: Bounce(int, int): movl %edi, %eax cltd idivl %esi movl %edx, %ecx sarl $31, %ecx xorl %ecx, %edx subl %ecx, %edx subl %edx, %esi testb $1, %al cmovne %esi, %edx movl %edx, %eax ret That output is observed as far back as (at least) GCC 4.8, and all the way up to the current 8.0 preview I have (8.0.0 20170716). However, when optimizing for size (-Os), the same function produces this output: Bounce: movl %edi, %eax cltd idivl %esi movl %edx, %ecx sarl $31, %ecx xorl %ecx, %edx subl %ecx, %edx testb $1, %al je .L1 subl %edx, %esi movl %esi, %edx .L1: movl %edx, %eax ret This defies expectations because it is actually *larger* (more bytes) than the version optimized for speed. The JE instruction in this version is 2 bytes, as is each MOVL instruction, making that section 6 bytes total. However, in the version optimized for speed, the CMOVNE instruction is 3 bytes, plus a 2-byte MOVL, for 5 bytes total. (The SUBL instruction there is required either way.) Now, one byte obviously isn't a big deal in terms of total size, except that the CMOV version is more performant, so even if these two versions were exactly the same size, it should be used in preference to the branching version! (The optimizer has no reason to suspect that the quotient in the division will be predictably odd or even, so a non-branching conditional move is most appropriate to get the best worst-case performance.) I notice that this is a regression post-GCC 6.3. In other words, GCC 6.3 generates the same code for -Os and -O1/-O2/-O3. I don't have GCC 7.0 available, so GCC 7.1 is the first version I have available that reproduces the described behavior. It continues to be there, as I said, in GCC 8. This also is *not* observed when targeting 32-bit x86. You get conditional moves when the target architecture supports them (P6 and later). So this affects only x86-64, where conditional moves are *always* available.