https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123560

            Bug ID: 123560
           Summary: Ternary introduces jump too easily, unnecessarily
                    complex assembly
           Product: gcc
           Version: 15.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tobi at gcc dot gnu.org
  Target Milestone: ---

Here's another case where gcc pessimizes a relatively simple ternary.
=======================================

#include <stdint.h>

int bla(int last_digit, uint32_t len) {
    return (last_digit != 0) * 17 + (last_digit == 0) * (len - (len == 1));
}

int bla_ternary(int last_digit, uint32_t len) {
    return (last_digit != 0) ? 17 : (len - (len == 1));
}

uint32_t clamp_len(uint32_t len) {
    return len - (len == 1);
}
===================================
gives https://godbolt.org/z/nnGcP59ca :
==================================
bla(int, unsigned int):
        mov     eax, edi
        neg     eax
        sbb     eax, eax
        xor     edx, edx
        and     eax, 17
        cmp     esi, 1
        sete    dl
        sub     esi, edx
        xor     edx, edx
        test    edi, edi
        cmovne  esi, edx
        add     eax, esi
        ret
bla_ternary(int, unsigned int):
        mov     eax, 17
        test    edi, edi
        jne     .L6
        xor     edx, edx
        cmp     esi, 1
        mov     eax, esi
        sete    dl
        sub     eax, edx
.L6:
        ret
clamp_len(unsigned int):
        xor     edx, edx
        cmp     edi, 1
        mov     eax, edi
        sete    dl
        sub     eax, edx
        ret
===========================

bla_ternary introduces a jump where a conditional move would do, which has
measurable performance impact.  Note that also the "multiply by bool" trick
used in bla doesn't yield optimal code.  My understanding is that gcc doesn't
convert a ternary to a conditional move if one of the branches contains a
non-trivial amount of computation which is why I extracted one branch of the
ternary into clamp_len. Indeed it turns out that the assembly for that is
overly complicated.  In general, looking at the assembly for this kind of
branchless programming, I find that gcc often chooses "set[cond] + add/sub"
where "cmov[cond]" would lead to shorter code.  While the former may be faster
on some microarchitectures, the added complexity then pessimizes other
optimization passes.

Reply via email to