https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96906

            Bug ID: 96906
           Summary: Failure to optimize __builtin_ia32_psubusw128 compared
                    to 0 to __builtin_ia32_pminuw128 compared to operand
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: gabravier at gmail dot com
  Target Milestone: ---

typedef int16_t v8i16 __attribute__((vector_size(16)));

v8i16 cmple_epu16(v8i16 x, v8i16 y)
{
        return __builtin_ia32_psubusw128(x, y) == 0;
}

With -msse4.1, this can be optimized to `return __builtin_ia32_pminuw128(x, y)
== x;`. This transformation is done by LLVM, but not by GCC. 

PS: I'm not 100% sure this is faster but it logically should be, since the
`pminuw` version doesn't have to handle zeroing an SSE register.

Reply via email to