https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104394
Andrew Pinski <pinskia at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Ever confirmed|0 |1 Severity|normal |enhancement Last reconfirmed| |2022-02-05 Status|UNCONFIRMED |NEW --- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- We are missing a lot of vector vs scalar optimizations really. This is a 2 optimizations we are missing here really. First is: #define bit 1u<<3 typedef int32_t v4i32 __attribute__((vector_size(16))); v4i32 get_cmpmask(v4i32 mask) { v4i32 signmask{(int32_t)bit, (int32_t)bit, (int32_t)bit, (int32_t)bit}; return ((signmask & mask) == signmask); } This is not optimized to -(signmask>>log2(bit))&1) which is similar to: pslld xmm0, 28 psrad xmm0, 31 on x86_64. Here is a full example which shows what needs to be done: #define bit 3 typedef int32_t v4i32 __attribute__((vector_size(16))); v4i32 get_cmpmask(v4i32 mask) { v4i32 signmask{(int32_t)1u<<bit, (int32_t)1u<<bit, (int32_t)1u<<bit, (int32_t)1u<<bit}; return ((signmask & mask) == signmask); } v4i32 get_cmpmask1(v4i32 mask) { mask >>= bit; mask &= 1; mask = -mask; return mask; } v4i32 get_cmpmask2(v4i32 mask) { mask <<= 31-bit; mask >>= 31; return mask; } ---- CUT --- Note clang does not even optimize get_cmpmask1 either. GCC does get the scalar version correct though (but not at the gimple level): int get_cmpmasks(int mask) { mask >>= bit; mask &= 1; mask = -mask; return mask; }