https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88152

            Bug ID: 88152
           Summary: optimize SSE & AVX char compares with subsequent
                    movmskb
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kretz at kde dot org
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

Testcase (https://godbolt.org/z/YNPZyf):

#include <x86intrin.h>

template <typename T, size_t N>
using V [[gnu::vector_size(N)]] = T;

// the following should be optimized to:
// vpxor %xmm1, %xmm1, %xmm1
// vpcmpgtb %[xy]mm0, %[xy]mm1, %[xy]mm0
// ret

auto cmp0(V<unsigned char, 16> a) { return a > 0x7f; }
auto cmp0(V<unsigned char, 32> a) { return a > 0x7f; }
auto cmp1(V<unsigned char, 16> a) { return a >= 0x80; }
auto cmp1(V<unsigned char, 32> a) { return a >= 0x80; }
auto cmp0(V<  signed char, 16> a) { return a < 0; }
auto cmp0(V<  signed char, 32> a) { return a < 0; }
auto cmp1(V<  signed char, 16> a) { return a <= -1; }
auto cmp1(V<  signed char, 32> a) { return a <= -1; }
auto cmp0(V<         char, 16> a) { return a < 0; }
auto cmp0(V<         char, 32> a) { return a < 0; }
auto cmp1(V<         char, 16> a) { return a <= -1; }
auto cmp1(V<         char, 32> a) { return a <= -1; }

// the following should be optimized to:
// vpmovmskb %[xy]mm0, %eax
// ret

int f0(V<unsigned char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a >  0x7f));
}
int f0(V<unsigned char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a >  0x7f));
}
int f1(V<unsigned char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a >= 0x80));
}
int f1(V<unsigned char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a >= 0x80));
}
int f0(V<  signed char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a <  0));
}
int f0(V<  signed char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <  0));
}
int f1(V<  signed char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a <= -1));
}
int f1(V<  signed char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <= -1));
}
int f0(V<         char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a <  0));
}
int f0(V<         char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <  0));
}
int f1(V<         char, 16> a) {
  return _mm_movemask_epi8   (reinterpret_cast<__m128i>(a <= -1));
}
int f1(V<         char, 32> a) {
  return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <= -1));
}

Compile with `-O2 -mavx2` (the same is relevant with -msse2 if you remove the
AVX overloads).

Motivation:
This pattern is relevant for vectorized UTF-8 decoding, where all bytes with
MSB == 0 can simply be zero extended to UTF-16/32. Such code could just skip
the compare and call movemask directly on `a`. However:
std::experimental::simd doesn't (and any other general purpose SIMD abstraction
shouldn't) expose a "munch sign bits into bitmask integer" function. Such a
function is too ISA specific.
In the interest of making code readable (and thus maintainable) I strongly
believe it should read: `n_ascii_chars = find_first_set(a > 0x7f)` while still
getting the optimization.

Similar test cases can be constructed for movmskp[sd] after 32/64 bit integer
compares.

Reply via email to