https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88152
Bug ID: 88152 Summary: optimize SSE & AVX char compares with subsequent movmskb Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: kretz at kde dot org Target Milestone: --- Target: x86_64-*-*, i?86-*-* Testcase (https://godbolt.org/z/YNPZyf): #include <x86intrin.h> template <typename T, size_t N> using V [[gnu::vector_size(N)]] = T; // the following should be optimized to: // vpxor %xmm1, %xmm1, %xmm1 // vpcmpgtb %[xy]mm0, %[xy]mm1, %[xy]mm0 // ret auto cmp0(V<unsigned char, 16> a) { return a > 0x7f; } auto cmp0(V<unsigned char, 32> a) { return a > 0x7f; } auto cmp1(V<unsigned char, 16> a) { return a >= 0x80; } auto cmp1(V<unsigned char, 32> a) { return a >= 0x80; } auto cmp0(V< signed char, 16> a) { return a < 0; } auto cmp0(V< signed char, 32> a) { return a < 0; } auto cmp1(V< signed char, 16> a) { return a <= -1; } auto cmp1(V< signed char, 32> a) { return a <= -1; } auto cmp0(V< char, 16> a) { return a < 0; } auto cmp0(V< char, 32> a) { return a < 0; } auto cmp1(V< char, 16> a) { return a <= -1; } auto cmp1(V< char, 32> a) { return a <= -1; } // the following should be optimized to: // vpmovmskb %[xy]mm0, %eax // ret int f0(V<unsigned char, 16> a) { return _mm_movemask_epi8 (reinterpret_cast<__m128i>(a > 0x7f)); } int f0(V<unsigned char, 32> a) { return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a > 0x7f)); } int f1(V<unsigned char, 16> a) { return _mm_movemask_epi8 (reinterpret_cast<__m128i>(a >= 0x80)); } int f1(V<unsigned char, 32> a) { return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a >= 0x80)); } int f0(V< signed char, 16> a) { return _mm_movemask_epi8 (reinterpret_cast<__m128i>(a < 0)); } int f0(V< signed char, 32> a) { return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a < 0)); } int f1(V< signed char, 16> a) { return _mm_movemask_epi8 (reinterpret_cast<__m128i>(a <= -1)); } int f1(V< signed char, 32> a) { return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <= -1)); } int f0(V< char, 16> a) { return _mm_movemask_epi8 (reinterpret_cast<__m128i>(a < 0)); } int f0(V< char, 32> a) { return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a < 0)); } int f1(V< char, 16> a) { return _mm_movemask_epi8 (reinterpret_cast<__m128i>(a <= -1)); } int f1(V< char, 32> a) { return _mm256_movemask_epi8(reinterpret_cast<__m256i>(a <= -1)); } Compile with `-O2 -mavx2` (the same is relevant with -msse2 if you remove the AVX overloads). Motivation: This pattern is relevant for vectorized UTF-8 decoding, where all bytes with MSB == 0 can simply be zero extended to UTF-16/32. Such code could just skip the compare and call movemask directly on `a`. However: std::experimental::simd doesn't (and any other general purpose SIMD abstraction shouldn't) expose a "munch sign bits into bitmask integer" function. Such a function is too ISA specific. In the interest of making code readable (and thus maintainable) I strongly believe it should read: `n_ascii_chars = find_first_set(a > 0x7f)` while still getting the optimization. Similar test cases can be constructed for movmskp[sd] after 32/64 bit integer compares.