https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90483

            Bug ID: 90483
           Summary: input to ptest not optimized
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kretz at kde dot org
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

The (V)PTEST instruction of SSE4.1/AVX produces ZF = `(a & b) == 0` and CF =
`(~a & b) == 0`. Generic usage of PTEST simply sets `b = ~__m128i()` (or
`~__m256i()`), i.e. tests `a` and `~a` for having only zero bits. (cf.
_mm_test_all_ones)

Consequently, if `a` is the result of a vector comparison which only depends on
a bitmask, the compare instruction can be elided and the `~__m128i()` mask
replaced with the corresponding bitmask.

Examples:

// test sign bit
bool bad(__v16qu x) {
  return __builtin_ia32_ptestz128(~__v16qu(), x > 0x7f);
}

Since x > 0x7f can be rewritten as a test for the sign bit, we can optimize to
(with 0x808080... at LC0):
        vptest .LC0(%rip), %xmm0
        sete %al
        ret

// test for zero
bool bad2(__v16qu x) {
  return __builtin_ia32_ptestz128(~__v16qu(), x == 0);
}

This equivalent to testing scalars for 0, i.e. we can optimize to:
        vptest %xmm0, %xmm0
        sete %al
        ret

// test for certain bits
bool bad3(__v16qu x, __v16qu k) {
  return __builtin_ia32_ptestz128(~__v16qu(), (x & k) == 0);
}

With the above transformation we already get PTEST(x&k, x&k) which can
consequently be reduced to PTEST(x, k):
        vptest %xmm0, %xmm1
        sete %al
        ret

Further optimization of e.g. `(x & ~k) == 0` using CF instead of ZF might also
be interesting.

And of course, these transformations apply to all vector types, not just
__v16qu.

Reply via email to