https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
--- Comment #15 from Kewen Lin <linkw at gcc dot gnu.org> ---
It looks r15-2820-gab18785840d7b8 has made the case in #c1 vectorized, nice!
But CPUBench has unsigned type in HADAMARD4:
#if BIT_DEPTH > 8
typedef uint32_t sum_t;
typedef uint64_t sum2_t;
#else
typedef uint16_t sum_t;
typedef uint32_t sum2_t;
#endif
#define BITS_PER_SUM (8 * sizeof(sum_t))
#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
sum2_t t0 = s0 + s1;\
sum2_t t1 = s0 - s1;\
sum2_t t2 = s2 + s3;\
sum2_t t3 = s2 - s3;\
d0 = t0 + t2;\
d2 = t0 - t2;\
d1 = t1 + t3;\
d3 = t1 - t3;\
}
GCC still fails to vectorize it if we change type of t0,t1,t2,t3's type to
unsigned int.