https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
--- Comment #15 from Kewen Lin <linkw at gcc dot gnu.org> --- It looks r15-2820-gab18785840d7b8 has made the case in #c1 vectorized, nice! But CPUBench has unsigned type in HADAMARD4: #if BIT_DEPTH > 8 typedef uint32_t sum_t; typedef uint64_t sum2_t; #else typedef uint16_t sum_t; typedef uint32_t sum2_t; #endif #define BITS_PER_SUM (8 * sizeof(sum_t)) #define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ sum2_t t0 = s0 + s1;\ sum2_t t1 = s0 - s1;\ sum2_t t2 = s2 + s3;\ sum2_t t3 = s2 - s3;\ d0 = t0 + t2;\ d2 = t0 - t2;\ d1 = t1 + t3;\ d3 = t1 - t3;\ } GCC still fails to vectorize it if we change type of t0,t1,t2,t3's type to unsigned int.