https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91201
--- Comment #6 from Joel Yliluoma <bisqwit at iki dot fi> --- Maybe a horizontal checksum is a bit obscure term. A 8-bit checksum is what is being accomplished, nonetheless. Yes, there are simpler ways to do it… But I tried a number of different approaches in order to try and get maximum performance SIMD code out of GCC, and I came upon this curious case that I posted this bugreport about. To another compiler, I reported a related bug concerning a code that looks like this: unsigned char calculate_checksum(const void* ptr) { unsigned char bytes[16], result = 0; memcpy(bytes, ptr, 16); // The reason the memcpy is there in place is because to // my knowledge, it is the only _safe_ way permitted by // the standard to do conversions between representations. // Union, pointer casting, etc. are not safe. for(unsigned n=0; n<16; ++n) result += bytes[n]; return result; } After my report, their compiler now generates: vmovdqu xmm0, xmmword ptr [rdi] vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1] vpaddb xmm0, xmm0, xmm1 vpxor xmm1, xmm1, xmm1 vpsadbw xmm0, xmm0, xmm1 vpextrb eax, xmm0, 0 ret This is what GCC generates for the same code. vmovdqu xmm0, XMMWORD PTR [rdi] vpsrldq xmm1, xmm0, 8 vpaddb xmm0, xmm0, xmm1 vpsrldq xmm1, xmm0, 4 vpaddb xmm0, xmm0, xmm1 vpsrldq xmm1, xmm0, 2 vpaddb xmm0, xmm0, xmm1 vpsrldq xmm1, xmm0, 1 vpaddb xmm0, xmm0, xmm1 vpextrb eax, xmm0, 0 ret So the bottom line is, (v)psadbw reductions should be added as M. Glisse correctly indicated.