https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494
--- Comment #11 from Peter Cordes <peter at cordes dot ca> --- Also, horizontal byte sums are generally best done with VPSADBW against a zero vector, even if that means some fiddling to flip to unsigned first and then undo the bias. simde_vaddlv_s8: vpxor xmm0, xmm0, .LC0[rip] # set1_epi8(0x80) flip to unsigned 0..255 range vpxor xmm1, xmm1 vpsadbw xmm0, xmm0, xmm1 # horizontal byte sum within each 64-bit half vmovd eax, xmm0 # we only wanted the low half anyway sub eax, 8 * 128 # subtract the bias we added earlier by flipping sign bits ret This is so much shorter we'd still be ahead if we generated the vector constant on the fly instead of loading it. (3 instructions: vpcmpeqd same,same / vpabsb / vpslld by 7. Or pcmpeqd / psllw 8 / packsswb same,same to saturate to -128) If we had wanted a 128-bit (16 byte) vector sum, we'd need ... vpsadbw ... vpshufd xmm1, xmm0, 0xfe # shuffle upper 64 bits to the bottom vpaddd xmm0, xmm0, xmm1 vmovd eax, xmm0 sub eax, 16 * 128 Works efficiently with only SSE2. Actually with AVX2, we should unpack the top half with VUNPCKHQDQ to save a byte (no immediate operand), since we don't need PSHUFD copy-and-shuffle. Or movd / pextrw / scalar add but that's more uops: pextrw is 2 on its own.