https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #11 from Peter Cordes <peter at cordes dot ca> ---
Also, horizontal byte sums are generally best done with  VPSADBW against a zero
vector, even if that means some fiddling to flip to unsigned first and then
undo the bias.

simde_vaddlv_s8:
 vpxor    xmm0, xmm0, .LC0[rip]  # set1_epi8(0x80) flip to unsigned 0..255
range
 vpxor    xmm1, xmm1
 vpsadbw  xmm0, xmm0, xmm1       # horizontal byte sum within each 64-bit half
 vmovd    eax, xmm0              # we only wanted the low half anyway
 sub      eax, 8 * 128      # subtract the bias we added earlier by flipping
sign bits
 ret

This is so much shorter we'd still be ahead if we generated the vector constant
on the fly instead of loading it.  (3 instructions: vpcmpeqd same,same / vpabsb
/ vpslld by 7.  Or pcmpeqd / psllw 8 / packsswb same,same to saturate to -128)

If we had wanted a 128-bit (16 byte) vector sum, we'd need

  ...
  vpsadbw ...

  vpshufd  xmm1, xmm0, 0xfe     # shuffle upper 64 bits to the bottom
  vpaddd   xmm0, xmm0, xmm1
  vmovd    eax, xmm0
  sub      eax, 16 * 128

Works efficiently with only SSE2.  Actually with AVX2, we should unpack the top
half with VUNPCKHQDQ to save a byte (no immediate operand), since we don't need
PSHUFD copy-and-shuffle.

Or movd / pextrw / scalar add but that's more uops: pextrw is 2 on its own.

Reply via email to