On Thu, Aug 11, 2022 at 11:10:34AM +0700, John Naylor wrote: >> I wonder if reusing a zero vector (instead of creating a new one every >> time) has any noticeable effect on performance. > > Creating a zeroed register is just FOO PXOR FOO, which should get > hoisted out of the (unrolled in this case) loop, and which a recent > CPU will just map to a hard-coded zero in the register file, in which > case the execution latency is 0 cycles. :-)
Ah, indeed. At -O2, my compiler seems to zero out two registers before the loop with either approach: pxor %xmm0, %xmm0 ; accumulator pxor %xmm2, %xmm2 ; always zeros And within the loop, I see the following: movdqu (%rdi), %xmm1 movdqu (%rdi), %xmm3 addq $16, %rdi pcmpeqb %xmm2, %xmm1 ; check for zeros por %xmm3, %xmm0 ; OR data into accumulator por %xmm1, %xmm0 ; OR zero check results into accumulator cmpq %rdi, %rsi So the call to _mm_setzero_si128() within the loop is fine. Apologies for the noise. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com