On Fri, 5 Apr 2024 at 07:15, Nathan Bossart <nathandboss...@gmail.com> wrote: > Here is an updated patch set. IMHO this is in decent shape and is > approaching committable.
I checked the code generation on various gcc and clang versions. It looks mostly fine starting from versions where avx512 is supported, gcc-7.1 and clang-5. The main issue I saw was that clang was able to peel off the first iteration of the loop and then eliminate the mask assignment and replace masked load with a memory operand for vpopcnt. I was not able to convince gcc to do that regardless of optimization options. Generated code for the inner loop: clang: <L2>: 50: add rdx, 64 54: cmp rdx, rdi 57: jae <L1> 59: vpopcntq zmm1, zmmword ptr [rdx] 5f: vpaddq zmm0, zmm1, zmm0 65: jmp <L2> gcc: <L1>: 38: kmovq k1, rdx 3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax] 43: add rax, 64 47: mov rdx, -1 4e: vpopcntq zmm0, zmm0 54: vpaddq zmm0, zmm0, zmm1 5a: vmovdqa64 zmm1, zmm0 60: cmp rax, rsi 63: jb <L1> I'm not sure how much that matters in practice. Attached is a patch to do this manually giving essentially the same result in gcc. As most distro packages are built using gcc I think it would make sense to have the extra code if it gives a noticeable benefit for large cases. The visibility map patch has the same issue, otherwise looks good. Regards, Ants Aasma
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c index dacc7553d29..f6e718b86e9 100644 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_avx512.c @@ -52,13 +52,21 @@ pg_popcount_avx512(const char *buf, int bytes) * Iterate through all but the final iteration. Starting from second * iteration, the start index mask is ignored. */ - for (; buf < final; buf += sizeof(__m512i)) + if (buf < final) { val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); cnt = _mm512_popcnt_epi64(val); accum = _mm512_add_epi64(accum, cnt); + buf += sizeof(__m512i); mask = ~UINT64CONST(0); + + for (; buf < final; buf += sizeof(__m512i)) + { + val = _mm512_load_si512((const __m512i *) buf); + cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + } } /* Final iteration needs to ignore bytes that are not within the length */