On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote: > The main issue I saw was that clang was able to peel off the first > iteration of the loop and then eliminate the mask assignment and > replace masked load with a memory operand for vpopcnt. I was not able > to convince gcc to do that regardless of optimization options. > Generated code for the inner loop: > > clang: > <L2>: > 50: add rdx, 64 > 54: cmp rdx, rdi > 57: jae <L1> > 59: vpopcntq zmm1, zmmword ptr [rdx] > 5f: vpaddq zmm0, zmm1, zmm0 > 65: jmp <L2> > > gcc: > <L1>: > 38: kmovq k1, rdx > 3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax] > 43: add rax, 64 > 47: mov rdx, -1 > 4e: vpopcntq zmm0, zmm0 > 54: vpaddq zmm0, zmm0, zmm1 > 5a: vmovdqa64 zmm1, zmm0 > 60: cmp rax, rsi > 63: jb <L1> > > I'm not sure how much that matters in practice. Attached is a patch to > do this manually giving essentially the same result in gcc. As most > distro packages are built using gcc I think it would make sense to > have the extra code if it gives a noticeable benefit for large cases.
Yeah, I did see this, but I also wasn't sure if it was worth further complicating the code. I can test with and without your fix and see if it makes any difference in the benchmarks. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com