On 8/1/23 00:47, Robin Dapp via Gcc-patches wrote:
  I'm not against continuing with the more well-known approach for now
  but we should keep in mind that might still be potential for improvement.

No. I don't think it's faster.

I did a quick check on my x86 laptop and it's roughly 25% faster there.
That's consistent with the literature.  RISC-V qemu only shows 5-10%
improvement, though.

I have no ideal. I saw ARM SVE generate:
POP_COUNT
POP_COUNT
VEC_PACK_TRUNC.

I'd strongly suspect this happens because it's converting to int.
If you change dst to uint64_t there won't be any vec_pack_trunc.

I am gonna drop this patch since it's meaningless.

But why?  It can still help even if we can improve on the sequence.
IMHO you can go ahead with it and just change int -> uint64_t in the
tests.
It'd also be interesting to see if those popcounts in deepsjeng are vectorizable. We got a major boost in deepsjeng at a prior employer, but I can't remember if it was from getting the pcounts vectorized or just not doing stupid stuff with them on the scalar side.


jeff

Reply via email to