On Fri, Dec 15, 2023 at 09:51:10PM -0800, Andrew Pinski wrote:
> I was looking into improving __builtin_popcountg for __int128 on
> aarch64 (when CSSC is not implemented which right now is almost all
> cores) but this patch forces __builtin_popcountg to expand into 2
> __builtin_popcountll (and add) before it could optimize into an
> internal function for the popcount and have the backend a possibility
> of using implementing something better.
> This is due to the code in fold_builtin_bit_query, what might be the
> best way of disabling that for this case?
> 
> Basically right now popcount is implemented using the SIMD instruction
> cnt which can be used either 8x1 or 16x1 wide. Using the 16x1 improves
> both the code size and performance (on almost all cores I know of). So
> instead of 2 cnt instructions, we only would need one.

The reason for lowering those 2 * wordsize cases early is that there
is no __builtin_{clz,ctz,clrsb,ffs,parity,popcount}* for those cases (so we
can't expect expansion to say libgcc routines as fallback) and
IFN_{CLZ,CTZ,CLRSB,FFS,PARITY,POPCOUNT} are still direct optab ifns
(now with the extension that large/huge _BitInt is ok for those as operands
because we are guaranteed to lower that during bitint lowering).
Anything else won't make it through the direct optab checks and won't be
guaranteed to expand.

You can always define optabs for those and handle them in md files if it
results in better code.

        Jakub

Reply via email to