http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041
--- Comment #21 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Created attachment 30382 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30382&action=edit gcc49-pr36041.patch Untested libgcc2.c implementation (no hw support). HW support is IMHO best dealt on the compiler side doing something, already a PLT call is fairly expensive, but it depends if __builtin_popcount* is used in a hot loop or in cold code (in the latter case it really doesn't matter). I've looked at code generated for: int foo (unsigned long long i) { i = i - ((i >> 1) & 0x5555555555555555); i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333); i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F; return (i * 0x101010101010101) >> 56; } int bar (unsigned long long i) { unsigned int i1 = i, i2 = i >> 32; i1 = i1 - ((i1 >> 1) & 0x55555555); i2 = i2 - ((i2 >> 1) & 0x55555555); i1 = (i1 & 0x33333333) + ((i1 >> 2) & 0x33333333); i2 = (i2 & 0x33333333) + ((i2 >> 2) & 0x33333333); i1 = (i1 + (i1 >> 4)) & 0xF0F0F0F; i2 = (i2 + (i2 >> 4)) & 0xF0F0F0F; return ((i1 + i2) * 0x1010101) >> 24; } int baz (unsigned long long i) { i = i - ((i >> 1) & 0x5555555555555555); i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333); i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F; return ((((unsigned int) i) + (unsigned int) (i >> 32)) * 0x1010101) >> 24; } on gcc -O2 -m32 and picked the second variant as shortest for the UDWtype implementation, I guess that is likely the case on most targets. Note that the patch still doesn't attempt to figure out if UWtype multiplication is expensive or not, perhaps a useful test would be whether we emit __umul<mode>3 for that mode inside of libgcc - we'd need to define some macro and if multiplication is expensive use the shifts + additions alternative instead.