http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041

--- Comment #21 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 30382
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30382&action=edit
gcc49-pr36041.patch

Untested libgcc2.c implementation (no hw support).  HW support is IMHO best
dealt on the compiler side doing something, already a PLT call is fairly
expensive, but it depends if __builtin_popcount* is used in a hot loop or in
cold code (in the latter case it really doesn't matter).

I've looked at code generated for:
int
foo (unsigned long long i)
{
  i = i - ((i >> 1) & 0x5555555555555555);
  i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);
  i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F;
  return (i * 0x101010101010101) >> 56;
}

int
bar (unsigned long long i)
{
  unsigned int i1 = i, i2 = i >> 32;
  i1 = i1 - ((i1 >> 1) & 0x55555555);
  i2 = i2 - ((i2 >> 1) & 0x55555555);
  i1 = (i1 & 0x33333333) + ((i1 >> 2) & 0x33333333);
  i2 = (i2 & 0x33333333) + ((i2 >> 2) & 0x33333333);
  i1 = (i1 + (i1 >> 4)) & 0xF0F0F0F;
  i2 = (i2 + (i2 >> 4)) & 0xF0F0F0F;
  return ((i1 + i2) * 0x1010101) >> 24;
}

int
baz (unsigned long long i)
{
  i = i - ((i >> 1) & 0x5555555555555555);
  i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);
  i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F;
  return ((((unsigned int) i) + (unsigned int) (i >> 32)) * 0x1010101) >> 24;
}
on gcc -O2 -m32 and picked the second variant as shortest for the UDWtype
implementation, I guess that is likely the case on most targets.  Note that the
patch still doesn't attempt to figure out if UWtype multiplication is expensive
or not, perhaps a useful test would be whether we emit __umul<mode>3 for that
mode inside of libgcc - we'd need to define some macro and if multiplication is
expensive use the shifts + additions alternative instead.

Reply via email to