[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2023-07-08 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #5 from rsaxvc at gmail dot com ---
Related ticket requested with Clang:
https://github.com/llvm/llvm-project/issues/63731

latest umulh() function is a little shorter:

static uint64_t umulh(uint64_t a, uint64_t b)
{
const uint32_t a_lo = a;
const uint32_t a_hi = a >> 32;
const uint32_t b_lo = b;
const uint32_t b_hi = b >> 32;

/* FOIL method of multiplication
See https://en.wikipedia.org/wiki/FOIL_method,
but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy +
d),
consider it with variables a,b,c,d, constants x,y = 1<<32
Results in one UMULL or UMLAL(when merged with accumulate below) per multiply*/
const uint64_t acc0 = (uint64_t)a_lo * b_lo;
const uint64_t acc1 = (uint64_t)a_hi * b_lo;
const uint64_t acc2 = (uint64_t)a_lo * b_hi;
const uint64_t acc3 = (uint64_t)a_hi * b_hi;

/* Accumulate the results, keeping only the top 64-bits of the 128-bit result*/
uint64_t acc = acc0;
acc >>= 32;
acc += acc1;
acc += acc2;
acc >>= 32;
acc += acc3;

return acc;
}

[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2023-05-04 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #4 from rsaxvc at gmail dot com ---
Benchmarking shows the speedup to be highly variable depending on CPU core as
well as __aeabi_uldivmod() implementation, and somewhat on numerator.

The best __aeabi_uldivmod()s I've seen do use 32bit division instructions when
available, and umulh() based approach is only 2-3x faster when division
instructions are available.

When umull(32x32 with 64bit result) is available and udiv is not available or
libc doesn't use it, the umulh() based approach proposed here completes 28-38x
faster, on Cortex-M4, measured via GPIO and oscilloscope. The wide variation in
relative speed is due to variable execution time of __aeabi_uldivmod(). Similar
on ARM11.

There's a partial list of some contemporary cores have udiv here:
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/divide-and-conquer
it does look like things are headed towards more cores having udiv available.

[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2022-08-01 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #3 from rsaxvc at gmail dot com ---
> 32-bit division by constant relies on a 32x32->64 bit multiply.

I believe only the upper half of the bits are required for the product to
represent the integer component of the quotient. This makes parameter passing
much easier in the uint64_t/constant case.

> We don't have an instruction for that on arm nor does gcc support 128-bit 
> integer types on a 32-bit compiler.

I ran into that as well. The attached uint64_t umulh( uint64_t a, uint64_t b )
implements a 64x64->64MSBsOf128BitProduct using only 32x32->64 bit
multiplication, 64-bit addition, and register-aligned shifting.

[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2022-08-01 Thread rearnsha at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #2 from Richard Earnshaw  ---
32-bit division by constant relies on a 32x32->64 bit multiply.  So it seems
reasonable that a 64-bit division would need a 64x64->128 bit multiply.  We
don't have an instruction for that on arm nor does gcc support 128-bit integer
types on a 32-bit compiler.

[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2022-07-30 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

Andrew Pinski  changed:

   What|Removed |Added

   Severity|normal  |enhancement

[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32

2022-07-30 Thread rsaxvc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

rsaxvc at gmail dot com changed:

   What|Removed |Added

 CC||rsaxvc at gmail dot com

--- Comment #1 from rsaxvc at gmail dot com ---
Created attachment 53388
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53388&action=edit
example umulh implementation

C implementation of umulh which can be used to optimize uint64_t/constant
division on 32-bit processors without a umulh instruction but with a
32x32=64bit multiplier.