[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484 --- Comment #5 from rsaxvc at gmail dot com --- Related ticket requested with Clang: https://github.com/llvm/llvm-project/issues/63731 latest umulh() function is a little shorter: static uint64_t umulh(uint64_t a, uint64_t b) { const uint32_t a_lo = a; const uint32_t a_hi = a >> 32; const uint32_t b_lo = b; const uint32_t b_hi = b >> 32; /* FOIL method of multiplication See https://en.wikipedia.org/wiki/FOIL_method, but instead of binomials with constants a,b,c,d variables x,y: (ax+b) * (cy + d), consider it with variables a,b,c,d, constants x,y = 1<<32 Results in one UMULL or UMLAL(when merged with accumulate below) per multiply*/ const uint64_t acc0 = (uint64_t)a_lo * b_lo; const uint64_t acc1 = (uint64_t)a_hi * b_lo; const uint64_t acc2 = (uint64_t)a_lo * b_hi; const uint64_t acc3 = (uint64_t)a_hi * b_hi; /* Accumulate the results, keeping only the top 64-bits of the 128-bit result*/ uint64_t acc = acc0; acc >>= 32; acc += acc1; acc += acc2; acc >>= 32; acc += acc3; return acc; }
[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484 --- Comment #4 from rsaxvc at gmail dot com --- Benchmarking shows the speedup to be highly variable depending on CPU core as well as __aeabi_uldivmod() implementation, and somewhat on numerator. The best __aeabi_uldivmod()s I've seen do use 32bit division instructions when available, and umulh() based approach is only 2-3x faster when division instructions are available. When umull(32x32 with 64bit result) is available and udiv is not available or libc doesn't use it, the umulh() based approach proposed here completes 28-38x faster, on Cortex-M4, measured via GPIO and oscilloscope. The wide variation in relative speed is due to variable execution time of __aeabi_uldivmod(). Similar on ARM11. There's a partial list of some contemporary cores have udiv here: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/divide-and-conquer it does look like things are headed towards more cores having udiv available.
[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484 --- Comment #3 from rsaxvc at gmail dot com --- > 32-bit division by constant relies on a 32x32->64 bit multiply. I believe only the upper half of the bits are required for the product to represent the integer component of the quotient. This makes parameter passing much easier in the uint64_t/constant case. > We don't have an instruction for that on arm nor does gcc support 128-bit > integer types on a 32-bit compiler. I ran into that as well. The attached uint64_t umulh( uint64_t a, uint64_t b ) implements a 64x64->64MSBsOf128BitProduct using only 32x32->64 bit multiplication, 64-bit addition, and register-aligned shifting.
[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484 --- Comment #2 from Richard Earnshaw --- 32-bit division by constant relies on a 32x32->64 bit multiply. So it seems reasonable that a 64-bit division would need a 64x64->128 bit multiply. We don't have an instruction for that on arm nor does gcc support 128-bit integer types on a 32-bit compiler.
[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484 Andrew Pinski changed: What|Removed |Added Severity|normal |enhancement
[Bug target/106484] Failure to optimize uint64_t/constant division on ARM32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484 rsaxvc at gmail dot com changed: What|Removed |Added CC||rsaxvc at gmail dot com --- Comment #1 from rsaxvc at gmail dot com --- Created attachment 53388 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53388&action=edit example umulh implementation C implementation of umulh which can be used to optimize uint64_t/constant division on 32-bit processors without a umulh instruction but with a 32x32=64bit multiplier.