https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484
--- Comment #4 from rsaxvc at gmail dot com --- Benchmarking shows the speedup to be highly variable depending on CPU core as well as __aeabi_uldivmod() implementation, and somewhat on numerator. The best __aeabi_uldivmod()s I've seen do use 32bit division instructions when available, and umulh() based approach is only 2-3x faster when division instructions are available. When umull(32x32 with 64bit result) is available and udiv is not available or libc doesn't use it, the umulh() based approach proposed here completes 28-38x faster, on Cortex-M4, measured via GPIO and oscilloscope. The wide variation in relative speed is due to variable execution time of __aeabi_uldivmod(). Similar on ARM11. There's a partial list of some contemporary cores have udiv here: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/divide-and-conquer it does look like things are headed towards more cores having udiv available.