https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484

--- Comment #4 from rsaxvc at gmail dot com ---
Benchmarking shows the speedup to be highly variable depending on CPU core as
well as __aeabi_uldivmod() implementation, and somewhat on numerator.

The best __aeabi_uldivmod()s I've seen do use 32bit division instructions when
available, and umulh() based approach is only 2-3x faster when division
instructions are available.

When umull(32x32 with 64bit result) is available and udiv is not available or
libc doesn't use it, the umulh() based approach proposed here completes 28-38x
faster, on Cortex-M4, measured via GPIO and oscilloscope. The wide variation in
relative speed is due to variable execution time of __aeabi_uldivmod(). Similar
on ARM11.

There's a partial list of some contemporary cores have udiv here:
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/divide-and-conquer
it does look like things are headed towards more cores having udiv available.

Reply via email to