https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106484
Bug ID: 106484 Summary: Failure to optimize uint64_t/constant division on ARM32 Product: gcc Version: 12.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsaxvc at gmail dot com Target Milestone: --- Target: arm The following test function compiles into a call to __aeabi_uldivmod, even though the divisor is a constant. Here's an example function: #include <stdint.h> uint64_t ns_to_s( uint64_t ns64 ) { return ns64 / 1000000000ULL; } CortexM4(-O3 -Wall -Wextra -mcpu=cortex-m4) assembly: ns_to_s(unsigned long long): push {r3, lr} adr r3, .L4 ldrd r2, [r3] bl __aeabi_uldivmod pop {r3, pc} .L4: .word 1000000000 .word 0 Interestingly, gcc 12.1 for aarch64 compiles the above C function by implementing division by a constant with scaled multiplication by the inverse using the umulh instruction(not present on 32-bit ARM). (-O3 -Wall -Wextra): ns_to_s(unsigned long): mov x1, 23123 lsr x0, x0, 9 movk x1, 0xa09b, lsl 16 movk x1, 0xb82f, lsl 32 movk x1, 0x44, lsl 48 umulh x0, x0, x1 lsr x0, x0, 11 ret Instead, if something like __umulh could be added to libgcc, then GCC could use the constants generated in the aarch64 logic to implement uint64_t/constant division. Example umulh approach is attached. On Cortex-M4, the umulh-based approach is significantly faster, although this depends on the specific libc __aeabi_uldivmod linked against as well as the numerator.