https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123763
Bug ID: 123763
Summary: Suboptimal code for some 64-bit loads on 32-bit ARM
Cortex M
Product: gcc
Version: 15.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: david at westcontrol dot com
Target Milestone: ---
godbolt link: <https://godbolt.org/z/6jKfT4v4j>
I have been experimenting a little with 64-bit types on small 32-bit ARM Cortex
M microcontrollers. uint64_t is 8-byte aligned in the ABI, which is
unnecessary on these chips and can lead to wasted space from padding in
structs. (If you only have a few KB of ram, 4 bytes for padding is
significant.)
It's no problem creating a type that works as uint64_t, but has 4 byte
alignment:
typedef __attribute__((aligned(4))) uint64_t uint64_a4;
But accessing data of this type is sometimes less efficient than with a normal
uint64_t (or its underlying type, unsigned long long int). For these devices,
there is no hardware difference for 4-byte or 8-byte alignment, and no reason
for a difference in the generated code.
uint64_t foo8(const uint64_t * p) {
return *p;
}
uint64_t u8;
uint64_t bar8() { return u8; }
gives optimal Cortex-M0+ code:
foo8:
ldmia r0, {r0, r1}
bx lr
bar8:
ldr r3, .L6
ldmia r3!, {r0, r1}
bx lr
But using a 4-byte aligned type does not:
uint64_t foo4(const uint64_a4 * p) {
return *p;
}
uint64_a4 u4;
uint64_t bar4() { return u4; }
foo4:
movs r3, r0
ldmia r3!, {r0, r1}
bx lr
bar4:
ldr r3, .L9
ldr r0, [r3, #8]
ldr r1, [r3, #12]
bx lr
On the Cortex-M4 (and other "bigger" Cortex-M devices), the compiler can use
the "ldrd" double register load instruction for optimal code, even for the
4-byte aligned type. If I make a 2-byte aligned type for testing purposes, the
compiler must use two separate "ldr" loads on the Cortex-M4 as "ldrd" requires
4-byte alignment. (On the Cortex-M0+, even "ldr" requires 4-byte alignment, so
the code there must use 16-bit loads.) But again, the 4-byte loads are done
using an unnecessary extra register:
typedef __attribute__((aligned(2))) uint64_t uint64_a2;
uint64_t foo2(const uint64_a2 * p) {
return *p;
}
uint64_a2 u2;
uint64_t bar2() { return u2; }
On the Cortex-M4, this gives:
foo8:
ldrd r0, [r0]
bx lr
foo4:
ldrd r0, r1, [r0]
bx lr
foo2:
mov r3, r0
ldr r0, [r0] @ unaligned
ldr r1, [r3, #4] @ unaligned
bx lr
bar8:
ldr r3, .L6
ldrd r0, [r3]
bx lr
bar4:
ldr r3, .L9
ldrd r0, r1, [r3, #8]
bx lr
bar2:
ldr r3, .L12
ldr r0, [r3, #16] @ unaligned
ldr r1, [r3, #20] @ unaligned
bx lr
All the code is correct (that's always the most important thing), but such
inefficiencies add up. It is not hard to find other examples of unnecessarily
using an additional pointer register when dealing with data bigger than
32-bits, such as with structs - adding "aligned" attributes are not necessary
to show the problem, but gave clear and simple examples here.