https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928
Bug ID: 105928 Summary: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least) Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm64-*-* void foo(unsigned long *p) { *p = 0xdeadbeefdeadbeef; } cleverly compiles to https://godbolt.org/z/b3oqao5Kz mov w1, 48879 movk w1, 0xdead, lsl 16 stp w1, w1, [x0] ret But producing the value in a register uses more than 3 instructions: unsigned long constant(){ return 0xdeadbeefdeadbeef; } mov x0, 48879 movk x0, 0xdead, lsl 16 movk x0, 0xbeef, lsl 32 movk x0, 0xdead, lsl 48 ret At least with -Os, and maybe at -O2 or -O3 if it's efficient, we could be doing a shifted ADD or ORR to broadcast a zero-extended 32-bit value to 64-bit. mov x0, 48879 movk x0, 0xdead, lsl 16 add x0, x0, x0, lsl 32 Some CPUs may fuse sequences of movk, and shifted operands for ALU ops may take extra time in some CPUs, so this might not actually be optimal for performance, but it is smaller for -Os and -Oz. We should also be using that trick for stores to _Atomic or volatile long*, where we currently do MOV + 3x MOVK, then an STR, with ARMv8.4-a which guarantees atomicity. --- ARMv8.4-a and later guarantees atomicity for ldp/stp within an aligned 16-byte chunk, so we should use MOV/MOVK / STP there even for volatile or __ATOMIC_RELAXED. But presumably that's a different part of GCC's internals, so I'll report that separately.