https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502
--- Comment #5 from Lasse Collin ---
If I understood correctly, PR 50417 is about wishing that GCC would infer that
a pointer given to memcpy has alignment higher than one. In my examples the
alignment of the uint8_t *b argument is one and thus byte-by-byte access is
needed (if the target processor doesn't have fast unaligned access, determined
from -mtune and -mno-strict-align).
My report is about the instruction sequence used for the byte-by-byte access.
Omitting the stack pointer manipulation and return instruction, this is
bytes16:
lbu a5,1(a0)
lbu a0,0(a0)
sllia5,a5,8
or a0,a5,a0
And copy16:
lbu a4,0(a0)
lbu a5,1(a0)
sb a4,14(sp)
sb a5,15(sp)
lhu a0,14(sp)
Is the latter as good code as the former? If yes, then this report might be
invalid and I apologize for the noise.
PR 50417 includes a case where a memcpy(a, b, 4) generates an actual call to
memcpy, so that is the same detail as the -Os case in my first message. Calling
memcpy instead of expanding it inline saves six bytes in RV64C. On ARM64 with
-Os -mstrict-align the call doesn't save space:
bytes32:
ldrbw1, [x0]
ldrbw2, [x0, 1]
orr x2, x1, x2, lsl 8
ldrbw1, [x0, 2]
ldrbw0, [x0, 3]
orr x1, x2, x1, lsl 16
orr w0, w1, w0, lsl 24
ret
copy32:
stp x29, x30, [sp, -32]!
mov x1, x0
mov x2, 4
mov x29, sp
add x0, sp, 28
bl memcpy
ldr w0, [sp, 28]
ldp x29, x30, [sp], 32
ret
And ARM64 with -O2 -mstrict-align, shuffing via stack is longer too:
bytes32:
ldrbw4, [x0]
ldrbw2, [x0, 1]
ldrbw1, [x0, 2]
ldrbw3, [x0, 3]
orr x2, x4, x2, lsl 8
orr x0, x2, x1, lsl 16
orr w0, w0, w3, lsl 24
ret
copy32:
sub sp, sp, #16
ldrbw3, [x0]
ldrbw2, [x0, 1]
ldrbw1, [x0, 2]
ldrbw0, [x0, 3]
strbw3, [sp, 12]
strbw2, [sp, 13]
strbw1, [sp, 14]
strbw0, [sp, 15]
ldr w0, [sp, 12]
add sp, sp, 16
ret
ARM64 with -mstrict-align might be a contrived example in practice though.