[Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets

2023-09-21 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization

--- Comment #7 from Richard Biener  ---
You need to trace RTL expansion where it decides to use a temporary stack slot,
there's likely some instruction pattern missing to compose the larger words.

[Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets

2023-09-20 Thread andrew at sifive dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

--- Comment #6 from Andrew Waterman  ---
Ack, I misunderstood your earlier message.  You're of course right that the
load/load/shift/or sequence is preferable to the load/load/store/store/load
sequence, on just about any practical implementation.  That the memcpy version
is optimized less optimally does seem to be disjoint from the issue Andrew
mentioned.

[Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets

2023-09-20 Thread lasse.collin at tukaani dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

--- Comment #5 from Lasse Collin  ---
If I understood correctly, PR 50417 is about wishing that GCC would infer that
a pointer given to memcpy has alignment higher than one. In my examples the
alignment of the uint8_t *b argument is one and thus byte-by-byte access is
needed (if the target processor doesn't have fast unaligned access, determined
from -mtune and -mno-strict-align).

My report is about the instruction sequence used for the byte-by-byte access.

Omitting the stack pointer manipulation and return instruction, this is
bytes16:

lbu a5,1(a0)
lbu a0,0(a0)
sllia5,a5,8
or  a0,a5,a0

And copy16:

lbu a4,0(a0)
lbu a5,1(a0)
sb  a4,14(sp)
sb  a5,15(sp)
lhu a0,14(sp)

Is the latter as good code as the former? If yes, then this report might be
invalid and I apologize for the noise.

PR 50417 includes a case where a memcpy(a, b, 4) generates an actual call to
memcpy, so that is the same detail as the -Os case in my first message. Calling
memcpy instead of expanding it inline saves six bytes in RV64C. On ARM64 with
-Os -mstrict-align the call doesn't save space:

bytes32:
ldrbw1, [x0]
ldrbw2, [x0, 1]
orr x2, x1, x2, lsl 8
ldrbw1, [x0, 2]
ldrbw0, [x0, 3]
orr x1, x2, x1, lsl 16
orr w0, w1, w0, lsl 24
ret

copy32:
stp x29, x30, [sp, -32]!
mov x1, x0
mov x2, 4
mov x29, sp
add x0, sp, 28
bl  memcpy
ldr w0, [sp, 28]
ldp x29, x30, [sp], 32
ret

And ARM64 with -O2 -mstrict-align, shuffing via stack is longer too:

bytes32:
ldrbw4, [x0]
ldrbw2, [x0, 1]
ldrbw1, [x0, 2]
ldrbw3, [x0, 3]
orr x2, x4, x2, lsl 8
orr x0, x2, x1, lsl 16
orr w0, w0, w3, lsl 24
ret

copy32:
sub sp, sp, #16
ldrbw3, [x0]
ldrbw2, [x0, 1]
ldrbw1, [x0, 2]
ldrbw0, [x0, 3]
strbw3, [sp, 12]
strbw2, [sp, 13]
strbw1, [sp, 14]
strbw0, [sp, 15]
ldr w0, [sp, 12]
add sp, sp, 16
ret

ARM64 with -mstrict-align might be a contrived example in practice though.

[Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets

2023-09-20 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

Andrew Pinski  changed:

   What|Removed |Added

 Depends on||50417

--- Comment #4 from Andrew Pinski  ---
See pr 50417


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50417
[Bug 50417] [11/12/13/14 regression]: memcpy with known alignment

[Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets

2023-09-20 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

Andrew Pinski  changed:

   What|Removed |Added

  Component|target  |middle-end

--- Comment #3 from Andrew Pinski  ---
This is a dup of this bug. Basically memcpy is not changed into an unaligned
load ..