[Bug tree-optimization/111502] New: Suboptimal unaligned 2/4-byte memcpy on strict-align targets

2023-09-20 Thread lasse.collin at tukaani dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

Bug ID: 111502
   Summary: Suboptimal unaligned 2/4-byte memcpy on strict-align
targets
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lasse.collin at tukaani dot org
  Target Milestone: ---

I was playing with RISC-V GCC 12.2.0 from Arch Linux. I noticed
inefficient-looking assembly output in code that uses memcpy to access 32-bit
unaligned integers. I tried Godbolt with 16/32-bit integers and seems that the
same weirdness happens with RV32 & RV64 with GCC 13.2.0 and trunk, and also on
a few other targets. (Clang's output looks OK.)

For a little endian target:

#include 
#include 

uint32_t bytes16(const uint8_t *b)
{
return (uint32_t)b[0]
| ((uint32_t)b[1] << 8);
}

uint32_t copy16(const uint8_t *b)
{
uint16_t v;
memcpy(&v, b, sizeof(v));
return v;
}

riscv64-linux-gnu-gcc -march=rv64gc -O2 -mtune=size

bytes16:
lhu a0,0(a0)
ret

copy16:
lhu a0,0(a0)
ret

That looks good because -mno-strict-align is the default.

After omitting -mtune=size, unaligned access isn't used (the output is the same
as with -mstrict-align):

riscv64-linux-gnu-gcc -march=rv64gc -O2

bytes16:
lbu a5,1(a0)
lbu a0,0(a0)
sllia5,a5,8
or  a0,a5,a0
ret

copy16:
lbu a4,0(a0)
lbu a5,1(a0)
addisp,sp,-16
sb  a4,14(sp)
sb  a5,15(sp)
lhu a0,14(sp)
addisp,sp,16
jr  ra

bytes16 looks good but copy16 is weird: the bytes are copied to an aligned
location on stack and then loaded back.

On Godbolt it happens with GCC 13.2.0 on RV32, RV64, ARM64 (but only if using
-mstrict-align), MIPS64EL, and SPARC & SPARC64 (comparison needs big endian
bytes16). For ARM64 and MIPS64EL the oldest GCC on Godbolt is GCC 5.4 and the
same thing happens with that too.

32-bit reads with -O2 behave similarly. With -Os a call to memcpy is emitted
for copy32 but not for bytes32.

#include 
#include 

uint32_t bytes32(const uint8_t *b)
{
return (uint32_t)b[0]
| ((uint32_t)b[1] << 8)
| ((uint32_t)b[2] << 16)
| ((uint32_t)b[3] << 24);
}

uint32_t copy32(const uint8_t *b)
{
uint32_t v;
memcpy(&v, b, sizeof(v));
return v;
}

riscv64-linux-gnu-gcc -march=rv64gc -O2

bytes32:
lbu a4,1(a0)
lbu a3,0(a0)
lbu a5,2(a0)
lbu a0,3(a0)
sllia4,a4,8
or  a4,a4,a3
sllia5,a5,16
or  a5,a5,a4
sllia0,a0,24
or  a0,a0,a5
sext.w  a0,a0
ret

copy32:
lbu a2,0(a0)
lbu a3,1(a0)
lbu a4,2(a0)
lbu a5,3(a0)
addisp,sp,-16
sb  a2,12(sp)
sb  a3,13(sp)
sb  a4,14(sp)
sb  a5,15(sp)
lw  a0,12(sp)
addisp,sp,16
jr  ra

riscv64-linux-gnu-gcc -march=rv64gc -Os

bytes32:
lbu a4,1(a0)
lbu a5,0(a0)
sllia4,a4,8
or  a4,a4,a5
lbu a5,2(a0)
lbu a0,3(a0)
sllia5,a5,16
or  a5,a5,a4
sllia0,a0,24
or  a0,a0,a5
sext.w  a0,a0
ret

copy32:
addisp,sp,-32
mv  a1,a0
li  a2,4
addia0,sp,12
sd  ra,24(sp)
callmemcpy@plt
ld  ra,24(sp)
lw  a0,12(sp)
addisp,sp,32
jr  ra

I probably cannot test any proposed fixes but I hope this report is still
useful. Thanks!

[Bug target/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets

2023-09-20 Thread lasse.collin at tukaani dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

--- Comment #2 from Lasse Collin  ---
Byte access by default is good when the compiler doesn't know if unaligned is
fast on the target processor. There is no disagreement here.

What I suspect is a bug is the instruction sequence used for byte access in
copy16 and copy32 cases. copy16 uses 2 * lbu + 2 * sb + 1 * lhu, that is, five
memory operations to load an unaligned 16-bit integer. copy32 uses 4 * lbu + 4
* sb + 1 * lw, that is, nine memory operations to load a 32-bit integer.

bytes16 needs two memory operations and bytes32 needs four. Clang generates
this kind of code from both bytesxx and copyxx.

[Bug middle-end/111502] Suboptimal unaligned 2/4-byte memcpy on strict-align targets

2023-09-20 Thread lasse.collin at tukaani dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111502

--- Comment #5 from Lasse Collin  ---
If I understood correctly, PR 50417 is about wishing that GCC would infer that
a pointer given to memcpy has alignment higher than one. In my examples the
alignment of the uint8_t *b argument is one and thus byte-by-byte access is
needed (if the target processor doesn't have fast unaligned access, determined
from -mtune and -mno-strict-align).

My report is about the instruction sequence used for the byte-by-byte access.

Omitting the stack pointer manipulation and return instruction, this is
bytes16:

lbu a5,1(a0)
lbu a0,0(a0)
sllia5,a5,8
or  a0,a5,a0

And copy16:

lbu a4,0(a0)
lbu a5,1(a0)
sb  a4,14(sp)
sb  a5,15(sp)
lhu a0,14(sp)

Is the latter as good code as the former? If yes, then this report might be
invalid and I apologize for the noise.

PR 50417 includes a case where a memcpy(a, b, 4) generates an actual call to
memcpy, so that is the same detail as the -Os case in my first message. Calling
memcpy instead of expanding it inline saves six bytes in RV64C. On ARM64 with
-Os -mstrict-align the call doesn't save space:

bytes32:
ldrbw1, [x0]
ldrbw2, [x0, 1]
orr x2, x1, x2, lsl 8
ldrbw1, [x0, 2]
ldrbw0, [x0, 3]
orr x1, x2, x1, lsl 16
orr w0, w1, w0, lsl 24
ret

copy32:
stp x29, x30, [sp, -32]!
mov x1, x0
mov x2, 4
mov x29, sp
add x0, sp, 28
bl  memcpy
ldr w0, [sp, 28]
ldp x29, x30, [sp], 32
ret

And ARM64 with -O2 -mstrict-align, shuffing via stack is longer too:

bytes32:
ldrbw4, [x0]
ldrbw2, [x0, 1]
ldrbw1, [x0, 2]
ldrbw3, [x0, 3]
orr x2, x4, x2, lsl 8
orr x0, x2, x1, lsl 16
orr w0, w0, w3, lsl 24
ret

copy32:
sub sp, sp, #16
ldrbw3, [x0]
ldrbw2, [x0, 1]
ldrbw1, [x0, 2]
ldrbw0, [x0, 3]
strbw3, [sp, 12]
strbw2, [sp, 13]
strbw1, [sp, 14]
strbw0, [sp, 15]
ldr w0, [sp, 12]
add sp, sp, 16
ret

ARM64 with -mstrict-align might be a contrived example in practice though.

[Bug target/111555] New: [AArch64] __ARM_FEATURE_UNALIGNED should be undefined with -mstrict-align

2023-09-23 Thread lasse.collin at tukaani dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111555

Bug ID: 111555
   Summary: [AArch64] __ARM_FEATURE_UNALIGNED should be undefined
with -mstrict-align
   Product: gcc
   Version: 13.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lasse.collin at tukaani dot org
  Target Milestone: ---

On 32-bit ARM, the macro __ARM_FEATURE_UNALIGNED is defined when using
-munaligned-access and not defined when using -mno-unaligned-access. On AArch64
the macro is always defined with both -mno-strict-align and -mstrict-align. I
think the macro shouldn't be defined with -mstrict-align on AArch64.

For comparison, with Clang on AArch64 the definition of __ARM_FEATURE_UNALIGNED
is omitted with -mstrict-align and -mno-unaligned-access.

[Bug target/111557] New: [RISC-V] The macro __riscv_unaligned_fast should be __riscv_misaligned_fast

2023-09-23 Thread lasse.collin at tukaani dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111557

Bug ID: 111557
   Summary: [RISC-V] The macro __riscv_unaligned_fast should be
__riscv_misaligned_fast
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: lasse.collin at tukaani dot org
  Target Milestone: ---

The RISC-V C API Specification[1] has __riscv_misaligned_fast,
__riscv_misaligned_slow, and __riscv_misaligned_avoid. The commit 6e23440b [2]
used "unaligned" instead of "misaligned" though. The spelling
__riscv_unaligned_* was mentioned in [3] but in [4] it was changed to
__riscv_misaligned_*.

Clang doesn't have these macros yet but there is a recent pull request[5] that
uses the __riscv_misaligned_* spelling.

[1] https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md
[2]
https://gcc.gnu.org/git/?p=gcc.git;a=commitdiff;h=6e23440b5df4011bbe1dbee74d47641125dd7d16
[3] https://github.com/riscv-non-isa/riscv-c-api-doc/issues/32
[4] https://github.com/riscv-non-isa/riscv-c-api-doc/pull/40
[5] https://github.com/llvm/llvm-project/pull/65756