https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366
Bug ID: 67366 Summary: Poor assembly generation for unaligned memory accesses on ARM v6 & v7 cpus Product: gcc Version: 4.8.2 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: yann.collet.73 at gmail dot com Target Milestone: --- Accessing unaligned memory positions used to be forbidden on ARM cpus. But since ARMv6 (quite many years by now), this operation is supported. However, GCC 4.5 - 4.6 - 4.7 - 4.8 seem to generate sub-optimal code on these targets. In theory, it's illegal to issue a direct statement such as : u32 read32(const void* ptr) { return *(const u32*)ptr; } if ptr is not properly aligned. There are 2 work-around that I know. The first is to use `packed` instruction, which is not portable (compiler specific). The second and better one is to use memcpy() : u32 read32(const void* ptr) { u32 v; memcpy(&u, ptr, sizeof(v)); return v; } This version is portable and safe. It also works very well on multiple platform, such as x86/x64 or PPC, or ARM64, being reduced to an optimal assembly sequence (single instruction). Unfortunately, GCC 4.5 - 4.6 - 4.7 - 4.8 generate suboptimal assembly for this function on ARMv6 or ARMv7 : read32(void const*): ldr r0, [r0] @ unaligned sub sp, sp, #8 str r0, [sp, #4] @ unaligned ldr r0, [sp, #4] add sp, sp, #8 bx lr This in stark contrast with clang, which generates a much more efficient assembly : read32(void const*): @ @read32(void const*) ldr r0, [r0] bx lr (assembly can be generated and displayed using a simple tool : https://goo.gl/7FWDB8) It's not that gcc is unaware of cpu's unaligned memory access capability, since it does use it : `ldr r0, [r0]` but then lose a lot of time on useless operations on a discardable temporary variable, storing data into stack just to read it again. Inlining does not save the day. -O3 help at reducing the impact, but it's still large. On a recent exercise comparing efficient vs inefficient memory access on ARMv6 and ARMv7, the measured difference was very large : up to 6x faster at -O2 settings. See : http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html It's definitely a too large difference to be ignored. As a consequence, to preserve performance, source code must try a bunch of possibilities depending on target and compiler, if not version. In some circumstances (gcc with ARMv6, or gcc <= 4.5), it's even necessary to write illegal code (see !st version above) to reach optimal performance on targets. This looks like a waste of energy, and a recipe for bugs, especially compared to clang, which generates clean code in all circumstances for all targets. Considering the huge performance difference such an improvement could make, is that something the gcc team would like to look into ? Regards