https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67366

            Bug ID: 67366
           Summary: Poor assembly generation for unaligned memory accesses
                    on ARM v6 & v7 cpus
           Product: gcc
           Version: 4.8.2
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: yann.collet.73 at gmail dot com
  Target Milestone: ---

Accessing unaligned memory positions used to be forbidden on ARM cpus. But
since ARMv6 (quite many years by now), this operation is supported.

However, GCC 4.5 - 4.6 - 4.7 - 4.8 seem to generate sub-optimal code on these
targets.

In theory, it's illegal to issue a direct statement such as :

u32 read32(const void* ptr) { return *(const u32*)ptr; }

if ptr is not properly aligned.

There are 2 work-around that I know.
The first is to use `packed` instruction, which is not portable (compiler
specific).

The second and better one is to use memcpy() :

u32 read32(const void* ptr) { u32 v; memcpy(&u, ptr, sizeof(v)); return v; }

This version is portable and safe.
It also works very well on multiple platform, such as x86/x64 or PPC, or ARM64,
being reduced to an optimal assembly sequence (single instruction).

Unfortunately, GCC 4.5 - 4.6 - 4.7 - 4.8 generate suboptimal assembly for this
function on ARMv6 or ARMv7 :

read32(void const*):
        ldr     r0, [r0]        @ unaligned
        sub     sp, sp, #8
        str     r0, [sp, #4]    @ unaligned
        ldr     r0, [sp, #4]
        add     sp, sp, #8
        bx      lr

This in stark contrast with clang, which generates a much more efficient
assembly :

read32(void const*):                           @ @read32(void const*)
        ldr     r0, [r0]
        bx      lr

(assembly can be generated and displayed using a simple tool :
https://goo.gl/7FWDB8)

It's not that gcc is unaware of cpu's unaligned memory access capability,
since it does use it : `ldr r0, [r0]`
but then lose a lot of time on useless operations on a discardable temporary
variable,
storing data into stack just to read it again.


Inlining does not save the day. -O3 help at reducing the impact, but it's still
large.

On a recent exercise comparing efficient vs inefficient memory access on ARMv6
and ARMv7,
the measured difference was very large : up to 6x faster at -O2 settings.
See :
http://fastcompression.blogspot.com/2015/08/accessing-unaligned-memory.html

It's definitely a too large difference to be ignored.
As a consequence, to preserve performance, source code must try a bunch of
possibilities depending on target and compiler, if not version.
In some circumstances (gcc with ARMv6, or gcc <= 4.5), it's even necessary to
write illegal code (see !st version above) to reach optimal performance on
targets.

This looks like a waste of energy, and a recipe for bugs, especially compared
to clang, which generates clean code in all circumstances for all targets.


Considering the huge performance difference such an improvement could make, is
that something the gcc team would like to look into ?


Regards

Reply via email to