https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440

            Bug ID: 88440
           Summary: size optimization of memcpy-like code
           Product: gcc
           Version: 8.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: hoganmeier at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/RTji7B

void foo(char* restrict dst, const char* buf) {
    for (int i=0; i<8; ++i)
        *dst++ = *buf++;
}

$ gcc -Os
$ gcc -O2
.L2:
        mov     dl, BYTE PTR [rsi+rax]
        mov     BYTE PTR [rdi+rax], dl
        inc     rax
        cmp     rax, 8
        jne     .L2

$ gcc -O3
        mov     rax, QWORD PTR [rsi]
        mov     QWORD PTR [rdi], rax

$ arm-none-eabi-gcc -O3 -mthumb -mcpu=cortex-m4
        ldr     r3, [r1]  @ unaligned
        ldr     r2, [r1, #4]      @ unaligned
        str     r2, [r0, #4]      @ unaligned
        str     r3, [r0]  @ unaligned

The -O3 code is both faster and smaller for both ARM and x64:
"note: Loop 1 distributed: split to 0 loops and 1 library calls."

Should be considered for -O2 and -Os as well.

Reply via email to