https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
Bug ID: 88440 Summary: size optimization of memcpy-like code Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: hoganmeier at gmail dot com Target Milestone: --- https://godbolt.org/z/RTji7B void foo(char* restrict dst, const char* buf) { for (int i=0; i<8; ++i) *dst++ = *buf++; } $ gcc -Os $ gcc -O2 .L2: mov dl, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], dl inc rax cmp rax, 8 jne .L2 $ gcc -O3 mov rax, QWORD PTR [rsi] mov QWORD PTR [rdi], rax $ arm-none-eabi-gcc -O3 -mthumb -mcpu=cortex-m4 ldr r3, [r1] @ unaligned ldr r2, [r1, #4] @ unaligned str r2, [r0, #4] @ unaligned str r3, [r0] @ unaligned The -O3 code is both faster and smaller for both ARM and x64: "note: Loop 1 distributed: split to 0 loops and 1 library calls." Should be considered for -O2 and -Os as well.