https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78963
Bug ID: 78963 Summary: Missed optimization opportunity in copies of small unaligned data Product: gcc Version: 6.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: eyalroz at technion dot ac.il Target Milestone: --- Preliminary notes: * This bug report stems from a StackOverflow question I asked: http://stackoverflow.com/q/41407257/1593077 * This bug regards the x86_64 architecture, but may apply elsewhere. * This bug regards -O3 optimizations * Everything described here is about the same for GCC 6.3 and 7 - whatever version of it GodBolt uses. * The entire bug is demonstrated here: https://godbolt.org/g/lDJSRm plus here https://godbolt.org/g/9Y2ebd Consider the task of copying 3-byte values from one place to another. If both those places are in memory, it seems reasonable to do four moves, and indeed GCC compiles this: #include <string.h> typedef struct { unsigned char data[3]; } uint24_t; void f(uint24_t* __restrict__ dest, uint24_t* __restrict__ src) { memcpy(dest,src,3); } into this (clipping the instructions for the return value): f(uint24_t*, uint24_t*): movzx eax, WORD PTR [rsi] mov WORD PTR [rdi], ax movzx eax, BYTE PTR [rsi+2] mov BYTE PTR [rdi+2], al If the source or the destination is a register, two mov's should suffice - either the first two or the second two of the above. However, if I write this (perhaps contrived, but likely demonstrative of what could happen with larger programs, especially with multi-translation units, or when the OS gives you a pointer to work with etc): #include <string.h> typedef struct { unsigned char data[3]; } uint24_t; void f(uint24_t* __restrict__ dest, uint24_t* __restrict__ src) { memcpy(dest,src,3); } int main() { uint24_t* p = (uint24_t*) 48; unsigned x; f((uint24_t*) &x,p); x += 1; f(p,(uint24_t*) &x); return 0; } The 3-byte value is "constructed" on the stack rather than in a register (first four mov's), and then one cannot avoid using four more mov's to copy it to the destination: movzx eax, WORD PTR ds:48 mov WORD PTR [rsp-4], ax movzx eax, BYTE PTR ds:50 mov BYTE PTR [rsp-2], al add DWORD PTR [rsp-4], 1 movzx eax, WORD PTR [rsp-4] mov WORD PTR ds:48, ax movzx eax, BYTE PTR [rsp-2] mov BYTE PTR ds:50, al If we do this with 4-byte values, i.e. replace uint24_t with uint32_t, it's a single mov both ways, and in fact it gets further optimized, so that this: #include <string.h> #include <stdint.h> void f(uint32_t* __restrict__ dest, uint32_t* __restrict__ src) { memcpy(dest,src,4); } int main() { uint32_t* p = (uint32_t*) 48; uint32_t x; f(&x,p); x += 1; f(p,&x); return 0; } is compiled into just this add DWORD PTR ds:48, 1 Now obviously you can't expect to optimize-out _that_ much with a 3-byte value, but 2 mov's in and 2 mov's out should be enough. Indeed, clang (since at least 3.4.1 or so) emits this for the uint24_t code: movzx eax, byte ptr [50] shl eax, 16 movzx ecx, word ptr [48] lea eax, [rcx + rax + 1] mov word ptr [48], ax shr eax, 16 mov byte ptr [50], al which has just four mov's.