https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108255
Bug ID: 108255 Summary: Repeated address-of (lea) not optimized for size. Product: gcc Version: 13.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: witold.baryluk+gcc at gmail dot com Target Milestone: --- https://godbolt.org/z/q5sx9e49j void f(int *); int g(int of) { int x = 13; f(&x); f(&x); f(&x); f(&x); f(&x); f(&x); f(&x); f(&x); return 0; } Got: g(int): sub rsp, 24 lea rdi, [rsp+12] mov DWORD PTR [rsp+12], 13 call f(int*) lea rdi, [rsp+12] # compute, 5 bytes call f(int*) lea rdi, [rsp+12] # recompute, 5 bytes call f(int*) lea rdi, [rsp+12] # recompute, 5 bytes call f(int*) lea rdi, [rsp+12] call f(int*) lea rdi, [rsp+12] call f(int*) lea rdi, [rsp+12] call f(int*) lea rdi, [rsp+12] call f(int*) xor eax, eax add rsp, 24 ret But, note that lea is 5 bytes. Expected (generated by clang 3.0 - 15.0): g(int): # @g(int) push rbx # extra, but just 1 byte sub rsp, 16 mov dword ptr [rsp + 12], 13 # CSE temp lea rbx, [rsp + 12] mov rdi, rbx # use call f(int*)@PLT mov rdi, rbx # reuse, 3 bytes call f(int*)@PLT mov rdi, rbx # reuse, 3 bytes call f(int*)@PLT mov rdi, rbx call f(int*)@PLT mov rdi, rbx call f(int*)@PLT mov rdi, rbx call f(int*)@PLT mov rdi, rbx call f(int*)@PLT mov rdi, rbx call f(int*)@PLT xor eax, eax add rsp, 16 pop rbx # extra, but just 1 byte ret Technically this is more instructions. But mov rdi, rbx is 3 bytes, which is shorter than 5 bytes of lea. This is at minor expense of needing to save and restore rbx. PS. Same happens when using temporary `int *const y = &x;` Also same when optimizing for size (`-Os`). It looks like gcc 4.8.5 produced expected code, but gcc 4.9.0 does not. It is possible that the code produced by gcc 4.9.0 is faster, but it is also likely it contributes quite a bit to binary size. clang uses CSE even if there are even just two uses of `&x` in the above example. It is likely a bit higher threshold is (3 or 4) is actually optimal (can be calculated knowing encoding sizes). Weirdly tho, gcc -m32 does this: g(): push ebp mov ebp, esp push ebx lea ebx, [ebp-12] sub esp, 32 mov DWORD PTR [ebp-12], 13 push ebx call f(int*) mov DWORD PTR [esp], ebx call f(int*) mov DWORD PTR [esp], ebx call f(int*) mov ebx, DWORD PTR [ebp-4] xor eax, eax leave ret Where, it does compute address and stores it in temporary. But does it on a stack, instead in a register (my guess is there are no free register to store it and it is spilled)., but in fact lea here would be likely faster (mov DWORD PTR [esp], ebx, but requires memory/cache access, lea is 5 bytes, but does not require memory access)