https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339
Bug ID: 82339 Summary: Inefficient movabs instruction Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jakub at gcc dot gnu.org Target Milestone: --- At least on i7-5960X in the following testcase: __attribute__((noinline, noclone)) unsigned long long int foo (int x) { asm volatile ("" : : : "memory"); return 1ULL << (63 - x); } __attribute__((noinline, noclone)) unsigned long long int bar (int x) { asm volatile ("" : : : "memory"); return (1ULL << 63) >> x; } __attribute__((noinline, noclone)) unsigned long long int baz (int x) { unsigned long long int y = 1; asm volatile ("" : "+r" (y) : : "memory"); return (y << 63) >> x; } int main (int argc, const char **argv) { int i; if (argc == 1) for (i = 0; i < 1000000000; i++) asm volatile ("" : : "r" (foo (13))); else if (argc == 2) for (i = 0; i < 1000000000; i++) asm volatile ("" : : "r" (bar (13))); else if (argc == 3) for (i = 0; i < 1000000000; i++) asm volatile ("" : : "r" (baz (13))); return 0; } baz is fastest as well as shortest. So I think we should consider using movl $cst, %edx; shlq $shift, %rdx instead of movabsq $(cst << shift), %rdx. Unfortunately I can't find in Agner Fog MOVABS and for MOV r64,i64 there is too little information, so it is unclear on which CPUs it is beneficial. For -Os, if the destination is a %rax to %rsp register, it is one byte shorter (5+4 vs 10), for %r8 to %r15 it is the same size. For speed optimization, the disadvantage is obviously that the shift clobbers flags register. Peter, any information on what the MOV r64,i64 latency/throughput on various CPUs vs. MOV r32,i32; SHL r64,i8 is?