https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339

            Bug ID: 82339
           Summary: Inefficient movabs instruction
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jakub at gcc dot gnu.org
  Target Milestone: ---

At least on i7-5960X in the following testcase:
__attribute__((noinline, noclone)) unsigned long long int
foo (int x)
{
  asm volatile ("" : : : "memory");
  return 1ULL << (63 - x);
}

__attribute__((noinline, noclone)) unsigned long long int
bar (int x)
{
  asm volatile ("" : : : "memory");
  return (1ULL << 63) >> x;
}

__attribute__((noinline, noclone)) unsigned long long int
baz (int x)
{
  unsigned long long int y = 1;
  asm volatile ("" : "+r" (y) : : "memory");
  return (y << 63) >> x;
}

int
main (int argc, const char **argv)
{
  int i;
  if (argc == 1)
    for (i = 0; i < 1000000000; i++)
      asm volatile ("" : : "r" (foo (13)));
  else if (argc == 2)
    for (i = 0; i < 1000000000; i++)
      asm volatile ("" : : "r" (bar (13)));
  else if (argc == 3)
    for (i = 0; i < 1000000000; i++)
      asm volatile ("" : : "r" (baz (13)));
  return 0;
}

baz is fastest as well as shortest.
So I think we should consider using movl $cst, %edx; shlq $shift, %rdx instead
of movabsq $(cst << shift), %rdx.

Unfortunately I can't find in Agner Fog MOVABS and for MOV r64,i64 there is too
little information, so it is unclear on which CPUs it is beneficial.
For -Os, if the destination is a %rax to %rsp register, it is one byte shorter
(5+4 vs 10), for %r8 to %r15 it is the same size.
For speed optimization, the disadvantage is obviously that the shift clobbers
flags register.

Peter, any information on what the MOV r64,i64 latency/throughput on various
CPUs vs. MOV r32,i32; SHL r64,i8 is?

Reply via email to