8 regression] 20% slowdown of linux kernel AES cipher

rguenth at gcc dot gnu.org Fri, 19 Jan 2018 02:52:49 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651


--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
So somehow without code hoisting we don't find a single PRE opportunity -
that's odd.  Ah, so it goes

int x;
int foo(int cond1, int cond2, int op1, int op2, int op3)
{
  int op;
  if (cond1)
    {
      x = op1 << 8;
      if (cond2)
        op = op2;
      else
        op = op3;
    }
  else
    op = op1;
  return op << 8;
}

when looking at simple PRE then op << 8 is not detected as partially redundant
because while GVN PRE (PHI translation VN) ends up value-numbering op << 8 on
the !cond1 path the same as op1 << 8:

[changed] ANTIC_IN[6] := { op1_5(D) (0006), {lshift_expr,op1_5(D),8} (0002) }
[changed] ANTIC_IN[3] := { op1_5(D) (0006), cond2_8(D) (0009),
{lshift_expr,op1_5(D),8} (0002) }

it doesn't consider op << 8 available on any path -- there isn't really
any redundant computation on any path in the above code.

Now comes code hoisting, seeing op1 << 8 computed twice and hoists it
before if (cond1).

int x;
int foo(int cond1, int cond2, int op1, int op2, int op3)
{
  int op;
  int tem = op1 << 8;
  if (cond1)
    {
      x = tem;
      if (cond2)
        op = op2;
      else
        op = op3;
    }
  else
    op = op1;
  return op << 8;
}

Note how it didn't end up removing the redundancy on the !cond1 path
on its own!  It relies on PRE to clean up after itself here.

After this the PRE algorithm now finds its
"available on one path" -- namely on the cond1 path where op << 8 is
now computed twice -- and inserts op2 << 8 and op3 << 8 in the other
predecessor (note we have a PHI with three args here, if we'd split
that this might also change code generation for the better).  And we end up
with

int x;
int foo(int cond1, int cond2, int op1, int op2, int op3)
{
  int op;
  tem = op1 << 8;
  if (cond1)
    {
      x = tem;
      if (cond2)
        tem = op2 << 8;
      else
        tem = op3 << 8;
    }
  return tem;
}

Note with the proposed patch the handling of the two FAILing testcases
becomes less efficient -- doing hoisting first exposes full redundancies
and thus avoids useless PRE insertions.  For both FAILing testcases
code generation in the end doesn't change.

Now we have to wrap our brains around the above testcase and transform
and decide if the order of events are good and expected or not.

[Bug tree-optimization/83651] [7/8 regression] 20% slowdown of linux kernel AES cipher

Reply via email to