[Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons

jakub at gcc dot gnu.org via Gcc-bugs Thu, 13 Apr 2023 09:54:24 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154


--- Comment #45 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, would
void
foo (float *f, float d, float e)
{
  if (e >= 2.0f && e <= 4.0f)
    ;
  else
    __builtin_unreachable ();
  for (int i = 0; i < 1024; i++)
    {
      float a = f[i];
      f[i] = (a < 0.0f ? 1.0f : 1.0f - a * d) * (a < e ? 1.0f : 0.0f);
    }
}
be a better reduction on what's going on?
>From the frange/threading POV, when e is in [2.0f, 4.0f] range, if a < 0.0f, we
know that a < e is also true, so there is no point in testing that at runtime.
So I think what threadfull1 does is right and desirable if the final code
actually performs those comparisons and uses conditional jumps.
The only thing is that it is harmful for vectorization and maybe for predicated
code.
Therefore, for scalar code at least without massive ARM style conditional
execution,
the above is better emitted as
  if (a < 0.0f)
    tmp = 1.0f;
  else
    {
      tmp = (1.0f - a * d) * (a < e ? 1.0f : 0.0f);
    }
or even
  if (a < 0.0f)
    tmp = 1.0f;
  else if (a < e)
    tmp = 1.0f - a * d;
  else
    tmp = 0.0f;
  f[i] = tmp;
Thus, could we effectively try to undo it at ifcvt time on loops for
vectorization only, or during vectorization or something similar?
As ifcvt then turns the IMHO desirable
  if (a_16 >= 0.0)
    goto <bb 5>; [59.00%]
  else
    goto <bb 11>; [41.00%]

  <bb 11> [local count: 435831803]:
  goto <bb 7>; [100.00%]

  <bb 5> [local count: 627172605]:
  _7 = a_16 * d_17(D);
  iftmp.0_18 = 1.0e+0 - _7;
  if (e_13(D) > a_16)
    goto <bb 12>; [20.00%]
  else
    goto <bb 6>; [80.00%]

  <bb 12> [local count: 125434523]:
  goto <bb 7>; [100.00%]

  <bb 6> [local count: 501738082]:

  <bb 7> [local count: 1063004410]:
  # prephitmp_26 = PHI <iftmp.0_18(12), 0.0(6), 1.0e+0(11)>
(ok, the 2 empty forwarders are unlikely useful) into:
  _7 = a_16 * d_17(D);
  iftmp.0_18 = 1.0e+0 - _7;
  _21 = a_16 >= 0.0;
  _10 = e_13(D) > a_16;
  _9 = _10 & _21;
  _27 = e_13(D) <= a_16;
  _28 = _21 & _27;
  _ifc__43 = _9 ? iftmp.0_18 : 0.0;
  _ifc__44 = _28 ? 0.0 : _ifc__43;
  _45 = a_16 < 0.0;
  prephitmp_26 = _45 ? 1.0e+0 : _ifc__44;
Now, perhaps if ifcvt used ranger, it could figure out that a_16 < 0.0 implies
e_13(D) > a_16 and do something smarter with it.
Or maybe just try to do smarter ifcvt just based on the original CFG.
The pre-ifcvt code was a_16 < 0.0f ? 1.0f : a_16 < e_13 ? 1.0f - a_16 * d_17 :
0.0f
so when ifcvt puts everything together, make it
  _7 = a_16 * d_17(D);
  iftmp.0_18 = 1.0e+0 - _7;
  _27 = e_13(D) > a_16;
  _28 = a_16 < 0.0;
  _ifc__43 = _27 ? iftmp.0_18 : 0.0f;
  prephitmp_26 = _28 ? 1.0f : _ifc__43;
?

[Bug tree-optimization/109154] [13 regression] jump threading de-optimizes nested floating point comparisons

Reply via email to