[Bug target/109154] [13 regression] aarch64 -mcpu=neoverse-v1 microbude performance regression

tnfchris at gcc dot gnu.org via Gcc-bugs Thu, 16 Mar 2023 10:03:09 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109154


--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
Confirmed, It looks like the extra range information from
g:4fbe3e6aa74dae5c75a73c46ae6683fdecd1a75d is leading jump threading down the
wrong path.

Reduced testcase:
---

int etot_0, fasten_main_natpro_chrg_init;

void fasten_main_natpro() {
  float elcdst = 1;
  for (int l; l < 1; l++) {
    int zone1 = l < 0.0f, chrg_e = fasten_main_natpro_chrg_init * (zone1 ?: 1)
*
                                   (l < elcdst ? 1 : 0.0f);
    etot_0 += chrg_e;
  }
}

---

and compile with `-O1`. Issue also effects all targets not just AArch64
https://godbolt.org/z/qes4K4oTz. and using `-fno-thread-jumps` confirmed to
"fix" it.

With the new case jump threading seems to duplicate the edges on the l < 0.0f
check.

the dump says:

"Jump threading proved probability of edge 5->7 too small (it is 41.0%
(guessed) should be 69.5% (guessed))"

In BB 3 the branch probabilities are guessed as:

    if (_1 < 0.0)
      goto <bb 4>; [41.00%]
    else
      goto <bb 5>; [59.00%]

and in BB 5:

    if (_1 < 1.0e+0)       
      goto <bb 7>; [41.00%]
    else
      goto <bb 6>; [59.00%]

and so it thinks that the chances of _1 >= 0.0 && _1 < 1.0 is very small:

    if (_1 < 1.0e+0)
      goto <bb 7>; [14.80%]
    else
      goto <bb 6>; [85.20%]

The problem is that both BB 4 falls through to BB 5, and BB 6 falls through to
BB 7.

jump threading optimizes BB 5 by splitting the work to be done in BB 5 for the
fall-through from BB 4 back into BB 4.
It then threads the additional edge to BB 7 where the final calculation is now
more expensive.  much more than before (three way phi-node).

but because the hot path in BB 6 also falls into BB 7 the overall result is
that all paths become slower. but the hot path actually got an additional
comparison.

This is why the code slows down, for each instance of this occurrence (and in
the example provided by microbude it happens often) we get an addition branch
in a few paths.

this has a bigger slow down in SVE (vs the scalar slowdown) because it then
creates a longer dependency chain on producing the predicate for the BB.

It looks like this threading shouldn't be done if both hot and cold branches
end up in the same place?

[Bug target/109154] [13 regression] aarch64 -mcpu=neoverse-v1 microbude performance regression

Reply via email to