[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151

2017-04-20 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534

--- Comment #10 from James Greenhalgh  ---
The most striking improvement was in libquantum, for which we saw a 15%
performance improvement on Cortex-A72 (3% on cortex-A57) directly attributable
to basic block ordering after this patch.

Otherwise, I don't have a direct before/after comparison for just Honza's patch
across a wider set of benchmarks, but our nightly runs show general
improvements in benchmarks from Spec which are sensitive to block reordering
after the day of the patch. I don't see any large regressions in this time.

[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151

2017-04-19 Thread brzycki at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534

--- Comment #9 from Brian Rzycki  ---
Hello James,

If this is working as designed this may just be a case of having to live with
the regression. Do we know if anyone analyzed the net-benefit of Honza's patch?
If more benchmarks/tests win then I think we have our answer.

[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151

2017-04-19 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534

--- Comment #8 from James Greenhalgh  ---
In the case before Honza's patch, corrupt profile information leads to a branch
being marked as 100% taken. After Honza's patch, the branch is instead seen
with 95.6% taken:

(jump_insn 1916 1915 1922 309 (set (pc)
(if_then_else (ne (reg:CC 66 cc)
(const_int 0 [0]))
(label_ref 1905)
(pc))) "foo.cpp":59 9 {condjump}
 (expr_list:REG_DEAD (reg:CC 66 cc)
(int_list:REG_BR_PROB 1 (nil)))
 -> 1905)
;;  succ:   227 [95.6%] 
;;  226 [4.4%]  (FALLTHRU)

That's enough for GCC to consider the branch unpredictable, which in turn
causes GCC to use the "unpredictable" number for BRANCH_COST when setting the
maximum , which when tuning for Cortex-A57 is 1 for predictable branches (not
high enough to trigger the transform) and 3 for unpredictable branches (high
enough to trigger the transform). That explains why we don't see the
performance difference for -mcpu=generic, where BRANCH_COST always returns 2 -
which is always high enough to trigger this if-conversion.

The cost model looks reasonable, this is clearly a borderline case for the
heuristic. The only thing I found surprising in my analysis of this regression
is that GCC considers a 95.6% taken branch as unpredictable.

I'm not sure what the correct course for fixing this is - nothing in the
compiler seems to be broken, we're just on an unlucky side of the static
prediction engine and the ifcvt heuristics.

[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151

2017-03-31 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534

Jeffrey A. Law  changed:

   What|Removed |Added

   Priority|P3  |P2
 CC||law at redhat dot com

[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151

2017-03-30 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534

--- Comment #7 from James Greenhalgh  ---
I'm not sure there are any bugs here to fix, though I can still reproduce the
performance differences.

First up, basic block reordering causes an issue across all microarchitectures
on which I've looked at this. Basic block reordering is kicking in because the
static estimates of the execution profile make it look like a good idea. I'm
struggling to understand exactly what the execution profile of the testcase is
intended to be, as I'm finding both the source and the generated assembly/perf
reports hard to follow. Because I'm struggling to follow it, I can't tell if
the basic block reorganisation is sensible, but it doesn't look buggy.

Turning basic block reordering off (with -fno-reorder-blocks) removes the
performance difference for me, with that off both before and after r245151 have
similar performance on Cortex-A53 and Cortex-A72.

However, Cortex-A57 still shows a performance regression, which I believe is
related to an extra conditional branch in the code after r245151. I tried to
find which pass previously removed this branch and narrowed it down to jump2,
but I haven't figured out why there is such a change in jump2.

I'm on vacation now, so won't be able to look at this in the next week if
anyone else wants to dig.

[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151

2017-03-29 Thread brzycki at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534

--- Comment #6 from Brian Rzycki  ---
James, my apologies if it wasn't clear enough what the compile options were.
The test platform in this case is a Juno A57 running Ubuntu.

I actually never turned off -mcpu=cortex-a57 during my testing. I'll keep this
in mind too in the future.

[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151

2017-03-29 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534

James Greenhalgh  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-03-29
 Ever confirmed|0   |1

--- Comment #5 from James Greenhalgh  ---
It is always helpful to provide the full command line and configuration used,
along with rough details of the platform you are using.

Compiling with:

  gcc -O3 -mcpu=cortex-a57

I can see the performance difference reported here when running on Cortex-A57,
Cortex-A72 and Cortex-A53 based systems, while with:

  gcc -O3

(i.e. using the generic tuning structures) I see no meaningful performance
differences across multiple microarchitectures.

So, confirmed. Though this is more likely to be a deficiency in the heuristics
in the back-end ifcvt cost model than a regression directly introduced by
Honza's patch.

[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151

2017-03-28 Thread brzycki at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534

--- Comment #4 from Brian Rzycki  ---
Hi James, thank you for taking the time to test this. Unfortunately I can't
post the original code because of its license. The problem is with the
weighting changing enough that at least one pair of if conditionals is no
longer converted into a boolean logical if.

If it helps I can talk to Sebastian to see if we have time to refine the
reduced test case. Alternatively I may be able to post the diff of the previous
and current passes on the real code showing the branch no longer selected for
conversion.

[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151

2017-03-28 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534

James Greenhalgh  changed:

   What|Removed |Added

 CC||jgreenhalgh at gcc dot gnu.org

--- Comment #3 from James Greenhalgh  ---
I'm unable to reproduce the performance difference on the systems I have access
to, comparing r245150 against r245151.