[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534 --- Comment #10 from James Greenhalgh --- The most striking improvement was in libquantum, for which we saw a 15% performance improvement on Cortex-A72 (3% on cortex-A57) directly attributable to basic block ordering after this patch. Otherwise, I don't have a direct before/after comparison for just Honza's patch across a wider set of benchmarks, but our nightly runs show general improvements in benchmarks from Spec which are sensitive to block reordering after the day of the patch. I don't see any large regressions in this time.
[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534 --- Comment #9 from Brian Rzycki --- Hello James, If this is working as designed this may just be a case of having to live with the regression. Do we know if anyone analyzed the net-benefit of Honza's patch? If more benchmarks/tests win then I think we have our answer.
[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534 --- Comment #8 from James Greenhalgh --- In the case before Honza's patch, corrupt profile information leads to a branch being marked as 100% taken. After Honza's patch, the branch is instead seen with 95.6% taken: (jump_insn 1916 1915 1922 309 (set (pc) (if_then_else (ne (reg:CC 66 cc) (const_int 0 [0])) (label_ref 1905) (pc))) "foo.cpp":59 9 {condjump} (expr_list:REG_DEAD (reg:CC 66 cc) (int_list:REG_BR_PROB 1 (nil))) -> 1905) ;; succ: 227 [95.6%] ;; 226 [4.4%] (FALLTHRU) That's enough for GCC to consider the branch unpredictable, which in turn causes GCC to use the "unpredictable" number for BRANCH_COST when setting the maximum , which when tuning for Cortex-A57 is 1 for predictable branches (not high enough to trigger the transform) and 3 for unpredictable branches (high enough to trigger the transform). That explains why we don't see the performance difference for -mcpu=generic, where BRANCH_COST always returns 2 - which is always high enough to trigger this if-conversion. The cost model looks reasonable, this is clearly a borderline case for the heuristic. The only thing I found surprising in my analysis of this regression is that GCC considers a 95.6% taken branch as unpredictable. I'm not sure what the correct course for fixing this is - nothing in the compiler seems to be broken, we're just on an unlucky side of the static prediction engine and the ifcvt heuristics.
[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534 Jeffrey A. Law changed: What|Removed |Added Priority|P3 |P2 CC||law at redhat dot com
[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534 --- Comment #7 from James Greenhalgh --- I'm not sure there are any bugs here to fix, though I can still reproduce the performance differences. First up, basic block reordering causes an issue across all microarchitectures on which I've looked at this. Basic block reordering is kicking in because the static estimates of the execution profile make it look like a good idea. I'm struggling to understand exactly what the execution profile of the testcase is intended to be, as I'm finding both the source and the generated assembly/perf reports hard to follow. Because I'm struggling to follow it, I can't tell if the basic block reorganisation is sensible, but it doesn't look buggy. Turning basic block reordering off (with -fno-reorder-blocks) removes the performance difference for me, with that off both before and after r245151 have similar performance on Cortex-A53 and Cortex-A72. However, Cortex-A57 still shows a performance regression, which I believe is related to an extra conditional branch in the code after r245151. I tried to find which pass previously removed this branch and narrowed it down to jump2, but I haven't figured out why there is such a change in jump2. I'm on vacation now, so won't be able to look at this in the next week if anyone else wants to dig.
[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534 --- Comment #6 from Brian Rzycki --- James, my apologies if it wasn't clear enough what the compile options were. The test platform in this case is a Juno A57 running Ubuntu. I actually never turned off -mcpu=cortex-a57 during my testing. I'll keep this in mind too in the future.
[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534 James Greenhalgh changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-03-29 Ever confirmed|0 |1 --- Comment #5 from James Greenhalgh --- It is always helpful to provide the full command line and configuration used, along with rough details of the platform you are using. Compiling with: gcc -O3 -mcpu=cortex-a57 I can see the performance difference reported here when running on Cortex-A57, Cortex-A72 and Cortex-A53 based systems, while with: gcc -O3 (i.e. using the generic tuning structures) I see no meaningful performance differences across multiple microarchitectures. So, confirmed. Though this is more likely to be a deficiency in the heuristics in the back-end ifcvt cost model than a regression directly introduced by Honza's patch.
[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534 --- Comment #4 from Brian Rzycki --- Hi James, thank you for taking the time to test this. Unfortunately I can't post the original code because of its license. The problem is with the weighting changing enough that at least one pair of if conditionals is no longer converted into a boolean logical if. If it helps I can talk to Sebastian to see if we have time to refine the reduced test case. Alternatively I may be able to post the diff of the previous and current passes on the real code showing the branch no longer selected for conversion.
[Bug tree-optimization/79534] [7 Regression] tree-ifcombine aarch64 performance regression with trunk@245151
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79534 James Greenhalgh changed: What|Removed |Added CC||jgreenhalgh at gcc dot gnu.org --- Comment #3 from James Greenhalgh --- I'm unable to reproduce the performance difference on the systems I have access to, comparing r245150 against r245151.