[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409 Jan Hubicka changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #15 from Jan Hubicka --- the first graph seems to be back to normal and I think the second is withing noise range. If not I will try to figure out what happens here (one is -O2 and other -Ofast so there may be something in that).
[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409 --- Comment #14 from Aldy Hernandez --- (In reply to hubicka from comment #13) > > I've fixed the threading slowdown. Can someone verify and close this PR if > > all > > the slowdown has been accounted for? If not, then someone needs to explore > > any > > slowdown unrelated to the threader. > The plots linked from the PR are live, so they should come back to > original speed (so far they did not). > > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=226.548.8 > and > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=287.548.8 There's now a big drop for the first graph, and a small drop for the second one.
[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409 --- Comment #13 from hubicka at kam dot mff.cuni.cz --- > I've fixed the threading slowdown. Can someone verify and close this PR if > all > the slowdown has been accounted for? If not, then someone needs to explore > any > slowdown unrelated to the threader. The plots linked from the PR are live, so they should come back to original speed (so far they did not). https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=226.548.8 and https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=287.548.8
[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409 --- Comment #12 from Aldy Hernandez --- I've fixed the threading slowdown. Can someone verify and close this PR if all the slowdown has been accounted for? If not, then someone needs to explore any slowdown unrelated to the threader.
[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409 --- Comment #11 from CVS Commits --- The master branch has been updated by Aldy Hernandez : https://gcc.gnu.org/g:54ebec35abec09a24b47b997172622ca0d8e2318 commit r12-5694-g54ebec35abec09a24b47b997172622ca0d8e2318 Author: Aldy Hernandez Date: Mon Nov 29 14:49:59 2021 +0100 path solver: Use only one ssa_global_cache. We're using a temporary range cache while computing ranges for PHIs to make sure the real cache doesn't get set until all PHIs are computed. With the ltrans beast in LTO mode this causes undue overhead. Since we already have a bitmap to indicate whether there's a cache entry, we can avoid the extra cache object by clearing it while PHIs are being calculated. gcc/ChangeLog: PR tree-optimization/103409 * gimple-range-path.cc (path_range_query::compute_ranges_in_phis): Do all the work with just one ssa_global_cache. * gimple-range-path.h: Remove m_tmp_phi_cache.
[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409 --- Comment #10 from Aldy Hernandez --- Created attachment 51896 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=51896=edit untested patch The threading slowdown here is due to the ssa_global_cache temporary. It doesn't look like ssa_global_cache was meant to be lightweight temporary cache ;-). We can avoid the temporary altogether by using the bitmap already used to determine if a cache entry is available. With this (untested) patch the ltrans42 unit is back to: tree VRP : 13.70 ( 3%) 0.04 ( 2%) 13.71 ( 3%) 45M ( 4%) backwards jump threading : 13.22 ( 3%) 0.01 ( 0%) 13.26 ( 3%) 3609k ( 0%)
[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409 --- Comment #9 from Aldy Hernandez --- There's definitely something in the threader, but I'm not sure it's the cause of all the regression. For the record, I've reproduced on ppc64le with a spec .cfg file having: OPTIMIZE= -O2 -flto=100 -save-temps -ftime-report -v -fno-checking The slow wrf_r.ltransNN.o files that dominate the compilation and are taking more than 2-3 seconds are (42, 76, and 24). I've distilled -ftime-report for VRP and jump threading, which usually go hand in hand now that VRP2 runs with ranger: dumping.42: tree VRP : 13.70 ( 3%) 0.08 ( 2%) 13.73 ( 3%)45M ( 4%) dumping.42: backwards jump threading : 26.68 ( 5%) 0.00 ( 0%) 26.72 ( 5%) 3609k ( 0%) dumping.42: TOTAL : 524.00 3.31 527.30 1277M dumping.76: tree VRP : 38.30 ( 13%) 0.03 ( 2%) 38.31 ( 13%)19M ( 2%) dumping.76: backwards jump threading : 47.38 ( 17%) 0.01 ( 1%) 47.37 ( 16%) 1671k ( 0%) dumping.76: TOTAL : 286.03 1.79 287.82 1173M dumping.24: tree VRP : 87.43 ( 8%) 0.07 ( 2%) 87.53 ( 8%)58M ( 3%) dumping.24: backwards jump threading : 129.81 ( 12%) 0.00 ( 0%) 129.81 ( 12%) 8986k ( 0%) dumping.24: TOTAL :1042.37 3.58 1045.93 2325M Threading is usually more expensive than VRP because it tries candidates over and over, but it's not meant to be orders of magnitude slower. Prior to the bisected patch in r12-5228, we had: dumping.42: tree VRP : 14.58 ( 3%) 0.07 ( 2%) 14.62 ( 3%)45M ( 4%) dumping.42: backwards jump threading : 13.88 ( 3%) 0.00 ( 0%) 13.89 ( 3%) 3609k ( 0%) dumping.42: TOTAL : 484.12 3.06 487.18 1277M dumping.76: tree VRP : 37.68 ( 13%) 0.04 ( 2%) 37.79 ( 13%)19M ( 2%) dumping.76: backwards jump threading : 45.50 ( 15%) 0.03 ( 2%) 45.52 ( 15%) 1671k ( 0%) dumping.76: TOTAL : 293.74 1.81 295.55 1173M dumping.24: tree VRP : 94.27 ( 9%) 0.11 ( 3%) 94.39 ( 9%)58M ( 3%) dumping.24: backwards jump threading : 102.63 ( 10%) 0.02 ( 0%) 102.67 ( 10%) 8986k ( 0%) dumping.24: TOTAL :1021.66 4.28 1025.92 2325M So at least for ltrans42, there's a big slowdown with this patch. Before, threading was 4.80% faster than VRP, whereas now it's 94.7% slower. I have a patch for the above slowdown, but I wouldn't characterize the above difference as a "compile hog". When I add up the 3 ltrans unit totals (which are basically the entire compilation), the difference is a 3% slowdown. If this PR is for a larger than 3-4% slowdown, I think we should look elsewhere. I could be wrong though ;-).
[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409 Jan Hubicka changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=102943 --- Comment #8 from Jan Hubicka --- thanks for bisecting! So not modref, but jump threading this itme. Linking with the other PR on WRF and threading (perhaps those are different issues).
[Bug tree-optimization/103409] [12 Regression] 18% SPEC2017 WRF compile-time regression with -O2 -flto since r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103409 Martin Liška changed: What|Removed |Added Summary|[12 Regression] 18% |[12 Regression] 18% |SPEC2017 WRF compile-time |SPEC2017 WRF compile-time |regression with -O2 -flto |regression with -O2 -flto |since |since |r12-3903-g0288527f47cec669 |r12-5228-gb7a23949b0dcc4205 ||fcc2be6b84b91441faa384d --- Comment #7 from Martin Liška --- Ok, then it started with r12-5228-gb7a23949b0dcc4205fcc2be6b84b91441faa384d.