[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #9 from Rama Malladi --- (In reply to Martin Liška from comment #3) > Can you please share perf-profile before and after the revision? > > Note I can't see it for Altra aarch64 CPU: > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0&plot. > 1=683.477.0&plot.2=664.477.0&plot.3=648.477.0&plot.4=618.477.0&plot.5=605. > 477.0&plot.6=759.477.0&plot.7=584.477.0& > > However, there are huge changes in between GCC 6/7 and a newer releases. > Note the benchmark is pretty small and very sensitive to instruction caches. Hi, I got IPC data for baseline version of compiler and with this patch reverted. This is on Graviton3 processor machine, executing 1-copy rate run of 519.lbm_r. Baseline: Compiler commit ID: f896c13489d22b30d01257bc8316ab97b3359d1c Cycles:148,489,372,938 Instructions: 382,748,379,257 IPC: 2.58 Baseline with code change in a9a4edf0e71bbac9f1b5dcecdcf9250111d16889 reverted. $ git diff gcc/tree-cfg.cc diff --git a/gcc/tree-cfg.cc b/gcc/tree-cfg.cc index d982988048f..736432713fe 100644 --- a/gcc/tree-cfg.cc +++ b/gcc/tree-cfg.cc @@ -9984,7 +9984,7 @@ execute_fixup_cfg (void) } if (scale) { - update_max_bb_count (); +// update_max_bb_count (); compute_function_frequency (); } Cycles:140,937,228,769 Instructions: 368,881,714,982 IPC: 2.62 >From the above, I do see the instructions executed are higher for the baseline compiler code-gen vs. the one with patch reverted. Can you please look into the code-gen and let me know if you find some perf opportunity with this patch revert? Thank you.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 Martin Liška changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2022-12-01 Ever confirmed|0 |1 --- Comment #10 from Martin Liška --- @Honza ?
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #11 from Rama Malladi --- (In reply to Martin Liška from comment #10) > @Honza ? Just checking if this can be fixed/ implemented. Thanks.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #12 from Rama Malladi --- I found difference in dumps at various stages of the compilation for the mainline GCC and with update_max_bb_count() commented. Here are the details: Mainline: Commit ID: 63a42ffc0833553fbcb84b50cf0fd2d867b8a92f There was difference in the dumps for these 2 stages: "einline" and "earlydebug" Since we use LTO for this build of 519.lbm_r build, I found these differences in these stages of the link-time optimizer: "vect", "slp1", "ivopts", "earlydebug", "debug" Also, this perf drop of 5%-6% with update_max_bb_count() code was observed only on ARM64 instances (Graviton3) and not on x86_64 instances (Intel Xeon). I ran the other SPEC cpu2017_fprate benchmarks on ARM64 with this code commented on GCC mainline and I haven't observed any perf regression. So, maybe worth a fix. Thank you.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #13 from Martin Liška --- Note the mentioned revision is a fix and yes, sometimes these revisions can end up with a regression as profile estimation is a complex guess.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #14 from Rama Malladi --- (In reply to Martin Liška from comment #13) > Note the mentioned revision is a fix and yes, sometimes these revisions can > end up with a regression as profile estimation is a complex guess. Yes, possibly. So, from my understanding, the update_max_bb_count() tracks the max basic block count and takes a decision to inline or not in this case/ run. That is likely why we see a larger instruction count w this function executed/ enabled.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #22 from Rama Malladi --- I will close this issue as we were unable to reproduce the perf drop going from gcc-7 to gcc-8 on a Graviton2 based instance. The performance of 519.lbm_r built with gcc-7.4 was same as that with gcc-8.5.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #23 from Rama Malladi --- (In reply to Rama Malladi from comment #22) > I will close this issue as we were unable to reproduce the perf drop going > from gcc-7 to gcc-8 on a Graviton2 based instance. The performance of > 519.lbm_r built with gcc-7.4 was same as that with gcc-8.5. Can someone from the GCC dev/ regression team close this issue as I am unable to find an option for the same? Thanks
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 Martin Liška changed: What|Removed |Added Resolution|--- |INVALID Status|NEW |RESOLVED --- Comment #24 from Martin Liška --- Sure, let's close it.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #15 from Rama Malladi --- Hi, Can we review this issue and suggest next steps/ action please? Thanks.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #16 from Martin Liška --- @Honza: ???
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 Martin Jambor changed: What|Removed |Added CC||jamborm at gcc dot gnu.org --- Comment #17 from Martin Jambor --- (In reply to Martin Liška from comment #3) > Note I can't see it for Altra aarch64 CPU: I think LNT can see it very well and it has appeared around the reported time: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=759.477.0&plot.1=584.477.0&; But I don't think the commit given in the bug summary is the culprit, this clearly happened in GCC 13 development cycle. Rama, what is your reasoning to suggest reverting this particular commit? If you bisected the issue, can you double check you arrived at the correct one?
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 Sebastian Pop changed: What|Removed |Added CC||spop at gcc dot gnu.org --- Comment #18 from Sebastian Pop --- A new 5% regression happened in gcc-trunk more recently and may be due to another patch. Rama was bisecting a 15% perf regression on lbm when updating gcc-7 to gcc-10. The regression can be seen on the LNT graph link from comment#3 https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=633.477.0&plot.1=683.477.0&plot.2=664.477.0&plot.3=648.477.0&plot.4=618.477.0&plot.5=605.477.0&plot.6=759.477.0&plot.7=584.477.0 gcc-6 has execution time of 213 seconds gcc-7 is at 215 seconds gcc-8 is at 266 gcc-9 at 259 gcc-10 at 260 Honza's patch seems to be unrelated as it was committed to trunk before gcc-10 release on May 7, 2020: commit a9a4edf0e71bbac9f1b5dcecdcf9250111d16889 Author: Jan Hubicka Date: Sat Nov 30 22:25:24 2019 +0100 Update max_bb_count in execute_fixup_cfg We need to git-bisect between gcc-7 and gcc-8.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #19 from Rama Malladi --- Thanks @Sebastian and @Martin J. I will get another bisect between GCC 7-and-8.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #20 from Rama Malladi --- @Martin J and @Sebastian P, Let me walk you through the perf data and my triage. First, my triage has been on Graviton 3 (neoverse-v1) processor based instances. Next, I was looking for perf delta going from gcc-7 to gcc-10. I found 2 issues: One was reported in 107413 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107413) and fixed (the perf delta between gcc-7 and gcc-8 -- 215s vs. 266s); Another one is the issue reported in here. I did another triage and landed at the same commit that I reported earlier. # first bad commit: [a9a4edf0e71bbac9f1b5dcecdcf9250111d16889] Update max_bb_count in execute_fixup_cfg Please let me know any further info/ studies you would like to see on this report. Thank you.
[Bug tree-optimization/107409] Perf loss ~5% on 519.lbm_r SPEC cpu2017 benchmark with r10-5090-ga9a4edf0e71bba
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107409 --- Comment #21 from Rama Malladi --- I did another triage for perf loss on Graviton 2 processor (neoverse-n1) based instance and found this commit: `a9a4edf0e71bbac9f1b5dcecdcf9250111d16889` to be the reason. As I had indicated in my earlier reply, I was doing a triage of perf loss going from gcc-7 to gcc-10. The perf of 519.libm_r 1-copy run improved 1.08x with the revert of commit: `a9a4edf0e71bbac9f1b5dcecdcf9250111d16889` on gcc-mainline ( `2f1691be517fcdcabae9cd671ab511eb0e08b1d5`). I am guessing that we don't see it on LNT/ Altra CPUs. So, please look into this issue fix. Let me know if you have any queries. Thanks.