[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #28 from krzysio.kurek at wp dot pl --- I mean the relative performance is worse but still Profiled GCC6 is faster than GCC 7 or GCC 8.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 Martin Liška changed: What|Removed |Added Assignee|marxin at gcc dot gnu.org |unassigned at gcc dot gnu.org --- Comment #27 from Martin Liška --- I'm sorry for not having enough time. If GCC 7 and 8 is fine, then I won't spend much time investigating version 6. Thanks for understanding.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 krzysio.kurek at wp dot pl changed: What|Removed |Added Attachment #44025|0 |1 is obsolete|| --- Comment #26 from krzysio.kurek at wp dot pl --- Created attachment 44040 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44040=edit Annotates and report logs from perf I don't have an FTP, I'm just doing this stuff locally by hand.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #25 from Martin Liška --- (In reply to krzysio.kurek from comment #24) > Created attachment 44025 [details] > Performance logs from perf > > Alright so I've generated 4 profiles with the following flags: > "pure6": -O3 -DNDEBUG -flto > "profiled6": -O3 -DNDEBUG -flto + profile run > "pure7": -O3 -DNDEBUG -flto > "profiled7": -O3 -DNDEBUG -flto + profile run > > On GCC6 the profile run impacted performance of the application negatively > compared to "pure6". It would be best if you attach perf report and perf annotate. The perf.data files are bad as one needs built binaries. Can you please attach ftp which you have for these configurations?
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 krzysio.kurek at wp dot pl changed: What|Removed |Added Attachment #42199|0 |1 is obsolete|| --- Comment #24 from krzysio.kurek at wp dot pl --- Created attachment 44025 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44025=edit Performance logs from perf Alright so I've generated 4 profiles with the following flags: "pure6": -O3 -DNDEBUG -flto "profiled6": -O3 -DNDEBUG -flto + profile run "pure7": -O3 -DNDEBUG -flto "profiled7": -O3 -DNDEBUG -flto + profile run On GCC6 the profile run impacted performance of the application negatively compared to "pure6".
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #23 from Martin Liška --- (In reply to krzysio.kurek from comment #22) > I'm sorry but profiling doesn't seem to be well documented, what tools > should I use to generate a profiling log? $ gcc -fprofile-generate x.c $ ./a.out # or make check, which will train the compiler $ gcc -fprofile-use x.c The profile will be then automatically loaded and used.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #22 from krzysio.kurek at wp dot pl --- I'm sorry but profiling doesn't seem to be well documented, what tools should I use to generate a profiling log?
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 Martin Liška changed: What|Removed |Added Status|REOPENED|WAITING --- Comment #21 from Martin Liška --- Ok, so you can use compiler hints line attribute always_inline or inline. Apart from that you can use PGO (-fprofile-generate/-fprofile-use) to provide even more hints. For the analysis, please attach perf report log files for slow and fast version. That should help us.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 krzysio.kurek at wp dot pl changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|INVALID |--- --- Comment #20 from krzysio.kurek at wp dot pl --- Reopening.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #19 from krzysio.kurek at wp dot pl --- I'm sorry if it's not clear, English isn't my native language. So to reiterate. The normal, unoptimized path is: World::update() simulates world which uses Random::get() to generate a random number. Random::get() calls std::uniform_int_distribution, which then calls std::mersenne_twister_engine(). And this is what GCC7 uses. Now GCC6 with LTO manages to optimize the path by: Calling Random::get() only once per frame. Omitting std::uniform_int_distribution, calling std::mersenne_twister_engine() instead every time a random number is needed.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #18 from krzysio.kurek at wp dot pl --- I'm sorry if it's not clear, English isn't my native language. So to reiterate. The normal, unoptimized path is: World::update() simulates world which uses Random::get() on every person. Random::get() calls
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 krzysio.kurek at wp dot pl changed: What|Removed |Added Version|7.1.0 |7.2.0 --- Comment #17 from krzysio.kurek at wp dot pl --- I've been trying to create a minimal testcase but I've not been able to. Seemingly random actions like changing the directory from the build dir back into it prevents the bug from manifesting. Looking at results of Valgrind's tool, Callgrind, I've managed to find the hog being Random::get() calls not being optimized as aggressively on GCC7 as on GCC6. On GCC6, std::mersenne_twister_engine is being called directly from the main world simulation loop, whereas on GCC7, Random::get() is called every time calculations on random numbers are made. Random::get() in turn calls std::uniform_int_distribution which then calls the mersenne engine.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 Martin Liška changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |INVALID --- Comment #16 from Martin Liška --- I'm closing that as analysis is needed from side of the program. Feel free to reopen if you analyze what's the runtime hog.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #15 from Martin Liška --- I would recommend to use perf toop to identify and understand what happens in the program. I can return to that later when we'll tune GCC 8.x.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #14 from Martin Liška --- So for GCC 7 the drop is caused by r237791: SVN revision: 237791 Author: hubicka * gcc.dg/predict-12.c: New testcase. * predict.c: Include gimple-pretty-print.h (predicted_by_loop_heuristics_p): Check also PRED_LOOP_EXIT_WITH_RECURSION (predict_loops): Find self recursive calls and use special purpose predictors for them; dump log about decisions. (pass_profile::execute): Dump info about #of iterations. * predict.def (PRED_LOOP_EXIT_WITH_RECURSION, (PRED_LOOP_GUARD_WITH_RECURSION): New predictors. and the performance for current trunk is similar to GCC 6.x, starting from r248334: Author: hubicka * ipa-inline.c (edge_badness): Use inlined_time instead of inline_summaries->get. That said it's sensitive to inlining.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #13 from Martin Liška --- One another observation: using PGO (-fprofile-generate and -fprofile-use), both GCC 6 and GCC 7 have similar performance: 34fps. While GCC 6 has 43 fps on my machine.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 Martin Liška changed: What|Removed |Added Status|WAITING |ASSIGNED Assignee|unassigned at gcc dot gnu.org |marxin at gcc dot gnu.org --- Comment #12 from Martin Liška --- (In reply to krzysio.kurek from comment #11) > Done, cmake will now default to Release config with altered compiler flags > that include -flto. Thanks. I can confirm I can reproduce it on my laptop. Tomorrow, I'll bisect that.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #11 from krzysio.kurek at wp dot pl --- Done, cmake will now default to Release config with altered compiler flags that include -flto.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 Martin Liška changed: What|Removed |Added Status|NEW |WAITING
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 Martin Liška changed: What|Removed |Added Status|WAITING |NEW --- Comment #10 from Martin Liška --- I can confirm that I'm able to build it and to see FPS on console. Please modify the branch so that it uses -flto and optimization flags you use. Thanks, Martin
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #9 from Andrew Pinski --- Does -fno-threadsafe-statics help here? Though that should not make a difference here as GCC 6 defaults to the same as GCC 7 in this area. Also does moving "static Random r;" outside of Random::get() help?
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #8 from krzysio.kurek at wp dot pl --- Okay, I'm terribly sorry for the amount of messages but I'm failing over and over again trying to grasp the issue. The seed is only static for the first request, but changes later on becoming random once more. After making few (what seemed at the time) minor cosmetic changes GCC6 had optimized out at least one additional function completely.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #7 from krzysio.kurek at wp dot pl --- I've created a fork with "devel" branch here: https://github.com/kiroma/Empire/tree/devel I've tweaked some things, so now random seed is set, the simulation will exit after 250 iterations and after exiting it will output to console average fps for the whole run.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #6 from krzysio.kurek at wp dot pl --- Oh, two comments about the callgrind output. Random::get() is being called 14 times less often on GCC6 There is a function created from template in GCC7 that is completely absent on GCC6.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #5 from krzysio.kurek at wp dot pl --- Created attachment 42199 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42199=edit callgrind output I can make the program output to console average fps after 20 iterations with 10 warmup cycles. The program will still run with GUI, but will output just FPS to the console and exit. Will this be sufficient? I'm a complete newbie to the whole language, but I got some profiling done. I'm attaching callgrind output files, they're named after which compiler was used.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 --- Comment #4 from Martin Liška --- Moreover, if there's an automatic way how one can get FPS without GUI interaction, I can bisect revision that's responsible for that.
[Bug lto/82229] GCC7's LTO underperforms compared to GCC6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229 Richard Biener changed: What|Removed |Added Keywords||lto, missed-optimization Status|UNCONFIRMED |WAITING Last reconfirmed||2017-09-18 Ever confirmed|0 |1 --- Comment #3 from Richard Biener --- I suspect changes in cross-module inlining causing the issue. Without some analysis from your side it's really hard to do anything about this. A good start is to do some profiling.