[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-05-17 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #28 from krzysio.kurek at wp dot pl ---
I mean the relative performance is worse but still Profiled GCC6 is faster than
GCC 7 or GCC 8.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-05-11 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

Martin Liška  changed:

   What|Removed |Added

   Assignee|marxin at gcc dot gnu.org  |unassigned at gcc dot 
gnu.org

--- Comment #27 from Martin Liška  ---
I'm sorry for not having enough time. If GCC 7 and 8 is fine, then I won't
spend much time investigating version 6. Thanks for understanding.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-29 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

krzysio.kurek at wp dot pl changed:

   What|Removed |Added

  Attachment #44025|0   |1
is obsolete||

--- Comment #26 from krzysio.kurek at wp dot pl ---
Created attachment 44040
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44040=edit
Annotates and report logs from perf

I don't have an FTP, I'm just doing this stuff locally by hand.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-26 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #25 from Martin Liška  ---
(In reply to krzysio.kurek from comment #24)
> Created attachment 44025 [details]
> Performance logs from perf
> 
> Alright so I've generated 4 profiles with the following flags:
> "pure6": -O3 -DNDEBUG -flto
> "profiled6": -O3 -DNDEBUG -flto + profile run
> "pure7": -O3 -DNDEBUG -flto
> "profiled7": -O3 -DNDEBUG -flto + profile run
> 
> On GCC6 the profile run impacted performance of the application negatively
> compared to "pure6".

It would be best if you attach perf report and perf annotate. The perf.data
files are bad as one needs built binaries.

Can you please attach ftp which you have for these configurations?

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-26 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

krzysio.kurek at wp dot pl changed:

   What|Removed |Added

  Attachment #42199|0   |1
is obsolete||

--- Comment #24 from krzysio.kurek at wp dot pl ---
Created attachment 44025
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44025=edit
Performance logs from perf

Alright so I've generated 4 profiles with the following flags:
"pure6": -O3 -DNDEBUG -flto
"profiled6": -O3 -DNDEBUG -flto + profile run
"pure7": -O3 -DNDEBUG -flto
"profiled7": -O3 -DNDEBUG -flto + profile run

On GCC6 the profile run impacted performance of the application negatively
compared to "pure6".

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-24 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #23 from Martin Liška  ---
(In reply to krzysio.kurek from comment #22)
> I'm sorry but profiling doesn't seem to be well documented, what tools
> should I use to generate a profiling log?

$ gcc -fprofile-generate x.c
$ ./a.out # or make check, which will train the compiler
$ gcc -fprofile-use x.c

The profile will be then automatically loaded and used.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-17 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #22 from krzysio.kurek at wp dot pl ---
I'm sorry but profiling doesn't seem to be well documented, what tools should I
use to generate a profiling log?

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-16 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

Martin Liška  changed:

   What|Removed |Added

 Status|REOPENED|WAITING

--- Comment #21 from Martin Liška  ---
Ok, so you can use compiler hints line attribute always_inline or inline.
Apart from that you can use PGO (-fprofile-generate/-fprofile-use) to provide
even more hints.

For the analysis, please attach perf report log files for slow and fast
version. That should help us.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-15 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

krzysio.kurek at wp dot pl changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|INVALID |---

--- Comment #20 from krzysio.kurek at wp dot pl ---
Reopening.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-15 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #19 from krzysio.kurek at wp dot pl ---
I'm sorry if it's not clear, English isn't my native language.

So to reiterate.
The normal, unoptimized path is:
World::update() simulates world which uses Random::get() to generate a random
number. Random::get() calls std::uniform_int_distribution, which then calls
std::mersenne_twister_engine().
And this is what GCC7 uses.

Now GCC6 with LTO manages to optimize the path by:
Calling Random::get() only once per frame.
Omitting std::uniform_int_distribution, calling std::mersenne_twister_engine()
instead every time a random number is needed.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-15 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #18 from krzysio.kurek at wp dot pl ---
I'm sorry if it's not clear, English isn't my native language.

So to reiterate.
The normal, unoptimized path is:
World::update() simulates world which uses Random::get() on every person.
Random::get() calls

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-04-14 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

krzysio.kurek at wp dot pl changed:

   What|Removed |Added

Version|7.1.0   |7.2.0

--- Comment #17 from krzysio.kurek at wp dot pl ---
I've been trying to create a minimal testcase but I've not been able to.
Seemingly random actions like changing the directory from the build dir back
into it prevents the bug from manifesting.
Looking at results of Valgrind's tool, Callgrind, I've managed to find the hog
being Random::get() calls not being optimized as aggressively on GCC7 as on
GCC6.
On GCC6, std::mersenne_twister_engine is being called directly from the main
world simulation loop, whereas on GCC7, Random::get() is called every time
calculations on random numbers are made. Random::get() in turn calls
std::uniform_int_distribution which then calls the mersenne engine.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2018-02-16 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

Martin Liška  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |INVALID

--- Comment #16 from Martin Liška  ---
I'm closing that as analysis is needed from side of the program. Feel free to
reopen if you analyze what's the runtime hog.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-20 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #15 from Martin Liška  ---
I would recommend to use perf toop to identify and understand what happens in
the program. I can return to that later when we'll tune GCC 8.x.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-19 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #14 from Martin Liška  ---
So for GCC 7 the drop is caused by r237791:

SVN revision: 237791
Author: hubicka

* gcc.dg/predict-12.c: New testcase.

* predict.c: Include gimple-pretty-print.h
(predicted_by_loop_heuristics_p): Check also
PRED_LOOP_EXIT_WITH_RECURSION
(predict_loops): Find self recursive calls and use special purpose
predictors for them; dump log about decisions.
(pass_profile::execute): Dump info about #of iterations.
* predict.def (PRED_LOOP_EXIT_WITH_RECURSION,
(PRED_LOOP_GUARD_WITH_RECURSION): New predictors.

and the performance for current trunk is similar to GCC 6.x, starting from
r248334:

Author: hubicka
* ipa-inline.c (edge_badness): Use inlined_time instead of
inline_summaries->get.

That said it's sensitive to inlining.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-19 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #13 from Martin Liška  ---
One another observation: using PGO (-fprofile-generate and -fprofile-use), both
GCC 6 and GCC 7 have similar performance: 34fps. While GCC 6 has 43 fps on my
machine.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-19 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

Martin Liška  changed:

   What|Removed |Added

 Status|WAITING |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |marxin at gcc dot 
gnu.org

--- Comment #12 from Martin Liška  ---
(In reply to krzysio.kurek from comment #11)
> Done, cmake will now default to Release config with altered compiler flags
> that include -flto.

Thanks. I can confirm I can reproduce it on my laptop. Tomorrow, I'll bisect
that.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-19 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #11 from krzysio.kurek at wp dot pl ---
Done, cmake will now default to Release config with altered compiler flags that
include -flto.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-19 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

Martin Liška  changed:

   What|Removed |Added

 Status|NEW |WAITING

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-19 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

Martin Liška  changed:

   What|Removed |Added

 Status|WAITING |NEW

--- Comment #10 from Martin Liška  ---
I can confirm that I'm able to build it and to see FPS on console. Please
modify the branch so that it uses -flto and optimization flags you use.

Thanks,
Martin

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-18 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #9 from Andrew Pinski  ---
Does -fno-threadsafe-statics help here?

Though that should not make a difference here as GCC 6 defaults to the same as
GCC 7 in this area.

Also does moving "static Random r;" outside of Random::get() help?

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-18 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #8 from krzysio.kurek at wp dot pl ---
Okay, I'm terribly sorry for the amount of messages but I'm failing over and
over again trying to grasp the issue.
The seed is only static for the first request, but changes later on becoming
random once more.
After making few (what seemed at the time) minor cosmetic changes GCC6 had
optimized out at least one additional function completely.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-18 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #7 from krzysio.kurek at wp dot pl ---
I've created a fork with "devel" branch here:
https://github.com/kiroma/Empire/tree/devel
I've tweaked some things, so now random seed is set, the simulation will exit
after 250 iterations and after exiting it will output to console average fps
for the whole run.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-18 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #6 from krzysio.kurek at wp dot pl ---
Oh, two comments about the callgrind output.
Random::get() is being called 14 times less often on GCC6
There is a function created from template in GCC7 that is completely absent on
GCC6.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-18 Thread krzysio.kurek at wp dot pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #5 from krzysio.kurek at wp dot pl ---
Created attachment 42199
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42199=edit
callgrind output

I can make the program output to console average fps after 20 iterations with
10 warmup cycles. The program will still run with GUI, but will output just FPS
to the console and exit. Will this be sufficient?
I'm a complete newbie to the whole language, but I got some profiling done. I'm
attaching callgrind output files, they're named after which compiler was used.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-18 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

--- Comment #4 from Martin Liška  ---
Moreover, if there's an automatic way how one can get FPS without GUI
interaction, I can bisect revision that's responsible for that.

[Bug lto/82229] GCC7's LTO underperforms compared to GCC6

2017-09-18 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82229

Richard Biener  changed:

   What|Removed |Added

   Keywords||lto, missed-optimization
 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2017-09-18
 Ever confirmed|0   |1

--- Comment #3 from Richard Biener  ---
I suspect changes in cross-module inlining causing the issue.  Without some
analysis from your side it's really hard to do anything about this.  A good
start is to do some profiling.