> On 05/25/2017 01:22 PM, Markus Trippelsdorf wrote: > > On 2017.05.25 at 11:55 +0200, Martin Liška wrote: > >> Hi. > >> > >> As I spoke about the PGO with Honza and Richi, current 3-stage is not > >> ideal for following > >> 2 reasons: > >> > >> 1) stageprofile compiler is train just on libraries that are built during > >> stage2 > >> 2) apart from that, as the compiler is also used to build the final > >> compiler, profile > >> is being updated during the build. So the stage2 compiler is making > >> different decisions. > >> > >> Both problems can be resolved by adding another step in between current > >> stage2 and stage3 > >> where we train stage2 compiler by building compiler with default options. > >> > >> I'm going to do some measurements. > > > > I did some measurements on gcc67 (trunk with --enable-checking=release). > > The apparent speedup is in the noise. > > Hello. > > Thanks for measurements: > > I can see difference for GCC 7.1: > > g++-7 tramp3d-v4.ii -O2 && time for i in `seq 1 10` ; do g++-7 tramp3d-v4.ii > -O2 ; done > > before: 2m25.133s > after: real 2m25.133s > > which is 99.09124426480228%. It's probably within a noise level. > > And apparently file size of binary is bugger: > > before (using bloaty): > > VM SIZE FILE SIZE > -------------- -------------- > 59.0% 15.1Mi .text 15.1Mi 62.3% > 21.3% 5.45Mi .rodata 5.45Mi 22.5% > 6.6% 1.69Mi .eh_frame 1.69Mi 6.9% > 5.4% 1.38Mi .bss 0 0.0% > 3.3% 874Ki .dynstr 874Ki 3.5% > 1.8% 480Ki .dynsym 480Ki 1.9% > 1.1% 285Ki .eh_frame_hdr 285Ki 1.1% > 0.6% 158Ki .gnu.hash 158Ki 0.6% > 0.5% 144Ki .hash 144Ki 0.6% > 0.2% 44.4Ki .data 44.4Ki 0.2% > 0.2% 40.0Ki .gnu.version 40.0Ki 0.2% > 0.0% 11.1Ki .rela.plt 11.1Ki 0.0% > 0.0% 7.44Ki .plt 7.44Ki 0.0% > 0.0% 4.56Ki .data.rel.ro 4.56Ki 0.0% > 0.0% 3.73Ki .got.plt 3.73Ki 0.0% > 0.0% 38 [Unmapped] 2.75Ki 0.0% > 0.0% 624 [ELF Headers] 2.55Ki 0.0% > 0.0% 848 [Other] 1.13Ki 0.0% > 0.0% 917 .gcc_except_table 917 0.0% > 0.0% 608 .dynamic 608 0.0% > 0.0% 16 [None] 0 0.0% > 100.0% 25.7Mi TOTAL 24.3Mi 100.0% > > after: > > VM SIZE FILE SIZE > -------------- -------------- > 58.3% 14.6Mi .text 14.6Mi 54.2% > 21.6% 5.41Mi .rodata 5.41Mi 20.1% > 0.0% 0 .strtab 2.13Mi 7.9% > 6.7% 1.67Mi .eh_frame 1.67Mi 6.2% > 5.5% 1.38Mi .bss 0 0.0% > 0.0% 0 .symtab 1.11Mi 4.1% > 3.4% 876Ki .dynstr 876Ki 3.2% > 1.9% 480Ki .dynsym 480Ki 1.7% > 1.1% 280Ki .eh_frame_hdr 280Ki 1.0% > 0.6% 158Ki .gnu.hash 158Ki 0.6% > 0.6% 144Ki .hash 144Ki 0.5% > 0.2% 44.4Ki .data 44.4Ki 0.2% > 0.2% 40.1Ki .gnu.version 40.1Ki 0.1% > 0.0% 11.1Ki .rela.plt 11.1Ki 0.0% > 0.0% 7.44Ki .plt 7.44Ki 0.0% > 0.0% 4.56Ki .data.rel.ro 4.56Ki 0.0% > 0.0% 3.73Ki .got.plt 3.73Ki 0.0% > 0.0% 58 [Unmapped] 3.11Ki 0.0% > 0.0% 624 [ELF Headers] 2.61Ki 0.0% > 0.0% 2.32Ki [Other] 2.60Ki 0.0% > 0.0% 16 [None] 0 0.0% > 100.0% 25.1Mi TOTAL 26.9Mi 100.0% > > As I had chat with Honza, we still have problem in GCC that using current > working sets, > get_hot_bb_threshold () is very close to number of runs, which is effectively > 1 for a single > run. That's mistake and that should be fixed.
Yep, with LTO+PGO bootstrap I think we also hit the problem that PGO inliner was never seriously tuned (we basically use the very first badness metric I introduced and we never experimented with parameters). The reason is that hot/cold partitioning even when it is very coarsce does work reasonably well for per-file compilation model. With LTO we are facing very many inline decisions and probably there is a lot of low hanging fruit. GCC is currently on transition to new profile counter code. I will push out the initial patch retiring gcov_type soon (once I finish updating it to current tree - it is very anoying) and that will let us to track hotness more conservatively and fix the old problem that count becomes unrealistically low by broken profile updates and thus becomes cold. This should make it possible to increase the threshold and start with re-tunning (hopefully this or next week) Honza > > Martin