On October 18, 2018 11:05:32 PM GMT+02:00, Richard Sandiford <richard.sandif...@arm.com> wrote: >Richard Biener <rguent...@suse.de> writes: >> On Thu, 18 Oct 2018, Richard Sandiford wrote: >> >>> Richard Biener <rguent...@suse.de> writes: >>> > PR63155 made me pick up this old work from Steven, it turns our >>> > linked-list implementation to a two-mode one with one being a >>> > splay tree featuring O(log N) complexity for find/remove. >>> > >>> > Over Stevens original patch I added a bitmap_tree_to_vec helper >>> > that I use from the debug/print methods to avoid changing view >>> > there. In theory the bitmap iterator could get a "stack" >>> > as well and we could at least support EXECUTE_IF_SET_IN_BITMAP. >>> > >>> > This can be used to fix the two biggest bottlenecks in the PRs >>> > testcase, namely SSA propagator worklist handling and out-of-SSA >>> > coalesce list building. perf shows the following data, first >>> > unpatched, second patched - also watch the thrid coulumn (samples) >>> > when comparing percentages. >>> > >>> > -O0 >>> > - 18.19% 17.35% 407 cc1 cc1 [.] >bitmap_set_b▒ >>> > - bitmap_set_bit > ▒ >>> > + 8.77% create_coalesce_list_for_region > ▒ >>> > + 4.21% calculate_live_ranges > ▒ >>> > + 2.02% build_ssa_conflict_graph > ▒ >>> > + 1.66% insert_phi_nodes_for > ▒ >>> > + 0.86% coalesce_ssa_name >>> > patched: >>> > - 12.39% 10.48% 129 cc1 cc1 [.] >bitmap_set_b▒ >>> > - bitmap_set_bit > ▒ >>> > + 5.27% calculate_live_ranges > ▒ >>> > + 2.76% insert_phi_nodes_for > ▒ >>> > + 1.90% create_coalesce_list_for_region > ▒ >>> > + 1.63% build_ssa_conflict_graph > ▒ >>> > + 0.35% coalesce_ssa_name >>> > >>> > -O1 >>> > - 17.53% 17.53% 842 cc1 cc1 [.] >bitmap_set_b▒ >>> > - bitmap_set_bit > ▒ >>> > + 12.39% add_ssa_edge > ▒ >>> > + 1.48% create_coalesce_list_for_region > ▒ >>> > + 0.82% solve_constraints > ▒ >>> > + 0.71% calculate_live_ranges > ▒ >>> > + 0.64% add_implicit_graph_edge > ▒ >>> > + 0.41% insert_phi_nodes_for > ▒ >>> > + 0.34% build_ssa_conflict_graph >>> > patched: >>> > - 5.79% 5.00% 167 cc1 cc1 [.] >bitmap_set_b▒ >>> > - bitmap_set_bit > ▒ >>> > + 1.41% add_ssa_edge > ▒ >>> > + 0.88% calculate_live_ranges > ▒ >>> > + 0.75% add_implicit_graph_edge > ▒ >>> > + 0.68% solve_constraints > ▒ >>> > + 0.48% insert_phi_nodes_for > ▒ >>> > + 0.45% build_ssa_conflict_graph >>> > >>> > -O3 >>> > - 12.37% 12.34% 1145 cc1 cc1 [.] >bitmap_set_b▒ >>> > - bitmap_set_bit > ▒ >>> > + 9.14% add_ssa_edge > ▒ >>> > + 0.80% create_coalesce_list_for_region > ▒ >>> > + 0.69% add_implicit_graph_edge > ▒ >>> > + 0.54% solve_constraints > ▒ >>> > + 0.34% calculate_live_ranges > ▒ >>> > + 0.27% insert_phi_nodes_for > ▒ >>> > + 0.21% build_ssa_conflict_graph >>> > - 4.36% 3.86% 227 cc1 cc1 [.] >bitmap_set_b▒ >>> > - bitmap_set_bit > ▒ >>> > + 0.98% add_ssa_edge > ▒ >>> > + 0.86% add_implicit_graph_edge > ▒ >>> > + 0.64% solve_constraints > ▒ >>> > + 0.57% calculate_live_ranges > ▒ >>> > + 0.32% build_ssa_conflict_graph > ▒ >>> > + 0.29% mark_all_vars_used_1 > ▒ >>> > + 0.20% insert_phi_nodes_for > ▒ >>> > + 0.16% create_coalesce_list_for_region >>> > >>> > >>> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. >>> >>> What's the performance like for more reasonable testcases? E.g. is >there >>> a measurable change either way in --enable-checking=release for some >gcc >>> .iis at -g or -O2 -g? >> >> I did a quick check on my set of cc1files (still .i, .ii ones tend to >> be unusable quite quickly...). Most of them compile too quickly >> to make any difference appear other than noise. Multi-second >differences >> like for PR63155 should be the exception but our O(n) bitmap >> implementation really makes some parts of GCC quadratic where it >> doesn't appear so. >> >> Is there a reason you expect it to be ever slower? > >During recent stage3s I've tried to look at profiles of cc1plus >to see whether there was something easy we could do to improve >compile times. And bitmap operations always showed up near the >top of the profile. There were always some pathological queries >in which the linear search really hurt, but whenever I tried "simple" >ways to avoid the obvious cases, they made those queries quicker >but slowed things down overall. It seemed that adding any extra logic >to the queries hurted because even a small slowdown in common lookups >overwhelmed a big saving in less common lookups.
Yeah. I also noticed some 'obvious' shortcomings in the heuristics... I guess in the end well predicted branches in the out of line code are important... > >But there again I was looking to speed up more "normal" cases, not ones >like the PR. > >FWIW I've tried it on a local x86_64 box and it slows down current >optabs.ii at -O2 -g by ~0.35% (reproducable). I see similar slowdowns >for the other files I tried. But that's hardly a huge amount, and >probably a price worth paying for the speed-up in the PR. Just to make sure what to reproduce - this is with checking disabled? And with or without the hunks actually making use of the splay tree path? Richard. >Thanks, >Richard