[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #62 from Steven Bosscher steven at gcc dot gnu.org 2012-10-15 21:32:04 UTC --- Author: steven Date: Mon Oct 15 21:31:57 2012 New Revision: 192476 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=192476 Log: Backport from trunk (r190222): PR tree-optimization/54146 * ifcvt.c: Include pointer-set.h. (cond_move_process_if_block): Change type of then_regs and else_regs from alloca'd array to pointer_sets. (check_cond_move_block): Update for this change. (cond_move_convert_if_block): Likewise. * Makefile.in: Fix dependencies for ifcvt.o. Modified: branches/gcc-4_7-branch/gcc/ChangeLog branches/gcc-4_7-branch/gcc/Makefile.in branches/gcc-4_7-branch/gcc/ifcvt.c
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #61 from Jan Hubicka hubicka at ucw dot cz 2012-09-10 16:04:38 UTC --- I think it would be good to deprecate the flatten attribute... It still can be useful I think, if only for creating testcases with arbitrary large functions ;) I think it is generally useful and some of the HPC extensions do contain flatten equivalent. I do not see anything bad in compiler blowing up your machine when you explicitely ask him to do so via an attribute ;) Honza
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #60 from Steven Bosscher steven at gcc dot gnu.org 2012-09-04 10:50:47 UTC --- Another improvement that can be done for the test case, is to assign pseudo-register numbers such that the density of the DF_LR and DF_LIVE bitmaps increases. For the density I used the following definition here: size = (# list elts) * (# bits per elt) cardinality = (# bits set) density = cardinality / size The average density we achieve for the DF_LR_IN sets of the largest function is only 12%, and that number is misleading because there are a few elements with almost all bits set and many elements with only a few bits set. For instance there is one bitmap with the following statistics: size = 50 cardinality = 940 density = 14.7% but the per-element statistics show: bits/eltcount 120 219 32 42 1247 so excluding the 7 dense elements (124 bits set out of 128), the density is only 1.3%! The dense bits are for a bunch of SRA variables, the sparse ones for ivtmp variables. The ivtmp variables are live in mostly the same blocks as the SRA variables (10 blocks where the variables appear in DF_LR_IN of the block) so the packing is extremely inefficient. This is the cause for most of the slowness of DF on the test case: the bitmap chains are large and extremely sparse.
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #59 from Steven Bosscher steven at gcc dot gnu.org 2012-09-02 22:54:34 UTC --- FWIW Martin: SRA blows up this test case's register pressure. Compiling with SRA enabled takes ~900s, but with -fno-tree-sra compile time almost halves. There are extremely long live ranges for SA.* variables created by SRA.
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #56 from rguenther at suse dot de rguenther at suse dot de 2012-08-21 07:55:14 UTC --- On Mon, 20 Aug 2012, steven at gcc dot gnu.org wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution||WONTFIX --- Comment #55 from Steven Bosscher steven at gcc dot gnu.org 2012-08-20 19:30:34 UTC --- The remaining problem areas are all liveness calculation routines that are essentially inherent quadratic problems: DF liveness, IRA liveness, and out-of-SSA liveness. I think it would be good to deprecate the flatten attribute... It still can be useful I think, if only for creating testcases with arbitrary large functions ;)
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #57 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-21 13:34:28 UTC --- Author: rguenth Date: Tue Aug 21 13:34:19 2012 New Revision: 190562 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190562 Log: 2012-08-21 Richard Guenther rguent...@suse.de Backport from mainline 2012-08-16 Richard Guenther rguent...@suse.de PR middle-end/54146 * tree-ssa-loop-niter.c (find_loop_niter_by_eval): Free the exit vector. * ipa-pure-const.c (analyze_function): Use FOR_EACH_LOOP_BREAK. * cfgloop.h (FOR_EACH_LOOP_BREAK): Fix. * tree-ssa-structalias.c (handle_lhs_call): Properly free rhsc. * tree-ssa-loop-im.c (analyze_memory_references): Adjust. (tree_ssa_lim_finalize): Free all mem_refs. * tree-ssa-sccvn.c (extract_and_process_scc_for_name): Free scc when bailing out. * modulo-sched.c (sms_schedule): Use FOR_EACH_LOOP_BREAK. * ira-build.c (loop_with_complex_edge_p): Free loop exit vector. * graphite-sese-to-poly.c (scop_ivs_can_be_represented): Use FOR_EACH_LOOP_BREAK. 2012-08-17 Richard Guenther rguent...@suse.de * tree-sra.c (modify_function): Free redirect_callers vector. * ipa-split.c (split_function): Free args_to_pass vector. * tree-vect-stmts.c (vectorizable_operation): Do not pre-allocate vec_oprnds. (new_stmt_vec_info): Do not pre-allocate STMT_VINFO_SAME_ALIGN_REFS. * tree-vect-slp.c (vect_free_slp_instance): Free the instance. (vect_analyze_slp_instance): Free everything. (destroy_bb_vec_info): Free the SLP instances. 2012-08-17 Richard Guenther rguent...@suse.de * params.def (integer-share-limit): Decrease from 256 to 251, add rationale. 2012-08-21 Richard Guenther rguent...@suse.de * tree-ssa-loop-im.c (tree_ssa_lim_finalize): Properly free the affine expansion cache. Modified: branches/gcc-4_7-branch/gcc/ChangeLog branches/gcc-4_7-branch/gcc/cfgloop.h branches/gcc-4_7-branch/gcc/graphite-sese-to-poly.c branches/gcc-4_7-branch/gcc/ipa-pure-const.c branches/gcc-4_7-branch/gcc/ipa-split.c branches/gcc-4_7-branch/gcc/ira-build.c branches/gcc-4_7-branch/gcc/modulo-sched.c branches/gcc-4_7-branch/gcc/params.def branches/gcc-4_7-branch/gcc/tree-sra.c branches/gcc-4_7-branch/gcc/tree-ssa-loop-im.c branches/gcc-4_7-branch/gcc/tree-ssa-loop-niter.c branches/gcc-4_7-branch/gcc/tree-ssa-sccvn.c branches/gcc-4_7-branch/gcc/tree-ssa-structalias.c branches/gcc-4_7-branch/gcc/tree-vect-slp.c branches/gcc-4_7-branch/gcc/tree-vect-stmts.c
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #58 from stevenb.gcc at gmail dot com stevenb.gcc at gmail dot com 2012-08-21 13:56:27 UTC --- FWIW, I think all patches addressing parts of this bug are candidates for back-porting to release branches. They are all almost trivial.
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added Status|NEW |RESOLVED Resolution||WONTFIX --- Comment #55 from Steven Bosscher steven at gcc dot gnu.org 2012-08-20 19:30:34 UTC --- The remaining problem areas are all liveness calculation routines that are essentially inherent quadratic problems: DF liveness, IRA liveness, and out-of-SSA liveness. I think it would be good to deprecate the flatten attribute...
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #54 from Steven Bosscher steven at gcc dot gnu.org 2012-08-17 09:42:15 UTC --- Author: steven Date: Fri Aug 17 09:42:06 2012 New Revision: 190475 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190475 Log: PR middle-end/54146 * tree-ssa-loop-im.c (lim_bitmap_obstack): New bitmap_obstack. (memref_free): Don't free the bitmaps individually here. (mem_ref_alloc): Allocate the bitmaps on the new bitmap obstack. (analyze_memory_references): Likewise. (tree_ssa_lim_initialize): Initialize the new bitmap obstack. (tree_ssa_lim_finalize): Release it. * dse.c (dse_bitmap_obstack): New bitmap obstack. (dse_obstack): New obstack. (get_group_info): Allocate the bitmaps on the new bitmap obstack. (dse_step0): Allocate the scratch bitmap on reg_obstack. Initialize the new bitmap obstack and normal obstack. Use XNEWVEC for bb_table. (record_store): Allocate regs_set on reg_obstack. (dse_step1): Allocate regs_live on reg_obstack. (dse_step2_init): Allocate offset_map_n and offset_map_p on the new obstack. (dse_step3_scan): Allocate bitmaps on the new bitmap obstack. (dse_step3): Likewise. (dse_confluence_0): Likewise. (dse_confluence_n): Likewise. (dse_transfer_function): Likewise. (dse_step7): Destroy the new obstacks, and everything allocated on them, in one big sweep. (rest_of_handle_dse): Update. * cfgexpand.c (stack_var_bitmap_obstack): New bitmap obstack. (add_stack_var_conflict): Allocate bitmaps on it. (add_scope_conflicts_1): Likewise. (add_scope_conflicts): Likewise. (update_alias_info_with_stack_vars): Likewise. (init_vars_expansion): Move TREE_USED fiddling expand_used_vars. Initialize the new bitmap obstack. (fini_vars_expansion): Release it. (estimated_stack_frame_size): Use init_vars_expansion to set things up and always clean up at the end. (expand_used_vars): Do the TREE_USED trickery here. Always call fini_vars_expansion. * tree-ssa-live.h (struct tree_live_info_d): Make livein and liveout arrays of bitmap_head to avoid one indirection per bitmap access. (live_on_entry, live_on_exit, live_var_map, live_merge_and_clear, make_live_on_entry): Update. * tree-ssa-live.c (partition_view_bitmap): Don't double-free 'used'. (liveness_bitmap_obstack): New bitmap obstack. (remove_unused_locals): Use it to allocate all bitmaps on. Update for livein/liveout changes in tree-ssa-live.h. (delete_tree_live_info): Release the bitmap obstack. (loe_visit_block, live_worklist, set_var_live_on_entry, calculate_live_on_exit, dump_live_info): Update. (calculate_live_ranges): Initialize the bitmap. * tree-ssa-ter.c (ter_bitmap_obstack): New bitmap obstack. (new_temp_expr_table): Allocate bitmap on it. (make_dependent_on_partition, add_to_partition_kill_list, add_dependence, process_replaceable): Likewise. (find_replaceable_exprs): Initialize and release the new obstack here. * df-problems.c (df_lr_add_problem): Allocate persistent bitmap for out_of_date_transfer_functions on df_bitmap_obstack. (df_live_add_problem): Likewise. (df_chain_add_problem): Likewise. (df_word_lr_add_problem): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/cfgexpand.c trunk/gcc/df-problems.c trunk/gcc/dse.c trunk/gcc/tree-ssa-live.c trunk/gcc/tree-ssa-live.h trunk/gcc/tree-ssa-loop-im.c trunk/gcc/tree-ssa-loop-manip.c trunk/gcc/tree-ssa-ter.c
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #48 from Steven Bosscher steven at gcc dot gnu.org 2012-08-16 10:52:20 UTC --- Author: steven Date: Thu Aug 16 10:52:14 2012 New Revision: 190442 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190442 Log: PR middle-end/54146 * tree-flow.h (compute_global_livein): Remove prototype. * tree-into-ssa.c (compute_global_livein): Remove function. * tree-ssa-loop-manip.c: Include gimple-pretty-print.h. (find_sibling_superloop): New function. (compute_live_loop_exits): New function. (add_exit_phis_edge): Rename to add_exit_phi. Do not allow inserting a PHI in a block that is not a loop exit for VAR. Add dumping if TDF_DETAILS. (add_exit_phis_var): Rewrite. (add_exit_phis): Update. (get_loops_exits): Rewrite to return an array of per-loop exits rather than one bitmap with all loop exits. (find_uses_to_rename_bb): Ignore virtual PHI nodes. (rewrite_into_loop_closed_ssa): Update. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-flow.h trunk/gcc/tree-into-ssa.c trunk/gcc/tree-ssa-loop-manip.c
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Keywords||memory-hog --- Comment #49 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-16 12:10:56 UTC --- We still use very much memory (full testcase doesn't fit in 4GB ram). With ... checkCGAL::Gmpfr(); //checkCGAL::Gmpfi(); //checkCGAL::QuotientCGAL::Gmpz (); //checkCGAL::Lazy_exact_ntCGAL::Gmpq (); //checkCORE::BigInt(); //checkCORE::BigRat(); //checkCORE::BigFloat(); //checkCORE::Expr(); I see (--enable-gather-detailed-mem-stats): Kind Nodes Bytes --- decls1116380 181227400 types 535841 90021288 blocks222522 17801760 stmts 457492068872 refs 854577 43178392 exprs2147881 95117920 constants 1401324176820 identifiers737106486480 vecs 2654190 131466648 binfos 481624873432 ssa names 339369 27149520 constructors 26426 634224 random kinds 1885177 73894984 lang_decl kinds 352117 13918872 lang_type kinds482187329136 omp clauses0 0 --- Total10490451 699345748 a lot of memory in TREE_VECs for some reason. GIMPLE statements Kind Stmts Bytes --- assignments 316019 30602376 phi nodes 54994 15777472 conditionals 260902504640 everything else 237509 23110772 --- Total 634612 71995260 gimple is lean, so is RTL ;) Alloc-pool Kind Elt size Pools Allocated (elts)Peak (elts)Leak (elts) -- live ranges40513 19250760(481269) 10800800( 270020) 0( 0) df_scan ref base 56 1026 331010008( 5910893) 14059808( 251068) 0( 0) df_scan ref artificial 64 1026 20113600(314275)4239872( 66248) 0( 0) df_scan ref regular64 1026 66557568( 1039962)4543872( 70998) 0( 0) are by far the biggest alloc-pool users. bitmap stats are confusing because they show leaks for bitmaps we free by releasing their obstack. DF and PTA bitmaps are largest. We leak some VECs ... c-family/c-pragma.c:619 (push_visibility)24: 0.0% 24 1: 0.0% cp/pt.c:471 (maybe_begin_member_template_process 24: 0.0% 24 1: 0.0% function.c:4513 (push_struct_function) 40: 0.0% 40 1: 0.0% vec.c:307 (vec_stack_p_reserve_exact_1) 40: 0.0% 40 1: 0.0% tree-ssa-loop-ivopts.c:3070 (multiplier_allowed_328: 0.0%608 3: 0.0% tree-ssa-loop-ivopts.c:3153 (get_address_cost) 328: 0.0%608 3: 0.0% tree-ssa-sccvn.c:745 (copy_reference_ops_from_re392: 0.0% 806232 102098: 4.6% cfgloop.h:583 (fel_init)480: 0.0%860 106: 0.0% c-family/c-pragma.c:1246 (c_register_pragma_1) 584: 0.0%696 4: 0.0% function.c:156 (push_function_context) 976: 0.0% 1200 8: 0.0% ira.c:3699 (find_moveable_pseudos) 1240: 0.0% 221128 513: 0.0% passes.c:2188 (execute_one_pass) 4360: 0.1% 655320 16466: 0.7% tree-ssa-structalias.c:3870 (handle_lhs_call) 9576: 0.2% 18360 133: 0.0% ipa-ref.c:55 (ipa_record_reference) 60184: 1.1% 327640 5813: 0.3% cfgloop.c:1143 (get_loop_exit_edges) 73184: 1.3% 157888 62221: 2.8% tree-into-ssa.c:940 (mark_phi_for_rewrite) 153360: 2.7% 164096 17: 0.0% cfgloop.c:1134 (get_loop_exit_edges) 166592: 3.0% 238712 11639: 0.5% ipa-reference.c:184 (set_reference_optimization_ 180248: 3.2% 248064 47: 0.0% tree-into-ssa.c:321 (get_ssa_name_ann) 627448:11.2% 716496 14: 0.0% tree-ssa-sccvn.c:3657 (extract_and_process_scc_f1246864:22.3%1291960 105903: 4.7% tree-ssa-loop-im.c:1562 (record_mem_ref_loc)1292560:23.1%1392576 55465: 2.5% tree-ssa-loop-im.c:1551 (record_mem_ref_loc)1771800:31.7%3141360 52717: 2.4%
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #50 from stevenb.gcc at gmail dot com stevenb.gcc at gmail dot com 2012-08-16 13:55:40 UTC --- On Thu, Aug 16, 2012 at 2:10 PM, rguenth at gcc dot gnu.org gcc-bugzi...@gcc.gnu.org wrote: bitmap stats are confusing because they show leaks for bitmaps we free by releasing their obstack. DF and PTA bitmaps are largest. The bitmap obstack stats are not very reliable anyway, I think. I've been using my own stats for this (with a size_t version of obstack_memory_used: +static size_t +obstack_memory_used2 (struct obstack *h) +{ + struct _obstack_chunk* lp; + size_t nbytes = 0; + + for (lp = h-chunk; lp != 0; lp = lp-prev) +{ + nbytes += (size_t) (lp-limit - (char *) lp); +} + return nbytes; +} + so that also the freed bitmap elements are counted). With that, you can show that the reg_obstack and bitmap_default_obstack grow up to several GB that isn't released between passes. An added problem is that IRA puts its bitmap on its own obstack (as it should, really) but that means the 3GB of free elements on the reg_obstack and bitmap_default_obstack remain unused. So on the machine I use for testing (cfarm gcc17), the memory footprint is reduced by 2GB (~25%) with this hack: Index: ira.c === --- ira.c (revision 190442) +++ ira.c (working copy) @@ -4132,6 +4132,12 @@ int max_regno_before_ira, ira_max_point_before_emit; int rebuild_p; + /* There shouldn't be anything on these obstacks. */ + bitmap_obstack_release (NULL); + bitmap_obstack_initialize (NULL); + bitmap_obstack_release (reg_obstack); + bitmap_obstack_initialize (reg_obstack); + if (flag_caller_saves) init_caller_save (); There is in general a lot of BITMAP_ALLOC(NULL) abuse in the compiler. I have patches to address the cases in tree-ssa-live.c and dse.c, and I intend to look at the tree-ssa-ter and cfgexpand cases this weekend.
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #51 from rguenther at suse dot de rguenther at suse dot de 2012-08-16 14:06:06 UTC --- On Thu, 16 Aug 2012, stevenb.gcc at gmail dot com wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #50 from stevenb.gcc at gmail dot com stevenb.gcc at gmail dot com 2012-08-16 13:55:40 UTC --- On Thu, Aug 16, 2012 at 2:10 PM, rguenth at gcc dot gnu.org gcc-bugzi...@gcc.gnu.org wrote: bitmap stats are confusing because they show leaks for bitmaps we free by releasing their obstack. DF and PTA bitmaps are largest. The bitmap obstack stats are not very reliable anyway, I think. I've been using my own stats for this (with a size_t version of obstack_memory_used: +static size_t +obstack_memory_used2 (struct obstack *h) +{ + struct _obstack_chunk* lp; + size_t nbytes = 0; + + for (lp = h-chunk; lp != 0; lp = lp-prev) +{ + nbytes += (size_t) (lp-limit - (char *) lp); +} + return nbytes; +} + so that also the freed bitmap elements are counted). With that, you can show that the reg_obstack and bitmap_default_obstack grow up to several GB that isn't released between passes. An added problem is Hum, I thought we release those obstacks after each pass ...
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #52 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-16 14:27:59 UTC --- Author: rguenth Date: Thu Aug 16 14:27:51 2012 New Revision: 190445 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190445 Log: 2012-08-16 Richard Guenther rguent...@suse.de PR middle-end/54146 * tree-ssa-loop-niter.c (find_loop_niter_by_eval): Free the exit vector. * ipa-pure-const.c (analyze_function): Use FOR_EACH_LOOP_BREAK. * cfgloop.h (FOR_EACH_LOOP_BREAK): Fix. * tree-ssa-structalias.c (handle_lhs_call): Properly free rhsc. * tree-into-ssa.c (get_ssa_name_ann): Allocate info only when needed. * tree-ssa-loop-im.c (analyze_memory_references): Adjust. (tree_ssa_lim_finalize): Free all mem_refs. * tree-ssa-sccvn.c (extract_and_process_scc_for_name): Free scc when bailing out. * modulo-sched.c (sms_schedule): Use FOR_EACH_LOOP_BREAK. * ira-build.c (loop_with_complex_edge_p): Free loop exit vector. * graphite-sese-to-poly.c (scop_ivs_can_be_represented): Use FOR_EACH_LOOP_BREAK. Modified: trunk/gcc/ChangeLog trunk/gcc/cfgloop.h trunk/gcc/graphite-sese-to-poly.c trunk/gcc/ipa-pure-const.c trunk/gcc/ira-build.c trunk/gcc/modulo-sched.c trunk/gcc/tree-into-ssa.c trunk/gcc/tree-ssa-loop-im.c trunk/gcc/tree-ssa-loop-niter.c trunk/gcc/tree-ssa-sccvn.c trunk/gcc/tree-ssa-structalias.c
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added Attachment #28020|0 |1 is obsolete|| --- Comment #53 from Steven Bosscher steven at gcc dot gnu.org 2012-08-16 23:04:04 UTC --- Created attachment 28039 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=28039 More dedicated obstacks This, together with Richi's leak plugs from earlier today, brings peak memory down to 5GB. That's 5GB now, down from 9GB this morning. Not bad for a day's work.
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added Attachment #27930|0 |1 is obsolete|| Attachment #27946|0 |1 is obsolete|| Attachment #27953|0 |1 is obsolete|| Attachment #27957|0 |1 is obsolete|| Attachment #27979|0 |1 is obsolete|| --- Comment #46 from Steven Bosscher steven at gcc dot gnu.org 2012-08-15 14:33:27 UTC --- Created attachment 28020 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=28020 Faster rewrite_into_loop_closed_ssa This reduces time spent in rewrite_into_loop_closed_ssa to something too small to show up in the timevar measurements.
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #47 from Steven Bosscher steven at gcc dot gnu.org 2012-08-15 15:07:05 UTC --- (In reply to comment #46) Created attachment 28020 [details] Faster rewrite_into_loop_closed_ssa After this patch, IRA is the only major bottle-neck left, although there are still a few passes that take more time than I think is reasonable (out-of-SSA in particular). New top 10 spenders: integrated RA : 227.36 (30%) usr 678602 kB (19%) ggc tree SSA incremental: 66.17 ( 9%) usr 151828 kB ( 4%) ggc out of ssa : 43.76 ( 6%) usr608 kB ( 0%) ggc tree PTA: 43.06 ( 6%) usr 27466 kB ( 1%) ggc df live regs: 35.03 ( 5%) usr 0 kB ( 0%) ggc df liveinitialized regs: 34.25 ( 5%) usr 0 kB ( 0%) ggc tree SSA rewrite: 18.37 ( 2%) usr 24888 kB ( 1%) ggc remove unused locals: 17.97 ( 2%) usr 0 kB ( 0%) ggc dominance computation : 15.06 ( 2%) usr 0 kB ( 0%) ggc tree CFG cleanup: 11.79 ( 2%) usr 1123 kB ( 0%) ggc
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #44 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-14 12:38:41 UTC --- Author: rguenth Date: Tue Aug 14 12:38:32 2012 New Revision: 190382 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190382 Log: 2012-08-14 Richard Guenther rguent...@suse.de PR tree-optimization/54146 * tree-ssa-pre.c (do_regular_insertion): Use a VEC indexed by pred edge index for avail. (do_partial_partial_insertion): Likewise. (insert_into_preds_of_block): Adjust. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-pre.c
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #45 from Marc Glisse glisse at gcc dot gnu.org 2012-08-14 13:20:55 UTC --- (In reply to comment #11) Marc, do you know where the use of the flatten attribute comes from in your code? Comes from the Eigen library, I'll talk to them about it and see if they can comment. They mostly deal with simple number types (double, float), so the behavior with heavy types might have been unnoticed. For the record: after doing some benchmarks, they have removed the attribute, it didn't improve performance anymore (though it probably used to in the past). I have to say I am extremely impressed by all the improvements you guys have done based on this bad code (which I realize is now just a handy example of big code with the flatten attribute high enough in the call chain). If you keep at it, next time I might not even notice there is something wrong with the code ;-)
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #42 from Steven Bosscher steven at gcc dot gnu.org 2012-08-10 06:32:37 UTC --- (In reply to comment #40) Quite an achivement that Steven managed to chase out all the other cases. Thanks for the compliment :-) I'm still working on the rewrite_into_loop_closed_ssa slowness. I will have a patch ready this weekend. Next: On to -O2 :-)
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #43 from Jan Hubicka hubicka at gcc dot gnu.org 2012-08-10 07:52:31 UTC --- Author: hubicka Date: Fri Aug 10 07:52:23 2012 New Revision: 190283 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190283 Log: PR middle-end/54146 * ipa-inline-transform.c (inline_call): Add UPDATE_OVERALL_SUMMARY parameter; honnor it. * ipa-inline.c (recursive_inlining): Update call of inline_call. (inline_small_functions): Likewise. (ipa_inline): Likewise. (inline_always_inline_functions): Likewise. (early_inline_small_functions): Likewise. (flatten_function): Do separate update of summary info. * ipa-inline.h (inline_update_overall_summary): Declare. (inline_call): Update. * ipa-inline-analysis.c (inline_merge_summary): Break out updating code to ... (inline_update_overall_summary): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline-analysis.c trunk/gcc/ipa-inline-transform.c trunk/gcc/ipa-inline.c trunk/gcc/ipa-inline.h
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #37 from Jan Hubicka hubicka at ucw dot cz 2012-08-10 03:45:31 UTC --- I suppose it's the old issue that we update fibheap keys along each inlining decision - and with flatten there are just very many ... Honza? Well, I killed most of updating for mainline and at flattening time the keys are not even computed. So it is probably some other bookeping. I will profile it. It also seems we flatten depth-first (thus inline leafs) instead of the other way around: orig_callee = callee; inline_call (e, true, NULL, NULL); if (e-callee != orig_callee) orig_callee-symbol.aux = (void *) node; flatten_function (e-callee, early); if (e-callee != orig_callee) orig_callee-symbol.aux = NULL; This means we will materialize all intermediate flattenings in functions that will not be reclaimed, right? flattening foo should inline everything into foo, but not affect remaining callee bodies. I do not follow here. Callee is already the inline clone, so we will not affect other copies of foo... Honza Richard. -- Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug.
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #38 from Jan Hubicka hubicka at ucw dot cz 2012-08-10 03:50:26 UTC --- I don't understand how inline_merge_summary is supposed to work, so I'm going to leave that one for Richi and Honza. Well, it produces inline summary for function that is result of inlininig based on summaries of the former functions. You can't just bypass it for flattening or the summary of the function after flatting will be off. Obviously the function does fair amount of propagation. I will try to speed it up or limit the bounds. Since the summaries are of constant size, we should not explode here. Honza
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #39 from Jan Hubicka hubicka at ucw dot cz 2012-08-10 03:53:46 UTC --- Martin, can you look at comment #14 and the patch? I think what we want to do in flatten_function is before inline_call (e, true, NULL, NULL); reset the edge predicates so that inline_merge_summary becomes very cheap. Well, this will still result in (conservatively) wrong inline summary for flattened function. I am not sure if flatten attribute should laso mean do not care about inlining of the flattened function much. I think we want to handle this generically even if it is harder for small function inliner to explode here because it does a lot less cascaded inlining. Unfortunately that beast seems to have no early out (but instead uses true_predicate () ...). Can we speed it up for the case where we just want fast operation rather than precise accounting of sizes/time in the inlined-to caller? I will look. Honza
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #40 from Jan Hubicka hubicka at gcc dot gnu.org 2012-08-10 04:34:54 UTC --- OK, my simple ^C profiling shows: #14 0x0091f17f in estimate_edge_size_and_time (e=0x7fffb7192d68, size=optimized out, time=0x7fffc04110b0, prob=1) at ../../gcc/ipa-inline-analysis.c:2183 #15 0x0091f23a in estimate_calls_size_and_time (node=0x7fffb7193af8, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2274 #16 0x0091f2d1 in estimate_calls_size_and_time (node=0x7fffb71939c0, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277 #17 0x0091f2d1 in estimate_calls_size_and_time (node=0x7fffb7193888, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277 #18 0x0091f2d1 in estimate_calls_size_and_time (node=0x7fffb7191c30, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277 #19 0x0091f2d1 in estimate_calls_size_and_time (node=0x7fffb7180c30, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277 #20 0x0091f2d1 in estimate_calls_size_and_time (node=0x7fffb7147d68, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277 #21 0x0091f2d1 in estimate_calls_size_and_time (node=0x7fffcbb039c0, size=0x7fffc04110b4, time=0x7fffc04110b0, possible_truths=4294967294, known_vals=0x0, known_binfos=0x0) at ../../gcc/ipa-inline-analysis.c:2277 #22 0x009232a1 in inline_merge_summary (edge=0x7fffb6fba068) at ../../gcc/ipa-inline-analysis.c:2687 #23 0x00ed0cdf in inline_call (e=0x7fffb6fba068, update_original=optimized out, new_edges=0x0, overall_size=0x0) at ../../gcc/ipa-inline-transform.c:246 #24 0x00ece3a9 in flatten_function (node=0x7fffb6fb9ea0, early=1 '\001') at ../../gcc/ipa-inline.c:1605 #25 0x00ece3bf in flatten_function (node=0x7fffb6fb4c30, early=1 '\001') at ../../gcc/ipa-inline.c:1608 #26 0x00ece3bf in flatten_function (node=0x7fffb6f87270, early=1 '\001') at ../../gcc/ipa-inline.c:1608 #27 0x00ece3bf in flatten_function (node=0x7fffcbb039c0, early=1 '\001') at ../../gcc/ipa-inline.c:1608 #28 0x00ece616 in early_inliner () at ../../gcc/ipa-inline.c:1881 so the time is actually not spent in the predicate merging logic, but in computing overall size/time of the function after inline operation is performed. This is because of following call in inline_merge_summary: estimate_calls_size_and_time (to, info-size, info-time, ~(clause_t)(1 predicate_false_condition), NULL, NULL); Here we simply recompute the size/time by walking all callees of the function and since this involve recursive walk of the inlined callees that are many we hit quadratic time complexity. Ugly. One (particularly ugly) workaround would be to make inline_merge_summary to skip updating after each step of flattening because flattening do not care, but that is quite a kludge (similar to Steven's patch but with one extra update on the very end). Other solution is to work hard enough to update everything incrementally, but it is not 100% trivial (that is why I simply recompute). I actually tought of this case and concluded that if we do so deep inline trees we will blow up somewhere else. Quite an achivement that Steven managed to chase out all the other cases. Honza
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #41 from Jan Hubicka hubicka at gcc dot gnu.org 2012-08-10 05:18:52 UTC --- Created attachment 27979 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=27979 Path for inliner slowness Hi, this is patch I am testing. After some consideration I do not like the incremental update idea much. It asks for bookkeeping issues and has roundoff problems so unless it is really necessary I will try to stick with the recomputation scheme. This is more careful variant of Steven's patch. We skip the recomputation after each incremental inline and update the info once we are done with the flattening. Honza
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added Component|tree-optimization |middle-end --- Comment #31 from Steven Bosscher steven at gcc dot gnu.org 2012-08-08 06:28:16 UTC --- Author: steven Date: Wed Aug 8 06:28:10 2012 New Revision: 190222 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190222 Log: PR middle-end/54146 * ifcvt.c: Include pointer-set.h. (cond_move_process_if_block): Change type of then_regs and else_regs from alloca'd array to pointer_sets. (check_cond_move_block): Update for this change. (cond_move_convert_if_block): Likewise. * Makefile.in: Fix dependencies for ifcvt.o. Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/ifcvt.c --- Comment #32 from Steven Bosscher steven at gcc dot gnu.org 2012-08-08 06:29:16 UTC --- Author: steven Date: Wed Aug 8 06:29:12 2012 New Revision: 190223 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190223 Log: PR middle-end/54146 * ira.c (init_live_subregs): Take live_subregs_used as a bitmap. (build_insn_chain): Make live_subregs_used a bitmap. Use SBITMAP_SIZE to ignore the paradoxical bytes of subregs. Use sbitmap_free to free the live_subreg sbitmaps. Modified: trunk/gcc/ChangeLog trunk/gcc/ira.c
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added Component|tree-optimization |middle-end --- Comment #32 from Steven Bosscher steven at gcc dot gnu.org 2012-08-08 06:29:16 UTC --- Author: steven Date: Wed Aug 8 06:29:12 2012 New Revision: 190223 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190223 Log: PR middle-end/54146 * ira.c (init_live_subregs): Take live_subregs_used as a bitmap. (build_insn_chain): Make live_subregs_used a bitmap. Use SBITMAP_SIZE to ignore the paradoxical bytes of subregs. Use sbitmap_free to free the live_subreg sbitmaps. Modified: trunk/gcc/ChangeLog trunk/gcc/ira.c
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #33 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-08 08:09:25 UTC --- (In reply to comment #29) Created attachment 27957 [details] Do not traverse sibling loops The idea here is to note that for a nested loop we know for sure that the loop exits into a common loop father must all need PHI nodes, because we have have loop pre-headers. I'm not sure what the effect of this is on irreducible loops, and I haven't tested this much yet. Comments sought... Nice results. A few comments - if you rely on preheaders then please assert in rewrite_into_loop_closed_ssa that loops_state_satisfies_p (LOOPS_HAVE_PREHEADERS). Irreducible loops are one loop to cfgloop and to loop-closed SSA form, so they should be fine. What do you mean by for a nested loop we know for sure that the loop exits into a common loop father must all need PHI nodes? What happens for nested loops that do not exit into a common loop father? That is what's common here? I think what you want to say is that a loop cannot exit to a child of a sibling of a father, thus: for (;;) for (;;) goto x; for (;;) x: cannot happen in the sense that loop detection would not detect the loop nest as written (but had a single irreducible loop)? I think you should simply move compute_global_livein to its single use and make it static.
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #34 from Steven Bosscher steven at gcc dot gnu.org 2012-08-08 10:10:46 UTC --- (In reply to comment #33) I think you should simply move compute_global_livein to its single use and make it static. Yes, and I need to add the same smarts there as in find_uses_to_rename_use to look through loops at the same nesting level, because this test case fails: extern unsigned use (unsigned); void foo (void) { unsigned i, j; do { i = use (0); } while (i); do { j = use (0); } while (j); if (i) use (j); } It fails in check_loop_closed_ssa_use at -O2 but passes at -O2 -fno-tree-vrp, so VRP is doing something destroying the initially correct loop-closed SSA form.
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #35 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-08 11:49:21 UTC --- (In reply to comment #34) (In reply to comment #33) I think you should simply move compute_global_livein to its single use and make it static. Yes, and I need to add the same smarts there as in find_uses_to_rename_use to look through loops at the same nesting level, because this test case fails: extern unsigned use (unsigned); void foo (void) { unsigned i, j; do { i = use (0); } while (i); do { j = use (0); } while (j); if (i) use (j); } It fails in check_loop_closed_ssa_use at -O2 but passes at -O2 -fno-tree-vrp, so VRP is doing something destroying the initially correct loop-closed SSA form. Jump threading I suppose. I think you can drop loop-closed SSA form right before calling vrp_finalize ().
[Bug middle-end/54146] Very slow compile with attribute((flatten))
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54146 --- Comment #36 from Steven Bosscher steven at gcc dot gnu.org 2012-08-08 17:39:49 UTC --- Author: steven Date: Wed Aug 8 17:39:46 2012 New Revision: 190235 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190235 Log: PR middle-end/54146 * gimpify.c (gimplify_body): Only verify_gimple_in_seq with checking enabled. * tree-ssa-loop-manip.c (add_exit_phis_var): Assert that var is a gimple_reg if checking is enabled. (find_uses_to_rename_stmt): Only look at non-virtual USE operands. * tree-into-ssa (compute_global_livein): Change the worklist type from an array to a VEC. Modified: trunk/gcc/ChangeLog trunk/gcc/gimplify.c trunk/gcc/tree-into-ssa.c trunk/gcc/tree-ssa-loop-manip.c