Hi, thanks for looking into this, I only have one question:
On Thu, Dec 10, 2015 at 08:30:37AM +0100, Jan Hubicka wrote: > Martin, > while looking into the ipa-cp dumps for bzip and Firefox I noticed few issues. > First of all, ipcp_cloning_candidate_p calls > optimize_function_for_speed_p (DECL_STRUCT_FUNCTION (node->decl)) > which can not be used at WPA time, becuase we have no DECL_STRUCT_FUNCTION > around. I replaced it by node->optimize_for_size_p (). > > Second we perform incredible number of clones because we do obtain some sort > of > polymorphic call context for them. In wast majority of cases this is useless > effort, because the functions in question do not contain virtual calls and do > not pass the parameter further. For firefox about 40k out of 50k clones > created are created just because we found some context. > > I changed the code to only clone if this immediately leads to > devirtualization. > This do not cause any noticeable drop in number of devirtualized calls on > Firefox. I suppose we will miss the case where cloning a caller may allow > devirtualization in a clone of callee, but I do not think the heuristics for > context independent values can handle this as implemented right now and it > simply have way to many false positives. > > What we can do is to devirtualize w/o cloning for local functions and > speculatively devirtualize in case we would otherwise clone. > > Third problem I noticed is that > will_be_removed_from_program_if_no_direct_calls_p is used to decide if we can > ignore the function size when deciding about the code size impact. > This function is doing some analysis for inliner where it, for example, > analyses > if a comdat which is going to be inlined consistently in the whole program > will be removed. > > In the cloning case I do not see this to apply: we have no evidence that the > other units will pass the same constants to the function. I think you > basically want to assume that the function will be removed if it has no > address taken and it is not externally visibible. This is what local flag > is for. > > I gathered some stats: > > number of clones for all contexts: 49948->11102 > number of clones: 4376->4383 > > good_cloning_opportunity_p is called about 70k times, I wonder if the > thresholds are not simply set too high. For example, inliner does about 300k > inlines at Firefox. > > number of param replacements: 13041-> 13056 + 5383 aggregate replacements (I > do not have data on unpatched tree for this) > number of devirts: 956->933 > number of devirts happening at inline: 781->868 > number of indirect calls promoted: 512->512 > > Inliner stats from: Unit growth for small function inlining: 7965701->9130051 > (14%) > to: Unit growth for small function inlining: 7965010->9138577 > > So it seems that except for large drop in number of clones there is no > significant difference. > > I am bootstrapping/regtesting this on x86_64-linux, does it seem OK? > > Honza > > * ipa-cp.c (ipcp_cloning_candidate_p): Use node->optimize_for_size_p. > (good_cloning_opportunity_p): Likewise. > (gather_context_independent_values): Do not return true when > polymorphic call context is known or when we have known aggregate > value of unused parameter. > (estimate_local_effects): Try to create clone for all context > when either some params are substituted or devirtualization is possible > or some params can be removed; use local flag instead of > node->will_be_removed_from_program_if_no_direct_calls_p. > (identify_dead_nodes): Likewise. > Index: ipa-cp.c > =================================================================== > --- ipa-cp.c (revision 231477) > +++ ipa-cp.c (working copy) > @@ -613,7 +613,7 @@ ipcp_cloning_candidate_p (struct cgraph_ > return false; > } > > - if (!optimize_function_for_speed_p (DECL_STRUCT_FUNCTION (node->decl))) > + if (node->optimize_for_size_p ()) > { > if (dump_file) > fprintf (dump_file, "Not considering %s for cloning; " > @@ -2267,7 +2267,7 @@ good_cloning_opportunity_p (struct cgrap > { > if (time_benefit == 0 > || !opt_for_fn (node->decl, flag_ipa_cp_clone) > - || !optimize_function_for_speed_p (DECL_STRUCT_FUNCTION (node->decl))) > + || node->optimize_for_size_p ()) > return false; > > gcc_assert (size_cost > 0); > @@ -2387,12 +2387,14 @@ gather_context_independent_values (struc > *removable_params_cost > += ipa_get_param_move_cost (info, i); > > + if (!ipa_is_param_used (info, i)) > + continue; > + Is this really necessary, is it not enough to remove the assignment to ret below? If the parameter is not used, devirtualization time bonus, which you then rely on estimate_local_effects, should be zero for it. It is a very minor point, I suppose, but if the function gets cloned for a different reason, it might still be beneficial to have as much context-independent information for it as possible too, because that can then be used in a callee (see the second call of gather_context_independent_values). Other than that, all the changes seem like a clear improvement. Thanks, Martin > ipcp_lattice<ipa_polymorphic_call_context> *ctxlat = &plats->ctxlat; > + /* Do not account known context as reason for cloning. We can see > + if it permits devirtualization. */ > if (ctxlat->is_single_const ()) > - { > - (*known_contexts)[i] = ctxlat->values->value; > - ret = true; > - } > + (*known_contexts)[i] = ctxlat->values->value; > > if (known_aggs) > { > @@ -2494,7 +2496,9 @@ estimate_local_effects (struct cgraph_no > &known_contexts, > &known_aggs, > &removable_params_cost); > known_aggs_ptrs = agg_jmp_p_vec_for_t_vec (known_aggs); > - if (always_const) > + int devirt_bonus = devirtualization_time_bonus (node, known_csts, > + known_contexts, known_aggs_ptrs); > + if (always_const || devirt_bonus || removable_params_cost) > { > struct caller_statistics stats; > inline_hints hints; > @@ -2505,8 +2509,7 @@ estimate_local_effects (struct cgraph_no > false); > estimate_ipcp_clone_size_and_time (node, known_csts, known_contexts, > known_aggs_ptrs, &size, &time, &hints); > - time -= devirtualization_time_bonus (node, known_csts, known_contexts, > - known_aggs_ptrs); > + time -= devirt_bonus; > time -= hint_time_bonus (hints); > time -= removable_params_cost; > size -= stats.n_calls * removable_params_cost; > @@ -2515,8 +2518,7 @@ estimate_local_effects (struct cgraph_no > fprintf (dump_file, " - context independent values, size: %i, " > "time_benefit: %i\n", size, base_time - time); > > - if (size <= 0 > - || node->will_be_removed_from_program_if_no_direct_calls_p ()) > + if (size <= 0 || node->local.local) > { > info->do_clone_for_all_contexts = true; > base_time = time; > @@ -2544,6 +2546,10 @@ estimate_local_effects (struct cgraph_no > "max_new_size would be reached with %li.\n", > size + overall_size); > } > + else if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, " Not cloning for all contexts because " > + "!good_cloning_opportunity_p.\n"); > + > } > > for (i = 0; i < count ; i++) > @@ -4419,7 +4425,7 @@ identify_dead_nodes (struct cgraph_node > { > struct cgraph_node *v; > for (v = node; v ; v = ((struct ipa_dfs_info *) v->aux)->next_cycle) > - if (v->will_be_removed_from_program_if_no_direct_calls_p () > + if (v->local.local > && !v->call_for_symbol_thunks_and_aliases > (has_undead_caller_from_outside_scc_p, NULL, true)) > IPA_NODE_REF (v)->node_dead = 1;