Re: [PATCH][RFC] Add new ipa-reorder pass

Jan Hubicka Tue, 10 Dec 2019 08:57:59 -0800

>       * Makefile.in: Add ipa-reorder.o.
>       * cgraph.c (cgraph_node::dump): Dump
>       text_sorted_order.
>       (cgraph_node_cmp_by_text_sorted):
>       New function that sorts functions based
>       on text_sorted_order.
>       * cgraph.h (cgraph_node): Add text_sorted_order.
>       (cgraph_node_cmp_by_text_sorted): New.
>       * cgraphclones.c (cgraph_node::create_clone):
>       Clone also text_sorted_order.
>       * cgraphunit.c (node_cmp): Remove.
>       (expand_all_functions): Use new function
>       cgraph_node_cmp_by_text_sorted.
>       * common.opt: Add new option reorder_functions_algorithm.
>       * flag-types.h (enum reorder_functions_algorithm):
>       New enum.
>       * ipa-reorder.c: New file.
>       * lto-cgraph.c (lto_output_node): Stream in and out
>       text_sorted_order.
>       (input_node): Likewise.
>       * passes.def: Add pass_ipa_reorder.
>       * timevar.def (TV_IPA_REORDER): New.
>       * tree-pass.h (make_pass_ipa_reorder): New.
>       * varasm.c (default_function_section): Assign text.sorted.X
>       section.
> 
> gcc/lto/ChangeLog:
> 
> 2019-11-25  Martin Liska  <mli...@suse.cz>
> 
>       * lto-partition.c (node_cmp): Remove.
>       (lto_balanced_map): Use new cgraph_node_cmp_by_text_sorted.
> +/* Sort cgraph_nodes by text_sorted_order if available, or by order.  */
> +
> +int
> +cgraph_node_cmp_by_text_sorted (const void *pa, const void *pb)
> +{
> +  const cgraph_node *a = *(const cgraph_node * const *) pa;
> +  const cgraph_node *b = *(const cgraph_node * const *) pb;
> +
> +  /* Functions with text_sorted_order should be before these
> +     without profile.  */
> +  if (a->text_sorted_order == 0 || b->text_sorted_order == 0)
> +    return a->text_sorted_order - b->text_sorted_order;
> +
> +  return a->text_sorted_order != b->text_sorted_order
> +      ? b->text_sorted_order - a->text_sorted_order
> +      : b->order - a->order;
> +}


We now have tp_first_run_node_cmp with needs to be updated and renamed
isntead of adding new comparator.
>    /* Time profiler: first run of function.  */
>    int tp_first_run;
> +  /* Order in .text.sorted.* section.  */
> +  int text_sorted_order;

tp_first_run probalby should go into summary since it will live from
profling to the ipa-reorder pass where it will be copied to
text_sorted_order? Or how these two interface?
> diff --git a/gcc/cgraphclones.c b/gcc/cgraphclones.c
> index a79491e0b88..f624cbee185 100644
> --- a/gcc/cgraphclones.c
> +++ b/gcc/cgraphclones.c
> @@ -372,6 +372,7 @@ cgraph_node::create_clone (tree new_decl, profile_count 
> prof_count,
>    new_node->rtl = rtl;
>    new_node->frequency = frequency;
>    new_node->tp_first_run = tp_first_run;
> +  new_node->text_sorted_order = text_sorted_order;
Copy it also in duplicate_thunk_for_node and create_version_node.
I think we should refactor the copying code and share it in all places
we duplicate function.
> +freorder-functions-algorithm=
> +Common Joined RejectNegative Enum(reorder_functions_algorithm) 
> Var(flag_reorder_functions_algorithm) 
> Init(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING) Optimization
> +-freorder-functions-algorithm=[first-run|call-chain-clustering]      Set the 
> used function reordering algorithm.
> +
> +Enum
> +Name(reorder_functions_algorithm) Type(enum reorder_functions_algorithm) 
> UnknownError(unknown function reordering algorithm %qs)
> +
> +EnumValue
> +Enum(reorder_functions_algorithm) String(first-run) 
> Value(REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN)
> +
> +EnumValue
> +Enum(reorder_functions_algorithm) String(call-chain-clustering) 
> Value(REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING)

I think we should also have "definition-order" which will sort by
node->order.  For testing we may also want to have "random" which will
do something stupid like sorting by symbol name checksum so we could see
how these compares.
> +/* The algorithm used for function reordering.  */
> +enum reorder_functions_algorithm
> +{
> +  REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN,
> +  REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING
> +};

Isn't this supposed to be generated by awk scripts?
> +
> +#define C3_CLUSTER_THRESHOLD 1024
description + perhaps PARAM?
> +
> +struct cluster_edge;
> +
> +/* Cluster is set of functions considered in C3 algorithm.  */
Perhaps an high level overview of C3 algorithm with a reference would be
nice.
> +/* Sort functions based of first execution of the function.  */
> +
> +static void
> +sort_functions_by_first_run (void)
> +{
> +  cgraph_node *node;
> +
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +    if (node->tp_first_run && !node->alias)
> +      node->text_sorted_order = node->tp_first_run;

Why you care about excluding aliases?  We also have thunks where
tp_first_run is ignored.  I am just curious if you got some ICE here.
> +}
> +
> +/* Compare clusters by density after that are established.  */
> +
> +static int
> +cluster_cmp (const void *a_p, const void *b_p)
> +{
> +  const cluster *a = *(cluster * const *)a_p;
> +  const cluster *b = *(cluster * const *)b_p;
> +
> +  unsigned fncounta = a->m_functions.length ();
> +  unsigned fncountb = b->m_functions.length ();
I think we still do whitespace after declarations.
> +  if (fncounta <= 1 || fncountb <= 1)
> +    return fncountb - fncounta;
> +
> +  sreal r = b->m_time * a->m_size - a->m_time * b->m_size;
> +  return (r < 0) ? -1 : ((r > 0) ? 1 : 0);
> +}
> +
> +/* Visit callgraph edge CS until we reach a real cgraph_node (not a clone).
> +   Record such edge to EDGES or traverse recursively.  */
> +
> +static void
> +visit_all_edges_for_caller (auto_vec<cluster_edge *> *edges,
> +                         cgraph_node *node, cgraph_edge *cs)
> +{
> +  if (cs->inline_failed)
> +    {
> +      gcov_type count;
> +      profile_count pcount = cs->count.ipa ();
Similarly here.
> +      /* A real call edge.  */
> +      if (!cs->callee->alias
> +       && cs->callee->definition
> +       && pcount.initialized_p ()
> +       && (count = pcount.to_gcov_type ()) > 0)
> +     {
> +       cluster *caller = (cluster *)node->aux;
> +       cluster *callee = (cluster *)cs->callee->aux;
> +       cluster_edge **cedge = callee->m_callers.get (caller);
> +
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +         fprintf (dump_file, "Adding edge:%s->%s:%" PRIu64 "\n",
> +                  node->dump_name (), cs->callee->dump_name (), count);
> +
> +       if (cedge != NULL)
> +         (*cedge)->m_count += count;
> +       else
> +         {
> +           cluster_edge *cedge = new cluster_edge (caller, callee, count);
> +           edges->safe_push (cedge);
> +           callee->m_callers.put (caller, cedge);
> +         }
> +     }
> +    }
> +  else
> +    {
> +      cgraph_node *clone = cs->callee;
> +      for (cgraph_edge *cs = clone->callees; cs; cs = cs->next_callee)
> +     visit_all_edges_for_caller (edges, node, cs);
> +    }
> +}

With FDO we have likely targets for indirect calls which may or may not
be speculative calls at this moment.  We may want to put that info into
summaries and use here, but that is for incremental change.

I think not seeing indirect calls target may be one of main reasons for
clusters being mistakely broken up.
> +
> +/* Sort functions based on call chain clustering, which is an algorithm
> +   mentioned in the following article:
> +   
> https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf

I would use normal citation format - research.fb.com may change :)

Ottoni, Guilherme, and Bertrand Maher. "Optimizing function placement
for large-scale data-center applications." Proceedings of the 2017
International Symposium on Code Generation and Optimization. IEEE Press,
2017.
> +   .  */
> +
> +static void
> +sort_functions_by_c3 (void)
> +{
> +  cgraph_node *node;
> +  auto_vec<cluster *> clusters;
> +
> +  /* Create a cluster for each function.  */
Putting thunks into their own clusters will do no good at least until we
learn to output them independently.  So I would add !node->thunk
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +    if (!node->alias
> +     && !node->inlined_to)
> +      {
> +     if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "Adding node:%s\n", node->dump_name ());
> +
> +     ipa_size_summary *size_summary = ipa_size_summaries->get (node);
> +     ipa_fn_summary *fn_summary = ipa_fn_summaries->get (node);
> +     cluster *c = new cluster (node, size_summary->size, fn_summary->time);
> +     node->aux = c;
> +     clusters.safe_push (c);
> +      }

Either you want to walk from edge->callee to function_node or you want
to insert all associated aliases and thunks to the cluster I think
> +
> +  auto_vec<cluster_edge *> edges;
> +
> +  /* Insert edges between clusters that have a profile.  */
> +  for (unsigned i = 0; i < clusters.length (); i++)
> +    {
> +      cgraph_node *node = clusters[i]->m_functions[0];
> +      for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee)
> +     visit_all_edges_for_caller (&edges, node, cs);
> +    }
> +
> +  /* Now insert all created edges into a heap.  */
> +  fibonacci_heap <uint64_t, cluster_edge> heap (0);
> +
> +  for (unsigned i = 0; i < clusters.length (); i++)
> +    {
> +      cluster *c = clusters[i];
> +      for (hash_map<cluster *, cluster_edge *>::iterator it
> +        = c->m_callers.begin (); it != c->m_callers.end (); ++it)
> +     {
> +       cluster_edge *cedge = (*it).second;
> +       cedge->m_heap_node = heap.insert (cedge->inverted_count (), cedge);
> +     }
> +    }
> +
> +  while (!heap.empty ())
> +    {
> +      cluster_edge *cedge = heap.extract_min ();
> +      cluster *caller = cedge->m_caller;
> +      cluster *callee = cedge->m_callee;
> +      cedge->m_heap_node = NULL;
> +
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +     {
> +       fprintf (dump_file, "Processing cluster edge: %p->%p, count: %"
> +                PRIu64 "\n", (void *)caller, (void *)callee, cedge->m_count);
> +       fprintf (dump_file, "  source functions (%u): ", caller->m_size);
> +       for (unsigned j = 0; j < caller->m_functions.length (); j++)
> +         fprintf (dump_file, "%s ", caller->m_functions[j]->dump_name ());
> +       fprintf (dump_file, "\n  target functions (%u): ", callee->m_size);
> +       for (unsigned j = 0; j < callee->m_functions.length (); j++)
> +         fprintf (dump_file, "%s ", callee->m_functions[j]->dump_name ());
> +       fprintf (dump_file, "\n");
> +     }
> +
> +      if (caller == callee)
> +     continue;
> +      if (caller->m_size + callee->m_size <= C3_CLUSTER_THRESHOLD)

Did you experiment with the cluster threshold value? In a way since we
have so bad idea about size perhaps we may just set it to infinity.
> +     {
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +         fprintf (dump_file, "  (clusters merged)\n");
> +
> +       caller->m_size += callee->m_size;
> +       caller->m_time += callee->m_time;
> +
> +       /* Append all cgraph_nodes from callee to caller.  */
> +       for (unsigned i = 0; i < callee->m_functions.length (); i++)
> +         caller->m_functions.safe_push (callee->m_functions[i]);

Right datastructure for this is probably union-find.  We don't have a
function to append one vector to another?
> +
> +       callee->m_functions.truncate (0);

don't we want to free it?
> +
> +  /* Sort the candidate clusters.  */
> +  clusters.qsort (cluster_cmp);

We still have info about what code is used at startup, what code is
likely/unlikely.

After inlining we also may have a lot of functions that have IPA count
of 0 but still have tp_first_run set.  So i suppose we may want to
experiment with this incrementally.

Also merging clusters over even edges profiled 0 time at the end could
improve locality if profile run is not representative.
> +  /* Assign .text.sorted.* section names.  */
I would say just text_sorted_order since I think the sections is just
one way of arranging it.
> +  int counter = 1;
> +  for (unsigned i = 0; i < clusters.length (); i++)
> +    {
> +      cluster *c = clusters[i];
> +      if (c->m_functions.length () <= 1)
> +     continue;
> +
> +      for (unsigned j = 0; j < c->m_functions.length (); j++)
> +     {
> +       cgraph_node *node = c->m_functions[j];
> +
> +       if (dump_file)
> +         fprintf (dump_file, "setting: %d for %s with size:%d\n",
> +                  counter, node->dump_asm_name (),
> +                  ipa_size_summaries->get (node)->size);
> +       node->text_sorted_order = counter++;
> +     }
> +    }
> +
> +  /* Release memory.  */
> +  FOR_EACH_DEFINED_FUNCTION (node)
> +    if (!node->alias)
> +      node->aux = NULL;
> +
> +  for (unsigned i = 0; i < clusters.length (); i++)
> +    delete clusters[i];
> +
> +  for (unsigned i = 0; i < edges.length (); i++)
> +    delete edges[i];
> +}
> +
> +/* The main function for function sorting.  */
> +
> +static unsigned int
> +ipa_reorder (void)
> +{
> +  switch (flag_reorder_functions_algorithm)
> +    {
> +    case REORDER_FUNCTIONS_ALGORITHM_CALL_CHAIN_CLUSTERING:
> +      sort_functions_by_c3 ();
> +      break;
> +    case REORDER_FUNCTIONS_ALGORITHM_FIRST_RUN:
> +      sort_functions_by_first_run ();
> +      break;
> +    default:
> +      gcc_unreachable ();
> +    }
> +
> +  return 0;
> +}
> +
> +const pass_data pass_data_ipa_reorder =
> +{
> +  IPA_PASS, /* type */
> +  "reorder", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_IPA_REORDER, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_ipa_reorder : public ipa_opt_pass_d
> +{
> +public:
> +  pass_ipa_reorder (gcc::context *ctxt)
> +    : ipa_opt_pass_d (pass_data_ipa_reorder, ctxt,
> +                   NULL, /* generate_summary */
> +                   NULL, /* write_summary */
> +                   NULL, /* read_summary */
> +                   NULL, /* write_optimization_summary */
> +                   NULL, /* read_optimization_summary */
> +                   NULL, /* stmt_fixup */
> +                   0, /* function_transform_todo_flags_start */
> +                   NULL, /* function_transform */
> +                   NULL) /* variable_transform */
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *);
> +  virtual unsigned int execute (function *) { return ipa_reorder (); }
> +
> +}; // class pass_ipa_reorder
> +
> +bool
> +pass_ipa_reorder::gate (function *)
> +{
> +  return flag_profile_reorder_functions && flag_profile_use && flag_wpa;

There is nothing special about flag_wpa.  We can also run it
meaningfully with -flto-partitions=none or with -fwhole-program.

I think we may
 1) default to node->order if no FDO is available
    I think it is better than our current completely backward ordering
    We may tune clustering to beat node->order too.
 2) if FDO but no LTO is available use tp_first_run since the sections
    combine well across translation units.
    This also should apply for incremental LTO builds but not for
    -fwhole-program builds
 3) if FDO and non-incremental LTO or -fwhole-program is available use
     clustering.
moreover with current experineces I think flag_profile_reorder_functions
and flag_reorder_functions_algorithm needs to be Optimization in the
.opt file and thus the code needs to be able to work meaningfully on
mixed units.  Something like place

1) functions with tp_first_run
2) functions build with clusterings
3) remaining functions

and always do all three algorithms depending on opt flags.  More funilly
becuase the preferred logic is not known until link-time I guess we want
to have flag_reorder_functions_algorithm with REORDER_DEFAULT which will
do the above choice based on presence of profile (tested by
node->count.ipa ().nonzero_p ()), tp_first_run, in_lto_p,
flag_whole_proram
> +  order.qsort (cgraph_node_cmp_by_text_sorted);
> +  noreorder.qsort (cgraph_node_cmp_by_text_sorted);

This also needs updating after my mainline fixes. We do not want to sort
noreorder this way.
> +  cgraph_node *node = cgraph_node::get (decl);
> +  if (freq != NODE_FREQUENCY_UNLIKELY_EXECUTED && node->text_sorted_order > 
> 0)
> +    {
> +      char section_name[64];
> +      sprintf (section_name, ".text.sorted.%010d",
> +            node->text_sorted_order);
> +      return get_named_text_section (decl, section_name, NULL);
> +    }
> +

I would still control this by a flag (again with bit complicated
default).  First there are non-GNU linkers and second it only works well
for one translation unit in whole DSO unless we do time profiling.
Also when we go this way we want to disable sorting in cgraphunit.c so
we do not lose late propagators for no benefits.  Finally I still think
gas fragments would be useful here to save extra cost of alignments and
also be able to use short call/jmp instructions across functions.
-ffunction-sections is not free.

We still have problem with functions in COMDAT group, right? You should
be able to see them with -flto -fno-use-linker-plugin

Honza

Re: [PATCH][RFC] Add new ipa-reorder pass

Reply via email to