> -----Original Message-----
> From: Jan Hubicka <[email protected]>
> Sent: 11 December 2025 20:56
> To: Prathamesh Kulkarni <[email protected]>
> Cc: [email protected]
> Subject: Re: [RFC] Enable time profile function reordering with
> AutoFDO
> 
> External email: Use caution opening links or attachments
> 
> 
> > Hi Honza,
> > Thanks for the review, please find responses inline below.
> >
> > > I assume that the autofdo assigns timestamps to offile function
> > > instances, so this can also fit into function_instance
> > > datastructure, but I also suppose you want to keep it separate so
> it is optional?
> > function_instance does store timestamp, and uses it as a key into
> timestamp_info_map.
> > My intent of keeping a separate data-structure timestamp_info_map
> was
> > to sort 64-bit timestamps into ascending order, and then map the
> sorted timestamp values to (1..N), and assign the mapped value to
> node->tp_first_run.
> > So we don't need to up size of node->tp_first_run to 64bit (and I
> > guess we won't need the full 64-bit range here?)
> 
> Normal time profiling works in a way that time is increased only when
> new function is executed, so 2^32 is the limit on number of executed
> functions in the program which is probably still large enough: if one
> does LTO with more than 2^32 symbols we will run out ot DECL_UIDs and
> other things, though it may be possible to link such a large program.
> 
> At runtime, all computations and streaming happens in 64bit, so
> updating to_first_run into 64bit is also alternative, but I guess the
> compression is a reasonable approach until we will need to update
> other datastructures as well.
> > > Also does the fature work with normal partitioning? It should make
> > > those function with tp_first_run nonzero go into early partitions,
> > > but then we will likely get relatively lot of calls to functions
> > > with zero counts just because we lost track of them?
> > The results I reported have been with balanced partitioning. I am
> seeing roughly the same numbers on top of locality partitioning too.
> 
> Great, happy that this works on both settings.
> > >
> > > I also wonder if offlining can behave meaningfully here.  I.e.
> when
> > > function profile is non-zero just copy first run from the caller.
> > IIUC, the purpose of afdo_offline pass is to remove cross module
> > inlining from function_instances so AutoFDO annotation gets a better
> mapping from source locations to CFG ?
> 
> Yes, old code assumed all relevant inlining to happen at early inline
> time via the afdo-inline path.  All inlining that did not happen led
> to lost profile data.  With LTO this becomes quite problematic since a
> lot of inlining is cross-module, so we need to take inline instances
> out and merge them with existing offline instances in order to get
> realistic profile before the link-time IPA-profile pass.
> >
> > Currently, the patch only assigns timestamps to functions whose
> outline copy is executed during train run (toplevel).
> > Would afdo offlining help for instance if a function is inlined into
> > all it's callers during train run (thus the outline copy has no
> timestamp info), but during AutoFDO, it may not get inlined into all
> it's callers ? Currently in this case, I guess we lose timestamp info
> (and thus time profile reordering for such a function).
> >
> > Currently the transform is gated explicitly on -fauto-profile -
> fprofile-reorder-functions (so users can opt-in instead of being
> enabled by default with -fauto-profile).
> > Should that be OK for gcc-16 ?
> 
> I think we want to do that (also for gcc-16, since this is contained
> in autofdo).  If we offline a function with a a non-zero counts in it,
> we may want to assign the offline copy first run timestamp derived
> from the outer function.  (Probably something like
> timestamp+inline depth).
> > Does the patch look OK if bootstrap+test passes (with and without
> AutoFDO) ?
> > gcc/ChangeLog:
> >       * auto-profile.cc: (string_table::filenames): New method.
> >       (function_instance::timestamp_): New member.
> >       (function_instance::timestamp): New accessor for timestamp_
> member.
> >       (autofdo::timestamp_info_map): New std::map.
> >       (function_instance::function_instance): Add argument timestamp
> and set
> >       it to member timestamp_.
> >       (function_instance::read_function_instance): Adjust prototype
> and read
> >       timestamp.
> >       (autofdo_source_profile::get_function_instance_by_decl): New
> argument
> >       filename with default value NULL.
> >       (autofdo_source_profile::read): Populate timestamp_info_map.
> >       (afdo_annotate_cfg): Assign node->tp_first_run based on
> >       timestamp_info_map and bail out of annotation if
> >       param auto-profile-reorder-only is enabled.
> >       * params.opt: New param auto-profile-reorder-only.
> >       * ipa-locality-cloning.cc (partition_by_timestamps): New
> function.
> >       (locality_partition_and_clone): Call partition_by_timestamps.
> >       * varasm.cc (default_function_section): Bail out if -fauto-
> profile
> >       -fprofile-reorder-functions is enabled and node->tp_first_run
> > 0.
> >
> > gcc/lto/ChangeLog:
> >       * lto-partition.cc (create_partition): Move higher up in the
> file.
> >       (partition_by_timestamps): New function.
> >       (lto_balanced_map): Call partition_by_timestamps if -fauto-
> profile
> >       -fprofile-reorder-functions is passed and noreroder is empty.
> OK
> > +  /* perf timestamp associated with first execution of function.
> */
> Please add comment that tp_first_run is eventually computed using this
> value, but we need to watch overflows since it is 32bit.
> > +  gcov_type timestamp_;
> > +/* Map from timestamp -> <name, tp_first_run>.
> > +
> > +The purpose of this map is to map 64-bit timestamp values to
> (1..N),
> > +sorted by ascending order of timestamps and assign that to
> > +node->tp_first_run, since we don't need the full 64-bit range.  */
> Seems the comment is formatted wrongly.
> > +static std::map<gcov_type, std::pair<int, int>> timestamp_info_map;
> > +
> >  /* Scaling factor for afdo data.  Compared to normal profile
> >     AFDO profile counts are much lower, depending on sampling
> >     frequency.  We scale data up to reudce effects of roundoff @@
> > -2523,6 +2543,7 @@
> autofdo_source_profile::offline_unrealized_inlines
> > ()
> >  /* function instance profile format:
> >
> >     ENTRY_COUNT: 8 bytes
> > +   TIMESTAMP: 8 bytes (only for toplevel symbols)
> >     NAME_INDEX: 4 bytes
> >     NUM_POS_COUNTS: 4 bytes
> >     NUM_CALLSITES: 4 byte
> > @@ -2549,14 +2570,17 @@
> > autofdo_source_profile::offline_unrealized_inlines ()
> >
> >  function_instance *
> >  function_instance::read_function_instance (function_instance_stack
> *stack,
> > -                                        gcov_type head_count)
> > +                                        gcov_type head_count, bool
> > + toplevel)
> 
> There are also head count that is also streamed for toplevel instances
> that is handled by passing it around as a parameter:
> 
>       function_instance *s = function_instance::read_function_instance
> (
>           &stack, gcov_read_counter ());
> 
> I think "toplevel" flag is better, so please modify streaming of head
> count to work same way.
> >      = new function_instance (name, afdo_string_table-
> >get_filename_idx (name),
> > -                          head_count);
> > +                          head_count, timestamp);
> We will need setter api anyway to handle update while offlining, so
> perhaps instead of adding extra parameter of the ctor, just set
> timestamp after contruction?
> > +  /* timestamp_info_map is std::map with timestamp as key,
> > +     so it's already sorted in ascending order wrt timestamps.
> > +     This loop maps function with lowest timestamp to 1, and so on.
> > +     In afdo_annotate_cfg, node->tp_first_run is then set to
> corresponding
> > +     tp_first_run value.  */
> > +
> > +  int tp_first_run = 1;
> > +  for (auto &p : timestamp_info_map)
> > +    p.second.second = tp_first_run++;
> For time being I guess you can walk inline instances here recursively
> and assign tp_first_runs to all of those that contains non-zero
> counts.
> Then offlining only needs to merge the timestamps at a time it merges
> the profile.
> > diff --git a/gcc/ipa-locality-cloning.cc b/gcc/ipa-locality-
> cloning.cc
> > index 2684046bd2d..1653ea7b961 100644
> > --- a/gcc/ipa-locality-cloning.cc
> > +++ b/gcc/ipa-locality-cloning.cc
> 
> The changes above are OK with the changes mentioned.
Thanks for the suggestions, I have tried to address them in the attached patch, 
and will send the partitioning changes in a separate patch.

This patch propagates timestamp from toplevel function_instance to inlined 
instances,
and picks the lesser value while merging. On testing the patch, I am seeing a 
~4.1% improvement on a large internal workload,
and several more functions with non-zero tp_first_run values.
I didn't fully investigate the cause yet but IIUC, this would happen if 
ipa-inline during AutoFDO didn't do all the inlining that happened during train 
run ?
So the outlined callee now gets tp_first_run set, and is grouped together with 
the (first) caller (while without propagating tp_first_run, the caller and 
outlined callee will get placed apart) ?

Does the patch look OK ?

Signed-off-by: Prathamesh Kulkarni <[email protected]> 

Thanks,
Prathamesh
> 
> The changes below can still be a separate patch (once tp_first_run is
> set, we should get sane partitioning and ordering by existing balanced
> partitioning code).
> 
> > @@ -994,6 +994,50 @@ locality_determine_static_order
> (auto_vec<locality_order *> *order)
> >    return true;
> >  }
> >
> > +/* Similar to lto-partition.cc:partition_by_timestamp.s  */
> > +
> > +static locality_partition
> > +partition_by_timestamps (int64_t max_partition_size, int&
> > +npartitions)
> It is not called timestamps anymore, so perhaps
> parittion_by_tp_first_run.
> > @@ -1134,6 +1187,19 @@ lto_balanced_map (int n_lto_partitions, int
> > max_partition_size)
> >
> >    original_total_size = total_size;
> >
> > +  /* With -fauto-profile -fprofile-reorder-functions, place all
> symbols which
> > +     are profiled together into as few partitions as possible. The
> rationale
> > +     for doing this with AutoFDO is that the number of sampled
> functions is a
> > +     fraction of total number of executed functions (unlike PGO
> where each
> > +     executed function gets instrumented with time profile
> counter), and
> > +     placing them together helps with code locality.  */
> > +
> > +  partition = NULL;
> > +  if (flag_auto_profile
> > +      && flag_profile_reorder_functions
> > +      && noreorder.length () == 0)
> > +    partition = partition_by_timestamps (max_partition_size,
> > + npartitions);
> 
> I do not see why this is needed.  Balanced parititioning starts by a
> fixed order of functions:
> 
>   order.qsort (tp_first_run_node_cmp);
> 
> which will prioritize functions with tp_first_run and
> flag_profile_reodr_functions enabled (this flag is per-function so it
> needs to be checked on each of them)
> 
> int
> tp_first_run_node_cmp (const void *pa, const void *pb) {
>   const cgraph_node *a = *(const cgraph_node * const *) pa;
>   const cgraph_node *b = *(const cgraph_node * const *) pb;
>   unsigned int tp_first_run_a = a->tp_first_run;
>   unsigned int tp_first_run_b = b->tp_first_run;
> 
>   if (!opt_for_fn (a->decl, flag_profile_reorder_functions)
>       || a->no_reorder)
>     tp_first_run_a = 0;
>   if (!opt_for_fn (b->decl, flag_profile_reorder_functions)
>       || b->no_reorder)
>     tp_first_run_b = 0;
> 
>   if (tp_first_run_a == tp_first_run_b)
>     return a->order - b->order;
> 
>   /* Functions with time profile must be before these without profile.
> */
>   tp_first_run_a = (tp_first_run_a - 1) & INT_MAX;
>   tp_first_run_b = (tp_first_run_b - 1) & INT_MAX;
> 
>   return tp_first_run_a - tp_first_run_b; }
> 
> Once order is set, balanced partitioning assigns functions to
> partitiong in that order only trying to minimize the interface between
> the parts by looking for better cut points in a given range.
> 
> So it should give similar or better order than what you get with first
> putting non-zero tp first run into partitions using separate
> partitioner.
> > +
> > +  /* With -fauto-profile -fprofile-reorder-functions, give higher
> priority
> > +     to time profile based reordering, and ensure the reordering
> isn't split
> > +     by hot/cold partitioning.  */
> > +  if (flag_auto_profile
> > +      && flag_profile_reorder_functions
> > +      && cgraph_node::get (decl)->tp_first_run > 0)
> > +    return NULL;
> 
> I am also not sure why this would be needed.  If you have cold
> function that is run early (such as startup initialization code) it
> still makes sense to place it away from code that is run often.
> 
> Honza
Enable time profile function reordering with AutoFDO.

The patch enables time profile based reordering with AutoFDO with
-fauto-profile -fprofile-reorder-functions, by mapping timestamps obtained from 
perf
into node->tp_first_run. 

The rationale for doing this is:
(1) GCC already implements time-profile function reordering with PGO, the patch 
enables
it with AutoFDO.
(2) While time profile ordering is primarly meant for optimizing startup time,
we've also observed good effects on code-locality for large internal workloads.
(3) Possibly useful for function reordering when accurate profile annotation is
hard with AutoFDO -- For eg, if branch samples are missing (due to absence of
LBR like structure).

On AutoFDO tools side, a corresponding patch extends gcov to emit 64-bit perf 
timestamp that
records first execution of function, which loosely corresponds to PGO's 
time_profile counter.
The timestamp is stored adjacent to head field in toplevel function info.

On GCC side, this patch makes the following changes:

(1) Changes to auto-profile pass:
The patch adds a new field timestamp to function_instance,
and populates it in read_function_instance. 
It maintains a new timestamp_info_map from timestamp -> <name, tp_first_run>,
which maps timestamps sorted in ascending order to (1..N), so lowest ordered
timestamp is mapped to 1 and so on. The rationale for this is that
timestamps are 64-bit integers, and we don't need the full 64-bit range
for ordering by tp_first_run.

During annotation, the timestamp associated with function_instance is looked up
in timestamp_info_map, and corresponding mapped value is assigned
to node->tp_first_run.

Dhruv's sourcefile tracking patch already handles LTO privatized symbols.
The patch adds a workaround for mismatched/empty filenames, which should go away
when the issues with AutoFDO tools dwarf parsing are resolved.

(2) Param to disable profile driven opts.
The patch adds param auto-profile-reorder-only which only enables time-profile 
reordering with
AutoFDO:
(a) Useful as a debugging aid to isolate regression to either function 
reordering or profile driven opts.
(b) As a stopgap measure to avoid regressions with AutoFDO profile driven opts. 
(c) Possibly useful for architectures which do not support branch sampling.

gcc/ChangeLog:
        * auto-profile.cc: (string_table::filenames): New method.
        (function_instance::timestamp_): New member.
        (function_instance::timestamp): New accessor for timestamp_ member.
        (function_instance::set_timestamp): New function.
        (function_instance::prop_timestamp): Likewise.
        (function_instance::prop_timestamp_1): Likewise.
        (function_instance::function_instance): Initialize timestamp_ to 0.
        (function_instance::read_function_instance): Adjust prototype by
        replacing head_count with toplevel param with default value true, and
        stream in head_count and timestamp values from gcov file.
        (autofdo::timestamp_info_map): New std::map.
        (autofdo_source_profile::get_function_instance_by_decl): New argument
        filename with default value NULL.
        (autofdo_source_profile::read): Populate timestamp_info_map and
        propagate timestamp to inlined instances from toplevel function.
        (afdo_annotate_cfg): Assign node->tp_first_run based on
        timestamp_info_map and bail out of annotation if
        param_auto_profile_reorder_only is enabled.
        * params.opt: New param auto-profile-reorder-only.

Signed-off-by: Prathamesh Kulkarni <[email protected]>

diff --git a/gcc/auto-profile.cc b/gcc/auto-profile.cc
index 35874f465e5..2a55e4ffbaf 100644
--- a/gcc/auto-profile.cc
+++ b/gcc/auto-profile.cc
@@ -281,6 +281,8 @@ public:
 
   /* Return cgraph node corresponding to given name index.  */
   cgraph_node *get_cgraph_node (int);
+
+  const string_vector& filenames () { return filenames_; }
 private:
   typedef std::map<const char *, unsigned, string_compare> string_index_map;
   typedef std::map<const char *, auto_vec<unsigned>, string_compare>
@@ -348,8 +350,7 @@ public:
      HEAD_COUNT. Recursively read callsites to create nested function_instances
      too. STACK is used to track the recursive creation process.  */
   static function_instance *
-  read_function_instance (function_instance_stack *stack,
-                          gcov_type head_count);
+  read_function_instance (function_instance_stack *stack, bool toplevel = 
true);
 
   /* Recursively deallocate all callsites (nested function_instances).  */
   ~function_instance ();
@@ -373,6 +374,18 @@ public:
     return head_count_;
   }
 
+  gcov_type
+  timestamp () const
+  {
+    return timestamp_;
+  }
+
+  void set_timestamp (gcov_type timestamp) { timestamp_ = timestamp; }
+
+  /* Propagate timestamp from top-level function_instance to
+     inlined instances.  */
+  void prop_timestamp ();
+
   /* Traverse callsites of the current function_instance to find one at the
      location of LINENO and callee name represented in DECL.
      LOCATION should match LINENO and is used to output diagnostics.  */
@@ -536,9 +549,10 @@ private:
   function_instance (unsigned symbol_name, unsigned file_name,
                     gcov_type head_count)
     : descriptor_ (file_name, symbol_name), total_count_ (0),
-      head_count_ (head_count), removed_icall_target_ (false),
-      realized_ (false), in_worklist_ (false), inlined_to_ (NULL),
-      location_ (UNKNOWN_LOCATION), call_location_ (UNKNOWN_LOCATION)
+      head_count_ (head_count), timestamp_ (0),
+      removed_icall_target_ (false), realized_ (false), in_worklist_ (false),
+      inlined_to_ (NULL), location_ (UNKNOWN_LOCATION),
+      call_location_ (UNKNOWN_LOCATION)
   {
   }
 
@@ -554,6 +568,10 @@ private:
   /* Entry BB's sample count.  */
   gcov_type head_count_;
 
+  /* perf timestamp associated with first execution of function, which is
+     used to compute node->tp_first_run.  */
+  gcov_type timestamp_;
+
   /* Map from callsite location to callee function_instance.  */
   callsite_map callsites;
 
@@ -581,6 +599,9 @@ private:
   /* Turn inline instance to offline.  */
   static bool offline (function_instance *fn,
                       vec <function_instance *> &new_functions);
+
+  /* Helper routine for prop_timestamp.  */
+  void prop_timestamp_1 (gcov_type timestamp);
 };
 
 /* Profile for all functions.  */
@@ -601,7 +622,7 @@ public:
   ~autofdo_source_profile ();
 
   /* For a given DECL, returns the top-level function_instance.  */
-  function_instance *get_function_instance_by_decl (tree decl) const;
+  function_instance *get_function_instance_by_decl (tree decl, const char * = 
NULL) const;
 
   /* For a given DESCRIPTOR, return the matching instance if found.  */
   function_instance *
@@ -681,6 +702,13 @@ static autofdo_source_profile *afdo_source_profile;
 /* gcov_summary structure to store the profile_info.  */
 static gcov_summary *afdo_profile_info;
 
+/* Map from timestamp -> <name, tp_first_run>.
+
+   The purpose of this map is to map 64-bit timestamp values to (1..N) sorted
+   by ascending order of timestamps and assign that to node->tp_first_run,
+   since we don't need the full 64-bit range.  */
+static std::map<gcov_type, int> timestamp_info_map;
+
 /* Scaling factor for afdo data.  Compared to normal profile
    AFDO profile counts are much lower, depending on sampling
    frequency.  We scale data up to reduce effects of roundoff
@@ -1182,6 +1210,24 @@ function_instance::~function_instance ()
     delete iter->second;
 }
 
+/* Propagate timestamp TS of function_instance to inlined instances if it's
+   not already set.  */
+
+void
+function_instance::prop_timestamp_1 (gcov_type ts)
+{
+  if (!timestamp () && total_count () > 0)
+    set_timestamp (ts);
+  for (auto it = callsites.begin (); it != callsites.end (); ++it)
+    it->second->prop_timestamp_1 (ts);
+}
+
+void
+function_instance::prop_timestamp (void)
+{
+  prop_timestamp_1 (timestamp ());
+}
+
 /* Traverse callsites of the current function_instance to find one at the
    location of LINENO and callee name represented in DECL.  */
 
@@ -1244,6 +1290,10 @@ function_instance::merge (function_instance *other,
   else if (head_count_ != -1)
     head_count_ += other->head_count_;
 
+  /* While merging timestamps, set the one that occurs earlier.  */
+  if (other->timestamp () < timestamp ())
+    set_timestamp (other->timestamp ());
+
   bool changed = true;
 
   while (changed)
@@ -2582,6 +2632,7 @@ autofdo_source_profile::offline_unrealized_inlines ()
 /* function instance profile format:
 
    ENTRY_COUNT: 8 bytes
+   TIMESTAMP: 8 bytes (only for toplevel symbols)
    NAME_INDEX: 4 bytes
    NUM_POS_COUNTS: 4 bytes
    NUM_CALLSITES: 4 byte
@@ -2608,8 +2659,15 @@ autofdo_source_profile::offline_unrealized_inlines ()
 
 function_instance *
 function_instance::read_function_instance (function_instance_stack *stack,
-                                          gcov_type head_count)
+                                          bool toplevel)
 {
+  gcov_type_unsigned timestamp = 0;
+  gcov_type head_count = -1;
+  if (toplevel)
+    {
+      head_count = gcov_read_counter ();
+      timestamp = (gcov_type_unsigned) gcov_read_counter ();
+    }
   unsigned name = gcov_read_unsigned ();
   unsigned num_pos_counts = gcov_read_unsigned ();
   unsigned num_callsites = gcov_read_unsigned ();
@@ -2617,6 +2675,8 @@ function_instance::read_function_instance 
(function_instance_stack *stack,
     = new function_instance (name,
                             afdo_string_table->get_filename_by_symbol (name),
                             head_count);
+  if (timestamp > 0)
+    s->set_timestamp (timestamp);
   if (!stack->is_empty ())
     s->set_inlined_to (stack->last ());
   stack->safe_push (s);
@@ -2644,7 +2704,7 @@ function_instance::read_function_instance 
(function_instance_stack *stack,
     {
       unsigned offset = gcov_read_unsigned ();
       function_instance *callee_function_instance
-          = read_function_instance (stack, -1);
+       = read_function_instance (stack, false);
       s->callsites[std::make_pair (offset,
                                   callee_function_instance->symbol_name ())]
        = callee_function_instance;
@@ -2665,9 +2725,10 @@ autofdo_source_profile::~autofdo_source_profile ()
 /* For a given DECL, returns the top-level function_instance.  */
 
 function_instance *
-autofdo_source_profile::get_function_instance_by_decl (tree decl) const
+autofdo_source_profile::get_function_instance_by_decl (tree decl, const char 
*filename) const
 {
-  const char *filename = get_normalized_path (DECL_SOURCE_FILE (decl));
+  if (!filename)
+    filename = get_normalized_path (DECL_SOURCE_FILE (decl));
   int index = afdo_string_table->get_index_by_decl (decl);
   if (index == -1)
     return NULL;
@@ -2898,8 +2959,7 @@ autofdo_source_profile::read ()
     {
       function_instance::function_instance_stack stack;
       function_instance *s
-       = function_instance::read_function_instance (&stack,
-                                                    gcov_read_counter ());
+       = function_instance::read_function_instance (&stack);
 
       if (find_function_instance (s->get_descriptor ()) == nullptr)
        add_function_instance (s);
@@ -2907,7 +2967,21 @@ autofdo_source_profile::read ()
        fatal_error (UNKNOWN_LOCATION,
                     "auto-profile contains duplicated function instance %s",
                     afdo_string_table->get_symbol_name (s->symbol_name ()));
+
+      s->prop_timestamp ();
+      timestamp_info_map.insert({s->timestamp (), 0});
     }
+
+  /* timestamp_info_map is std::map with timestamp as key,
+     so it's already sorted in ascending order wrt timestamps.
+     This loop maps function with lowest timestamp to 1, and so on.
+     In afdo_annotate_cfg, node->tp_first_run is then set to corresponding
+     tp_first_run value.  */
+
+  int tp_first_run = 1;
+  for (auto &p : timestamp_info_map)
+    p.second = tp_first_run++;
+
   int hot_frac = param_hot_bb_count_fraction;
   /* Scale up the profile, but leave some bits in case some counts gets
      bigger than sum_max eventually.  */
@@ -4277,6 +4351,17 @@ afdo_annotate_cfg (void)
       = afdo_source_profile->get_function_instance_by_decl (
           current_function_decl);
 
+  /* FIXME: This is a workaround for sourcefile tracking, if afdo_string_table
+     ends up with empty filename or incorrect filename for the function and
+     should be removed once issues with sourcefile tracking get fixed.  */
+  if (s == NULL)
+    for (unsigned i = 0; i < afdo_string_table->filenames ().length (); i++)
+      {
+       s = afdo_source_profile->get_function_instance_by_decl 
(current_function_decl, afdo_string_table->filenames()[i]);
+       if (s)
+         break;
+      }
+
   if (s == NULL)
     {
       if (dump_file)
@@ -4285,7 +4370,8 @@ afdo_annotate_cfg (void)
       /* create_gcov only dumps symbols with some samples in them.
         This means that we get nonempty zero_bbs only if some
         nonzero counts in profile were not matched with statements.  */
-      if (!flag_profile_partial_training)
+      if (!flag_profile_partial_training
+         && !param_auto_profile_reorder_only)
        {
          FOR_ALL_BB_FN (bb, cfun)
            if (bb->count.quality () == GUESSED_LOCAL)
@@ -4295,6 +4381,20 @@ afdo_annotate_cfg (void)
       return;
     }
 
+  if (auto it = timestamp_info_map.find (s->timestamp ());
+      it != timestamp_info_map.end ())
+    {
+      cgraph_node *node = cgraph_node::get (current_function_decl);
+      node->tp_first_run = it->second;
+
+      if (dump_file)
+       fprintf (dump_file, "Setting %s->tp_first_run to %d\n",
+                node->asm_name (), node->tp_first_run);
+    }
+
+  if (param_auto_profile_reorder_only)
+    return;
+
   calculate_dominance_info (CDI_POST_DOMINATORS);
   calculate_dominance_info (CDI_DOMINATORS);
   loop_optimizer_init (0);
diff --git a/gcc/params.opt b/gcc/params.opt
index ad5ee887367..936c5a2a116 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -70,6 +70,10 @@ Enable asan detection of use-after-return bugs.
 Common Joined UInteger Var(param_auto_profile_bbs) Init(1) IntegerRange(0, 1) 
Param Optimization
 Build basic block profile using auto profile.
 
+-param=auto-profile-reorder-only=
+Common Joined UInteger Var(param_auto_profile_reorder_only) Init(0) 
IntegerRange(0, 1) Param Optimization
+Eanble only function reordering with auto-profile.
+
 -param=cycle-accurate-model=
 Common Joined UInteger Var(param_cycle_accurate_model) Init(1) IntegerRange(0, 
1) Param Optimization
 Whether the scheduling description is mostly a cycle-accurate model of the 
target processor and is likely to be spill aggressively to fill any pipeline 
bubbles.

Reply via email to