> -----Original Message-----
> From: Prathamesh Kulkarni <[email protected]>
> Sent: 16 December 2025 17:37
> To: Jan Hubicka <[email protected]>
> Cc: [email protected]
> Subject: RE: [RFC] Enable time profile function reordering with
> AutoFDO
>
> External email: Use caution opening links or attachments
>
>
> > -----Original Message-----
> > From: Jan Hubicka <[email protected]>
> > Sent: 11 December 2025 20:56
> > To: Prathamesh Kulkarni <[email protected]>
> > Cc: [email protected]
> > Subject: Re: [RFC] Enable time profile function reordering with
> > AutoFDO
> >
> > External email: Use caution opening links or attachments
> >
> >
> > > Hi Honza,
> > > Thanks for the review, please find responses inline below.
> > >
> > > > I assume that the autofdo assigns timestamps to offile function
> > > > instances, so this can also fit into function_instance
> > > > datastructure, but I also suppose you want to keep it separate
> so
> > it is optional?
> > > function_instance does store timestamp, and uses it as a key into
> > timestamp_info_map.
> > > My intent of keeping a separate data-structure timestamp_info_map
> > was
> > > to sort 64-bit timestamps into ascending order, and then map the
> > sorted timestamp values to (1..N), and assign the mapped value to
> > node->tp_first_run.
> > > So we don't need to up size of node->tp_first_run to 64bit (and I
> > > guess we won't need the full 64-bit range here?)
> >
> > Normal time profiling works in a way that time is increased only
> when
> > new function is executed, so 2^32 is the limit on number of executed
> > functions in the program which is probably still large enough: if
> one
> > does LTO with more than 2^32 symbols we will run out ot DECL_UIDs
> and
> > other things, though it may be possible to link such a large
> program.
> >
> > At runtime, all computations and streaming happens in 64bit, so
> > updating to_first_run into 64bit is also alternative, but I guess
> the
> > compression is a reasonable approach until we will need to update
> > other datastructures as well.
> > > > Also does the fature work with normal partitioning? It should
> make
> > > > those function with tp_first_run nonzero go into early
> partitions,
> > > > but then we will likely get relatively lot of calls to functions
> > > > with zero counts just because we lost track of them?
> > > The results I reported have been with balanced partitioning. I am
> > seeing roughly the same numbers on top of locality partitioning too.
> >
> > Great, happy that this works on both settings.
> > > >
> > > > I also wonder if offlining can behave meaningfully here. I.e.
> > when
> > > > function profile is non-zero just copy first run from the
> caller.
> > > IIUC, the purpose of afdo_offline pass is to remove cross module
> > > inlining from function_instances so AutoFDO annotation gets a
> better
> > mapping from source locations to CFG ?
> >
> > Yes, old code assumed all relevant inlining to happen at early
> inline
> > time via the afdo-inline path. All inlining that did not happen led
> > to lost profile data. With LTO this becomes quite problematic since
> a
> > lot of inlining is cross-module, so we need to take inline instances
> > out and merge them with existing offline instances in order to get
> > realistic profile before the link-time IPA-profile pass.
> > >
> > > Currently, the patch only assigns timestamps to functions whose
> > outline copy is executed during train run (toplevel).
> > > Would afdo offlining help for instance if a function is inlined
> into
> > > all it's callers during train run (thus the outline copy has no
> > timestamp info), but during AutoFDO, it may not get inlined into all
> > it's callers ? Currently in this case, I guess we lose timestamp
> info
> > (and thus time profile reordering for such a function).
> > >
> > > Currently the transform is gated explicitly on -fauto-profile -
> > fprofile-reorder-functions (so users can opt-in instead of being
> > enabled by default with -fauto-profile).
> > > Should that be OK for gcc-16 ?
> >
> > I think we want to do that (also for gcc-16, since this is contained
> > in autofdo). If we offline a function with a a non-zero counts in
> it,
> > we may want to assign the offline copy first run timestamp derived
> > from the outer function. (Probably something like
> > timestamp+inline depth).
> > > Does the patch look OK if bootstrap+test passes (with and without
> > AutoFDO) ?
> > > gcc/ChangeLog:
> > > * auto-profile.cc: (string_table::filenames): New method.
> > > (function_instance::timestamp_): New member.
> > > (function_instance::timestamp): New accessor for timestamp_
> > member.
> > > (autofdo::timestamp_info_map): New std::map.
> > > (function_instance::function_instance): Add argument
> timestamp
> > and set
> > > it to member timestamp_.
> > > (function_instance::read_function_instance): Adjust
> prototype
> > and read
> > > timestamp.
> > > (autofdo_source_profile::get_function_instance_by_decl): New
> > argument
> > > filename with default value NULL.
> > > (autofdo_source_profile::read): Populate timestamp_info_map.
> > > (afdo_annotate_cfg): Assign node->tp_first_run based on
> > > timestamp_info_map and bail out of annotation if
> > > param auto-profile-reorder-only is enabled.
> > > * params.opt: New param auto-profile-reorder-only.
> > > * ipa-locality-cloning.cc (partition_by_timestamps): New
> > function.
> > > (locality_partition_and_clone): Call
> partition_by_timestamps.
> > > * varasm.cc (default_function_section): Bail out if -fauto-
> > profile
> > > -fprofile-reorder-functions is enabled and node-
> >tp_first_run
> > > 0.
> > >
> > > gcc/lto/ChangeLog:
> > > * lto-partition.cc (create_partition): Move higher up in the
> > file.
> > > (partition_by_timestamps): New function.
> > > (lto_balanced_map): Call partition_by_timestamps if -fauto-
> > profile
> > > -fprofile-reorder-functions is passed and noreroder is
> empty.
> > OK
> > > + /* perf timestamp associated with first execution of function.
> > */
> > Please add comment that tp_first_run is eventually computed using
> this
> > value, but we need to watch overflows since it is 32bit.
> > > + gcov_type timestamp_;
> > > +/* Map from timestamp -> <name, tp_first_run>.
> > > +
> > > +The purpose of this map is to map 64-bit timestamp values to
> > (1..N),
> > > +sorted by ascending order of timestamps and assign that to
> > > +node->tp_first_run, since we don't need the full 64-bit range.
> */
> > Seems the comment is formatted wrongly.
> > > +static std::map<gcov_type, std::pair<int, int>>
> timestamp_info_map;
> > > +
> > > /* Scaling factor for afdo data. Compared to normal profile
> > > AFDO profile counts are much lower, depending on sampling
> > > frequency. We scale data up to reudce effects of roundoff @@
> > > -2523,6 +2543,7 @@
> > autofdo_source_profile::offline_unrealized_inlines
> > > ()
> > > /* function instance profile format:
> > >
> > > ENTRY_COUNT: 8 bytes
> > > + TIMESTAMP: 8 bytes (only for toplevel symbols)
> > > NAME_INDEX: 4 bytes
> > > NUM_POS_COUNTS: 4 bytes
> > > NUM_CALLSITES: 4 byte
> > > @@ -2549,14 +2570,17 @@
> > > autofdo_source_profile::offline_unrealized_inlines ()
> > >
> > > function_instance *
> > > function_instance::read_function_instance
> (function_instance_stack
> > *stack,
> > > - gcov_type head_count)
> > > + gcov_type head_count,
> bool
> > > + toplevel)
> >
> > There are also head count that is also streamed for toplevel
> instances
> > that is handled by passing it around as a parameter:
> >
> > function_instance *s =
> function_instance::read_function_instance
> > (
> > &stack, gcov_read_counter ());
> >
> > I think "toplevel" flag is better, so please modify streaming of
> head
> > count to work same way.
> > > = new function_instance (name, afdo_string_table-
> > >get_filename_idx (name),
> > > - head_count);
> > > + head_count, timestamp);
> > We will need setter api anyway to handle update while offlining, so
> > perhaps instead of adding extra parameter of the ctor, just set
> > timestamp after contruction?
> > > + /* timestamp_info_map is std::map with timestamp as key,
> > > + so it's already sorted in ascending order wrt timestamps.
> > > + This loop maps function with lowest timestamp to 1, and so
> on.
> > > + In afdo_annotate_cfg, node->tp_first_run is then set to
> > corresponding
> > > + tp_first_run value. */
> > > +
> > > + int tp_first_run = 1;
> > > + for (auto &p : timestamp_info_map)
> > > + p.second.second = tp_first_run++;
> > For time being I guess you can walk inline instances here
> recursively
> > and assign tp_first_runs to all of those that contains non-zero
> > counts.
> > Then offlining only needs to merge the timestamps at a time it
> merges
> > the profile.
> > > diff --git a/gcc/ipa-locality-cloning.cc b/gcc/ipa-locality-
> > cloning.cc
> > > index 2684046bd2d..1653ea7b961 100644
> > > --- a/gcc/ipa-locality-cloning.cc
> > > +++ b/gcc/ipa-locality-cloning.cc
> >
> > The changes above are OK with the changes mentioned.
> Thanks for the suggestions, I have tried to address them in the
> attached patch, and will send the partitioning changes in a separate
> patch.
>
> This patch propagates timestamp from toplevel function_instance to
> inlined instances, and picks the lesser value while merging. On
> testing the patch, I am seeing a ~4.1% improvement on a large internal
> workload, and several more functions with non-zero tp_first_run
> values.
> I didn't fully investigate the cause yet but IIUC, this would happen
> if ipa-inline during AutoFDO didn't do all the inlining that happened
> during train run ?
> So the outlined callee now gets tp_first_run set, and is grouped
> together with the (first) caller (while without propagating
> tp_first_run, the caller and outlined callee will get placed apart) ?
>
> Does the patch look OK ?
Hi,
ping: https://gcc.gnu.org/pipermail/gcc-patches/2025-December/703774.html
Thanks,
Prathamesh
>
> Signed-off-by: Prathamesh Kulkarni <[email protected]>
>
> Thanks,
> Prathamesh
> >
> > The changes below can still be a separate patch (once tp_first_run
> is
> > set, we should get sane partitioning and ordering by existing
> balanced
> > partitioning code).
> >
> > > @@ -994,6 +994,50 @@ locality_determine_static_order
> > (auto_vec<locality_order *> *order)
> > > return true;
> > > }
> > >
> > > +/* Similar to lto-partition.cc:partition_by_timestamp.s */
> > > +
> > > +static locality_partition
> > > +partition_by_timestamps (int64_t max_partition_size, int&
> > > +npartitions)
> > It is not called timestamps anymore, so perhaps
> > parittion_by_tp_first_run.
> > > @@ -1134,6 +1187,19 @@ lto_balanced_map (int n_lto_partitions, int
> > > max_partition_size)
> > >
> > > original_total_size = total_size;
> > >
> > > + /* With -fauto-profile -fprofile-reorder-functions, place all
> > symbols which
> > > + are profiled together into as few partitions as possible.
> The
> > rationale
> > > + for doing this with AutoFDO is that the number of sampled
> > functions is a
> > > + fraction of total number of executed functions (unlike PGO
> > where each
> > > + executed function gets instrumented with time profile
> > counter), and
> > > + placing them together helps with code locality. */
> > > +
> > > + partition = NULL;
> > > + if (flag_auto_profile
> > > + && flag_profile_reorder_functions
> > > + && noreorder.length () == 0)
> > > + partition = partition_by_timestamps (max_partition_size,
> > > + npartitions);
> >
> > I do not see why this is needed. Balanced parititioning starts by a
> > fixed order of functions:
> >
> > order.qsort (tp_first_run_node_cmp);
> >
> > which will prioritize functions with tp_first_run and
> > flag_profile_reodr_functions enabled (this flag is per-function so
> it
> > needs to be checked on each of them)
> >
> > int
> > tp_first_run_node_cmp (const void *pa, const void *pb) {
> > const cgraph_node *a = *(const cgraph_node * const *) pa;
> > const cgraph_node *b = *(const cgraph_node * const *) pb;
> > unsigned int tp_first_run_a = a->tp_first_run;
> > unsigned int tp_first_run_b = b->tp_first_run;
> >
> > if (!opt_for_fn (a->decl, flag_profile_reorder_functions)
> > || a->no_reorder)
> > tp_first_run_a = 0;
> > if (!opt_for_fn (b->decl, flag_profile_reorder_functions)
> > || b->no_reorder)
> > tp_first_run_b = 0;
> >
> > if (tp_first_run_a == tp_first_run_b)
> > return a->order - b->order;
> >
> > /* Functions with time profile must be before these without
> profile.
> > */
> > tp_first_run_a = (tp_first_run_a - 1) & INT_MAX;
> > tp_first_run_b = (tp_first_run_b - 1) & INT_MAX;
> >
> > return tp_first_run_a - tp_first_run_b; }
> >
> > Once order is set, balanced partitioning assigns functions to
> > partitiong in that order only trying to minimize the interface
> between
> > the parts by looking for better cut points in a given range.
> >
> > So it should give similar or better order than what you get with
> first
> > putting non-zero tp first run into partitions using separate
> > partitioner.
> > > +
> > > + /* With -fauto-profile -fprofile-reorder-functions, give higher
> > priority
> > > + to time profile based reordering, and ensure the reordering
> > isn't split
> > > + by hot/cold partitioning. */
> > > + if (flag_auto_profile
> > > + && flag_profile_reorder_functions
> > > + && cgraph_node::get (decl)->tp_first_run > 0)
> > > + return NULL;
> >
> > I am also not sure why this would be needed. If you have cold
> > function that is run early (such as startup initialization code) it
> > still makes sense to place it away from code that is run often.
> >
> > Honza