https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114855

--- Comment #32 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Aldy Hernandez from comment #31)
> (In reply to rguent...@suse.de from comment #30)
> > On Wed, 4 Sep 2024, amacleod at redhat dot com wrote:
> 
> > > I tried running valgrind, which died, but up to that point it showed 77% 
> > > of the
> > > time spend in the body of 
> > > back_jt_path_registry::adjust_paths_after_duplication ()
> > 
> > Hmm, I can have a look there if you don't beat me to it (though the
> > profiles didn't have this part in AFAIR).
> 
> [Some background before somebody shoots me.]
> 
> This function should really go.  It was leftover from a PR I worked on a
> long time ago, before the backwards threader rewrite.  In working on it I
> realized that we were threading a lot of sub-threads that could be
> simplified.  I suggested this to Jeff, who may remember the details.  There
> is no testcase unfortunately.
> 
> It seemed cheap at the time, since the original backwards threader could
> only thread a handful of things, so the number of threads was always in the
> single digits.
> 
> The function is ugly.  I thought about removing it as part of the threader
> rewrite, but was wasn't brave enough because we didn't have a testcase and I
> had forgotten the details.
> 
> Ok, enough excuses.
> 
> Andrew was playing with this last night and saw a reduction from 1700
> seconds to 27 (?? in a subset of the testcase??) by just removing it.
> 
> We either need to rewrite this function, or delete it altogether.  I vote
> for the latter.  It's a micro optimization that is obviously too expensive
> to be worth it.  Perhaps we should benchmark how many times it actually
> triggers in a bootstrap, and what the supposed code size savings are.

I think this functionality should have been implemented during discovery
somehow - you'd have to build a way to search threading paths from
path prefixes.  (quite easy to do for forward threading, just to name
another reason why forward threading is "better")

In the end this would also affect costing of a path (or a pair of paths),
but doing better there, esp. if path A or B individually are not profitable
to thread but the combined thread would be within size/copy limits of
two paths, is a bit difficult.

It might be the easiest way would be to keep it at transform time but
instead of walking all paths after each threading just populate a
pair<start-of-path, orig-bb> -> copy-bb map and when threading a new
path lookup start-of-path, orig-bb for each of the threads path and
adjust it there?  You'd need multi-levels as well I guess if you
have three paths working on each others.  Maybe the map should be
two levels deep, the first key only on start-of-path yielding a map
for the orig-bb.

Note the situation that the start of a path is copied is also not handled,
which basically duplicates a threading opportunity to two paths.

That said, implementing separate threading for each path is probably
sub-optimal and instead the "threading web" could have a better overall
data structure where these combinations/duplications are more obvious
(and can be costed to their actual effect).

That said - I'd not remove the functionality but instead throw memory
at it to make the "lookup" O(log n) instead of a linear walk and perform
the adjustment at the time a thread is code generated (but fill the data
structure after each thread).

Reply via email to