On Fri, Apr 18, 2014 at 2:16 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> On Fri, Apr 18, 2014 at 12:27 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> >> What I've observed on power is that LTO alone reduces performance and >> >> LTO+FDO is not significantly different than FDO alone. >> > On SPEC2k6? >> > >> > This is quite surprising, for our (well SUSE's) spec testers (AMD64) LTO >> > seems >> > off-noise win on SPEC2k6 >> > http://gcc.opensuse.org/SPEC/CINT/sb-megrez-head-64-2006/recent.html >> > http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006/recent.html >> > >> > I do not see why PPC should be significantly more constrained by register >> > pressure. >> > >> > I do not have head to head comparsion of FDO and FDO+LTO for SPEC >> > http://gcc.opensuse.org/SPEC/CFP/sb-megrez-head-64-2006-patched-FDO/index.html >> > shows noticeable drop in calculix and gamess. >> > Martin profiled calculix and tracked it down to a loop that is not trained >> > but hot in the reference run. That makes it optimized for size. >> > >> > http://dromaeo.com/?id=219677,219672,219965,219877 >> > compares Firefox's dromaeo runs with default build, LTO, FDO and LTO+FDO >> > Here the benefits of LTO and FDO seems to add up nicely. >> >> >> >> I agree that an exact estimate of the register pressure would be a >> >> difficult problem. I'm hoping that something that approximates potential >> >> register pressure downstream will be sufficient to help inlining >> >> decisions. >> > >> > Yep, register pressure and I-cache overhead estimates are used for inline >> > decisions by some compilers. >> > >> > I am mostly concerned about the metric suffering from GIGO principe if we >> > mix >> > together too many estimates that are somehwat wrong by their nature. This >> > is >> > why I mostly tried to focus on size/time estimates and not add too many >> > other >> > metrics. But perhaps it is a time to experiment wit these, since obviously >> > we >> > pushed current infrastructure to mostly to its limits. >> > >> >> I like the word GIGO here. Getting inline signals right requires deep >> analysis (including interprocedural analysis). Different signals/hints >> may also come with different quality thus different weights. >> >> Another challenge is how to quantify cycle savings/overhead more >> precisely. With that, we can abandon the threshold based scheme -- any >> callsite with a net saving will be considered. > > Inline hints are intended to do this - at the moment we bump the limits up > when we estimate big speedups for the inlining and with today patch and FDO > we bypass the thresholds when we know from FDO that call matters. > > Concerning your other email, indeed we should consider heavy callees (in > Open64 > terminology) that consume a lot of time and do not skip the call sites. Easy > way would be to replace maybe_hot_edge predicate by maybe_hot_call that simply > multiplies the count and estimated time. (We probably gouth to get rid of the > time capping and use wider arithmetics too).
That's what we did in Google branches. We had two heuristics -- hot caller and hot callee heuristics. 1) For the hot caller heuristic, other simple analysis is checked a) global working set size; b) callsite argument check -- very simple check to guess if inlining this callsite would sharpen analysis 2) We had not tuned hot callee heuristic by doing more analysis -- simply turn in on using hotness does not make a noticable differences. Other hints are needed. David > > I wonder if that is not too local and if we should not try to estimate > cumulative time > of the function and get more agressive on inlining over the whole path leading > to hot code. > > Honza