On Tue, Nov 19, 2013 at 9:40 AM, Jeff Law <l...@redhat.com> wrote: > On 11/19/13 10:24, Teresa Johnson wrote: >> >> On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >>> >>> Martin, >>> can you, please, generate the updated systemtap with >>> -freorder-blocks-and-partition enabled? >>> >>> I am in favour of enabling this - it is usefull pass and it is pointless >>> ot >>> have passes that are not enabled by default. >>> Is there reason why this would not work on other ELF target? Is it >>> working >>> with Darwin and Windows? >> >> >> I don't know how to test these (I don't see any machines listed in the >> gcc compile farm of those types). For Windows, I assume you mean >> MinGW, which should be enabled as it is under i386. Should I disable >> it there and for Darwin? >> >>> >>>> This patch enables -freorder-blocks-and-partition by default for x86 >>>> at -O2 and up. It is showing some modest gains in cpu2006 performance >>>> with profile feedback and -O2 on an Intel Westmere system. Specifically, >>>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 483.xalancbmk >>>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions. >>> >>> >>> This actually sounds very good ;) >>> >>> Lets see how the systemtap graphs goes. If we will end up with problem >>> of too many accesses to cold section, I would suggest making cold section >>> subdivided into .unlikely and .unlikely.part (we could have better name) >>> with the second consisting only of unlikely parts of hot&normal >>> functions. >>> >>> This should reduce the problems we are seeing with mistakely identifying >>> code to be cold because of roundoff errors (and it probably makes sense >>> in general, too). >>> We will however need to update gold and ld for that. >> >> >> Note that I don't think this would help much unless the linker is >> changed to move the cold split section close to the hot section. There >> is probably some fine-tuning we could do eventually in the linker >> under -ffunction-sections without putting the split portions in a >> separate section. I.e. clump the split parts together within unlikely. >> But hopefully this can all be done later on as follow-on work to boost >> the performance further. >> >>>> >>>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal >>>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were >>>> configured with --enable-languages=all,obj-c++ and tested for both >>>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}". >>>> >>>> It would be good to enable this for additional targets as a follow on, >>>> but it needs more testing for both correctness and performance on those >>>> other targets (i.e for correctness because I see a number of places >>>> in other config/*/*.c files that do some special handling under this >>>> option for different targets or simply disable it, so I am not sure >>>> how well-tested it is under different architectural constraints). >>>> >>>> Ok for trunk? >>>> >>>> Thanks, >>>> Teresa >>>> >>>> 2013-11-19 Teresa Johnson <tejohn...@google.com> >>>> >>>> * common/config/i386/i386-common.c: Enable >>>> -freorder-blocks-and-partition at -O2 and up for x86. >>>> * opts.c (finish_options): Only warn if -freorder-blocks-and- >>>> partition was set on command line. >>> >>> >>> You probably mis doc/invoke.texi update. >>> Thank you for working on this! >> >> >> Yes, thanks. Here is the patch with the invoke.texi update. >> >> Teresa >> >> >> 2013-11-19 Teresa Johnson <tejohn...@google.com> >> >> * common/config/i386/i386-common.c: Enable >> -freorder-blocks-and-partition at -O2 and up for x86. >> * doc/invoke.texi: Update -freorder-blocks-and-partition default. >> * opts.c (finish_options): Only warn if -freorder-blocks-and- >> partition was set on command line. > > This is fine. Glad to see we're getting some good benefits from this code.
Ok, thanks. It is in as r205058. > > FWIW, I would expect the PA to show benefits from this work -- HP showed > good results for block positioning and procedure splitting eons ago. A > secondary effect you would expect to see would be an increase in the number > of long branch stubs (static count), but a decrease in the number of those > stubs that actually execute. Measuring those effects would obviously be > out-of-scope. > > I totally understand that the PA is irrelevant these days, but seeing good > results on another architecture would go a long way to answering the > question "should this be enabled by default across the board?" Yep, we saw benefits for IPF on hp-ux as well. It would be great to eventually enable this across the board and only selectively disable. I know there were people interested in trying this with ARM, that would be a good relevant arch to try this with now to see if it can be enabled more widely. Teresa > > jeff > > -- Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413