Re: [PATCH] rs6000: Enable limited unrolling at -O2
Richard Biener writes: > On Mon, 28 Oct 2019, Jiufu Guo wrote: > >> Richard Biener writes: >> >> > On Fri, 25 Oct 2019, Jiufu Guo wrote: >> > >> >> Hi, >> >> >> >> In PR88760, there are a few disscussion about improve or tune unroller for >> >> targets. And we would agree to enable unroller for small loops at O2 >> >> first. >> >> And we could see performance improvement(~10%) for below code: >> >> ``` >> >> subroutine foo (i, i1, block) >> >> integer :: i, i1 >> >> integer :: block(9, 9, 9) >> >> block(i:9,1,i1) = block(i:9,1,i1) - 10 >> >> end subroutine foo >> >> >> >> ``` >> >> This kind of code occurs a few times in exchange2 benchmark. >> >> >> >> Similar C code: >> >> ``` >> >> for (i = 0; i < n; i++) >> >> arr[i] = arr[i] - 10; >> >> ``` >> >> >> >> On powerpc64le, for O2 , enable -funroll-loops and limit >> >> PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2% >> >> overall improvement for SPEC2017. >> > >> > Note the behavior with LTO will be surprising (and generally when >> > functions are compiled with different -O level via the optimize >> > attribute). It's a bad idea to base --param values on the value >> > of a global optimization option that can be set per function >> > (see PR92046). >> >> Thanks Richard, >> --param values are not save per function. If using different method to >> compile and link (e.g. compiling with -O2 -flto, linking with -flto -O2 >> -funroll-loops --param xxx), compiling and linking would use different >> --param value. > > Yes, and more so if you have most units commpiled with -O2 but one > with -O3 then you get implicit "global" -O3 at link-time and thus > -O3 --param settings for all -O2 compiled functions. > >> > >> > A better patch would have been to adjust via the target hooks for >> > unroll adjust. >> Yeap. And in unroll adjust target hook, we can do fine tunning with more >> heuristic thougths. I'm going to refine this way. > > Thanks. Note enabling unrolling at -O2 for the target like you did > should work OK for LTO (even if some units explicitely disable it > via -fno-unroll-loop). > > Richard. Thanks Richard! I just send out a new patch which is using loop unroll adjust hook. The patch is still checking global option setting: x_flag_unroll_loops to avoid changing the behavior of explicit -funroll-loops. And then if there is no explicit -funroll-loops on some case, conservative unrolling (only unroll small loops 2 times) will be adopted. For example: -flto -c t.c -O2 -funroll-loops -flto t.o this linking may not be treated as explicit -funroll-loops. I thinking to keep this behavior untill removing global setting check. How do you think? Thanks in advance. Jiufu Guo BR. > >> Thanks again. >> >> Jiufu Guo >> BR. >> >> > >> > Thanks, >> > Richard. >> > >> >> This patch is only for rs6000 in which we see visible performance >> >> improvement. >> >> >> >> Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk? >> >> >> >> Jiufu Guo >> >> BR >> >> >> >> >> >> gcc/ >> >> 2019-10-25 Jiufu Guo >> >> >> >> PR tree-optimization/88760 >> >> * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for >> >> O2, enable funroll-loops. >> >> * config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller >> >> is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and >> >> PARAM_MAX_UNROLLED_INSNS=20 for small loops. >> >> >> >> gcc.testsuite/ >> >> 2019-10-25 Jiufu Guo >> >> >> >> PR tree-optimization/88760 >> >> * gcc.target/powerpc/small-loop-unroll.c: New test. >> >> * c-c++-common/tsan/thread_leak2.c: Update test. >> >> * gcc.dg/pr59643.c: Update test. >> >> * gcc.target/powerpc/loop_align.c: Update test. >> >> * gcc.target/powerpc/ppc-fma-1.c: Update test. >> >> * gcc.target/powerpc/ppc-fma-2.c: Update test. >> >> * gcc.target/powerpc/ppc-fma-3.c: Update test. >> >> * gcc.target/powerpc/ppc-fma-4.c: Update test. >> >> * gcc.target/powerpc/pr78604.c: Update test. >> >> >> >> --- >> >> gcc/common/config/rs6000/rs6000-common.c | 1 + >> >> gcc/config/rs6000/rs6000.c | 20 >> >> >> >> gcc/testsuite/c-c++-common/tsan/thread_leak2.c | 1 + >> >> gcc/testsuite/gcc.dg/pr59643.c | 1 + >> >> gcc/testsuite/gcc.target/powerpc/loop_align.c| 2 +- >> >> gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c | 2 +- >> >> gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c | 2 +- >> >> gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c | 2 +- >> >> gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c | 2 +- >> >> gcc/testsuite/gcc.target/powerpc/pr78604.c | 2 +- >> >> gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 + >> >> 11 files changed, 42 insertions(+), 6 deletions(-) >> >> create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c >> >> >> >> dif
Re: [PATCH] rs6000: Enable limited unrolling at -O2
On October 28, 2019 6:18:26 PM GMT+01:00, Segher Boessenkool wrote: >On Mon, Oct 28, 2019 at 09:08:04AM +0100, Richard Biener wrote: >> On Fri, 25 Oct 2019, Jiufu Guo wrote: >> > On powerpc64le, for O2 , enable -funroll-loops and limit >> > PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can >see >2% >> > overall improvement for SPEC2017. >> >> Note the behavior with LTO will be surprising (and generally when >> functions are compiled with different -O level via the optimize >> attribute). It's a bad idea to base --param values on the value >> of a global optimization option that can be set per function >> (see PR92046). >> >> A better patch would have been to adjust via the target hooks for >> unroll adjust. > >So we should add target hooks for all params, to have them work sanely >with LTO? No. There are existing target hooks for the unroller though. >What makes params special, different from normal command line options? They use a different machinery. They should be moved over to the general options handling so that per-function settings work. Richard. > >Segher
Re: [PATCH] rs6000: Enable limited unrolling at -O2
On Mon, Oct 28, 2019 at 09:08:04AM +0100, Richard Biener wrote: > On Fri, 25 Oct 2019, Jiufu Guo wrote: > > On powerpc64le, for O2 , enable -funroll-loops and limit > > PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2% > > overall improvement for SPEC2017. > > Note the behavior with LTO will be surprising (and generally when > functions are compiled with different -O level via the optimize > attribute). It's a bad idea to base --param values on the value > of a global optimization option that can be set per function > (see PR92046). > > A better patch would have been to adjust via the target hooks for > unroll adjust. So we should add target hooks for all params, to have them work sanely with LTO? What makes params special, different from normal command line options? Segher
Re: [PATCH] rs6000: Enable limited unrolling at -O2
On Mon, 28 Oct 2019, Jiufu Guo wrote: > Richard Biener writes: > > > On Fri, 25 Oct 2019, Jiufu Guo wrote: > > > >> Hi, > >> > >> In PR88760, there are a few disscussion about improve or tune unroller for > >> targets. And we would agree to enable unroller for small loops at O2 first. > >> And we could see performance improvement(~10%) for below code: > >> ``` > >> subroutine foo (i, i1, block) > >> integer :: i, i1 > >> integer :: block(9, 9, 9) > >> block(i:9,1,i1) = block(i:9,1,i1) - 10 > >> end subroutine foo > >> > >> ``` > >> This kind of code occurs a few times in exchange2 benchmark. > >> > >> Similar C code: > >> ``` > >> for (i = 0; i < n; i++) > >> arr[i] = arr[i] - 10; > >> ``` > >> > >> On powerpc64le, for O2 , enable -funroll-loops and limit > >> PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2% > >> overall improvement for SPEC2017. > > > > Note the behavior with LTO will be surprising (and generally when > > functions are compiled with different -O level via the optimize > > attribute). It's a bad idea to base --param values on the value > > of a global optimization option that can be set per function > > (see PR92046). > > Thanks Richard, > --param values are not save per function. If using different method to > compile and link (e.g. compiling with -O2 -flto, linking with -flto -O2 > -funroll-loops --param xxx), compiling and linking would use different > --param value. Yes, and more so if you have most units commpiled with -O2 but one with -O3 then you get implicit "global" -O3 at link-time and thus -O3 --param settings for all -O2 compiled functions. > > > > A better patch would have been to adjust via the target hooks for > > unroll adjust. > Yeap. And in unroll adjust target hook, we can do fine tunning with more > heuristic thougths. I'm going to refine this way. Thanks. Note enabling unrolling at -O2 for the target like you did should work OK for LTO (even if some units explicitely disable it via -fno-unroll-loop). Richard. > Thanks again. > > Jiufu Guo > BR. > > > > > Thanks, > > Richard. > > > >> This patch is only for rs6000 in which we see visible performance > >> improvement. > >> > >> Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk? > >> > >> Jiufu Guo > >> BR > >> > >> > >> gcc/ > >> 2019-10-25 Jiufu Guo > >> > >>PR tree-optimization/88760 > >>* config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for > >>O2, enable funroll-loops. > >>* config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller > >>is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and > >>PARAM_MAX_UNROLLED_INSNS=20 for small loops. > >> > >> gcc.testsuite/ > >> 2019-10-25 Jiufu Guo > >> > >>PR tree-optimization/88760 > >>* gcc.target/powerpc/small-loop-unroll.c: New test. > >>* c-c++-common/tsan/thread_leak2.c: Update test. > >>* gcc.dg/pr59643.c: Update test. > >>* gcc.target/powerpc/loop_align.c: Update test. > >>* gcc.target/powerpc/ppc-fma-1.c: Update test. > >>* gcc.target/powerpc/ppc-fma-2.c: Update test. > >>* gcc.target/powerpc/ppc-fma-3.c: Update test. > >>* gcc.target/powerpc/ppc-fma-4.c: Update test. > >>* gcc.target/powerpc/pr78604.c: Update test. > >> > >> --- > >> gcc/common/config/rs6000/rs6000-common.c | 1 + > >> gcc/config/rs6000/rs6000.c | 20 > >> > >> gcc/testsuite/c-c++-common/tsan/thread_leak2.c | 1 + > >> gcc/testsuite/gcc.dg/pr59643.c | 1 + > >> gcc/testsuite/gcc.target/powerpc/loop_align.c| 2 +- > >> gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c | 2 +- > >> gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c | 2 +- > >> gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c | 2 +- > >> gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c | 2 +- > >> gcc/testsuite/gcc.target/powerpc/pr78604.c | 2 +- > >> gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 + > >> 11 files changed, 42 insertions(+), 6 deletions(-) > >> create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c > >> > >> diff --git a/gcc/common/config/rs6000/rs6000-common.c > >> b/gcc/common/config/rs6000/rs6000-common.c > >> index 4b0c205..b947196 100644 > >> --- a/gcc/common/config/rs6000/rs6000-common.c > >> +++ b/gcc/common/config/rs6000/rs6000-common.c > >> @@ -35,6 +35,7 @@ static const struct default_options > >> rs6000_option_optimization_table[] = > >> { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 }, > >> /* Enable -fsched-pressure for first pass instruction scheduling. */ > >> { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 }, > >> +{ OPT_LEVELS_2_PLUS, OPT_funroll_loops, NULL, 1 }, > >> { OPT_LEVELS_NONE, 0, NULL, 0 } > >>}; > >> > >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs60
Re: [PATCH] rs6000: Enable limited unrolling at -O2
Richard Biener writes: > On Fri, 25 Oct 2019, Jiufu Guo wrote: > >> Hi, >> >> In PR88760, there are a few disscussion about improve or tune unroller for >> targets. And we would agree to enable unroller for small loops at O2 first. >> And we could see performance improvement(~10%) for below code: >> ``` >> subroutine foo (i, i1, block) >> integer :: i, i1 >> integer :: block(9, 9, 9) >> block(i:9,1,i1) = block(i:9,1,i1) - 10 >> end subroutine foo >> >> ``` >> This kind of code occurs a few times in exchange2 benchmark. >> >> Similar C code: >> ``` >> for (i = 0; i < n; i++) >> arr[i] = arr[i] - 10; >> ``` >> >> On powerpc64le, for O2 , enable -funroll-loops and limit >> PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2% >> overall improvement for SPEC2017. > > Note the behavior with LTO will be surprising (and generally when > functions are compiled with different -O level via the optimize > attribute). It's a bad idea to base --param values on the value > of a global optimization option that can be set per function > (see PR92046). Thanks Richard, --param values are not save per function. If using different method to compile and link (e.g. compiling with -O2 -flto, linking with -flto -O2 -funroll-loops --param xxx), compiling and linking would use different --param value. > > A better patch would have been to adjust via the target hooks for > unroll adjust. Yeap. And in unroll adjust target hook, we can do fine tunning with more heuristic thougths. I'm going to refine this way. Thanks again. Jiufu Guo BR. > > Thanks, > Richard. > >> This patch is only for rs6000 in which we see visible performance >> improvement. >> >> Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk? >> >> Jiufu Guo >> BR >> >> >> gcc/ >> 2019-10-25 Jiufu Guo >> >> PR tree-optimization/88760 >> * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for >> O2, enable funroll-loops. >> * config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller >> is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and >> PARAM_MAX_UNROLLED_INSNS=20 for small loops. >> >> gcc.testsuite/ >> 2019-10-25 Jiufu Guo >> >> PR tree-optimization/88760 >> * gcc.target/powerpc/small-loop-unroll.c: New test. >> * c-c++-common/tsan/thread_leak2.c: Update test. >> * gcc.dg/pr59643.c: Update test. >> * gcc.target/powerpc/loop_align.c: Update test. >> * gcc.target/powerpc/ppc-fma-1.c: Update test. >> * gcc.target/powerpc/ppc-fma-2.c: Update test. >> * gcc.target/powerpc/ppc-fma-3.c: Update test. >> * gcc.target/powerpc/ppc-fma-4.c: Update test. >> * gcc.target/powerpc/pr78604.c: Update test. >> >> --- >> gcc/common/config/rs6000/rs6000-common.c | 1 + >> gcc/config/rs6000/rs6000.c | 20 >> >> gcc/testsuite/c-c++-common/tsan/thread_leak2.c | 1 + >> gcc/testsuite/gcc.dg/pr59643.c | 1 + >> gcc/testsuite/gcc.target/powerpc/loop_align.c| 2 +- >> gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c | 2 +- >> gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c | 2 +- >> gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c | 2 +- >> gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c | 2 +- >> gcc/testsuite/gcc.target/powerpc/pr78604.c | 2 +- >> gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 + >> 11 files changed, 42 insertions(+), 6 deletions(-) >> create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c >> >> diff --git a/gcc/common/config/rs6000/rs6000-common.c >> b/gcc/common/config/rs6000/rs6000-common.c >> index 4b0c205..b947196 100644 >> --- a/gcc/common/config/rs6000/rs6000-common.c >> +++ b/gcc/common/config/rs6000/rs6000-common.c >> @@ -35,6 +35,7 @@ static const struct default_options >> rs6000_option_optimization_table[] = >> { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 }, >> /* Enable -fsched-pressure for first pass instruction scheduling. */ >> { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 }, >> +{ OPT_LEVELS_2_PLUS, OPT_funroll_loops, NULL, 1 }, >> { OPT_LEVELS_NONE, 0, NULL, 0 } >>}; >> >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >> index a129137d..9a8ff9a 100644 >> --- a/gcc/config/rs6000/rs6000.c >> +++ b/gcc/config/rs6000/rs6000.c >> @@ -4540,6 +4540,26 @@ rs6000_option_override_internal (bool global_init_p) >> global_options.x_param_values, >> global_options_set.x_param_values); >> >> + /* unroll very small loops 2 time if no -funroll-loops. */ >> + if (!global_options_set.x_flag_unroll_loops >> + && !global_options_set.x_flag_unroll_all_loops) >> +{ >> + maybe_set_param_value (PARAM_MAX_UNROLL_TIMES, 2, >> +
Re: [PATCH] rs6000: Enable limited unrolling at -O2
On Fri, 25 Oct 2019, Jiufu Guo wrote: > Hi, > > In PR88760, there are a few disscussion about improve or tune unroller for > targets. And we would agree to enable unroller for small loops at O2 first. > And we could see performance improvement(~10%) for below code: > ``` > subroutine foo (i, i1, block) > integer :: i, i1 > integer :: block(9, 9, 9) > block(i:9,1,i1) = block(i:9,1,i1) - 10 > end subroutine foo > > ``` > This kind of code occurs a few times in exchange2 benchmark. > > Similar C code: > ``` > for (i = 0; i < n; i++) > arr[i] = arr[i] - 10; > ``` > > On powerpc64le, for O2 , enable -funroll-loops and limit > PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2% > overall improvement for SPEC2017. Note the behavior with LTO will be surprising (and generally when functions are compiled with different -O level via the optimize attribute). It's a bad idea to base --param values on the value of a global optimization option that can be set per function (see PR92046). A better patch would have been to adjust via the target hooks for unroll adjust. Thanks, Richard. > This patch is only for rs6000 in which we see visible performance improvement. > > Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk? > > Jiufu Guo > BR > > > gcc/ > 2019-10-25 Jiufu Guo > > PR tree-optimization/88760 > * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for > O2, enable funroll-loops. > * config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller > is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and > PARAM_MAX_UNROLLED_INSNS=20 for small loops. > > gcc.testsuite/ > 2019-10-25 Jiufu Guo > > PR tree-optimization/88760 > * gcc.target/powerpc/small-loop-unroll.c: New test. > * c-c++-common/tsan/thread_leak2.c: Update test. > * gcc.dg/pr59643.c: Update test. > * gcc.target/powerpc/loop_align.c: Update test. > * gcc.target/powerpc/ppc-fma-1.c: Update test. > * gcc.target/powerpc/ppc-fma-2.c: Update test. > * gcc.target/powerpc/ppc-fma-3.c: Update test. > * gcc.target/powerpc/ppc-fma-4.c: Update test. > * gcc.target/powerpc/pr78604.c: Update test. > > --- > gcc/common/config/rs6000/rs6000-common.c | 1 + > gcc/config/rs6000/rs6000.c | 20 > > gcc/testsuite/c-c++-common/tsan/thread_leak2.c | 1 + > gcc/testsuite/gcc.dg/pr59643.c | 1 + > gcc/testsuite/gcc.target/powerpc/loop_align.c| 2 +- > gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c | 2 +- > gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c | 2 +- > gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c | 2 +- > gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c | 2 +- > gcc/testsuite/gcc.target/powerpc/pr78604.c | 2 +- > gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 + > 11 files changed, 42 insertions(+), 6 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c > > diff --git a/gcc/common/config/rs6000/rs6000-common.c > b/gcc/common/config/rs6000/rs6000-common.c > index 4b0c205..b947196 100644 > --- a/gcc/common/config/rs6000/rs6000-common.c > +++ b/gcc/common/config/rs6000/rs6000-common.c > @@ -35,6 +35,7 @@ static const struct default_options > rs6000_option_optimization_table[] = > { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 }, > /* Enable -fsched-pressure for first pass instruction scheduling. */ > { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 }, > +{ OPT_LEVELS_2_PLUS, OPT_funroll_loops, NULL, 1 }, > { OPT_LEVELS_NONE, 0, NULL, 0 } >}; > > diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > index a129137d..9a8ff9a 100644 > --- a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c > @@ -4540,6 +4540,26 @@ rs6000_option_override_internal (bool global_init_p) >global_options.x_param_values, >global_options_set.x_param_values); > > + /* unroll very small loops 2 time if no -funroll-loops. */ > + if (!global_options_set.x_flag_unroll_loops > + && !global_options_set.x_flag_unroll_all_loops) > + { > + maybe_set_param_value (PARAM_MAX_UNROLL_TIMES, 2, > + global_options.x_param_values, > + global_options_set.x_param_values); > + > + maybe_set_param_value (PARAM_MAX_UNROLLED_INSNS, 20, > + global_options.x_param_values, > + global_options_set.x_param_values); > + > + /* If fweb or frename-registers are not specificed in command-line, > + do not turn them on implicitly. */ > + if (!global_options_set.x_flag_web) > + global_options.x_flag_web
Re: [PATCH] rs6000: Enable limited unrolling at -O2
Hi Jiufu, On Fri, Oct 25, 2019 at 10:44:39PM +0800, Jiufu Guo wrote: > In PR88760, there are a few disscussion about improve or tune unroller for > targets. And we would agree to enable unroller for small loops at O2 first. [ snip ] > PR tree-optimization/88760 > * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for > O2, enable funroll-loops. * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): Enable -funroll-loops for -O2 and above. > * config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller > is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and > PARAM_MAX_UNROLLED_INSNS=20 for small loops. * config/rs6000/rs6000.c (rs6000_option_override_internal): Set PARAM_MAX_UNROLL_TIMES to 2 and PARAM_MAX_UNROLLED_INSNS to 20 if the unroller is not explicitly enabled. > --- a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c > @@ -4540,6 +4540,26 @@ rs6000_option_override_internal (bool global_init_p) >global_options.x_param_values, >global_options_set.x_param_values); > > + /* unroll very small loops 2 time if no -funroll-loops. */ > + if (!global_options_set.x_flag_unroll_loops > + && !global_options_set.x_flag_unroll_all_loops) > + { > + maybe_set_param_value (PARAM_MAX_UNROLL_TIMES, 2, > + global_options.x_param_values, > + global_options_set.x_param_values); > + > + maybe_set_param_value (PARAM_MAX_UNROLLED_INSNS, 20, > + global_options.x_param_values, > + global_options_set.x_param_values); > + > + /* If fweb or frename-registers are not specificed in command-line, (specified) > + do not turn them on implicitly. */ > + if (!global_options_set.x_flag_web) > + global_options.x_flag_web = 0; > + if (!global_options_set.x_flag_rename_registers) > + global_options.x_flag_rename_registers = 0; > + } This web and rnreg thing needs to be in the changelog, too. > diff --git a/gcc/testsuite/c-c++-common/tsan/thread_leak2.c > b/gcc/testsuite/c-c++-common/tsan/thread_leak2.c > index c9b8046..17aa5c6 100644 > --- a/gcc/testsuite/c-c++-common/tsan/thread_leak2.c > +++ b/gcc/testsuite/c-c++-common/tsan/thread_leak2.c > @@ -1,4 +1,5 @@ > /* { dg-shouldfail "tsan" } */ > +/* { dg-additional-options "-fno-unroll-loops" { target { powerpc-*-* > powerpc64-*-* powerpc64le-*-* } } } */ You can write that as just { target { powerpc*-*-* } } Could you explain why this is needed here? In the test file itself, preferably. > --- a/gcc/testsuite/gcc.dg/pr59643.c > +++ b/gcc/testsuite/gcc.dg/pr59643.c > @@ -1,6 +1,7 @@ > /* PR tree-optimization/59643 */ > /* { dg-do compile } */ > /* { dg-options "-O3 -fdump-tree-pcom-details" } */ > +/* { dg-additional-options "-fno-unroll-loops" { target { powerpc-*-* > powerpc64-*-* powerpc64le-*-* } } } */ Same here. How does this patch change behaviour here, anyway? The test uses -O3? > --- a/gcc/testsuite/gcc.target/powerpc/loop_align.c > +++ b/gcc/testsuite/gcc.target/powerpc/loop_align.c > @@ -1,6 +1,6 @@ > /* { dg-do compile { target { powerpc*-*-* } } } */ > /* { dg-skip-if "" { powerpc*-*-darwin* powerpc-ibm-aix* } } */ > -/* { dg-options "-O2 -mdejagnu-cpu=power7 -falign-functions=16" } */ > +/* { dg-options "-O2 -mdejagnu-cpu=power7 -falign-functions=16 > -fno-unroll-loops" } */ > /* { dg-final { scan-assembler ".p2align 5" } } */ For this one it is pretty much obvious. > --- a/gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c > +++ b/gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c > @@ -1,7 +1,7 @@ > /* { dg-do compile { target { powerpc*-*-* } } } */ > /* { dg-skip-if "" { powerpc*-*-darwin* } } */ > /* { dg-require-effective-target powerpc_vsx_ok } */ > -/* { dg-options "-O3 -ftree-vectorize -mdejagnu-cpu=power7 -ffast-math" } */ > +/* { dg-options "-O3 -ftree-vectorize -mdejagnu-cpu=power7 -ffast-math > -fno-unroll-loops" } */ And here, and all other powerpc-specific tests (they count the generated machine instructions). > --- /dev/null > +++ b/gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c > @@ -0,0 +1,13 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -fdump-rtl-loop2_unroll" } */ > + > +void __attribute__ ((noinline)) foo(int n, int *arr) > +{ > + int i; > + for (i = 0; i < n; i++) > +arr[i] = arr[i] - 10; > +} > +/* { dg-final { scan-rtl-dump-times "Unrolled loop 1 times" 1 "loop2_unroll" > } } */ > +/* { dg-final { scan-assembler-times "lwz" 3 } } */ > +/* { dg-final { scan-assembler-times "stw" 3 } } */ /* { dg-final { scan-assembler-times {\mlwz\M} 3 } } */ /* { dg-final { scan-assembler-times {\mstw\M} 3 } } */ so that it doesn't count when lwz or stw is only part of the word. This doesn't so easily matter for scan-assembler-
[PATCH] rs6000: Enable limited unrolling at -O2
Hi, In PR88760, there are a few disscussion about improve or tune unroller for targets. And we would agree to enable unroller for small loops at O2 first. And we could see performance improvement(~10%) for below code: ``` subroutine foo (i, i1, block) integer :: i, i1 integer :: block(9, 9, 9) block(i:9,1,i1) = block(i:9,1,i1) - 10 end subroutine foo ``` This kind of code occurs a few times in exchange2 benchmark. Similar C code: ``` for (i = 0; i < n; i++) arr[i] = arr[i] - 10; ``` On powerpc64le, for O2 , enable -funroll-loops and limit PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2% overall improvement for SPEC2017. This patch is only for rs6000 in which we see visible performance improvement. Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk? Jiufu Guo BR gcc/ 2019-10-25 Jiufu Guo PR tree-optimization/88760 * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for O2, enable funroll-loops. * config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20 for small loops. gcc.testsuite/ 2019-10-25 Jiufu Guo PR tree-optimization/88760 * gcc.target/powerpc/small-loop-unroll.c: New test. * c-c++-common/tsan/thread_leak2.c: Update test. * gcc.dg/pr59643.c: Update test. * gcc.target/powerpc/loop_align.c: Update test. * gcc.target/powerpc/ppc-fma-1.c: Update test. * gcc.target/powerpc/ppc-fma-2.c: Update test. * gcc.target/powerpc/ppc-fma-3.c: Update test. * gcc.target/powerpc/ppc-fma-4.c: Update test. * gcc.target/powerpc/pr78604.c: Update test. --- gcc/common/config/rs6000/rs6000-common.c | 1 + gcc/config/rs6000/rs6000.c | 20 gcc/testsuite/c-c++-common/tsan/thread_leak2.c | 1 + gcc/testsuite/gcc.dg/pr59643.c | 1 + gcc/testsuite/gcc.target/powerpc/loop_align.c| 2 +- gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c | 2 +- gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c | 2 +- gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c | 2 +- gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c | 2 +- gcc/testsuite/gcc.target/powerpc/pr78604.c | 2 +- gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 + 11 files changed, 42 insertions(+), 6 deletions(-) create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c diff --git a/gcc/common/config/rs6000/rs6000-common.c b/gcc/common/config/rs6000/rs6000-common.c index 4b0c205..b947196 100644 --- a/gcc/common/config/rs6000/rs6000-common.c +++ b/gcc/common/config/rs6000/rs6000-common.c @@ -35,6 +35,7 @@ static const struct default_options rs6000_option_optimization_table[] = { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 }, /* Enable -fsched-pressure for first pass instruction scheduling. */ { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 }, +{ OPT_LEVELS_2_PLUS, OPT_funroll_loops, NULL, 1 }, { OPT_LEVELS_NONE, 0, NULL, 0 } }; diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index a129137d..9a8ff9a 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -4540,6 +4540,26 @@ rs6000_option_override_internal (bool global_init_p) global_options.x_param_values, global_options_set.x_param_values); + /* unroll very small loops 2 time if no -funroll-loops. */ + if (!global_options_set.x_flag_unroll_loops + && !global_options_set.x_flag_unroll_all_loops) + { + maybe_set_param_value (PARAM_MAX_UNROLL_TIMES, 2, +global_options.x_param_values, +global_options_set.x_param_values); + + maybe_set_param_value (PARAM_MAX_UNROLLED_INSNS, 20, +global_options.x_param_values, +global_options_set.x_param_values); + + /* If fweb or frename-registers are not specificed in command-line, +do not turn them on implicitly. */ + if (!global_options_set.x_flag_web) + global_options.x_flag_web = 0; + if (!global_options_set.x_flag_rename_registers) + global_options.x_flag_rename_registers = 0; + } + /* If using typedef char *va_list, signal that __builtin_va_start (&ap, 0) can be optimized to ap = __builtin_next_arg (0). */ diff --git a/gcc/testsuite/c-c++-common/tsan/thread_leak2.c b/gcc/testsuite/c-c++-common/tsan/thread_leak2.c index c9b8046..17aa5c6 100644 --- a/gcc/testsuite/c-c++-common/tsan/thread_leak2.c +++ b/gcc/testsuite/c-c++-common/tsan/thread_leak2.c @@ -1,4 +1,5 @@ /* { dg-shouldfail "tsan"