Re: [PATCH] rs6000: Enable limited unrolling at -O2

2019-10-29 Thread Jiufu Guo
Richard Biener  writes:

> On Mon, 28 Oct 2019, Jiufu Guo wrote:
>
>> Richard Biener  writes:
>> 
>> > On Fri, 25 Oct 2019, Jiufu Guo wrote:
>> >
>> >> Hi,
>> >> 
>> >> In PR88760, there are a few disscussion about improve or tune unroller for
>> >> targets. And we would agree to enable unroller for small loops at O2 
>> >> first.
>> >> And we could see performance improvement(~10%) for below code:
>> >> ```
>> >>   subroutine foo (i, i1, block)
>> >> integer :: i, i1
>> >> integer :: block(9, 9, 9)
>> >> block(i:9,1,i1) = block(i:9,1,i1) - 10
>> >>   end subroutine foo
>> >> 
>> >> ```
>> >> This kind of code occurs a few times in exchange2 benchmark.
>> >> 
>> >> Similar C code:
>> >> ```
>> >>   for (i = 0; i < n; i++)
>> >> arr[i] = arr[i] - 10;
>> >> ```
>> >> 
>> >> On powerpc64le, for O2 , enable -funroll-loops and limit
>> >> PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2%
>> >> overall improvement for SPEC2017.
>> >
>> > Note the behavior with LTO will be surprising (and generally when
>> > functions are compiled with different -O level via the optimize
>> > attribute).  It's a bad idea to base --param values on the value
>> > of a global optimization option that can be set per function
>> > (see PR92046).
>> 
>> Thanks Richard,
>> --param values are not save per function. If using different method to
>> compile and link (e.g. compiling with -O2 -flto, linking with -flto -O2
>> -funroll-loops --param xxx), compiling and linking would use different
>> --param value. 
>
> Yes, and more so if you have most units commpiled with -O2 but one
> with -O3 then you get implicit "global" -O3 at link-time and thus
> -O3 --param settings for all -O2 compiled functions.
>
>> >
>> > A better patch would have been to adjust via the target hooks for
>> > unroll adjust.
>> Yeap. And in unroll adjust target hook, we can do fine tunning with more
>> heuristic thougths. I'm going to refine this way.
>
> Thanks.  Note enabling unrolling at -O2 for the target like you did
> should work OK for LTO (even if some units explicitely disable it
> via -fno-unroll-loop).
>
> Richard.
Thanks Richard!
I just send out a new patch which is using loop unroll adjust hook.
The patch is still checking global option setting: x_flag_unroll_loops
to avoid changing the behavior of explicit -funroll-loops.

And then if there is no explicit -funroll-loops on some case,
conservative unrolling (only unroll small loops 2 times) will be
adopted. For example:

-flto -c t.c -O2 -funroll-loops 
-flto t.o

this linking may not be treated as explicit -funroll-loops. 
I thinking to keep this behavior untill removing global setting check.
How do you think?

Thanks in advance.

Jiufu Guo
BR.

>
>> Thanks again.
>> 
>> Jiufu Guo
>> BR.
>> 
>> >
>> > Thanks,
>> > Richard.
>> >
>> >> This patch is only for rs6000 in which we see visible performance 
>> >> improvement.
>> >> 
>> >> Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk?
>> >>
>> >> Jiufu Guo
>> >> BR
>> >> 
>> >> 
>> >> gcc/
>> >> 2019-10-25  Jiufu Guo 
>> >> 
>> >>   PR tree-optimization/88760
>> >>   * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for
>> >>   O2, enable funroll-loops.
>> >>   * config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller
>> >>   is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and
>> >>   PARAM_MAX_UNROLLED_INSNS=20 for small loops.
>> >>   
>> >> gcc.testsuite/
>> >> 2019-10-25  Jiufu Guo  
>> >> 
>> >>   PR tree-optimization/88760
>> >>   * gcc.target/powerpc/small-loop-unroll.c: New test.
>> >>   * c-c++-common/tsan/thread_leak2.c: Update test.
>> >>   * gcc.dg/pr59643.c: Update test.
>> >>   * gcc.target/powerpc/loop_align.c: Update test.
>> >>   * gcc.target/powerpc/ppc-fma-1.c: Update test.
>> >>   * gcc.target/powerpc/ppc-fma-2.c: Update test.
>> >>   * gcc.target/powerpc/ppc-fma-3.c: Update test.
>> >>   * gcc.target/powerpc/ppc-fma-4.c: Update test.
>> >>   * gcc.target/powerpc/pr78604.c: Update test.
>> >> 
>> >> ---
>> >>  gcc/common/config/rs6000/rs6000-common.c |  1 +
>> >>  gcc/config/rs6000/rs6000.c   | 20 
>> >> 
>> >>  gcc/testsuite/c-c++-common/tsan/thread_leak2.c   |  1 +
>> >>  gcc/testsuite/gcc.dg/pr59643.c   |  1 +
>> >>  gcc/testsuite/gcc.target/powerpc/loop_align.c|  2 +-
>> >>  gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c |  2 +-
>> >>  gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c |  2 +-
>> >>  gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c |  2 +-
>> >>  gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c |  2 +-
>> >>  gcc/testsuite/gcc.target/powerpc/pr78604.c   |  2 +-
>> >>  gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 +
>> >>  11 files changed, 42 insertions(+), 6 deletions(-)
>> >>  create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c
>> >> 
>> >> dif

Re: [PATCH] rs6000: Enable limited unrolling at -O2

2019-10-28 Thread Richard Biener
On October 28, 2019 6:18:26 PM GMT+01:00, Segher Boessenkool 
 wrote:
>On Mon, Oct 28, 2019 at 09:08:04AM +0100, Richard Biener wrote:
>> On Fri, 25 Oct 2019, Jiufu Guo wrote:
>> > On powerpc64le, for O2 , enable -funroll-loops and limit
>> > PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can
>see >2%
>> > overall improvement for SPEC2017.
>> 
>> Note the behavior with LTO will be surprising (and generally when
>> functions are compiled with different -O level via the optimize
>> attribute).  It's a bad idea to base --param values on the value
>> of a global optimization option that can be set per function
>> (see PR92046).
>> 
>> A better patch would have been to adjust via the target hooks for
>> unroll adjust.
>
>So we should add target hooks for all params, to have them work sanely
>with LTO?

No. There are existing target hooks for the unroller though. 

>What makes params special, different from normal command line options?

They use a different machinery. They should be moved over to the general 
options handling so that per-function settings work. 

Richard. 

>
>Segher



Re: [PATCH] rs6000: Enable limited unrolling at -O2

2019-10-28 Thread Segher Boessenkool
On Mon, Oct 28, 2019 at 09:08:04AM +0100, Richard Biener wrote:
> On Fri, 25 Oct 2019, Jiufu Guo wrote:
> > On powerpc64le, for O2 , enable -funroll-loops and limit
> > PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2%
> > overall improvement for SPEC2017.
> 
> Note the behavior with LTO will be surprising (and generally when
> functions are compiled with different -O level via the optimize
> attribute).  It's a bad idea to base --param values on the value
> of a global optimization option that can be set per function
> (see PR92046).
> 
> A better patch would have been to adjust via the target hooks for
> unroll adjust.

So we should add target hooks for all params, to have them work sanely
with LTO?

What makes params special, different from normal command line options?


Segher


Re: [PATCH] rs6000: Enable limited unrolling at -O2

2019-10-28 Thread Richard Biener
On Mon, 28 Oct 2019, Jiufu Guo wrote:

> Richard Biener  writes:
> 
> > On Fri, 25 Oct 2019, Jiufu Guo wrote:
> >
> >> Hi,
> >> 
> >> In PR88760, there are a few disscussion about improve or tune unroller for
> >> targets. And we would agree to enable unroller for small loops at O2 first.
> >> And we could see performance improvement(~10%) for below code:
> >> ```
> >>   subroutine foo (i, i1, block)
> >> integer :: i, i1
> >> integer :: block(9, 9, 9)
> >> block(i:9,1,i1) = block(i:9,1,i1) - 10
> >>   end subroutine foo
> >> 
> >> ```
> >> This kind of code occurs a few times in exchange2 benchmark.
> >> 
> >> Similar C code:
> >> ```
> >>   for (i = 0; i < n; i++)
> >> arr[i] = arr[i] - 10;
> >> ```
> >> 
> >> On powerpc64le, for O2 , enable -funroll-loops and limit
> >> PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2%
> >> overall improvement for SPEC2017.
> >
> > Note the behavior with LTO will be surprising (and generally when
> > functions are compiled with different -O level via the optimize
> > attribute).  It's a bad idea to base --param values on the value
> > of a global optimization option that can be set per function
> > (see PR92046).
> 
> Thanks Richard,
> --param values are not save per function. If using different method to
> compile and link (e.g. compiling with -O2 -flto, linking with -flto -O2
> -funroll-loops --param xxx), compiling and linking would use different
> --param value. 

Yes, and more so if you have most units commpiled with -O2 but one
with -O3 then you get implicit "global" -O3 at link-time and thus
-O3 --param settings for all -O2 compiled functions.

> >
> > A better patch would have been to adjust via the target hooks for
> > unroll adjust.
> Yeap. And in unroll adjust target hook, we can do fine tunning with more
> heuristic thougths. I'm going to refine this way.

Thanks.  Note enabling unrolling at -O2 for the target like you did
should work OK for LTO (even if some units explicitely disable it
via -fno-unroll-loop).

Richard.

> Thanks again.
> 
> Jiufu Guo
> BR.
> 
> >
> > Thanks,
> > Richard.
> >
> >> This patch is only for rs6000 in which we see visible performance 
> >> improvement.
> >> 
> >> Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk?
> >>
> >> Jiufu Guo
> >> BR
> >> 
> >> 
> >> gcc/
> >> 2019-10-25  Jiufu Guo  
> >> 
> >>PR tree-optimization/88760
> >>* config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for
> >>O2, enable funroll-loops.
> >>* config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller
> >>is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and
> >>PARAM_MAX_UNROLLED_INSNS=20 for small loops.
> >>
> >> gcc.testsuite/
> >> 2019-10-25  Jiufu Guo  
> >> 
> >>PR tree-optimization/88760
> >>* gcc.target/powerpc/small-loop-unroll.c: New test.
> >>* c-c++-common/tsan/thread_leak2.c: Update test.
> >>* gcc.dg/pr59643.c: Update test.
> >>* gcc.target/powerpc/loop_align.c: Update test.
> >>* gcc.target/powerpc/ppc-fma-1.c: Update test.
> >>* gcc.target/powerpc/ppc-fma-2.c: Update test.
> >>* gcc.target/powerpc/ppc-fma-3.c: Update test.
> >>* gcc.target/powerpc/ppc-fma-4.c: Update test.
> >>* gcc.target/powerpc/pr78604.c: Update test.
> >> 
> >> ---
> >>  gcc/common/config/rs6000/rs6000-common.c |  1 +
> >>  gcc/config/rs6000/rs6000.c   | 20 
> >> 
> >>  gcc/testsuite/c-c++-common/tsan/thread_leak2.c   |  1 +
> >>  gcc/testsuite/gcc.dg/pr59643.c   |  1 +
> >>  gcc/testsuite/gcc.target/powerpc/loop_align.c|  2 +-
> >>  gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c |  2 +-
> >>  gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c |  2 +-
> >>  gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c |  2 +-
> >>  gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c |  2 +-
> >>  gcc/testsuite/gcc.target/powerpc/pr78604.c   |  2 +-
> >>  gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 +
> >>  11 files changed, 42 insertions(+), 6 deletions(-)
> >>  create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c
> >> 
> >> diff --git a/gcc/common/config/rs6000/rs6000-common.c 
> >> b/gcc/common/config/rs6000/rs6000-common.c
> >> index 4b0c205..b947196 100644
> >> --- a/gcc/common/config/rs6000/rs6000-common.c
> >> +++ b/gcc/common/config/rs6000/rs6000-common.c
> >> @@ -35,6 +35,7 @@ static const struct default_options 
> >> rs6000_option_optimization_table[] =
> >>  { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 },
> >>  /* Enable -fsched-pressure for first pass instruction scheduling.  */
> >>  { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
> >> +{ OPT_LEVELS_2_PLUS, OPT_funroll_loops, NULL, 1 },
> >>  { OPT_LEVELS_NONE, 0, NULL, 0 }
> >>};
> >>  
> >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs60

Re: [PATCH] rs6000: Enable limited unrolling at -O2

2019-10-28 Thread Jiufu Guo
Richard Biener  writes:

> On Fri, 25 Oct 2019, Jiufu Guo wrote:
>
>> Hi,
>> 
>> In PR88760, there are a few disscussion about improve or tune unroller for
>> targets. And we would agree to enable unroller for small loops at O2 first.
>> And we could see performance improvement(~10%) for below code:
>> ```
>>   subroutine foo (i, i1, block)
>> integer :: i, i1
>> integer :: block(9, 9, 9)
>> block(i:9,1,i1) = block(i:9,1,i1) - 10
>>   end subroutine foo
>> 
>> ```
>> This kind of code occurs a few times in exchange2 benchmark.
>> 
>> Similar C code:
>> ```
>>   for (i = 0; i < n; i++)
>> arr[i] = arr[i] - 10;
>> ```
>> 
>> On powerpc64le, for O2 , enable -funroll-loops and limit
>> PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2%
>> overall improvement for SPEC2017.
>
> Note the behavior with LTO will be surprising (and generally when
> functions are compiled with different -O level via the optimize
> attribute).  It's a bad idea to base --param values on the value
> of a global optimization option that can be set per function
> (see PR92046).

Thanks Richard,
--param values are not save per function. If using different method to
compile and link (e.g. compiling with -O2 -flto, linking with -flto -O2
-funroll-loops --param xxx), compiling and linking would use different
--param value. 

>
> A better patch would have been to adjust via the target hooks for
> unroll adjust.
Yeap. And in unroll adjust target hook, we can do fine tunning with more
heuristic thougths. I'm going to refine this way.

Thanks again.

Jiufu Guo
BR.

>
> Thanks,
> Richard.
>
>> This patch is only for rs6000 in which we see visible performance 
>> improvement.
>> 
>> Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk?
>>
>> Jiufu Guo
>> BR
>> 
>> 
>> gcc/
>> 2019-10-25  Jiufu Guo
>> 
>>  PR tree-optimization/88760
>>  * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for
>>  O2, enable funroll-loops.
>>  * config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller
>>  is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and
>>  PARAM_MAX_UNROLLED_INSNS=20 for small loops.
>>  
>> gcc.testsuite/
>> 2019-10-25  Jiufu Guo  
>> 
>>  PR tree-optimization/88760
>>  * gcc.target/powerpc/small-loop-unroll.c: New test.
>>  * c-c++-common/tsan/thread_leak2.c: Update test.
>>  * gcc.dg/pr59643.c: Update test.
>>  * gcc.target/powerpc/loop_align.c: Update test.
>>  * gcc.target/powerpc/ppc-fma-1.c: Update test.
>>  * gcc.target/powerpc/ppc-fma-2.c: Update test.
>>  * gcc.target/powerpc/ppc-fma-3.c: Update test.
>>  * gcc.target/powerpc/ppc-fma-4.c: Update test.
>>  * gcc.target/powerpc/pr78604.c: Update test.
>> 
>> ---
>>  gcc/common/config/rs6000/rs6000-common.c |  1 +
>>  gcc/config/rs6000/rs6000.c   | 20 
>> 
>>  gcc/testsuite/c-c++-common/tsan/thread_leak2.c   |  1 +
>>  gcc/testsuite/gcc.dg/pr59643.c   |  1 +
>>  gcc/testsuite/gcc.target/powerpc/loop_align.c|  2 +-
>>  gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c |  2 +-
>>  gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c |  2 +-
>>  gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c |  2 +-
>>  gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c |  2 +-
>>  gcc/testsuite/gcc.target/powerpc/pr78604.c   |  2 +-
>>  gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 +
>>  11 files changed, 42 insertions(+), 6 deletions(-)
>>  create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c
>> 
>> diff --git a/gcc/common/config/rs6000/rs6000-common.c 
>> b/gcc/common/config/rs6000/rs6000-common.c
>> index 4b0c205..b947196 100644
>> --- a/gcc/common/config/rs6000/rs6000-common.c
>> +++ b/gcc/common/config/rs6000/rs6000-common.c
>> @@ -35,6 +35,7 @@ static const struct default_options 
>> rs6000_option_optimization_table[] =
>>  { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 },
>>  /* Enable -fsched-pressure for first pass instruction scheduling.  */
>>  { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
>> +{ OPT_LEVELS_2_PLUS, OPT_funroll_loops, NULL, 1 },
>>  { OPT_LEVELS_NONE, 0, NULL, 0 }
>>};
>>  
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index a129137d..9a8ff9a 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -4540,6 +4540,26 @@ rs6000_option_override_internal (bool global_init_p)
>>   global_options.x_param_values,
>>   global_options_set.x_param_values);
>>  
>> +  /* unroll very small loops 2 time if no -funroll-loops.  */
>> +  if (!global_options_set.x_flag_unroll_loops
>> +  && !global_options_set.x_flag_unroll_all_loops)
>> +{
>> +  maybe_set_param_value (PARAM_MAX_UNROLL_TIMES, 2,
>> +   

Re: [PATCH] rs6000: Enable limited unrolling at -O2

2019-10-28 Thread Richard Biener
On Fri, 25 Oct 2019, Jiufu Guo wrote:

> Hi,
> 
> In PR88760, there are a few disscussion about improve or tune unroller for
> targets. And we would agree to enable unroller for small loops at O2 first.
> And we could see performance improvement(~10%) for below code:
> ```
>   subroutine foo (i, i1, block)
> integer :: i, i1
> integer :: block(9, 9, 9)
> block(i:9,1,i1) = block(i:9,1,i1) - 10
>   end subroutine foo
> 
> ```
> This kind of code occurs a few times in exchange2 benchmark.
> 
> Similar C code:
> ```
>   for (i = 0; i < n; i++)
> arr[i] = arr[i] - 10;
> ```
> 
> On powerpc64le, for O2 , enable -funroll-loops and limit
> PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2%
> overall improvement for SPEC2017.

Note the behavior with LTO will be surprising (and generally when
functions are compiled with different -O level via the optimize
attribute).  It's a bad idea to base --param values on the value
of a global optimization option that can be set per function
(see PR92046).

A better patch would have been to adjust via the target hooks for
unroll adjust.

Thanks,
Richard.

> This patch is only for rs6000 in which we see visible performance improvement.
> 
> Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk?
>
> Jiufu Guo
> BR
> 
> 
> gcc/
> 2019-10-25  Jiufu Guo 
> 
>   PR tree-optimization/88760
>   * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for
>   O2, enable funroll-loops.
>   * config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller
>   is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and
>   PARAM_MAX_UNROLLED_INSNS=20 for small loops.
>   
> gcc.testsuite/
> 2019-10-25  Jiufu Guo  
> 
>   PR tree-optimization/88760
>   * gcc.target/powerpc/small-loop-unroll.c: New test.
>   * c-c++-common/tsan/thread_leak2.c: Update test.
>   * gcc.dg/pr59643.c: Update test.
>   * gcc.target/powerpc/loop_align.c: Update test.
>   * gcc.target/powerpc/ppc-fma-1.c: Update test.
>   * gcc.target/powerpc/ppc-fma-2.c: Update test.
>   * gcc.target/powerpc/ppc-fma-3.c: Update test.
>   * gcc.target/powerpc/ppc-fma-4.c: Update test.
>   * gcc.target/powerpc/pr78604.c: Update test.
> 
> ---
>  gcc/common/config/rs6000/rs6000-common.c |  1 +
>  gcc/config/rs6000/rs6000.c   | 20 
> 
>  gcc/testsuite/c-c++-common/tsan/thread_leak2.c   |  1 +
>  gcc/testsuite/gcc.dg/pr59643.c   |  1 +
>  gcc/testsuite/gcc.target/powerpc/loop_align.c|  2 +-
>  gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c |  2 +-
>  gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c |  2 +-
>  gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c |  2 +-
>  gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c |  2 +-
>  gcc/testsuite/gcc.target/powerpc/pr78604.c   |  2 +-
>  gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 +
>  11 files changed, 42 insertions(+), 6 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c
> 
> diff --git a/gcc/common/config/rs6000/rs6000-common.c 
> b/gcc/common/config/rs6000/rs6000-common.c
> index 4b0c205..b947196 100644
> --- a/gcc/common/config/rs6000/rs6000-common.c
> +++ b/gcc/common/config/rs6000/rs6000-common.c
> @@ -35,6 +35,7 @@ static const struct default_options 
> rs6000_option_optimization_table[] =
>  { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 },
>  /* Enable -fsched-pressure for first pass instruction scheduling.  */
>  { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
> +{ OPT_LEVELS_2_PLUS, OPT_funroll_loops, NULL, 1 },
>  { OPT_LEVELS_NONE, 0, NULL, 0 }
>};
>  
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index a129137d..9a8ff9a 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -4540,6 +4540,26 @@ rs6000_option_override_internal (bool global_init_p)
>global_options.x_param_values,
>global_options_set.x_param_values);
>  
> +  /* unroll very small loops 2 time if no -funroll-loops.  */
> +  if (!global_options_set.x_flag_unroll_loops
> +   && !global_options_set.x_flag_unroll_all_loops)
> + {
> +   maybe_set_param_value (PARAM_MAX_UNROLL_TIMES, 2,
> +  global_options.x_param_values,
> +  global_options_set.x_param_values);
> +
> +   maybe_set_param_value (PARAM_MAX_UNROLLED_INSNS, 20,
> +  global_options.x_param_values,
> +  global_options_set.x_param_values);
> +
> +   /* If fweb or frename-registers are not specificed in command-line,
> +  do not turn them on implicitly.  */
> +   if (!global_options_set.x_flag_web)
> + global_options.x_flag_web

Re: [PATCH] rs6000: Enable limited unrolling at -O2

2019-10-25 Thread Segher Boessenkool
Hi Jiufu,

On Fri, Oct 25, 2019 at 10:44:39PM +0800, Jiufu Guo wrote:
> In PR88760, there are a few disscussion about improve or tune unroller for
> targets. And we would agree to enable unroller for small loops at O2 first.

[ snip ]

>   PR tree-optimization/88760
>   * config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for
>   O2, enable funroll-loops.

* config/rs6000/rs6000-common.c (rs6000_option_optimization_table):
Enable -funroll-loops for -O2 and above.

>   * config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller
>   is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and
>   PARAM_MAX_UNROLLED_INSNS=20 for small loops.

* config/rs6000/rs6000.c (rs6000_option_override_internal): Set
PARAM_MAX_UNROLL_TIMES to 2 and PARAM_MAX_UNROLLED_INSNS to 20 if the
unroller is not explicitly enabled.

> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -4540,6 +4540,26 @@ rs6000_option_override_internal (bool global_init_p)
>global_options.x_param_values,
>global_options_set.x_param_values);
>  
> +  /* unroll very small loops 2 time if no -funroll-loops.  */
> +  if (!global_options_set.x_flag_unroll_loops
> +   && !global_options_set.x_flag_unroll_all_loops)
> + {
> +   maybe_set_param_value (PARAM_MAX_UNROLL_TIMES, 2,
> +  global_options.x_param_values,
> +  global_options_set.x_param_values);
> +
> +   maybe_set_param_value (PARAM_MAX_UNROLLED_INSNS, 20,
> +  global_options.x_param_values,
> +  global_options_set.x_param_values);
> +
> +   /* If fweb or frename-registers are not specificed in command-line,

(specified)

> +  do not turn them on implicitly.  */
> +   if (!global_options_set.x_flag_web)
> + global_options.x_flag_web = 0;
> +   if (!global_options_set.x_flag_rename_registers)
> + global_options.x_flag_rename_registers = 0;
> + }

This web and rnreg thing needs to be in the changelog, too.

> diff --git a/gcc/testsuite/c-c++-common/tsan/thread_leak2.c 
> b/gcc/testsuite/c-c++-common/tsan/thread_leak2.c
> index c9b8046..17aa5c6 100644
> --- a/gcc/testsuite/c-c++-common/tsan/thread_leak2.c
> +++ b/gcc/testsuite/c-c++-common/tsan/thread_leak2.c
> @@ -1,4 +1,5 @@
>  /* { dg-shouldfail "tsan" } */
> +/* { dg-additional-options "-fno-unroll-loops" { target { powerpc-*-* 
> powerpc64-*-* powerpc64le-*-* } } } */

You can write that as just  { target { powerpc*-*-* } }

Could you explain why this is needed here?  In the test file itself,
preferably.

> --- a/gcc/testsuite/gcc.dg/pr59643.c
> +++ b/gcc/testsuite/gcc.dg/pr59643.c
> @@ -1,6 +1,7 @@
>  /* PR tree-optimization/59643 */
>  /* { dg-do compile } */
>  /* { dg-options "-O3 -fdump-tree-pcom-details" } */
> +/* { dg-additional-options "-fno-unroll-loops" { target { powerpc-*-* 
> powerpc64-*-* powerpc64le-*-* } } } */

Same here.  How does this patch change behaviour here, anyway?  The test
uses -O3?

> --- a/gcc/testsuite/gcc.target/powerpc/loop_align.c
> +++ b/gcc/testsuite/gcc.target/powerpc/loop_align.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile { target { powerpc*-*-* } } } */
>  /* { dg-skip-if "" { powerpc*-*-darwin* powerpc-ibm-aix* } } */
> -/* { dg-options "-O2 -mdejagnu-cpu=power7 -falign-functions=16" } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power7 -falign-functions=16 
> -fno-unroll-loops" } */
>  /* { dg-final { scan-assembler ".p2align 5" } } */

For this one it is pretty much obvious.

> --- a/gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c
> +++ b/gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c
> @@ -1,7 +1,7 @@
>  /* { dg-do compile { target { powerpc*-*-* } } } */
>  /* { dg-skip-if "" { powerpc*-*-darwin* } } */
>  /* { dg-require-effective-target powerpc_vsx_ok } */
> -/* { dg-options "-O3 -ftree-vectorize -mdejagnu-cpu=power7 -ffast-math" } */
> +/* { dg-options "-O3 -ftree-vectorize -mdejagnu-cpu=power7 -ffast-math 
> -fno-unroll-loops" } */

And here, and all other powerpc-specific tests (they count the generated
machine instructions).

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-rtl-loop2_unroll" } */
> +
> +void __attribute__ ((noinline)) foo(int n, int *arr)
> +{
> +  int i;
> +  for (i = 0; i < n; i++)
> +arr[i] = arr[i] - 10;
> +}
> +/* { dg-final { scan-rtl-dump-times "Unrolled loop 1 times" 1 "loop2_unroll" 
> } } */
> +/* { dg-final { scan-assembler-times "lwz" 3 } } */
> +/* { dg-final { scan-assembler-times "stw" 3 } } */

/* { dg-final { scan-assembler-times {\mlwz\M} 3 } } */
/* { dg-final { scan-assembler-times {\mstw\M} 3 } } */

so that it doesn't count when lwz or stw is only part of the word.  This
doesn't so easily matter for scan-assembler-

[PATCH] rs6000: Enable limited unrolling at -O2

2019-10-25 Thread Jiufu Guo
Hi,

In PR88760, there are a few disscussion about improve or tune unroller for
targets. And we would agree to enable unroller for small loops at O2 first.
And we could see performance improvement(~10%) for below code:
```
  subroutine foo (i, i1, block)
integer :: i, i1
integer :: block(9, 9, 9)
block(i:9,1,i1) = block(i:9,1,i1) - 10
  end subroutine foo

```
This kind of code occurs a few times in exchange2 benchmark.

Similar C code:
```
  for (i = 0; i < n; i++)
arr[i] = arr[i] - 10;
```

On powerpc64le, for O2 , enable -funroll-loops and limit
PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2%
overall improvement for SPEC2017.

This patch is only for rs6000 in which we see visible performance improvement.

Bootstrapped on powerpc64le, and cases are updated. Is this ok for trunk?

Jiufu Guo
BR


gcc/
2019-10-25  Jiufu Guo   

PR tree-optimization/88760
* config/rs6000/rs6000-common.c (rs6000_option_optimization_table): for
O2, enable funroll-loops.
* config/rs6000/rs6000.c (rs6000_option_override_internal): if unroller
is enabled throught O2, set constrains to PARAM_MAX_UNROLL_TIMES=2 and
PARAM_MAX_UNROLLED_INSNS=20 for small loops.

gcc.testsuite/
2019-10-25  Jiufu Guo  

PR tree-optimization/88760
* gcc.target/powerpc/small-loop-unroll.c: New test.
* c-c++-common/tsan/thread_leak2.c: Update test.
* gcc.dg/pr59643.c: Update test.
* gcc.target/powerpc/loop_align.c: Update test.
* gcc.target/powerpc/ppc-fma-1.c: Update test.
* gcc.target/powerpc/ppc-fma-2.c: Update test.
* gcc.target/powerpc/ppc-fma-3.c: Update test.
* gcc.target/powerpc/ppc-fma-4.c: Update test.
* gcc.target/powerpc/pr78604.c: Update test.

---
 gcc/common/config/rs6000/rs6000-common.c |  1 +
 gcc/config/rs6000/rs6000.c   | 20 
 gcc/testsuite/c-c++-common/tsan/thread_leak2.c   |  1 +
 gcc/testsuite/gcc.dg/pr59643.c   |  1 +
 gcc/testsuite/gcc.target/powerpc/loop_align.c|  2 +-
 gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c |  2 +-
 gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c |  2 +-
 gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c |  2 +-
 gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c |  2 +-
 gcc/testsuite/gcc.target/powerpc/pr78604.c   |  2 +-
 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c | 13 +
 11 files changed, 42 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c

diff --git a/gcc/common/config/rs6000/rs6000-common.c 
b/gcc/common/config/rs6000/rs6000-common.c
index 4b0c205..b947196 100644
--- a/gcc/common/config/rs6000/rs6000-common.c
+++ b/gcc/common/config/rs6000/rs6000-common.c
@@ -35,6 +35,7 @@ static const struct default_options 
rs6000_option_optimization_table[] =
 { OPT_LEVELS_ALL, OPT_fsplit_wide_types_early, NULL, 1 },
 /* Enable -fsched-pressure for first pass instruction scheduling.  */
 { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
+{ OPT_LEVELS_2_PLUS, OPT_funroll_loops, NULL, 1 },
 { OPT_LEVELS_NONE, 0, NULL, 0 }
   };
 
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index a129137d..9a8ff9a 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -4540,6 +4540,26 @@ rs6000_option_override_internal (bool global_init_p)
 global_options.x_param_values,
 global_options_set.x_param_values);
 
+  /* unroll very small loops 2 time if no -funroll-loops.  */
+  if (!global_options_set.x_flag_unroll_loops
+ && !global_options_set.x_flag_unroll_all_loops)
+   {
+ maybe_set_param_value (PARAM_MAX_UNROLL_TIMES, 2,
+global_options.x_param_values,
+global_options_set.x_param_values);
+
+ maybe_set_param_value (PARAM_MAX_UNROLLED_INSNS, 20,
+global_options.x_param_values,
+global_options_set.x_param_values);
+
+ /* If fweb or frename-registers are not specificed in command-line,
+do not turn them on implicitly.  */
+ if (!global_options_set.x_flag_web)
+   global_options.x_flag_web = 0;
+ if (!global_options_set.x_flag_rename_registers)
+   global_options.x_flag_rename_registers = 0;
+   }
+
   /* If using typedef char *va_list, signal that
 __builtin_va_start (&ap, 0) can be optimized to
 ap = __builtin_next_arg (0).  */
diff --git a/gcc/testsuite/c-c++-common/tsan/thread_leak2.c 
b/gcc/testsuite/c-c++-common/tsan/thread_leak2.c
index c9b8046..17aa5c6 100644
--- a/gcc/testsuite/c-c++-common/tsan/thread_leak2.c
+++ b/gcc/testsuite/c-c++-common/tsan/thread_leak2.c
@@ -1,4 +1,5 @@
 /* { dg-shouldfail "tsan"