Re: [PATCH V2] Enable small loop unrolling for O2

2022-11-13 Thread Hongyu Wang via Gcc-patches
> Ok, Note GCC documents have been ported to sphinx, so you need to
> adjust changes in invoke.texi to new sphinx files.

Yes, this is the patch I'm going to check-in. Thanks.

Hongtao Liu  于2022年11月14日周一 09:35写道:
>
> On Wed, Nov 9, 2022 at 9:29 AM Hongyu Wang  wrote:
> >
> > > Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> > > for codesize.
> > > Make it exact as issue_rate and using factor * issue_width /
> > > loop->ninsns may increase code size too much.
> > > So I prefer to add those 2 parameters to the cost table for core
> > > tunings instead of 1.
> >
> > Yes, here is the updated patch that changes the cost table.
> >
> > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > Ok for trunk?
> Ok, Note GCC documents have been ported to sphinx, so you need to
> adjust changes in invoke.texi to new sphinx files.
> >
> > Hongtao Liu via Gcc-patches  于2022年11月8日周二 11:05写道:
> > >
> > > On Mon, Nov 7, 2022 at 10:25 PM Richard Biener via Gcc-patches
> > >  wrote:
> > > >
> > > > On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang  
> > > > wrote:
> > > > >
> > > > > Hi, this is the updated patch of
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> > > > > which uses targetm.loop_unroll_adjust as gate to enable small loop 
> > > > > unroll.
> > > > >
> > > > > This patch does not change rs6000/s390 since I don't have machine to
> > > > > test them, but I suppose the default behavior is the same since they
> > > > > enable flag_unroll_loops at O2.
> > > > >
> > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > >
> > > > > Ok for trunk?
> > > > >
> > > > > -- Patch content 
> > > > >
> > > > > Modern processors has multiple way instruction decoders
> > > > > For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> > > > > instructions (usually has 3 uops with a cmp/jmp pair that can be
> > > > > macro-fused), the decoder would have 2 uops bubble for each iteration
> > > > > and the pipeline could not be fully utilized.
> > > > >
> > > > > Therefore, this patch enables loop unrolling for small size loop at O2
> > > > > to fullfill the decoder as much as possible. It turns on rtl loop
> > > > > unrolling when targetm.loop_unroll_adjust exists and O2 plus speed 
> > > > > only.
> > > > > In x86 backend the default behavior is to unroll small loops with less
> > > > > than 4 insns by 1 time.
> > > > >
> > > > > This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> > > > > 0.9% codesize increment. For other benchmarks the variants are minor
> > > > > and overall codesize increased by 0.2%.
> > > > >
> > > > > The kernel image size increased by 0.06%, and no impact on eembc.
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > * common/config/i386/i386-common.cc (ix86_optimization_table):
> > > > > Enable small loop unroll at O2 by default.
> > > > > * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
> > > > > factor if -munroll-only-small-loops enabled and 
> > > > > -funroll-loops/
> > > > > -funroll-all-loops are disabled.
> > > > > * config/i386/i386.opt: Add -munroll-only-small-loops,
> > > > > -param=x86-small-unroll-ninsns= for loop insn limit,
> > > > > -param=x86-small-unroll-factor= for unroll factor.
> > > > > * doc/invoke.texi: Document -munroll-only-small-loops,
> > > > > x86-small-unroll-ninsns and x86-small-unroll-factor.
> > > > > * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> > > > > loop unrolling for -O2-speed and above if target hook
> > > > > loop_unroll_adjust exists.
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > > * gcc.dg/guality/loop-1.c: Add additional option
> > > > >   -mno-unroll-only-small-loops.
> > > > > * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
> > > > > * gcc.target/i386/pr93002.c: Likewise.
> > > > > ---
> > > > >  gcc/common/config/i386/i386-common.cc   |  1 +
> > > > >  gcc/config/i386/i386.cc | 18 ++
> > > > >  gcc/config/i386/i386.opt| 13 +
> > > > >  gcc/doc/invoke.texi | 16 
> > > > >  gcc/loop-init.cc| 10 +++---
> > > > >  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
> > > > >  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> > > > >  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
> > > > >  8 files changed, 59 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/gcc/common/config/i386/i386-common.cc 
> > > > > b/gcc/common/config/i386/i386-common.cc
> > > > > index f66bdd5a2af..c6891486078 100644
> > > > > --- a/gcc/common/config/i386/i386-common.cc
> > > > > +++ b/gcc/common/config/i386/i386-common.cc
> > > > > @@ -1724,6 +1724,7 @@ static const struct default_options 
> > > > > ix86_option_optimization_table[] =
> > > > >  /* The STC 

Re: [PATCH V2] Enable small loop unrolling for O2

2022-11-13 Thread Hongtao Liu via Gcc-patches
On Wed, Nov 9, 2022 at 9:29 AM Hongyu Wang  wrote:
>
> > Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> > for codesize.
> > Make it exact as issue_rate and using factor * issue_width /
> > loop->ninsns may increase code size too much.
> > So I prefer to add those 2 parameters to the cost table for core
> > tunings instead of 1.
>
> Yes, here is the updated patch that changes the cost table.
>
> Bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> Ok for trunk?
Ok, Note GCC documents have been ported to sphinx, so you need to
adjust changes in invoke.texi to new sphinx files.
>
> Hongtao Liu via Gcc-patches  于2022年11月8日周二 11:05写道:
> >
> > On Mon, Nov 7, 2022 at 10:25 PM Richard Biener via Gcc-patches
> >  wrote:
> > >
> > > On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang  wrote:
> > > >
> > > > Hi, this is the updated patch of
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> > > > which uses targetm.loop_unroll_adjust as gate to enable small loop 
> > > > unroll.
> > > >
> > > > This patch does not change rs6000/s390 since I don't have machine to
> > > > test them, but I suppose the default behavior is the same since they
> > > > enable flag_unroll_loops at O2.
> > > >
> > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > >
> > > > Ok for trunk?
> > > >
> > > > -- Patch content 
> > > >
> > > > Modern processors has multiple way instruction decoders
> > > > For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> > > > instructions (usually has 3 uops with a cmp/jmp pair that can be
> > > > macro-fused), the decoder would have 2 uops bubble for each iteration
> > > > and the pipeline could not be fully utilized.
> > > >
> > > > Therefore, this patch enables loop unrolling for small size loop at O2
> > > > to fullfill the decoder as much as possible. It turns on rtl loop
> > > > unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
> > > > In x86 backend the default behavior is to unroll small loops with less
> > > > than 4 insns by 1 time.
> > > >
> > > > This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> > > > 0.9% codesize increment. For other benchmarks the variants are minor
> > > > and overall codesize increased by 0.2%.
> > > >
> > > > The kernel image size increased by 0.06%, and no impact on eembc.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * common/config/i386/i386-common.cc (ix86_optimization_table):
> > > > Enable small loop unroll at O2 by default.
> > > > * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
> > > > factor if -munroll-only-small-loops enabled and -funroll-loops/
> > > > -funroll-all-loops are disabled.
> > > > * config/i386/i386.opt: Add -munroll-only-small-loops,
> > > > -param=x86-small-unroll-ninsns= for loop insn limit,
> > > > -param=x86-small-unroll-factor= for unroll factor.
> > > > * doc/invoke.texi: Document -munroll-only-small-loops,
> > > > x86-small-unroll-ninsns and x86-small-unroll-factor.
> > > > * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> > > > loop unrolling for -O2-speed and above if target hook
> > > > loop_unroll_adjust exists.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > * gcc.dg/guality/loop-1.c: Add additional option
> > > >   -mno-unroll-only-small-loops.
> > > > * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
> > > > * gcc.target/i386/pr93002.c: Likewise.
> > > > ---
> > > >  gcc/common/config/i386/i386-common.cc   |  1 +
> > > >  gcc/config/i386/i386.cc | 18 ++
> > > >  gcc/config/i386/i386.opt| 13 +
> > > >  gcc/doc/invoke.texi | 16 
> > > >  gcc/loop-init.cc| 10 +++---
> > > >  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
> > > >  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> > > >  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
> > > >  8 files changed, 59 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/gcc/common/config/i386/i386-common.cc 
> > > > b/gcc/common/config/i386/i386-common.cc
> > > > index f66bdd5a2af..c6891486078 100644
> > > > --- a/gcc/common/config/i386/i386-common.cc
> > > > +++ b/gcc/common/config/i386/i386-common.cc
> > > > @@ -1724,6 +1724,7 @@ static const struct default_options 
> > > > ix86_option_optimization_table[] =
> > > >  /* The STC algorithm produces the smallest code at -Os, for x86.  
> > > > */
> > > >  { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
> > > >REORDER_BLOCKS_ALGORITHM_STC },
> > > > +{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, 
> > > > NULL, 1 },
> > > >  /* Turn off -fschedule-insns by default.  It tends to make the
> > > > problem with not enough registers even worse.  */
> > > >  { 

RE: [PATCH V2] Enable small loop unrolling for O2

2022-11-10 Thread Wang, Hongyu via Gcc-patches
Thanks for the notification! I’m not aware of the compile farm before. Will see 
what’s the impact of my patch then.

Regards,
Hongyu, Wang

From: David Edelsohn 
Sent: Thursday, November 10, 2022 1:22 AM
To: Wang, Hongyu 
Cc: GCC Patches 
Subject: Re: [PATCH V2] Enable small loop unrolling for O2

> This patch does not change rs6000/s390 since I don't have machines to
> test them, but I suppose the default behavior is the same since they
> enable flag_unroll_loops at O2.

There are Power (rs6000) systems in the Compile Farm.

Trial Linux on Z (s390x) VMs are available through the Linux Community Cloud.
https://linuxone.cloud.marist.edu/#/register?flag=VM

Thanks, David




Re: [PATCH V2] Enable small loop unrolling for O2

2022-11-09 Thread David Edelsohn via Gcc-patches
> This patch does not change rs6000/s390 since I don't have machines to
> test them, but I suppose the default behavior is the same since they
> enable flag_unroll_loops at O2.

There are Power (rs6000) systems in the Compile Farm.

Trial Linux on Z (s390x) VMs are available through the Linux Community
Cloud.
https://linuxone.cloud.marist.edu/#/register?flag=VM

Thanks, David


Re: [PATCH V2] Enable small loop unrolling for O2

2022-11-08 Thread Hongyu Wang via Gcc-patches
> Although ix86_small_unroll_insns is coming from issue_rate, it's tuned
> for codesize.
> Make it exact as issue_rate and using factor * issue_width /
> loop->ninsns may increase code size too much.
> So I prefer to add those 2 parameters to the cost table for core
> tunings instead of 1.

Yes, here is the updated patch that changes the cost table.

Bootstrapped & regrtested on x86_64-pc-linux-gnu.

Ok for trunk?

Hongtao Liu via Gcc-patches  于2022年11月8日周二 11:05写道:
>
> On Mon, Nov 7, 2022 at 10:25 PM Richard Biener via Gcc-patches
>  wrote:
> >
> > On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang  wrote:
> > >
> > > Hi, this is the updated patch of
> > > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> > > which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.
> > >
> > > This patch does not change rs6000/s390 since I don't have machine to
> > > test them, but I suppose the default behavior is the same since they
> > > enable flag_unroll_loops at O2.
> > >
> > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > >
> > > Ok for trunk?
> > >
> > > -- Patch content 
> > >
> > > Modern processors has multiple way instruction decoders
> > > For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> > > instructions (usually has 3 uops with a cmp/jmp pair that can be
> > > macro-fused), the decoder would have 2 uops bubble for each iteration
> > > and the pipeline could not be fully utilized.
> > >
> > > Therefore, this patch enables loop unrolling for small size loop at O2
> > > to fullfill the decoder as much as possible. It turns on rtl loop
> > > unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
> > > In x86 backend the default behavior is to unroll small loops with less
> > > than 4 insns by 1 time.
> > >
> > > This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> > > 0.9% codesize increment. For other benchmarks the variants are minor
> > > and overall codesize increased by 0.2%.
> > >
> > > The kernel image size increased by 0.06%, and no impact on eembc.
> > >
> > > gcc/ChangeLog:
> > >
> > > * common/config/i386/i386-common.cc (ix86_optimization_table):
> > > Enable small loop unroll at O2 by default.
> > > * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
> > > factor if -munroll-only-small-loops enabled and -funroll-loops/
> > > -funroll-all-loops are disabled.
> > > * config/i386/i386.opt: Add -munroll-only-small-loops,
> > > -param=x86-small-unroll-ninsns= for loop insn limit,
> > > -param=x86-small-unroll-factor= for unroll factor.
> > > * doc/invoke.texi: Document -munroll-only-small-loops,
> > > x86-small-unroll-ninsns and x86-small-unroll-factor.
> > > * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> > > loop unrolling for -O2-speed and above if target hook
> > > loop_unroll_adjust exists.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > > * gcc.dg/guality/loop-1.c: Add additional option
> > >   -mno-unroll-only-small-loops.
> > > * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
> > > * gcc.target/i386/pr93002.c: Likewise.
> > > ---
> > >  gcc/common/config/i386/i386-common.cc   |  1 +
> > >  gcc/config/i386/i386.cc | 18 ++
> > >  gcc/config/i386/i386.opt| 13 +
> > >  gcc/doc/invoke.texi | 16 
> > >  gcc/loop-init.cc| 10 +++---
> > >  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
> > >  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> > >  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
> > >  8 files changed, 59 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/gcc/common/config/i386/i386-common.cc 
> > > b/gcc/common/config/i386/i386-common.cc
> > > index f66bdd5a2af..c6891486078 100644
> > > --- a/gcc/common/config/i386/i386-common.cc
> > > +++ b/gcc/common/config/i386/i386-common.cc
> > > @@ -1724,6 +1724,7 @@ static const struct default_options 
> > > ix86_option_optimization_table[] =
> > >  /* The STC algorithm produces the smallest code at -Os, for x86.  */
> > >  { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
> > >REORDER_BLOCKS_ALGORITHM_STC },
> > > +{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 
> > > 1 },
> > >  /* Turn off -fschedule-insns by default.  It tends to make the
> > > problem with not enough registers even worse.  */
> > >  { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index c0f37149ed0..0f94a3b609e 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class 
> > > loop *loop)
> > >unsigned i;
> > >unsigned mem_count = 0;
> > 

Re: [PATCH V2] Enable small loop unrolling for O2

2022-11-07 Thread Hongtao Liu via Gcc-patches
On Mon, Nov 7, 2022 at 10:25 PM Richard Biener via Gcc-patches
 wrote:
>
> On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang  wrote:
> >
> > Hi, this is the updated patch of
> > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> > which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.
> >
> > This patch does not change rs6000/s390 since I don't have machine to
> > test them, but I suppose the default behavior is the same since they
> > enable flag_unroll_loops at O2.
> >
> > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> >
> > Ok for trunk?
> >
> > -- Patch content 
> >
> > Modern processors has multiple way instruction decoders
> > For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> > instructions (usually has 3 uops with a cmp/jmp pair that can be
> > macro-fused), the decoder would have 2 uops bubble for each iteration
> > and the pipeline could not be fully utilized.
> >
> > Therefore, this patch enables loop unrolling for small size loop at O2
> > to fullfill the decoder as much as possible. It turns on rtl loop
> > unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
> > In x86 backend the default behavior is to unroll small loops with less
> > than 4 insns by 1 time.
> >
> > This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> > 0.9% codesize increment. For other benchmarks the variants are minor
> > and overall codesize increased by 0.2%.
> >
> > The kernel image size increased by 0.06%, and no impact on eembc.
> >
> > gcc/ChangeLog:
> >
> > * common/config/i386/i386-common.cc (ix86_optimization_table):
> > Enable small loop unroll at O2 by default.
> > * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
> > factor if -munroll-only-small-loops enabled and -funroll-loops/
> > -funroll-all-loops are disabled.
> > * config/i386/i386.opt: Add -munroll-only-small-loops,
> > -param=x86-small-unroll-ninsns= for loop insn limit,
> > -param=x86-small-unroll-factor= for unroll factor.
> > * doc/invoke.texi: Document -munroll-only-small-loops,
> > x86-small-unroll-ninsns and x86-small-unroll-factor.
> > * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> > loop unrolling for -O2-speed and above if target hook
> > loop_unroll_adjust exists.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.dg/guality/loop-1.c: Add additional option
> >   -mno-unroll-only-small-loops.
> > * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
> > * gcc.target/i386/pr93002.c: Likewise.
> > ---
> >  gcc/common/config/i386/i386-common.cc   |  1 +
> >  gcc/config/i386/i386.cc | 18 ++
> >  gcc/config/i386/i386.opt| 13 +
> >  gcc/doc/invoke.texi | 16 
> >  gcc/loop-init.cc| 10 +++---
> >  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
> >  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
> >  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
> >  8 files changed, 59 insertions(+), 5 deletions(-)
> >
> > diff --git a/gcc/common/config/i386/i386-common.cc 
> > b/gcc/common/config/i386/i386-common.cc
> > index f66bdd5a2af..c6891486078 100644
> > --- a/gcc/common/config/i386/i386-common.cc
> > +++ b/gcc/common/config/i386/i386-common.cc
> > @@ -1724,6 +1724,7 @@ static const struct default_options 
> > ix86_option_optimization_table[] =
> >  /* The STC algorithm produces the smallest code at -Os, for x86.  */
> >  { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
> >REORDER_BLOCKS_ALGORITHM_STC },
> > +{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 
> > },
> >  /* Turn off -fschedule-insns by default.  It tends to make the
> > problem with not enough registers even worse.  */
> >  { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index c0f37149ed0..0f94a3b609e 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class 
> > loop *loop)
> >unsigned i;
> >unsigned mem_count = 0;
> >
> > +  /* Unroll small size loop when unroll factor is not explicitly
> > + specified.  */
> > +  if (!(flag_unroll_loops
> > +   || flag_unroll_all_loops
> > +   || loop->unroll))
> > +{
> > +  nunroll = 1;
> > +
> > +  /* Any explicit -f{no-}unroll-{all-}loops turns off
> > +-munroll-only-small-loops.  */
> > +  if (ix86_unroll_only_small_loops
> > + && !OPTION_SET_P (flag_unroll_loops))
> > +   if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)
>
> either add braces or combine the two if's
>
> Otherwise the middle-end changes look OK.  The target maintainers need to 
> decide
> whether the two 

Re: [PATCH V2] Enable small loop unrolling for O2

2022-11-07 Thread Richard Biener via Gcc-patches
On Wed, Nov 2, 2022 at 4:37 AM Hongyu Wang  wrote:
>
> Hi, this is the updated patch of
> https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
> which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.
>
> This patch does not change rs6000/s390 since I don't have machine to
> test them, but I suppose the default behavior is the same since they
> enable flag_unroll_loops at O2.
>
> Bootstrapped & regrtested on x86_64-pc-linux-gnu.
>
> Ok for trunk?
>
> -- Patch content 
>
> Modern processors has multiple way instruction decoders
> For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
> instructions (usually has 3 uops with a cmp/jmp pair that can be
> macro-fused), the decoder would have 2 uops bubble for each iteration
> and the pipeline could not be fully utilized.
>
> Therefore, this patch enables loop unrolling for small size loop at O2
> to fullfill the decoder as much as possible. It turns on rtl loop
> unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
> In x86 backend the default behavior is to unroll small loops with less
> than 4 insns by 1 time.
>
> This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
> 0.9% codesize increment. For other benchmarks the variants are minor
> and overall codesize increased by 0.2%.
>
> The kernel image size increased by 0.06%, and no impact on eembc.
>
> gcc/ChangeLog:
>
> * common/config/i386/i386-common.cc (ix86_optimization_table):
> Enable small loop unroll at O2 by default.
> * config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
> factor if -munroll-only-small-loops enabled and -funroll-loops/
> -funroll-all-loops are disabled.
> * config/i386/i386.opt: Add -munroll-only-small-loops,
> -param=x86-small-unroll-ninsns= for loop insn limit,
> -param=x86-small-unroll-factor= for unroll factor.
> * doc/invoke.texi: Document -munroll-only-small-loops,
> x86-small-unroll-ninsns and x86-small-unroll-factor.
> * loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
> loop unrolling for -O2-speed and above if target hook
> loop_unroll_adjust exists.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/guality/loop-1.c: Add additional option
>   -mno-unroll-only-small-loops.
> * gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
> * gcc.target/i386/pr93002.c: Likewise.
> ---
>  gcc/common/config/i386/i386-common.cc   |  1 +
>  gcc/config/i386/i386.cc | 18 ++
>  gcc/config/i386/i386.opt| 13 +
>  gcc/doc/invoke.texi | 16 
>  gcc/loop-init.cc| 10 +++---
>  gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
>  gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
>  gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
>  8 files changed, 59 insertions(+), 5 deletions(-)
>
> diff --git a/gcc/common/config/i386/i386-common.cc 
> b/gcc/common/config/i386/i386-common.cc
> index f66bdd5a2af..c6891486078 100644
> --- a/gcc/common/config/i386/i386-common.cc
> +++ b/gcc/common/config/i386/i386-common.cc
> @@ -1724,6 +1724,7 @@ static const struct default_options 
> ix86_option_optimization_table[] =
>  /* The STC algorithm produces the smallest code at -Os, for x86.  */
>  { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
>REORDER_BLOCKS_ALGORITHM_STC },
> +{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
>  /* Turn off -fschedule-insns by default.  It tends to make the
> problem with not enough registers even worse.  */
>  { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index c0f37149ed0..0f94a3b609e 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class 
> loop *loop)
>unsigned i;
>unsigned mem_count = 0;
>
> +  /* Unroll small size loop when unroll factor is not explicitly
> + specified.  */
> +  if (!(flag_unroll_loops
> +   || flag_unroll_all_loops
> +   || loop->unroll))
> +{
> +  nunroll = 1;
> +
> +  /* Any explicit -f{no-}unroll-{all-}loops turns off
> +-munroll-only-small-loops.  */
> +  if (ix86_unroll_only_small_loops
> + && !OPTION_SET_P (flag_unroll_loops))
> +   if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)

either add braces or combine the two if's

Otherwise the middle-end changes look OK.  The target maintainers need to decide
whether the two --params should be core tunings instead - I would assume that
given your rationale the decode and issue widths of the core plays an important
role here.  That might also suggest a single parameter instead and unrolling
(factor * issue_width) / loop->ninsns times instead of a static 

[PATCH V2] Enable small loop unrolling for O2

2022-11-01 Thread Hongyu Wang via Gcc-patches
Hi, this is the updated patch of
https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604345.html,
which uses targetm.loop_unroll_adjust as gate to enable small loop unroll.

This patch does not change rs6000/s390 since I don't have machine to 
test them, but I suppose the default behavior is the same since they
enable flag_unroll_loops at O2.

Bootstrapped & regrtested on x86_64-pc-linux-gnu.

Ok for trunk?

-- Patch content 

Modern processors has multiple way instruction decoders
For x86, icelake/zen3 has 5 uops, so for small loop with <= 4
instructions (usually has 3 uops with a cmp/jmp pair that can be
macro-fused), the decoder would have 2 uops bubble for each iteration
and the pipeline could not be fully utilized.

Therefore, this patch enables loop unrolling for small size loop at O2
to fullfill the decoder as much as possible. It turns on rtl loop
unrolling when targetm.loop_unroll_adjust exists and O2 plus speed only.
In x86 backend the default behavior is to unroll small loops with less
than 4 insns by 1 time.

This improves 548.exchange2 by 9% on icelake and 7.4% on zen3 with
0.9% codesize increment. For other benchmarks the variants are minor
and overall codesize increased by 0.2%.

The kernel image size increased by 0.06%, and no impact on eembc.

gcc/ChangeLog:

* common/config/i386/i386-common.cc (ix86_optimization_table):
Enable small loop unroll at O2 by default.
* config/i386/i386.cc (ix86_loop_unroll_adjust): Adjust unroll
factor if -munroll-only-small-loops enabled and -funroll-loops/
-funroll-all-loops are disabled.
* config/i386/i386.opt: Add -munroll-only-small-loops,
-param=x86-small-unroll-ninsns= for loop insn limit,
-param=x86-small-unroll-factor= for unroll factor.
* doc/invoke.texi: Document -munroll-only-small-loops,
x86-small-unroll-ninsns and x86-small-unroll-factor.
* loop-init.cc (pass_rtl_unroll_loops::gate): Enable rtl
loop unrolling for -O2-speed and above if target hook
loop_unroll_adjust exists.

gcc/testsuite/ChangeLog:

* gcc.dg/guality/loop-1.c: Add additional option
  -mno-unroll-only-small-loops.
* gcc.target/i386/pr86270.c: Add -mno-unroll-only-small-loops.
* gcc.target/i386/pr93002.c: Likewise.
---
 gcc/common/config/i386/i386-common.cc   |  1 +
 gcc/config/i386/i386.cc | 18 ++
 gcc/config/i386/i386.opt| 13 +
 gcc/doc/invoke.texi | 16 
 gcc/loop-init.cc| 10 +++---
 gcc/testsuite/gcc.dg/guality/loop-1.c   |  2 ++
 gcc/testsuite/gcc.target/i386/pr86270.c |  2 +-
 gcc/testsuite/gcc.target/i386/pr93002.c |  2 +-
 8 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/gcc/common/config/i386/i386-common.cc 
b/gcc/common/config/i386/i386-common.cc
index f66bdd5a2af..c6891486078 100644
--- a/gcc/common/config/i386/i386-common.cc
+++ b/gcc/common/config/i386/i386-common.cc
@@ -1724,6 +1724,7 @@ static const struct default_options 
ix86_option_optimization_table[] =
 /* The STC algorithm produces the smallest code at -Os, for x86.  */
 { OPT_LEVELS_2_PLUS, OPT_freorder_blocks_algorithm_, NULL,
   REORDER_BLOCKS_ALGORITHM_STC },
+{ OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_munroll_only_small_loops, NULL, 1 },
 /* Turn off -fschedule-insns by default.  It tends to make the
problem with not enough registers even worse.  */
 { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index c0f37149ed0..0f94a3b609e 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -23827,6 +23827,24 @@ ix86_loop_unroll_adjust (unsigned nunroll, class loop 
*loop)
   unsigned i;
   unsigned mem_count = 0;
 
+  /* Unroll small size loop when unroll factor is not explicitly
+ specified.  */
+  if (!(flag_unroll_loops
+   || flag_unroll_all_loops
+   || loop->unroll))
+{
+  nunroll = 1;
+
+  /* Any explicit -f{no-}unroll-{all-}loops turns off
+-munroll-only-small-loops.  */
+  if (ix86_unroll_only_small_loops
+ && !OPTION_SET_P (flag_unroll_loops))
+   if (loop->ninsns <= (unsigned) ix86_small_unroll_ninsns)
+ nunroll = (unsigned) ix86_small_unroll_factor;
+
+  return nunroll;
+}
+
   if (!TARGET_ADJUST_UNROLL)
  return nunroll;
 
diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt
index 53d534f6392..6da9c8d670d 100644
--- a/gcc/config/i386/i386.opt
+++ b/gcc/config/i386/i386.opt
@@ -1224,3 +1224,16 @@ mavxvnniint8
 Target Mask(ISA2_AVXVNNIINT8) Var(ix86_isa_flags2) Save
 Support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2 and
 AVXVNNIINT8 built-in functions and code generation.
+
+munroll-only-small-loops
+Target Var(ix86_unroll_only_small_loops) Init(0) Save
+Enable conservative small loop unrolling.
+