Re: znver3 tuning part 1

2021-03-31 Thread Jan Hubicka
> On 3/31/21 1:08 PM, Jan Hubicka wrote:
> > > 
> > > 2021-03-15  Jan Hubicka  
> > > 
> > >   * config/i386/i386-options.c (processor_cost_table): Add znver3_cost.
> > >   * config/i386/x86-tune-costs.h (znver3_cost): New gobal variable; copy
> > >   of znver2_cost.
> > 
> > I have backported the patch to gcc10 branch as
> > g:aa99212489545c6c970a8f91b3d37ea6466cb988.
> > 
> > Honza
> > 
> 
> Looking at the backport, it has likely enabled PR99753.
> Please consider backporting 4f00c4d40a539360938607561460904663c64cda.

Not this patch but backport of the original Venkat's change indeed
triggers PR99763.  I am now testing backport if this revision.

Thanks,
Honza
> 
> Cheers,
> Martin


Re: znver3 tuning part 1

2021-03-31 Thread Martin Liška

On 3/31/21 1:08 PM, Jan Hubicka wrote:


2021-03-15  Jan Hubicka  

* config/i386/i386-options.c (processor_cost_table): Add znver3_cost.
* config/i386/x86-tune-costs.h (znver3_cost): New gobal variable; copy
of znver2_cost.


I have backported the patch to gcc10 branch as
g:aa99212489545c6c970a8f91b3d37ea6466cb988.

Honza



Looking at the backport, it has likely enabled PR99753.
Please consider backporting 4f00c4d40a539360938607561460904663c64cda.

Cheers,
Martin


Re: znver3 tuning part 1

2021-03-31 Thread Jan Hubicka
> 
> 2021-03-15  Jan Hubicka  
> 
>   * config/i386/i386-options.c (processor_cost_table): Add znver3_cost.
>   * config/i386/x86-tune-costs.h (znver3_cost): New gobal variable; copy
>   of znver2_cost.

I have backported the patch to gcc10 branch as
g:aa99212489545c6c970a8f91b3d37ea6466cb988.

Honza


RE: znver3 tuning part 1

2021-03-23 Thread Kumar, Venkataramanan via Gcc-patches
[AMD Public Use]

Hi Honza,


> -Original Message-
> From: Jan Hubicka 
> Sent: Monday, March 22, 2021 4:31 PM
> To: Kumar, Venkataramanan 
> Cc: gcc-patches@gcc.gnu.org; mjam...@suse.cz
> Subject: Re: znver3 tuning part 1
> 
> [CAUTION: External Email]
> 
> > > Hi,
> > > I plan to commit some retuning of znver3 codegen that is based on
> > > real hardware benchmarks.  It turns out that there are not too many
> > > changes necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:
> > >
> > >  - some instructions (like idiv) have shorter latencies.  Adjusting
> > >costs reduces code size a bit but seems within noise in benchmark
> > >(since our cost calculation is quite off anyway because it does not
> > >account register pressure and parallelism that does make huge
> > >difference here)
> > >  - gather instructions are still microcoded but a lot faster than in
> > >znver1/znver2 and it turns out they are now beneficial for few tsmc
> > >benchmarks, so I plan to enable them.
> >
> > Can we get a copy of this benchmark to try ?
> > we need to check on bigger benchmarks like SPEC also.
> 
> Yes, I am also running specs.  However for basic instruction selection tuning
> smaller benchmarks are doing quite well.  In general if there are relatively
> natural loops where gather helps, i think we should enable it and try to fix
> possible regressions (I did not see one in spec runs, but I plan to do more
> benhcmarking this week).

Okay Thank you.  

> 
> I did some work on TSVC mostly because zen3 seems very smooth update to
> zen2 for instruction selection (which is already happy with almost everything
> especially for scalar code) and vectorizer costs seems to be place where we
> seem to have most room for improvement.
> 
> I briefly analyzed all tsvc kernels where we regress compared to clang, aocc 
> and
> icc.  You can search tsvc in bugzilla. Richard also wrote some observations 
> there.
> These are related to missing features rather than cost model however.
> 
> One problem of tsvc is that it is FP only.  I hacked it for integer but it 
> would be
> nice to have someting else as well.
> >
> > >
> > >It seems we missed revisiting this for znver2 tuning.
> > >I think even for znver2 it may make sense to re-enable them, so I
> > >will benchmark this as well.
> > >  - memcpy/memset expansion seems to work same way as for znver2,
> > >so I am keeping same changes.
> > >  - instruction scheduler is already modified in trunk to some degree
> > >reflecting new units.  Problem with instruction scheduling is that
> > >it treats zen as in-order CPU and is unlikely going to fill all
> > >execution resources this way.
> > >We may want to try to model the out-of-order nature similar way as
> > >LLVM does, but at the other hand the current scheduling logic seems
> > >to do mostly fine (i.e. not worse than llvm's).  What matters is
> > >to schedule for long latencies and just after branch boundaries
> > >where simplified model seems to do just fine.
> >
> > So we can keep the existing model for znver3 for GCC 11 ?
> 
> I think so - I experimented with making the model bit more precise and it does
> not seem to add any performance improvements and makes the automaton a
> lot bigger.  The existing model already handles the updated
> zen3 latencies...
> 
> I think the only possible iprovment here would be to start modelling 
> explicitly the
> out of order nature but even then I am not sure how much benefits that can
> bring (given that we are limited to relatively small basic blocks and do not 
> have a
> lot of information needed to model the execution precisely). Do you have some
> options on this?

Given that basic blocks are small and hardware itself reorders the 
instructions, I don't think precisely modelling the scheduler will give much 
benefit.

> 
> Honza

Regards,
Venkat.


Re: znver3 tuning part 1

2021-03-22 Thread Richard Biener via Gcc-patches
On Mon, Mar 22, 2021 at 12:02 PM Jan Hubicka  wrote:
>
> > > Hi,
> > > I plan to commit some retuning of znver3 codegen that is based on real
> > > hardware benchmarks.  It turns out that there are not too many changes
> > > necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:
> > >
> > >  - some instructions (like idiv) have shorter latencies.  Adjusting
> > >costs reduces code size a bit but seems within noise in benchmark
> > >(since our cost calculation is quite off anyway because it does not
> > >account register pressure and parallelism that does make huge
> > >difference here)
> > >  - gather instructions are still microcoded but a lot faster than in
> > >znver1/znver2 and it turns out they are now beneficial for few tsmc
> > >benchmarks, so I plan to enable them.
> >
> > Can we get a copy of this benchmark to try ?
> > we need to check on bigger benchmarks like SPEC also.
>
> Yes, I am also running specs.  However for basic instruction selection
> tuning smaller benchmarks are doing quite well.  In general if there are
> relatively natural loops where gather helps, i think we should enable it
> and try to fix possible regressions (I did not see one in spec runs, but
> I plan to do more benhcmarking this week).
>
> I did some work on TSVC mostly because zen3 seems very smooth update to
> zen2 for instruction selection (which is already happy with almost
> everything especially for scalar code) and vectorizer costs seems to be
> place where we seem to have most room for improvement.
>
> I briefly analyzed all tsvc kernels where we regress compared to clang,
> aocc and icc.  You can search tsvc in bugzilla. Richard also wrote some
> observations there.  These are related to missing features rather than
> cost model however.
>
> One problem of tsvc is that it is FP only.  I hacked it for integer but
> it would be nice to have someting else as well.
> >
> > >
> > >It seems we missed revisiting this for znver2 tuning.
> > >I think even for znver2 it may make sense to re-enable them, so I
> > >will benchmark this as well.
> > >  - memcpy/memset expansion seems to work same way as for znver2,
> > >so I am keeping same changes.
> > >  - instruction scheduler is already modified in trunk to some degree
> > >reflecting new units.  Problem with instruction scheduling is that
> > >it treats zen as in-order CPU and is unlikely going to fill all
> > >execution resources this way.
> > >We may want to try to model the out-of-order nature similar way as
> > >LLVM does, but at the other hand the current scheduling logic seems
> > >to do mostly fine (i.e. not worse than llvm's).  What matters is
> > >to schedule for long latencies and just after branch boundaries
> > >where simplified model seems to do just fine.
> >
> > So we can keep the existing model for znver3 for GCC 11 ?
>
> I think so - I experimented with making the model bit more precise and
> it does not seem to add any performance improvements and makes the
> automaton a lot bigger.  The existing model already handles the updated
> zen3 latencies...
>
> I think the only possible iprovment here would be to start modelling
> explicitly the out of order nature but even then I am not sure how much
> benefits that can bring (given that we are limited to relatively small
> basic blocks and do not have a lot of information needed to model the
> execution precisely). Do you have some options on this?

I think it makes sense to model instruction fetch quite precisely
(including, or rather either/or fetch from the uop cache) up to where
OOO starts.  From there on backwards only very long latency insns
and of course insn dependences should be a factor to maximise
issue width per fetch block.  Not sure if it makes sense to model the
uop cache at all or whehter we should switch between L1 fetch
and uop cache assumption based on loop depth?

That said, for loops scheduling is somewhat moot but for cold
(in terms of the OOO window size) serial code it makes sense to
optimize for uop issue.  I also note that this seems to work out
quite well with the existing automata - if only as side-effect.

Richard.

> Honza


Re: znver3 tuning part 1

2021-03-22 Thread Jan Hubicka
> > Hi,
> > I plan to commit some retuning of znver3 codegen that is based on real
> > hardware benchmarks.  It turns out that there are not too many changes
> > necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:
> > 
> >  - some instructions (like idiv) have shorter latencies.  Adjusting
> >costs reduces code size a bit but seems within noise in benchmark
> >(since our cost calculation is quite off anyway because it does not
> >account register pressure and parallelism that does make huge
> >difference here)
> >  - gather instructions are still microcoded but a lot faster than in
> >znver1/znver2 and it turns out they are now beneficial for few tsmc
> >benchmarks, so I plan to enable them.
> 
> Can we get a copy of this benchmark to try ?  
> we need to check on bigger benchmarks like SPEC also. 

Yes, I am also running specs.  However for basic instruction selection
tuning smaller benchmarks are doing quite well.  In general if there are
relatively natural loops where gather helps, i think we should enable it
and try to fix possible regressions (I did not see one in spec runs, but
I plan to do more benhcmarking this week).

I did some work on TSVC mostly because zen3 seems very smooth update to
zen2 for instruction selection (which is already happy with almost
everything especially for scalar code) and vectorizer costs seems to be
place where we seem to have most room for improvement.

I briefly analyzed all tsvc kernels where we regress compared to clang,
aocc and icc.  You can search tsvc in bugzilla. Richard also wrote some
observations there.  These are related to missing features rather than 
cost model however.

One problem of tsvc is that it is FP only.  I hacked it for integer but
it would be nice to have someting else as well.
> 
> > 
> >It seems we missed revisiting this for znver2 tuning.
> >I think even for znver2 it may make sense to re-enable them, so I
> >will benchmark this as well.
> >  - memcpy/memset expansion seems to work same way as for znver2,
> >so I am keeping same changes.
> >  - instruction scheduler is already modified in trunk to some degree
> >reflecting new units.  Problem with instruction scheduling is that
> >it treats zen as in-order CPU and is unlikely going to fill all
> >execution resources this way.
> >We may want to try to model the out-of-order nature similar way as
> >LLVM does, but at the other hand the current scheduling logic seems
> >to do mostly fine (i.e. not worse than llvm's).  What matters is
> >to schedule for long latencies and just after branch boundaries
> >where simplified model seems to do just fine.
> 
> So we can keep the existing model for znver3 for GCC 11 ?

I think so - I experimented with making the model bit more precise and
it does not seem to add any performance improvements and makes the
automaton a lot bigger.  The existing model already handles the updated
zen3 latencies...

I think the only possible iprovment here would be to start modelling
explicitly the out of order nature but even then I am not sure how much
benefits that can bring (given that we are limited to relatively small
basic blocks and do not have a lot of information needed to model the
execution precisely). Do you have some options on this?

Honza


Re: znver3 tuning part 1

2021-03-22 Thread Martin Liška

On 3/22/21 9:43 AM, Kumar, Venkataramanan via Gcc-patches wrote:

[AMD Official Use Only - Internal Distribution Only]


Just brief note about this. You CC a public mailing list, thus the disclaimer 
does not
make much sense. Please take a look here: 
https://gcc.gnu.org/lists.html#policies



Hi Honza,

Thank you for working on this.


-Original Message-
From: Gcc-patches  On Behalf Of Jan
Hubicka
Sent: Monday, March 15, 2021 3:33 PM
To: gcc-patches@gcc.gnu.org; mjam...@suse.cz
Subject: znver3 tuning part 1

[CAUTION: External Email]

Hi,
I plan to commit some retuning of znver3 codegen that is based on real
hardware benchmarks.  It turns out that there are not too many changes
necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:

  - some instructions (like idiv) have shorter latencies.  Adjusting
costs reduces code size a bit but seems within noise in benchmark
(since our cost calculation is quite off anyway because it does not
account register pressure and parallelism that does make huge
difference here)
  - gather instructions are still microcoded but a lot faster than in
znver1/znver2 and it turns out they are now beneficial for few tsmc
benchmarks, so I plan to enable them.


Can we get a copy of this benchmark to try ?


Sure: https://github.com/UoB-HPC/TSVC_2

Cheers,
Martin


we need to check on bigger benchmarks like SPEC also.



It seems we missed revisiting this for znver2 tuning.
I think even for znver2 it may make sense to re-enable them, so I
will benchmark this as well.
  - memcpy/memset expansion seems to work same way as for znver2,
so I am keeping same changes.
  - instruction scheduler is already modified in trunk to some degree
reflecting new units.  Problem with instruction scheduling is that
it treats zen as in-order CPU and is unlikely going to fill all
execution resources this way.
We may want to try to model the out-of-order nature similar way as
LLVM does, but at the other hand the current scheduling logic seems
to do mostly fine (i.e. not worse than llvm's).  What matters is
to schedule for long latencies and just after branch boundaries
where simplified model seems to do just fine.


So we can keep the existing model for znver3 for GCC 11 ?


  - some move instruction latencies does not reflect reality
(at least the published latencies by Agner Fog or AMD optimization
manual that themseleves does not agree with each otehr).
Adjusting tables however triggers regressions in ImageMagick and
parest, so I am still looking if there is easy fix for this and if
not, I will wait for next stage1 with these.
Interesting property is that reg-reg moves are a zero latency.
Since costs are officially relative to reg-reg move it makes it bit
hard to define here :)
  - fmadd was optimized and it is now 4 cycles (was 5 and 6 cycles on
znver2 and znver1 respectively) like on Intel. However there is still
problem with extending the critical chain in matrix multiplication
loop.  The difference seems to be that Intel implementation needs the
accumulator value to be ready only 1 cycle after the execution
started processing the multiplication.

So there is still a performance regression on matmul and thus I am
keeping the logic to break critical chains.


My observation is also same here.



This first patch is no-op and it only copies the cost tables.  I will adjust 
them one-
by-one for easier hunting of possible regressions.

Honza

2021-03-15  Jan Hubicka  

 * config/i386/i386-options.c (processor_cost_table): Add znver3_cost.
 * config/i386/x86-tune-costs.h (znver3_cost): New gobal variable; copy
 of znver2_cost.

diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
index e93935f6f2c..7865bc110a3 100644
--- a/gcc/config/i386/i386-options.c
+++ b/gcc/config/i386/i386-options.c
@@ -743,7 +743,7 @@ static const struct processor_costs
*processor_cost_table[] =
&btver2_cost,
&znver1_cost,
&znver2_cost,
-  &znver2_cost
+  &znver3_cost
  };

  /* Guarantee that the array is aligned with enum processor_type.  */ diff 
--git
a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
index cc27c7911e3..e655e668c7a 100644
--- a/gcc/config/i386/x86-tune-costs.h
+++ b/gcc/config/i386/x86-tune-costs.h
@@ -1688,6 +1688,140 @@ struct processor_costs znver2_cost = {
"16",/* Func alignment.  */
  };

+struct processor_costs znver3_cost = {
+  {
+  /* Start of register allocator costs.  integer->integer move cost is
+2. */
+
+  /* reg-reg moves are done by renaming and thus they are even cheaper than
+ 1 cycle.  Because reg-reg move cost is 2 and following tables correspond
+ to doubles of latencies, we do not model this correctly.  It does not
+ seem to make practical difference to bump prices up even more.  */
+  6,  

RE: znver3 tuning part 1

2021-03-22 Thread Kumar, Venkataramanan via Gcc-patches
[AMD Official Use Only - Internal Distribution Only]

Hi Honza,

Thank you for working on this.  

> -Original Message-
> From: Gcc-patches  On Behalf Of Jan
> Hubicka
> Sent: Monday, March 15, 2021 3:33 PM
> To: gcc-patches@gcc.gnu.org; mjam...@suse.cz
> Subject: znver3 tuning part 1
> 
> [CAUTION: External Email]
> 
> Hi,
> I plan to commit some retuning of znver3 codegen that is based on real
> hardware benchmarks.  It turns out that there are not too many changes
> necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:
> 
>  - some instructions (like idiv) have shorter latencies.  Adjusting
>costs reduces code size a bit but seems within noise in benchmark
>(since our cost calculation is quite off anyway because it does not
>account register pressure and parallelism that does make huge
>difference here)
>  - gather instructions are still microcoded but a lot faster than in
>znver1/znver2 and it turns out they are now beneficial for few tsmc
>benchmarks, so I plan to enable them.

Can we get a copy of this benchmark to try ?  
we need to check on bigger benchmarks like SPEC also. 

> 
>It seems we missed revisiting this for znver2 tuning.
>I think even for znver2 it may make sense to re-enable them, so I
>will benchmark this as well.
>  - memcpy/memset expansion seems to work same way as for znver2,
>so I am keeping same changes.
>  - instruction scheduler is already modified in trunk to some degree
>reflecting new units.  Problem with instruction scheduling is that
>it treats zen as in-order CPU and is unlikely going to fill all
>execution resources this way.
>We may want to try to model the out-of-order nature similar way as
>LLVM does, but at the other hand the current scheduling logic seems
>to do mostly fine (i.e. not worse than llvm's).  What matters is
>to schedule for long latencies and just after branch boundaries
>where simplified model seems to do just fine.

So we can keep the existing model for znver3 for GCC 11 ?

>  - some move instruction latencies does not reflect reality
>(at least the published latencies by Agner Fog or AMD optimization
>manual that themseleves does not agree with each otehr).
>Adjusting tables however triggers regressions in ImageMagick and
>parest, so I am still looking if there is easy fix for this and if
>not, I will wait for next stage1 with these.
>Interesting property is that reg-reg moves are a zero latency.
>Since costs are officially relative to reg-reg move it makes it bit
>hard to define here :)
>  - fmadd was optimized and it is now 4 cycles (was 5 and 6 cycles on
>znver2 and znver1 respectively) like on Intel. However there is still
>problem with extending the critical chain in matrix multiplication
>loop.  The difference seems to be that Intel implementation needs the
>accumulator value to be ready only 1 cycle after the execution
>started processing the multiplication.
> 
>So there is still a performance regression on matmul and thus I am
>keeping the logic to break critical chains.

My observation is also same here. 

> 
> This first patch is no-op and it only copies the cost tables.  I will adjust 
> them one-
> by-one for easier hunting of possible regressions.
> 
> Honza
> 
> 2021-03-15  Jan Hubicka  
> 
> * config/i386/i386-options.c (processor_cost_table): Add znver3_cost.
> * config/i386/x86-tune-costs.h (znver3_cost): New gobal variable; copy
> of znver2_cost.
> 
> diff --git a/gcc/config/i386/i386-options.c b/gcc/config/i386/i386-options.c
> index e93935f6f2c..7865bc110a3 100644
> --- a/gcc/config/i386/i386-options.c
> +++ b/gcc/config/i386/i386-options.c
> @@ -743,7 +743,7 @@ static const struct processor_costs
> *processor_cost_table[] =
>&btver2_cost,
>&znver1_cost,
>&znver2_cost,
> -  &znver2_cost
> +  &znver3_cost
>  };
> 
>  /* Guarantee that the array is aligned with enum processor_type.  */ diff 
> --git
> a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h
> index cc27c7911e3..e655e668c7a 100644
> --- a/gcc/config/i386/x86-tune-costs.h
> +++ b/gcc/config/i386/x86-tune-costs.h
> @@ -1688,6 +1688,140 @@ struct processor_costs znver2_cost = {
>"16",/* Func alignment.  */
>  };
> 
> +struct processor_costs znver3_cost = {
> +  {
> +  /* Start of register allocator costs.  integer->integer move cost is
> +2. */
> +
> +  /* reg-reg moves are done by renaming and thus they are even cheaper than
> + 1 cycle.  Because reg-reg move cost is 2 and following tables correspond
> + to doubles of latencies, we do not model this correctly.  It does not
> + seem to make practical difference to bump prices up even more.  */
> +  6,   /* cost for loading QImode using
> +  movzbl.  */
> +  {6, 6, 6},   /