[AMD Public Use]

Hi Honza,


> -----Original Message-----
> From: Jan Hubicka <hubi...@ucw.cz>
> Sent: Monday, March 22, 2021 4:31 PM
> To: Kumar, Venkataramanan <venkataramanan.ku...@amd.com>
> Cc: gcc-patches@gcc.gnu.org; mjam...@suse.cz
> Subject: Re: znver3 tuning part 1
> 
> [CAUTION: External Email]
> 
> > > Hi,
> > > I plan to commit some retuning of znver3 codegen that is based on
> > > real hardware benchmarks.  It turns out that there are not too many
> > > changes necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:
> > >
> > >  - some instructions (like idiv) have shorter latencies.  Adjusting
> > >    costs reduces code size a bit but seems within noise in benchmark
> > >    (since our cost calculation is quite off anyway because it does not
> > >    account register pressure and parallelism that does make huge
> > >    difference here)
> > >  - gather instructions are still microcoded but a lot faster than in
> > >    znver1/znver2 and it turns out they are now beneficial for few tsmc
> > >    benchmarks, so I plan to enable them.
> >
> > Can we get a copy of this benchmark to try ?
> > we need to check on bigger benchmarks like SPEC also.
> 
> Yes, I am also running specs.  However for basic instruction selection tuning
> smaller benchmarks are doing quite well.  In general if there are relatively
> natural loops where gather helps, i think we should enable it and try to fix
> possible regressions (I did not see one in spec runs, but I plan to do more
> benhcmarking this week).

Okay Thank you.  

> 
> I did some work on TSVC mostly because zen3 seems very smooth update to
> zen2 for instruction selection (which is already happy with almost everything
> especially for scalar code) and vectorizer costs seems to be place where we
> seem to have most room for improvement.
> 
> I briefly analyzed all tsvc kernels where we regress compared to clang, aocc 
> and
> icc.  You can search tsvc in bugzilla. Richard also wrote some observations 
> there.
> These are related to missing features rather than cost model however.
> 
> One problem of tsvc is that it is FP only.  I hacked it for integer but it 
> would be
> nice to have someting else as well.
> >
> > >
> > >    It seems we missed revisiting this for znver2 tuning.
> > >    I think even for znver2 it may make sense to re-enable them, so I
> > >    will benchmark this as well.
> > >  - memcpy/memset expansion seems to work same way as for znver2,
> > >    so I am keeping same changes.
> > >  - instruction scheduler is already modified in trunk to some degree
> > >    reflecting new units.  Problem with instruction scheduling is that
> > >    it treats zen as in-order CPU and is unlikely going to fill all
> > >    execution resources this way.
> > >    We may want to try to model the out-of-order nature similar way as
> > >    LLVM does, but at the other hand the current scheduling logic seems
> > >    to do mostly fine (i.e. not worse than llvm's).  What matters is
> > >    to schedule for long latencies and just after branch boundaries
> > >    where simplified model seems to do just fine.
> >
> > So we can keep the existing model for znver3 for GCC 11 ?
> 
> I think so - I experimented with making the model bit more precise and it does
> not seem to add any performance improvements and makes the automaton a
> lot bigger.  The existing model already handles the updated
> zen3 latencies...
> 
> I think the only possible iprovment here would be to start modelling 
> explicitly the
> out of order nature but even then I am not sure how much benefits that can
> bring (given that we are limited to relatively small basic blocks and do not 
> have a
> lot of information needed to model the execution precisely). Do you have some
> options on this?

Given that basic blocks are small and hardware itself reorders the 
instructions, I don't think precisely modelling the scheduler will give much 
benefit.

> 
> Honza

Regards,
Venkat.

Reply via email to