[AMD Public Use] Hi Honza,
> -----Original Message----- > From: Jan Hubicka <hubi...@ucw.cz> > Sent: Monday, March 22, 2021 4:31 PM > To: Kumar, Venkataramanan <venkataramanan.ku...@amd.com> > Cc: gcc-patches@gcc.gnu.org; mjam...@suse.cz > Subject: Re: znver3 tuning part 1 > > [CAUTION: External Email] > > > > Hi, > > > I plan to commit some retuning of znver3 codegen that is based on > > > real hardware benchmarks. It turns out that there are not too many > > > changes necessary sinze Zen3 is quite smooth upgrade to Zen2. In summary: > > > > > > - some instructions (like idiv) have shorter latencies. Adjusting > > > costs reduces code size a bit but seems within noise in benchmark > > > (since our cost calculation is quite off anyway because it does not > > > account register pressure and parallelism that does make huge > > > difference here) > > > - gather instructions are still microcoded but a lot faster than in > > > znver1/znver2 and it turns out they are now beneficial for few tsmc > > > benchmarks, so I plan to enable them. > > > > Can we get a copy of this benchmark to try ? > > we need to check on bigger benchmarks like SPEC also. > > Yes, I am also running specs. However for basic instruction selection tuning > smaller benchmarks are doing quite well. In general if there are relatively > natural loops where gather helps, i think we should enable it and try to fix > possible regressions (I did not see one in spec runs, but I plan to do more > benhcmarking this week). Okay Thank you. > > I did some work on TSVC mostly because zen3 seems very smooth update to > zen2 for instruction selection (which is already happy with almost everything > especially for scalar code) and vectorizer costs seems to be place where we > seem to have most room for improvement. > > I briefly analyzed all tsvc kernels where we regress compared to clang, aocc > and > icc. You can search tsvc in bugzilla. Richard also wrote some observations > there. > These are related to missing features rather than cost model however. > > One problem of tsvc is that it is FP only. I hacked it for integer but it > would be > nice to have someting else as well. > > > > > > > > It seems we missed revisiting this for znver2 tuning. > > > I think even for znver2 it may make sense to re-enable them, so I > > > will benchmark this as well. > > > - memcpy/memset expansion seems to work same way as for znver2, > > > so I am keeping same changes. > > > - instruction scheduler is already modified in trunk to some degree > > > reflecting new units. Problem with instruction scheduling is that > > > it treats zen as in-order CPU and is unlikely going to fill all > > > execution resources this way. > > > We may want to try to model the out-of-order nature similar way as > > > LLVM does, but at the other hand the current scheduling logic seems > > > to do mostly fine (i.e. not worse than llvm's). What matters is > > > to schedule for long latencies and just after branch boundaries > > > where simplified model seems to do just fine. > > > > So we can keep the existing model for znver3 for GCC 11 ? > > I think so - I experimented with making the model bit more precise and it does > not seem to add any performance improvements and makes the automaton a > lot bigger. The existing model already handles the updated > zen3 latencies... > > I think the only possible iprovment here would be to start modelling > explicitly the > out of order nature but even then I am not sure how much benefits that can > bring (given that we are limited to relatively small basic blocks and do not > have a > lot of information needed to model the execution precisely). Do you have some > options on this? Given that basic blocks are small and hardware itself reorders the instructions, I don't think precisely modelling the scheduler will give much benefit. > > Honza Regards, Venkat.