On Wed, Nov 24, 2021 at 08:55:39AM -0500, Rodrigo Vivi wrote:
> On Wed, Nov 24, 2021 at 08:56:52AM +0000, Tvrtko Ursulin wrote:
> > 
> > On 23/11/2021 19:52, Rodrigo Vivi wrote:
> > > On Tue, Nov 23, 2021 at 09:39:25AM +0000, Tvrtko Ursulin wrote:
> > > > 
> > > > On 17/11/2021 22:49, Vinay Belgaumkar wrote:
> > > > > From: Chris Wilson <ch...@chris-wilson.co.uk>
> > > > > 
> > > > > Everytime we come to the end of a virtual engine's context, 
> > > > > re-randomise
> > > > > it's siblings[]. As we schedule the siblings' tasklets in the order 
> > > > > they
> > > > > are in the array, earlier entries are executed first (when idle) and 
> > > > > so
> > > > > will be preferred when scheduling the next virtual request. Currently,
> > > > > we only update the array when switching onto a new idle engine, so we
> > > > > prefer to stick on the last execute engine, keeping the work compact.
> > > > > However, it can be beneficial to spread the work out across idle
> > > > > engines, so choose another sibling as our preferred target at the end 
> > > > > of
> > > > > the context's execution.
> > > > 
> > > > This partially brings back, from a different angle, the more dynamic
> > > > scheduling behavior which has been lost since bugfix 90a987205c6c
> > > > ("drm/i915/gt: Only swap to a random sibling once upon creation").
> > > 
> > > Shouldn't we use the Fixes tag here since this is targeting to fix one
> > > of the performance regressions of this patch?
> > 
> > Probably not but hard to say. Note that it wasn't a performance regression
> > that was reported but power.
> > 
> > And to go back to what we said elsewhere in the thread, I am actually with
> > you in thinking that in the ideal world we need PnP testing across a variety
> > of workloads and platforms. And "in the ideal world" should really be in the
> > normal world. It is not professional to be reactive to isolated bug reports
> > from users, without being able to see the overall picture.
> 
> We surely need to address the bug report from users. I'm just asking to 
> address
> that with the smallest fix that we can backport and fit to the products 
> milestones.
> 
> Instead, we are creating another optimization feature on a rush. Without a 
> proper
> validation.
> 
> I believe it is too risk to add an algorithm like that without a broader test.
> I see a big risk of introducing corner cases that will results in more bug 
> report
> from other users in a near future.
> 
> So, let's all be professionals and provide a smaller fix for a regression on
> the load balancing scenario and provide a better validation with more data
> to justify this new feature.

Okay, after more IRC discussions I see that patch 2 is also part of the solution
and probably safe.

Let me be clear that my biggest complain and the risk is with race-to-idle in
patch 3 on trying to predict the rc6 behavior and increasing the freq to try to
complete job faster and then get to rc6 faster... That one would need a lot
more validation.

> 
> Thanks,
> Rodrigo.
> 
> > 
> > > > One day we could experiment with using engine busyness as criteria 
> > > > (instead
> > > > of random). Back in the day busyness was kind of the best strategy, 
> > > > although
> > > > sampled at submit, not at the trailing edge like here, but it still may 
> > > > be
> > > > able to settle down to engine configuration better in some scenarios. 
> > > > Only
> > > > testing could say.
> > > > 
> > > > Still, from memory random also wasn't that bad so this should be okay 
> > > > for
> > > > now.
> > > > 
> > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursu...@intel.com>
> > > 
> > > Since you reviewed and it looks to be a middle ground point in terms
> > > of when to balancing (always like in the initial implementation vs
> > > only once like the in 90a987205c6c).
> > > 
> > > If this one is really fixing the regression by itself:
> > > Acked-by: Rodrigo Vivi <rodrigo.v...@intel.com>
> > > on this patch here.
> > > 
> > > But I still don't want to take the risk with touching the freq with
> > > race to idle, until not convinced that it is absolutely needed and
> > > that we are not breaking the world out there.
> > 
> > Yes agreed in principle, we have users with different priorities.
> > 
> > However the RPS patches in the series, definitely the 1st one which looks at
> > classes versus individual engines, sound plausible to me. Given the absence
> > of automated PnP testing mentioned above, in the past it was usually Chris
> > who was making the above and beyond effort to evaluate changes like these on
> > as many platforms as he could, and with different workloads. Not sure who
> > has the mandate and drive to fill that space but something will need to
> > happen.
> > 
> > Regards,
> > 
> > Tvrtko
> > 
> > > > 
> > > > Regards,
> > > > 
> > > > Tvrtko
> > > > 
> > > > > Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk>
> > > > > Cc: Vinay Belgaumkar <vinay.belgaum...@intel.com>
> > > > > Cc: Tvrtko Ursulin <tvrtko.ursu...@linux.intel.com>
> > > > > ---
> > > > >    .../drm/i915/gt/intel_execlists_submission.c  | 80 
> > > > > ++++++++++++-------
> > > > >    1 file changed, 52 insertions(+), 28 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
> > > > > b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > index ca03880fa7e4..b95bbc8fb91a 100644
> > > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > > > > @@ -539,6 +539,41 @@ static void execlists_schedule_in(struct 
> > > > > i915_request *rq, int idx)
> > > > >       GEM_BUG_ON(intel_context_inflight(ce) != rq->engine);
> > > > >    }
> > > > > +static void virtual_xfer_context(struct virtual_engine *ve,
> > > > > +                              struct intel_engine_cs *engine)
> > > > > +{
> > > > > +     unsigned int n;
> > > > > +
> > > > > +     if (likely(engine == ve->siblings[0]))
> > > > > +             return;
> > > > > +
> > > > > +     if (!intel_engine_has_relative_mmio(engine))
> > > > > +             lrc_update_offsets(&ve->context, engine);
> > > > > +
> > > > > +     /*
> > > > > +      * Move the bound engine to the top of the list for
> > > > > +      * future execution. We then kick this tasklet first
> > > > > +      * before checking others, so that we preferentially
> > > > > +      * reuse this set of bound registers.
> > > > > +      */
> > > > > +     for (n = 1; n < ve->num_siblings; n++) {
> > > > > +             if (ve->siblings[n] == engine) {
> > > > > +                     swap(ve->siblings[n], ve->siblings[0]);
> > > > > +                     break;
> > > > > +             }
> > > > > +     }
> > > > > +}
> > > > > +
> > > > > +static int ve_random_sibling(struct virtual_engine *ve)
> > > > > +{
> > > > > +     return prandom_u32_max(ve->num_siblings);
> > > > > +}
> > > > > +
> > > > > +static int ve_random_other_sibling(struct virtual_engine *ve)
> > > > > +{
> > > > > +     return 1 + prandom_u32_max(ve->num_siblings - 1);
> > > > > +}
> > > > > +
> > > > >    static void
> > > > >    resubmit_virtual_request(struct i915_request *rq, struct 
> > > > > virtual_engine *ve)
> > > > >    {
> > > > > @@ -578,8 +613,23 @@ static void kick_siblings(struct i915_request 
> > > > > *rq, struct intel_context *ce)
> > > > >           rq->execution_mask != engine->mask)
> > > > >               resubmit_virtual_request(rq, ve);
> > > > > -     if (READ_ONCE(ve->request))
> > > > > +     /*
> > > > > +      * Reschedule with a new "preferred" sibling.
> > > > > +      *
> > > > > +      * The tasklets are executed in the order of ve->siblings[], so
> > > > > +      * siblings[0] receives preferrential treatment of greedily 
> > > > > checking
> > > > > +      * for execution of the virtual engine. At this point, the 
> > > > > virtual
> > > > > +      * engine is no longer in the current GPU cache due to idleness 
> > > > > or
> > > > > +      * contention, so it can be executed on any without penalty. We
> > > > > +      * re-randomise at this point in order to spread light loads 
> > > > > across
> > > > > +      * the system, heavy overlapping loads will continue to be 
> > > > > greedily
> > > > > +      * executed by the first available engine.
> > > > > +      */
> > > > > +     if (READ_ONCE(ve->request)) {
> > > > > +             virtual_xfer_context(ve,
> > > > > +                                  
> > > > > ve->siblings[ve_random_other_sibling(ve)]);
> > > > >               tasklet_hi_schedule(&ve->base.sched_engine->tasklet);
> > > > > +     }
> > > > >    }
> > > > >    static void __execlists_schedule_out(struct i915_request * const 
> > > > > rq,
> > > > > @@ -1030,32 +1080,6 @@ first_virtual_engine(struct intel_engine_cs 
> > > > > *engine)
> > > > >       return NULL;
> > > > >    }
> > > > > -static void virtual_xfer_context(struct virtual_engine *ve,
> > > > > -                              struct intel_engine_cs *engine)
> > > > > -{
> > > > > -     unsigned int n;
> > > > > -
> > > > > -     if (likely(engine == ve->siblings[0]))
> > > > > -             return;
> > > > > -
> > > > > -     GEM_BUG_ON(READ_ONCE(ve->context.inflight));
> > > > > -     if (!intel_engine_has_relative_mmio(engine))
> > > > > -             lrc_update_offsets(&ve->context, engine);
> > > > > -
> > > > > -     /*
> > > > > -      * Move the bound engine to the top of the list for
> > > > > -      * future execution. We then kick this tasklet first
> > > > > -      * before checking others, so that we preferentially
> > > > > -      * reuse this set of bound registers.
> > > > > -      */
> > > > > -     for (n = 1; n < ve->num_siblings; n++) {
> > > > > -             if (ve->siblings[n] == engine) {
> > > > > -                     swap(ve->siblings[n], ve->siblings[0]);
> > > > > -                     break;
> > > > > -             }
> > > > > -     }
> > > > > -}
> > > > > -
> > > > >    static void defer_request(struct i915_request *rq, struct 
> > > > > list_head * const pl)
> > > > >    {
> > > > >       LIST_HEAD(list);
> > > > > @@ -3590,7 +3614,7 @@ static void virtual_engine_initial_hint(struct 
> > > > > virtual_engine *ve)
> > > > >        * NB This does not force us to execute on this engine, it will 
> > > > > just
> > > > >        * typically be the first we inspect for submission.
> > > > >        */
> > > > > -     swp = prandom_u32_max(ve->num_siblings);
> > > > > +     swp = ve_random_sibling(ve);
> > > > >       if (swp)
> > > > >               swap(ve->siblings[swp], ve->siblings[0]);
> > > > >    }
> > > > > 

Reply via email to