Re: [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine

2021-08-10 Thread Daniel Vetter
On Mon, Aug 09, 2021 at 07:05:58PM +, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 05:35:25PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:17PM -0700, Matthew Brost wrote:
> > > The heartbeat uses a single instance of a GuC submit engine (GSE) to do
> > > the hang check. As such if a different GSE's state machine hangs, the
> > > heartbeat cannot detect this hang. Add timer to each GSE which in turn
> > > can disable all submissions if it is hung.
> > > 
> > > Cc: John Harrison 
> > > Signed-off-by: Matthew Brost 
> > > ---
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++
> > >  .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
> > >  2 files changed, 39 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index afb9b4bb8971..2d8296bcc583 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct 
> > > guc_submit_engine *gse)
> > >   return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> > >  }
> > >  
> > > +/* 2 seconds seems like a reasonable timeout waiting for a G2H */
> > > +#define MAX_TASKLET_BLOCKED_NS   20
> > >  static void set_tasklet_blocked(struct guc_submit_engine *gse)
> > >  {
> > >   lockdep_assert_held(&gse->sched_engine.lock);
> > > + hrtimer_start_range_ns(&gse->hang_timer,
> > > +ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
> > > +HRTIMER_MODE_REL_PINNED);
> > >   set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> > 
> > So with drm/scheduler the reset handling is assumed to be
> > single-threaded, and there's quite complex rules around that. I've
> > recently worked with Boris Brezillion to clarify all this a bit and
> > improve docs. Does this all still work in that glorious future? Might be
> > good to at least sprinkle some comments/thoughts around in the commit
> > message about the envisaged future direction for all this stuff, to keep
> > people in the loop. Especially future people.
> > 
> > Ofc plan is still to just largely land all this.
> > 
> > Also: set_bit is an unordered atomic, which means you need barriers, which
> > meanes ... *insert the full rant about justifying/documenting lockless
> > algorithms from earlier *
> >
> 
> lockdep_assert_held(&gse->sched_engine.lock);
> 
> Not lockless. Also spin locks act as barriers, right?

Well if that spinlock is protecting that bit then that's good, but then it
shouldn't be an atomic set_bit. In that case:
- either make the entire bitfield non-atomic so it's clear there's boring
  dumb locking going on
- or split out your new bit into a separate field so that there's no false
  sharing with the existing bitfield state machinery, and add a kernel doc
  to that field explaining the locking

set_bit itself is atomic and unordered, so means you need barriers and all
that. If you don't have a lockless algorithm, don't use atomic bitops to
avoid confusing readers because set_bit/test_bit sets of all the warning
bells.

And yes it's annoying that for bitops the atomic ones don't have an
atomic_ prefix. The non-atomic ones have a __ prefix. This is honestly why
I don't think we should use bitfields as much as we do, because the main
use-case for them is when you have bitfields which are longer than 64bits.
They come from the cpumask world, and linux supports a lot of cpus.

Open-coding non-atomic simple bitfields with the usual C operators is
perfectly fine and legible imo. But that part is maybe more a bikeshed.

> > But I think this all falls out with the removal of the guc-id allocation
> > scheme?
> 
> Yes, this patch is getting deleted.

That works too :-)
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > >  }
> > >  
> > >  static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
> > >  {
> > >   lockdep_assert_held(&gse->sched_engine.lock);
> > > + hrtimer_cancel(&gse->hang_timer);
> > >   clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> > >  }
> > >  
> > > @@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc 
> > > *guc)
> > >   if (__tasklet_is_enabled(&sched_engine->tasklet)) {
> > >   GEM_BUG_ON(!guc->ct.enabled);
> > >   __tasklet_disable_sync_once(&sched_engine->tasklet);
> > > + hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
> > >   sched_engine->tasklet.callback = NULL;
> > >   }
> > >   }
> > > @@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref 
> > > *kref)
> > >   kfree(gse);
> > >  }
> > >  
> > > +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
> > > +{
> > > + struct guc_submit_engine *gse =
> > > + container_of(hrtimer, struct guc_submit_engine, hang_timer);
> > > + struct intel_guc *guc = gse->sched_engine.private_data;
> > 

Re: [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine

2021-08-09 Thread Matthew Brost
On Mon, Aug 09, 2021 at 05:35:25PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:17PM -0700, Matthew Brost wrote:
> > The heartbeat uses a single instance of a GuC submit engine (GSE) to do
> > the hang check. As such if a different GSE's state machine hangs, the
> > heartbeat cannot detect this hang. Add timer to each GSE which in turn
> > can disable all submissions if it is hung.
> > 
> > Cc: John Harrison 
> > Signed-off-by: Matthew Brost 
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++
> >  .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
> >  2 files changed, 39 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index afb9b4bb8971..2d8296bcc583 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine 
> > *gse)
> > return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> >  }
> >  
> > +/* 2 seconds seems like a reasonable timeout waiting for a G2H */
> > +#define MAX_TASKLET_BLOCKED_NS 20
> >  static void set_tasklet_blocked(struct guc_submit_engine *gse)
> >  {
> > lockdep_assert_held(&gse->sched_engine.lock);
> > +   hrtimer_start_range_ns(&gse->hang_timer,
> > +  ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
> > +  HRTIMER_MODE_REL_PINNED);
> > set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> 
> So with drm/scheduler the reset handling is assumed to be
> single-threaded, and there's quite complex rules around that. I've
> recently worked with Boris Brezillion to clarify all this a bit and
> improve docs. Does this all still work in that glorious future? Might be
> good to at least sprinkle some comments/thoughts around in the commit
> message about the envisaged future direction for all this stuff, to keep
> people in the loop. Especially future people.
> 
> Ofc plan is still to just largely land all this.
> 
> Also: set_bit is an unordered atomic, which means you need barriers, which
> meanes ... *insert the full rant about justifying/documenting lockless
> algorithms from earlier *
>

lockdep_assert_held(&gse->sched_engine.lock);

Not lockless. Also spin locks act as barriers, right?
 
> But I think this all falls out with the removal of the guc-id allocation
> scheme?

Yes, this patch is getting deleted.

Matt

> -Daniel
> 
> >  }
> >  
> >  static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
> >  {
> > lockdep_assert_held(&gse->sched_engine.lock);
> > +   hrtimer_cancel(&gse->hang_timer);
> > clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> >  }
> >  
> > @@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc *guc)
> > if (__tasklet_is_enabled(&sched_engine->tasklet)) {
> > GEM_BUG_ON(!guc->ct.enabled);
> > __tasklet_disable_sync_once(&sched_engine->tasklet);
> > +   hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
> > sched_engine->tasklet.callback = NULL;
> > }
> > }
> > @@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref 
> > *kref)
> > kfree(gse);
> >  }
> >  
> > +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
> > +{
> > +   struct guc_submit_engine *gse =
> > +   container_of(hrtimer, struct guc_submit_engine, hang_timer);
> > +   struct intel_guc *guc = gse->sched_engine.private_data;
> > +
> > +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > +   if (guc->gse_hang_expected)
> > +   drm_dbg(&guc_to_gt(guc)->i915->drm,
> > +   "GSE[%i] hung, disabling submission", gse->id);
> > +   else
> > +   drm_err(&guc_to_gt(guc)->i915->drm,
> > +   "GSE[%i] hung, disabling submission", gse->id);
> > +#else
> > +   drm_err(&guc_to_gt(guc)->i915->drm,
> > +   "GSE[%i] hung, disabling submission", gse->id);
> > +#endif
> > +
> > +   /*
> > +* Tasklet not making forward progress, disable submission which in turn
> > +* will kick in the heartbeat to do a full GPU reset.
> > +*/
> > +   disable_submission(guc);
> > +
> > +   return HRTIMER_NORESTART;
> > +}
> > +
> >  static void guc_submit_engine_init(struct intel_guc *guc,
> >struct guc_submit_engine *gse,
> >int id)
> > @@ -3767,6 +3801,8 @@ static void guc_submit_engine_init(struct intel_guc 
> > *guc,
> > sched_engine->retire_inflight_request_prio =
> > guc_retire_inflight_request_prio;
> > sched_engine->private_data = guc;
> > +   hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > +   gse->hang_timer.function = gse_hang;
> > gse->id = id;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h 
> > 

Re: [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine

2021-08-09 Thread Daniel Vetter
On Tue, Aug 03, 2021 at 03:29:17PM -0700, Matthew Brost wrote:
> The heartbeat uses a single instance of a GuC submit engine (GSE) to do
> the hang check. As such if a different GSE's state machine hangs, the
> heartbeat cannot detect this hang. Add timer to each GSE which in turn
> can disable all submissions if it is hung.
> 
> Cc: John Harrison 
> Signed-off-by: Matthew Brost 
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++
>  .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index afb9b4bb8971..2d8296bcc583 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine 
> *gse)
>   return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
>  }
>  
> +/* 2 seconds seems like a reasonable timeout waiting for a G2H */
> +#define MAX_TASKLET_BLOCKED_NS   20
>  static void set_tasklet_blocked(struct guc_submit_engine *gse)
>  {
>   lockdep_assert_held(&gse->sched_engine.lock);
> + hrtimer_start_range_ns(&gse->hang_timer,
> +ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
> +HRTIMER_MODE_REL_PINNED);
>   set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);

So with drm/scheduler the reset handling is assumed to be
single-threaded, and there's quite complex rules around that. I've
recently worked with Boris Brezillion to clarify all this a bit and
improve docs. Does this all still work in that glorious future? Might be
good to at least sprinkle some comments/thoughts around in the commit
message about the envisaged future direction for all this stuff, to keep
people in the loop. Especially future people.

Ofc plan is still to just largely land all this.

Also: set_bit is an unordered atomic, which means you need barriers, which
meanes ... *insert the full rant about justifying/documenting lockless
algorithms from earlier *

But I think this all falls out with the removal of the guc-id allocation
scheme?
-Daniel

>  }
>  
>  static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
>  {
>   lockdep_assert_held(&gse->sched_engine.lock);
> + hrtimer_cancel(&gse->hang_timer);
>   clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
>  }
>  
> @@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc *guc)
>   if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>   GEM_BUG_ON(!guc->ct.enabled);
>   __tasklet_disable_sync_once(&sched_engine->tasklet);
> + hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
>   sched_engine->tasklet.callback = NULL;
>   }
>   }
> @@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref *kref)
>   kfree(gse);
>  }
>  
> +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
> +{
> + struct guc_submit_engine *gse =
> + container_of(hrtimer, struct guc_submit_engine, hang_timer);
> + struct intel_guc *guc = gse->sched_engine.private_data;
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> + if (guc->gse_hang_expected)
> + drm_dbg(&guc_to_gt(guc)->i915->drm,
> + "GSE[%i] hung, disabling submission", gse->id);
> + else
> + drm_err(&guc_to_gt(guc)->i915->drm,
> + "GSE[%i] hung, disabling submission", gse->id);
> +#else
> + drm_err(&guc_to_gt(guc)->i915->drm,
> + "GSE[%i] hung, disabling submission", gse->id);
> +#endif
> +
> + /*
> +  * Tasklet not making forward progress, disable submission which in turn
> +  * will kick in the heartbeat to do a full GPU reset.
> +  */
> + disable_submission(guc);
> +
> + return HRTIMER_NORESTART;
> +}
> +
>  static void guc_submit_engine_init(struct intel_guc *guc,
>  struct guc_submit_engine *gse,
>  int id)
> @@ -3767,6 +3801,8 @@ static void guc_submit_engine_init(struct intel_guc 
> *guc,
>   sched_engine->retire_inflight_request_prio =
>   guc_retire_inflight_request_prio;
>   sched_engine->private_data = guc;
> + hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> + gse->hang_timer.function = gse_hang;
>   gse->id = id;
>  }
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> index a5933e07bdd2..eae2e9725ede 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> @@ -6,6 +6,8 @@
>  #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
>  #define _INTEL_GUC_SUBMISSION_TYPES_H_
>  
> +#include 
> +
>  #in