Hi Andy,

Thanks for review.

On Thursday, 2 March 2023 01:42:05 CET Andi Shyti wrote:
> Hi Janusz,
> 
> On Sat, Feb 25, 2023 at 11:12:18PM +0100, Janusz Krzysztofik wrote:
> > Users reported oopses on list corruptions when using i915 perf with a
> > number of concurrently running graphics applications.  Root cause analysis
> > pointed at an issue in barrier processing code -- a race among perf open /
> > close replacing active barriers with perf requests on kernel context and
> > concurrent barrier preallocate / acquire operations performed during user
> > context first pin / last unpin.
> > 
> > When adding a request to a composite tracker, we try to reuse an existing
> > fence tracker, already allocated and registered with that composite.  The
> > tracker we obtain may already track another fence, may be an idle barrier,
> > or an active barrier.
> > 
> > If the tracker we get occurs a non-idle barrier then we try to delete that
> > barrier from a list of barrier tasks it belongs to.  However, while doing
> > that we don't respect return value from a function that performs the
> > barrier deletion.  Should the deletion ever failed, we would end up
> > reusing the tracker still registered as a barrier task.  Since the same
> > structure field is reused with both fence callback lists and barrier
> > tasks list, list corruptions would likely occur.
> > 
> > Barriers are now deleted from a barrier tasks list by temporarily removing
> > the list content, traversing that content with skip over the node to be
> > deleted, then populating the list back with the modified content.  Should
> > that intentionally racy concurrent deletion attempts be not serialized,
> > one or more of those may fail because of the list being temporary empty.
> > 
> > Related code that ignores the results of barrier deletion was initially
> > introduced in v5.4 by commit d8af05ff38ae ("drm/i915: Allow sharing the
> > idle-barrier from other kernel requests").  However, all users of the
> > barrier deletion routine were apparently serialized at that time, then the
> > issue didn't exhibit itself.  Results of git bisect with help of a newly
> > developed igt@gem_barrier_race@remote-request IGT test indicate that list
> > corruptions might start to appear after commit 311770173fac ("drm/i915/gt:
> > Schedule request retirement when timeline idles"), introduced in v5.5.
> > 
> > Respect results of barrier deletion attempts -- mark the barrier as idle
> > only if successfully deleted from the list.  Then, before proceeding with
> > setting our fence as the one currently tracked, make sure that the tracker
> > we've got is not a non-idle barrier.  If that check fails then don't use
> > that tracker but go back and try to acquire a new, usable one.
> > 
> > v2: no code changes,
> >   - blame commit 311770173fac ("drm/i915/gt: Schedule request retirement
> >     when timeline idles"), v5.5, not commit d8af05ff38ae ("drm/i915: Allow
> >     sharing the idle-barrier from other kernel requests"), v5.4,
> >   - reword commit description.
> 
> That's a very good explanation and very much needed for such a
> catch. Thanks!
> 
> > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6333
> > Fixes: 311770173fac ("drm/i915/gt: Schedule request retirement when 
timeline idles")
> > Cc: Chris Wilson <ch...@chris-wilson.co.uk>
> > Cc: sta...@vger.kernel.org # v5.5
> > Signed-off-by: Janusz Krzysztofik <janusz.krzyszto...@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_active.c | 25 ++++++++++++++-----------
> >  1 file changed, 14 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/
i915_active.c
> > index 7412abf166a8c..f9282b8c87c1c 100644
> > --- a/drivers/gpu/drm/i915/i915_active.c
> > +++ b/drivers/gpu/drm/i915/i915_active.c
> > @@ -422,12 +422,12 @@ replace_barrier(struct i915_active *ref, struct 
i915_active_fence *active)
> >      * we can use it to substitute for the pending idle-barrer
> >      * request that we want to emit on the kernel_context.
> >      */
> > -   __active_del_barrier(ref, node_from_active(active));
> > -   return true;
> > +   return __active_del_barrier(ref, node_from_active(active));
> 
> In general, I support the idea of always checking the return
> value, even if we expect a certain outcome. In these cases, the
> likely/unlikely macros can be helpful. 

OK.

> Given this change, I
> believe the patch deserves an ack.
> 
> That being said, I was curious whether using an explicit lock
> and a normal list of active barriers, rather than a lockless
> list, could have solved the problem.  It seems like using a
> lockless list and iterating over it could be overkill, unless
> there are specific scenarios where the lockless properties are
> necessary.
> 
> Of course, this may be something to consider in a future cleanup,
> as it may be outside the scope of this particular patch.

Yes, I think so.

> 
> >  }
> >  
> >  int i915_active_add_request(struct i915_active *ref, struct i915_request 
*rq)
> >  {
> > +   u64 idx = i915_request_timeline(rq)->fence_context;
> >     struct dma_fence *fence = &rq->fence;
> >     struct i915_active_fence *active;
> >     int err;
> > @@ -437,16 +437,19 @@ int i915_active_add_request(struct i915_active *ref, 
struct i915_request *rq)
> >     if (err)
> >             return err;
> >  
> > -   active = active_instance(ref, i915_request_timeline(rq)-
>fence_context);
> > -   if (!active) {
> > -           err = -ENOMEM;
> > -           goto out;
> > -   }
> > +   do {
> > +           active = active_instance(ref, idx);
> > +           if (!active) {
> > +                   err = -ENOMEM;
> > +                   goto out;
> > +           }
> > +
> > +           if (replace_barrier(ref, active)) {
> > +                   RCU_INIT_POINTER(active->fence, NULL);
> > +                   atomic_dec(&ref->count);
> > +           }
> > +   } while (is_barrier(active));
> 
> unlikely()?

OK, please expect v3 with this added.

Thanks,
Janusz

> 
> It's worth noting that for each iteration, we are rebuilding the
> list of barriers.  Therefore, I believe it may be necessary to
> perform a cleanup within the replace_barrier() function.
> 
> Thanks,
> Andi
> 
> >  
> > -   if (replace_barrier(ref, active)) {
> > -           RCU_INIT_POINTER(active->fence, NULL);
> > -           atomic_dec(&ref->count);
> > -   }
> >     if (!__i915_active_fence_set(active, fence))
> >             __i915_active_acquire(ref);
> >  
> 




Reply via email to