[AMD Official Use Only - AMD Internal Distribution Only] Hi Philipp, Christian,
I modified the commit msg as: drm/sched: Remove optimization that causes hang when killing dependent jobs When application A submits jobs and application B submits a job with a dependency on A's fence, the normal flow wakes up the scheduler after processing each job. However, the optimization in drm_sched_entity_add_dependency_cb() uses a callback that only clears dependencies without waking up the scheduler. When application A is killed before its jobs can run, the callback gets triggered but only clears the dependency without waking up the scheduler, causing the scheduler to enter sleep state and application B to hang. Remove the optimization by deleting drm_sched_entity_clear_dep() and its usage, ensuring the scheduler is always woken up when dependencies are cleared. Cc: sta...@vger.kernel.org # v4.6+ Fixes: 777dbd458c89 ("drm/amdgpu: drop a dummy wakeup scheduler") Suggested-by: Christian König <christian.koe...@amd.com> Signed-off-by: Lin.Cao <linca...@amd.com> Thanks, Lin ________________________________ From: Philipp Stanner <pha...@mailbox.org> Sent: Wednesday, July 16, 2025 17:57 To: cao, lin <lin....@amd.com>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; pha...@kernel.org <pha...@kernel.org> Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; Deng, Emily <emily.d...@amd.com>; Koenig, Christian <christian.koe...@amd.com>; d...@kernel.org <d...@kernel.org>; matthew.br...@intel.com <matthew.br...@intel.com> Subject: Re: [PATCH] drm/sched: Remove optimization that causes hang when killing dependent jobs On Wed, 2025-07-16 at 09:43 +0000, cao, lin wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > Hi Philipp, > > > Thank you for the review. I found that this optimization was > introduced 9 years ago in commit > 777dbd458c89d4ca74a659f85ffb5bc817f29a35 ("drm/amdgpu: drop a dummy > wakeup scheduler"). > > > Given that the codebase has undergone significant changes over these > 9 years. May I ask if I still need to include the Fixes: tag? Yes. It's a helpful marker to see where the problem comes from, and it adds redundancy helping the stable-kernel maintainers in figuring out to which kernels to backport it to. If stable can't apply a patch to a very old stable kernel because the code base changed too much, they'll ping us and we might provide a dedicated fix. So like that: Cc: sta...@vger.kernel.org # v4.6+ Fixes: 777dbd458c89 ("drm/amdgpu: drop a dummy wakeup scheduler") P. > > > Thanks, > Lin > > > From: Philipp Stanner <pha...@mailbox.org> > Sent: Wednesday, July 16, 2025 16:33 > To: cao, lin <lin....@amd.com>; dri-devel@lists.freedesktop.org > <dri-devel@lists.freedesktop.org> > Cc: Yin, ZhenGuo (Chris) <zhenguo....@amd.com>; Deng, Emily > <emily.d...@amd.com>; Koenig, Christian <christian.koe...@amd.com>; > pha...@kernel.org <pha...@kernel.org>; d...@kernel.org > <d...@kernel.org>; matthew.br...@intel.com <matthew.br...@intel.com> > Subject: Re: [PATCH] drm/sched: Remove optimization that causes hang > when killing dependent jobs > > > > > On Tue, 2025-07-15 at 21:50 +0800, Lin.Cao wrote: > > When application A submits jobs and application B submits a job > > with > > a > > dependency on A's fence, the normal flow wakes up the scheduler > > after > > processing each job. However, the optimization in > > drm_sched_entity_add_dependency_cb() uses a callback that only > > clears > > dependencies without waking up the scheduler. > > > > When application A is killed before its jobs can run, the callback > > gets > > triggered but only clears the dependency without waking up the > > scheduler, > > causing the scheduler to enter sleep state and application B to > > hang. > > > > Remove the optimization by deleting drm_sched_entity_clear_dep() > > and > > its > > usage, ensuring the scheduler is always woken up when dependencies > > are > > cleared. > > > > Signed-off-by: Lin.Cao <linca...@amd.com> > > This is, still, a bug fix, so it needs Fixes: and Cc: stable :) > > Could also include a Suggested-by: Christian > > P. > > > --- > > drivers/gpu/drm/scheduler/sched_entity.c | 21 ++------------------ > > - > > 1 file changed, 2 insertions(+), 19 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c > > b/drivers/gpu/drm/scheduler/sched_entity.c > > index e671aa241720..ac678de7fe5e 100644 > > --- a/drivers/gpu/drm/scheduler/sched_entity.c > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > > @@ -355,17 +355,6 @@ void drm_sched_entity_destroy(struct > > drm_sched_entity *entity) > > } > > EXPORT_SYMBOL(drm_sched_entity_destroy); > > > > -/* drm_sched_entity_clear_dep - callback to clear the entities > > dependency */ > > -static void drm_sched_entity_clear_dep(struct dma_fence *f, > > - struct dma_fence_cb *cb) > > -{ > > - struct drm_sched_entity *entity = > > - container_of(cb, struct drm_sched_entity, cb); > > - > > - entity->dependency = NULL; > > - dma_fence_put(f); > > -} > > - > > /* > > * drm_sched_entity_wakeup - callback to clear the entity's > > dependency and > > * wake up the scheduler > > @@ -376,7 +365,8 @@ static void drm_sched_entity_wakeup(struct > > dma_fence *f, > > struct drm_sched_entity *entity = > > container_of(cb, struct drm_sched_entity, cb); > > > > - drm_sched_entity_clear_dep(f, cb); > > + entity->dependency = NULL; > > + dma_fence_put(f); > > drm_sched_wakeup(entity->rq->sched); > > } > > > > @@ -429,13 +419,6 @@ static bool > > drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity) > > fence = dma_fence_get(&s_fence->scheduled); > > dma_fence_put(entity->dependency); > > entity->dependency = fence; > > - if (!dma_fence_add_callback(fence, &entity->cb, > > - > > drm_sched_entity_clear_dep)) > > - return true; > > - > > - /* Ignore it when it is already scheduled */ > > - dma_fence_put(fence); > > - return false; > > } > > > > if (!dma_fence_add_callback(entity->dependency, &entity->cb, >