On Sat, 29 Jun 2024 13:10:35 +0100 Liviu Dudau <liviu.du...@arm.com> wrote:
> On Sat, Jun 29, 2024 at 10:52:56AM +0200, Boris Brezillon wrote: > > On Fri, 28 Jun 2024 22:34:57 +0100 > > Liviu Dudau <liviu.du...@arm.com> wrote: > > > > > On Fri, Jun 28, 2024 at 04:55:36PM +0200, Boris Brezillon wrote: > > > > A sync-only job is meant to provide a synchronization point on a > > > > queue, so we can't return a NULL fence there, we have to add a signal > > > > operation to the command stream which executes after all other > > > > previously submitted jobs are done. > > > > > > > > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") > > > > Signed-off-by: Boris Brezillon <boris.brezil...@collabora.com> > > > > > > Took me a bit longer to read, lets blame Friday. > > > > > > > --- > > > > drivers/gpu/drm/panthor/panthor_sched.c | 41 ++++++++++++++++++++----- > > > > include/uapi/drm/panthor_drm.h | 5 +++ > > > > 2 files changed, 38 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c > > > > b/drivers/gpu/drm/panthor/panthor_sched.c > > > > index 79ffcbc41d78..951ff7e63ea8 100644 > > > > --- a/drivers/gpu/drm/panthor/panthor_sched.c > > > > +++ b/drivers/gpu/drm/panthor/panthor_sched.c > > > > @@ -458,6 +458,16 @@ struct panthor_queue { > > > > /** @seqno: Sequence number of the last initialized > > > > fence. */ > > > > atomic64_t seqno; > > > > > > > > + /** > > > > + * @last_fence: Fence of the last submitted job. > > > > + * > > > > + * We return this fence when we get an empty command > > > > stream. > > > > + * This way, we are guaranteed that all earlier jobs > > > > have completed > > > > + * when drm_sched_job::s_fence::finished without having > > > > to feed > > > > + * the CS ring buffer with a dummy job that only > > > > signals the fence. > > > > + */ > > > > + struct dma_fence *last_fence; > > > > + > > > > /** > > > > * @in_flight_jobs: List containing all in-flight jobs. > > > > * > > > > @@ -829,6 +839,9 @@ static void group_free_queue(struct panthor_group > > > > *group, struct panthor_queue * > > > > panthor_kernel_bo_destroy(queue->ringbuf); > > > > panthor_kernel_bo_destroy(queue->iface.mem); > > > > > > > > + /* Release the last_fence we were holding, if any. */ > > > > + dma_fence_put(queue->fence_ctx.last_fence); > > > > + > > > > kfree(queue); > > > > } > > > > > > > > @@ -2865,11 +2878,14 @@ queue_run_job(struct drm_sched_job *sched_job) > > > > static_assert(sizeof(call_instrs) % 64 == 0, > > > > "call_instrs is not aligned on a cacheline"); > > > > > > > > - /* Stream size is zero, nothing to do => return a NULL fence > > > > and let > > > > - * drm_sched signal the parent. > > > > + /* Stream size is zero, nothing to do except making sure all > > > > previously > > > > + * submitted jobs are done before we signal the > > > > + * drm_sched_job::s_fence::finished fence. > > > > */ > > > > - if (!job->call_info.size) > > > > - return NULL; > > > > + if (!job->call_info.size) { > > > > + job->done_fence = > > > > dma_fence_get(queue->fence_ctx.last_fence); > > > > + return job->done_fence; > > > > > > What happens if the last job's done_fence was cancelled or timed out? Is > > > the > > > sync job's done_fence going to be signalled with the same error? > > > > It's the same object, so yes, the job will also be considered faulty > > (the error propagated to the job::s_fence::finished fence). I guess > > synchronization jobs are not supposed to fail/timeout in theory, because > > they don't do anything, but I don't think that's an issue in > > practice, because dma_fence errors are never propagated to user-space > > (only the queue status is). > > > > > > > > Now that we're returning a fence here, should the job be also added into > > > the > > > in_flight_jobs? > > > > Yeah, that's done on purpose, such that we don't end up signalling the > > same dma_fence object twice (which is forbidden). > > That's the thing I was trying to figure out on Friday: how do we stop the > fence > returned as last_fence for the sync job to be signalled after the job's > done_fence > has also been signalled. I can't say that I found a good answer, if you can > point > me to what I've missed it will be appreciated. Well, there's only just one job that's added to the in_flight for a given done_fence object, so there's no risk of signaling such a fence more than once (even in case of a timeout). Now, let's say the previous 'real' job is done when this 'dummy/sync' job queued, the queue::fence_ctx::last_fence object will be signalled too, since this is essentially the same object, and when we return a signalled fence to the drm_sched layer in our ::run_job() callback, it just signals the finished drm_sched_job::s_fence::finished immediately, just like when you return a NULL fence.