Re: [PATCH v5 5/8] drm/xe: Add helper to accumulate exec queue runtime

2024-05-20 Thread Lucas De Marchi

On Tue, May 21, 2024 at 02:53:56AM GMT, Matthew Brost wrote:

On Sat, May 18, 2024 at 09:37:20AM -0500, Lucas De Marchi wrote:

On Fri, May 17, 2024 at 03:40:22PM GMT, Matt Roper wrote:
> On Fri, May 17, 2024 at 01:43:07PM -0700, Lucas De Marchi wrote:
> > From: Umesh Nerlige Ramappa 
> >
> > Add a helper to accumulate per-client runtime of all its
> > exec queues. This is called every time a sched job is finished.
> >
> > v2:
> >   - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
> > runtime when job is finished since xe_sched_job_completed() is not a
> > notification that job finished.
> >   - Stop trying to update runtime from xe_exec_queue_fini() - that is
> > redundant and may happen after xef is closed, leading to a
> > use-after-free
> >   - Do not special case the first timestamp read: the default LRC sets
> > CTX_TIMESTAMP to zero, so even the first sample should be a valid
> > one.
> >   - Handle the parallel submission case by multiplying the runtime by
> > width.
> > v3: Update comments
> >
> > Signed-off-by: Umesh Nerlige Ramappa 
> > Signed-off-by: Lucas De Marchi 
> > ---
> >  drivers/gpu/drm/xe/xe_device_types.h |  3 +++
> >  drivers/gpu/drm/xe/xe_exec_queue.c   | 37 
> >  drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
> >  drivers/gpu/drm/xe/xe_execlist.c |  1 +
> >  drivers/gpu/drm/xe/xe_guc_submit.c   |  2 ++
> >  5 files changed, 44 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
b/drivers/gpu/drm/xe/xe_device_types.h
> > index 5c5e36de452a..bc97990fd032 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -555,6 +555,9 @@ struct xe_file {
> >   struct mutex lock;
> >   } exec_queue;
> >
> > + /** @runtime: hw engine class runtime in ticks for this drm client */
> > + u64 runtime[XE_ENGINE_CLASS_MAX];
> > +
> >   /** @client: drm client */
> >   struct xe_drm_client *client;
> >  };
> > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c 
b/drivers/gpu/drm/xe/xe_exec_queue.c
> > index 395de93579fa..fa6dc996eca8 100644
> > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > @@ -769,6 +769,43 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
> >   q->lrc[0].fence_ctx.next_seqno - 1;
> >  }
> >
> > +/**
> > + * xe_exec_queue_update_runtime() - Update runtime for this exec queue 
from hw
> > + * @q: The exec queue
> > + *
> > + * Update the timestamp saved by HW for this exec queue and save runtime
> > + * calculated by using the delta from last update. On multi-lrc case, only 
the
> > + * first is considered.
> > + */
> > +void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
> > +{
> > + struct xe_file *xef;
> > + struct xe_lrc *lrc;
> > + u32 old_ts, new_ts;
> > +
> > + /*
> > +  * Jobs that are run during driver load may use an exec_queue, but are
> > +  * not associated with a user xe file, so avoid accumulating busyness
> > +  * for kernel specific work.
> > +  */
> > + if (!q->vm || !q->vm->xef)
> > + return;
> > +
> > + xef = q->vm->xef;
> > +
> > + /*
> > +  * Only sample the first LRC. For parallel submission, all of them are
> > +  * scheduled together and we compensate that below by multiplying by
> > +  * width - this may introduce errors if that premise is not true and
> > +  * they don't exit 100% aligned. On the other hand, looping through
> > +  * the LRCs and reading them in different time could also introduce
> > +  * errors.
>
> At the time we're executing this function, those LRCs aren't executing
> on the hardware anymore and their timestamps aren't continuing to move,

not necessarily. Besides calling this function when execution ends, see
last patch. This is called when userspace reads the fdinfo file, which
portentially reads this for running LRCs.

> right?  I don't see where error could creep in from just looping over
> each of them?
>
> I guess parallel submission is mostly just used by media these days,
> where the batches submitted in parallel are nearly identical and
> expected to run the same amount of time, right?  Do we have any

what I got from Matt Brost is that they are always scheduled together
and will run together on the different instances of that engine class,
regardless if they are different. As you said, I'm not sure there's
even a use case for running different batches.  +Matt Brost to confirm


I believe the current usage involves running the same batch program, but
it operates on a different set of data. Even if widely different batches
were submitted across multiple engines, the GuC schedules all engines
simultaneously and does not schedule anything else on the engine set
until all engines are idle. So, I think the logic as it stands is fine.


that's sufficient for me :)


If everyone is paranoid, feel free to sample 

Re: [PATCH v5 5/8] drm/xe: Add helper to accumulate exec queue runtime

2024-05-20 Thread Matthew Brost
On Sat, May 18, 2024 at 09:37:20AM -0500, Lucas De Marchi wrote:
> On Fri, May 17, 2024 at 03:40:22PM GMT, Matt Roper wrote:
> > On Fri, May 17, 2024 at 01:43:07PM -0700, Lucas De Marchi wrote:
> > > From: Umesh Nerlige Ramappa 
> > > 
> > > Add a helper to accumulate per-client runtime of all its
> > > exec queues. This is called every time a sched job is finished.
> > > 
> > > v2:
> > >   - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
> > > runtime when job is finished since xe_sched_job_completed() is not a
> > > notification that job finished.
> > >   - Stop trying to update runtime from xe_exec_queue_fini() - that is
> > > redundant and may happen after xef is closed, leading to a
> > > use-after-free
> > >   - Do not special case the first timestamp read: the default LRC sets
> > > CTX_TIMESTAMP to zero, so even the first sample should be a valid
> > > one.
> > >   - Handle the parallel submission case by multiplying the runtime by
> > > width.
> > > v3: Update comments
> > > 
> > > Signed-off-by: Umesh Nerlige Ramappa 
> > > Signed-off-by: Lucas De Marchi 
> > > ---
> > >  drivers/gpu/drm/xe/xe_device_types.h |  3 +++
> > >  drivers/gpu/drm/xe/xe_exec_queue.c   | 37 
> > >  drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
> > >  drivers/gpu/drm/xe/xe_execlist.c |  1 +
> > >  drivers/gpu/drm/xe/xe_guc_submit.c   |  2 ++
> > >  5 files changed, 44 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
> > > b/drivers/gpu/drm/xe/xe_device_types.h
> > > index 5c5e36de452a..bc97990fd032 100644
> > > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > > @@ -555,6 +555,9 @@ struct xe_file {
> > >   struct mutex lock;
> > >   } exec_queue;
> > > 
> > > + /** @runtime: hw engine class runtime in ticks for this drm client */
> > > + u64 runtime[XE_ENGINE_CLASS_MAX];
> > > +
> > >   /** @client: drm client */
> > >   struct xe_drm_client *client;
> > >  };
> > > diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c 
> > > b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > index 395de93579fa..fa6dc996eca8 100644
> > > --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> > > +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> > > @@ -769,6 +769,43 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
> > >   q->lrc[0].fence_ctx.next_seqno - 1;
> > >  }
> > > 
> > > +/**
> > > + * xe_exec_queue_update_runtime() - Update runtime for this exec queue 
> > > from hw
> > > + * @q: The exec queue
> > > + *
> > > + * Update the timestamp saved by HW for this exec queue and save runtime
> > > + * calculated by using the delta from last update. On multi-lrc case, 
> > > only the
> > > + * first is considered.
> > > + */
> > > +void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
> > > +{
> > > + struct xe_file *xef;
> > > + struct xe_lrc *lrc;
> > > + u32 old_ts, new_ts;
> > > +
> > > + /*
> > > +  * Jobs that are run during driver load may use an exec_queue, but are
> > > +  * not associated with a user xe file, so avoid accumulating busyness
> > > +  * for kernel specific work.
> > > +  */
> > > + if (!q->vm || !q->vm->xef)
> > > + return;
> > > +
> > > + xef = q->vm->xef;
> > > +
> > > + /*
> > > +  * Only sample the first LRC. For parallel submission, all of them are
> > > +  * scheduled together and we compensate that below by multiplying by
> > > +  * width - this may introduce errors if that premise is not true and
> > > +  * they don't exit 100% aligned. On the other hand, looping through
> > > +  * the LRCs and reading them in different time could also introduce
> > > +  * errors.
> > 
> > At the time we're executing this function, those LRCs aren't executing
> > on the hardware anymore and their timestamps aren't continuing to move,
> 
> not necessarily. Besides calling this function when execution ends, see
> last patch. This is called when userspace reads the fdinfo file, which
> portentially reads this for running LRCs.
> 
> > right?  I don't see where error could creep in from just looping over
> > each of them?
> > 
> > I guess parallel submission is mostly just used by media these days,
> > where the batches submitted in parallel are nearly identical and
> > expected to run the same amount of time, right?  Do we have any
> 
> what I got from Matt Brost is that they are always scheduled together
> and will run together on the different instances of that engine class,
> regardless if they are different. As you said, I'm not sure there's
> even a use case for running different batches.  +Matt Brost to confirm

I believe the current usage involves running the same batch program, but
it operates on a different set of data. Even if widely different batches
were submitted across multiple engines, the GuC schedules all engines
simultaneously and does not schedule anything else on the engine set
until all engines are idle. So, I think the logic as it stands is fine.
If 

Re: [PATCH v5 5/8] drm/xe: Add helper to accumulate exec queue runtime

2024-05-18 Thread Lucas De Marchi

On Fri, May 17, 2024 at 03:40:22PM GMT, Matt Roper wrote:

On Fri, May 17, 2024 at 01:43:07PM -0700, Lucas De Marchi wrote:

From: Umesh Nerlige Ramappa 

Add a helper to accumulate per-client runtime of all its
exec queues. This is called every time a sched job is finished.

v2:
  - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
runtime when job is finished since xe_sched_job_completed() is not a
notification that job finished.
  - Stop trying to update runtime from xe_exec_queue_fini() - that is
redundant and may happen after xef is closed, leading to a
use-after-free
  - Do not special case the first timestamp read: the default LRC sets
CTX_TIMESTAMP to zero, so even the first sample should be a valid
one.
  - Handle the parallel submission case by multiplying the runtime by
width.
v3: Update comments

Signed-off-by: Umesh Nerlige Ramappa 
Signed-off-by: Lucas De Marchi 
---
 drivers/gpu/drm/xe/xe_device_types.h |  3 +++
 drivers/gpu/drm/xe/xe_exec_queue.c   | 37 
 drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
 drivers/gpu/drm/xe/xe_execlist.c |  1 +
 drivers/gpu/drm/xe/xe_guc_submit.c   |  2 ++
 5 files changed, 44 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
b/drivers/gpu/drm/xe/xe_device_types.h
index 5c5e36de452a..bc97990fd032 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -555,6 +555,9 @@ struct xe_file {
struct mutex lock;
} exec_queue;

+   /** @runtime: hw engine class runtime in ticks for this drm client */
+   u64 runtime[XE_ENGINE_CLASS_MAX];
+
/** @client: drm client */
struct xe_drm_client *client;
 };
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c 
b/drivers/gpu/drm/xe/xe_exec_queue.c
index 395de93579fa..fa6dc996eca8 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -769,6 +769,43 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
q->lrc[0].fence_ctx.next_seqno - 1;
 }

+/**
+ * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw
+ * @q: The exec queue
+ *
+ * Update the timestamp saved by HW for this exec queue and save runtime
+ * calculated by using the delta from last update. On multi-lrc case, only the
+ * first is considered.
+ */
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
+{
+   struct xe_file *xef;
+   struct xe_lrc *lrc;
+   u32 old_ts, new_ts;
+
+   /*
+* Jobs that are run during driver load may use an exec_queue, but are
+* not associated with a user xe file, so avoid accumulating busyness
+* for kernel specific work.
+*/
+   if (!q->vm || !q->vm->xef)
+   return;
+
+   xef = q->vm->xef;
+
+   /*
+* Only sample the first LRC. For parallel submission, all of them are
+* scheduled together and we compensate that below by multiplying by
+* width - this may introduce errors if that premise is not true and
+* they don't exit 100% aligned. On the other hand, looping through
+* the LRCs and reading them in different time could also introduce
+* errors.


At the time we're executing this function, those LRCs aren't executing
on the hardware anymore and their timestamps aren't continuing to move,


not necessarily. Besides calling this function when execution ends, see
last patch. This is called when userspace reads the fdinfo file, which
portentially reads this for running LRCs.


right?  I don't see where error could creep in from just looping over
each of them?

I guess parallel submission is mostly just used by media these days,
where the batches submitted in parallel are nearly identical and
expected to run the same amount of time, right?  Do we have any


what I got from Matt Brost is that they are always scheduled together
and will run together on the different instances of that engine class,
regardless if they are different. As you said, I'm not sure there's
even a use case for running different batches.  +Matt Brost to confirm
my reasoning below.

Looking at our uapi and some tests in igt. This is controlled by the width
arg when creating the exec queue. Later when executing, we can only pass 1
sync obj to wait for completion. So even if userspace passes different
batch addresses during exec (i.e. different batches), the whole lot will
not complete until everything finishes. I think it's reasonable to
consider everything is busy while it doesn't complete.


userspace (or potential future userspace) that might submit
heterogeneous batches in parallel, which would make this inaccurate?

I'm not very familiar with the use cases of parallel submission, so I'll
trust that you've got a better understanding of the userspace usage than
I do; everything else here looks fine to me.


I kind of learned this part when doing this code :). It'd be good to
wait for an ack by 

Re: [PATCH v5 5/8] drm/xe: Add helper to accumulate exec queue runtime

2024-05-17 Thread Matt Roper
On Fri, May 17, 2024 at 01:43:07PM -0700, Lucas De Marchi wrote:
> From: Umesh Nerlige Ramappa 
> 
> Add a helper to accumulate per-client runtime of all its
> exec queues. This is called every time a sched job is finished.
> 
> v2:
>   - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
> runtime when job is finished since xe_sched_job_completed() is not a
> notification that job finished.
>   - Stop trying to update runtime from xe_exec_queue_fini() - that is
> redundant and may happen after xef is closed, leading to a
> use-after-free
>   - Do not special case the first timestamp read: the default LRC sets
> CTX_TIMESTAMP to zero, so even the first sample should be a valid
> one.
>   - Handle the parallel submission case by multiplying the runtime by
> width.
> v3: Update comments
> 
> Signed-off-by: Umesh Nerlige Ramappa 
> Signed-off-by: Lucas De Marchi 
> ---
>  drivers/gpu/drm/xe/xe_device_types.h |  3 +++
>  drivers/gpu/drm/xe/xe_exec_queue.c   | 37 
>  drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
>  drivers/gpu/drm/xe/xe_execlist.c |  1 +
>  drivers/gpu/drm/xe/xe_guc_submit.c   |  2 ++
>  5 files changed, 44 insertions(+)
> 
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
> b/drivers/gpu/drm/xe/xe_device_types.h
> index 5c5e36de452a..bc97990fd032 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -555,6 +555,9 @@ struct xe_file {
>   struct mutex lock;
>   } exec_queue;
>  
> + /** @runtime: hw engine class runtime in ticks for this drm client */
> + u64 runtime[XE_ENGINE_CLASS_MAX];
> +
>   /** @client: drm client */
>   struct xe_drm_client *client;
>  };
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c 
> b/drivers/gpu/drm/xe/xe_exec_queue.c
> index 395de93579fa..fa6dc996eca8 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.c
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.c
> @@ -769,6 +769,43 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
>   q->lrc[0].fence_ctx.next_seqno - 1;
>  }
>  
> +/**
> + * xe_exec_queue_update_runtime() - Update runtime for this exec queue from 
> hw
> + * @q: The exec queue
> + *
> + * Update the timestamp saved by HW for this exec queue and save runtime
> + * calculated by using the delta from last update. On multi-lrc case, only 
> the
> + * first is considered.
> + */
> +void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
> +{
> + struct xe_file *xef;
> + struct xe_lrc *lrc;
> + u32 old_ts, new_ts;
> +
> + /*
> +  * Jobs that are run during driver load may use an exec_queue, but are
> +  * not associated with a user xe file, so avoid accumulating busyness
> +  * for kernel specific work.
> +  */
> + if (!q->vm || !q->vm->xef)
> + return;
> +
> + xef = q->vm->xef;
> +
> + /*
> +  * Only sample the first LRC. For parallel submission, all of them are
> +  * scheduled together and we compensate that below by multiplying by
> +  * width - this may introduce errors if that premise is not true and
> +  * they don't exit 100% aligned. On the other hand, looping through
> +  * the LRCs and reading them in different time could also introduce
> +  * errors.

At the time we're executing this function, those LRCs aren't executing
on the hardware anymore and their timestamps aren't continuing to move,
right?  I don't see where error could creep in from just looping over
each of them?

I guess parallel submission is mostly just used by media these days,
where the batches submitted in parallel are nearly identical and
expected to run the same amount of time, right?  Do we have any
userspace (or potential future userspace) that might submit
heterogeneous batches in parallel, which would make this inaccurate?

I'm not very familiar with the use cases of parallel submission, so I'll
trust that you've got a better understanding of the userspace usage than
I do; everything else here looks fine to me.

Reviewed-by: Matt Roper 


Matt

> +  */
> + lrc = >lrc[0];
> + new_ts = xe_lrc_update_timestamp(lrc, _ts);
> + xef->runtime[q->class] += (new_ts - old_ts) * q->width;
> +}
> +
>  void xe_exec_queue_kill(struct xe_exec_queue *q)
>  {
>   struct xe_exec_queue *eq = q, *next;
> diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h 
> b/drivers/gpu/drm/xe/xe_exec_queue.h
> index 48f6da53a292..e0f07d28ee1a 100644
> --- a/drivers/gpu/drm/xe/xe_exec_queue.h
> +++ b/drivers/gpu/drm/xe/xe_exec_queue.h
> @@ -75,5 +75,6 @@ struct dma_fence *xe_exec_queue_last_fence_get(struct 
> xe_exec_queue *e,
>  struct xe_vm *vm);
>  void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct xe_vm *vm,
> struct dma_fence *fence);
> +void xe_exec_queue_update_runtime(struct xe_exec_queue *q);
>  
>  #endif
> diff --git a/drivers/gpu/drm/xe/xe_execlist.c 

[PATCH v5 5/8] drm/xe: Add helper to accumulate exec queue runtime

2024-05-17 Thread Lucas De Marchi
From: Umesh Nerlige Ramappa 

Add a helper to accumulate per-client runtime of all its
exec queues. This is called every time a sched job is finished.

v2:
  - Use guc_exec_queue_free_job() and execlist_job_free() to accumulate
runtime when job is finished since xe_sched_job_completed() is not a
notification that job finished.
  - Stop trying to update runtime from xe_exec_queue_fini() - that is
redundant and may happen after xef is closed, leading to a
use-after-free
  - Do not special case the first timestamp read: the default LRC sets
CTX_TIMESTAMP to zero, so even the first sample should be a valid
one.
  - Handle the parallel submission case by multiplying the runtime by
width.
v3: Update comments

Signed-off-by: Umesh Nerlige Ramappa 
Signed-off-by: Lucas De Marchi 
---
 drivers/gpu/drm/xe/xe_device_types.h |  3 +++
 drivers/gpu/drm/xe/xe_exec_queue.c   | 37 
 drivers/gpu/drm/xe/xe_exec_queue.h   |  1 +
 drivers/gpu/drm/xe/xe_execlist.c |  1 +
 drivers/gpu/drm/xe/xe_guc_submit.c   |  2 ++
 5 files changed, 44 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_device_types.h 
b/drivers/gpu/drm/xe/xe_device_types.h
index 5c5e36de452a..bc97990fd032 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -555,6 +555,9 @@ struct xe_file {
struct mutex lock;
} exec_queue;
 
+   /** @runtime: hw engine class runtime in ticks for this drm client */
+   u64 runtime[XE_ENGINE_CLASS_MAX];
+
/** @client: drm client */
struct xe_drm_client *client;
 };
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c 
b/drivers/gpu/drm/xe/xe_exec_queue.c
index 395de93579fa..fa6dc996eca8 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.c
+++ b/drivers/gpu/drm/xe/xe_exec_queue.c
@@ -769,6 +769,43 @@ bool xe_exec_queue_is_idle(struct xe_exec_queue *q)
q->lrc[0].fence_ctx.next_seqno - 1;
 }
 
+/**
+ * xe_exec_queue_update_runtime() - Update runtime for this exec queue from hw
+ * @q: The exec queue
+ *
+ * Update the timestamp saved by HW for this exec queue and save runtime
+ * calculated by using the delta from last update. On multi-lrc case, only the
+ * first is considered.
+ */
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q)
+{
+   struct xe_file *xef;
+   struct xe_lrc *lrc;
+   u32 old_ts, new_ts;
+
+   /*
+* Jobs that are run during driver load may use an exec_queue, but are
+* not associated with a user xe file, so avoid accumulating busyness
+* for kernel specific work.
+*/
+   if (!q->vm || !q->vm->xef)
+   return;
+
+   xef = q->vm->xef;
+
+   /*
+* Only sample the first LRC. For parallel submission, all of them are
+* scheduled together and we compensate that below by multiplying by
+* width - this may introduce errors if that premise is not true and
+* they don't exit 100% aligned. On the other hand, looping through
+* the LRCs and reading them in different time could also introduce
+* errors.
+*/
+   lrc = >lrc[0];
+   new_ts = xe_lrc_update_timestamp(lrc, _ts);
+   xef->runtime[q->class] += (new_ts - old_ts) * q->width;
+}
+
 void xe_exec_queue_kill(struct xe_exec_queue *q)
 {
struct xe_exec_queue *eq = q, *next;
diff --git a/drivers/gpu/drm/xe/xe_exec_queue.h 
b/drivers/gpu/drm/xe/xe_exec_queue.h
index 48f6da53a292..e0f07d28ee1a 100644
--- a/drivers/gpu/drm/xe/xe_exec_queue.h
+++ b/drivers/gpu/drm/xe/xe_exec_queue.h
@@ -75,5 +75,6 @@ struct dma_fence *xe_exec_queue_last_fence_get(struct 
xe_exec_queue *e,
   struct xe_vm *vm);
 void xe_exec_queue_last_fence_set(struct xe_exec_queue *e, struct xe_vm *vm,
  struct dma_fence *fence);
+void xe_exec_queue_update_runtime(struct xe_exec_queue *q);
 
 #endif
diff --git a/drivers/gpu/drm/xe/xe_execlist.c b/drivers/gpu/drm/xe/xe_execlist.c
index e9dee1e14fef..bd7f27efe0e0 100644
--- a/drivers/gpu/drm/xe/xe_execlist.c
+++ b/drivers/gpu/drm/xe/xe_execlist.c
@@ -306,6 +306,7 @@ static void execlist_job_free(struct drm_sched_job *drm_job)
 {
struct xe_sched_job *job = to_xe_sched_job(drm_job);
 
+   xe_exec_queue_update_runtime(job->q);
xe_sched_job_put(job);
 }
 
diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c 
b/drivers/gpu/drm/xe/xe_guc_submit.c
index 4efb88e3e056..ad2b8067d071 100644
--- a/drivers/gpu/drm/xe/xe_guc_submit.c
+++ b/drivers/gpu/drm/xe/xe_guc_submit.c
@@ -749,6 +749,8 @@ static void guc_exec_queue_free_job(struct drm_sched_job 
*drm_job)
 {
struct xe_sched_job *job = to_xe_sched_job(drm_job);
 
+   xe_exec_queue_update_runtime(job->q);
+
trace_xe_sched_job_free(job);
xe_sched_job_put(job);
 }
-- 
2.43.0