On Mon, 23 Mar 2026 11:38:06 -0700
Matthew Brost <[email protected]> wrote:

> 
> Ok, getting stats is easier than I thought...
> 
> ./perf stat -a -e 
> context-switches,cpu-migrations,task-clock,cycles,instructions 
> /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads
>  --r threads-basic
> 
> This test creates one thread per engine instance (7 instances this BMG
> device) and submits 1k exec IOCTLs per thread, each performing a DW
> write. Each exec IOCTL typically does not have unsignaled input dependencies.
> 
> With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0):
> 
>              8,449      context-switches
>                412      cpu-migrations
>           2,531.43 msec task-clock
>      1,847,846,588      cpu_atom/cycles/
>      1,847,856,947      cpu_core/cycles/
>    <not supported>      cpu_atom/instructions/
>        460,744,020      cpu_core/instructions/
> 
> With IRQ putting of jobs off + bypass (drm_dep_queue_flags =
> DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED):
> 
>              8,655      context-switches
>                229      cpu-migrations
>           2,571.33 msec task-clock
>        855,900,607      cpu_atom/cycles/
>        855,900,272      cpu_core/cycles/
>    <not supported>      cpu_atom/instructions/
>        403,651,469      cpu_core/instructions/
> 
> With IRQ putting of jobs on + bypass (drm_dep_queue_flags =
> DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED |
> DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):
> 
>              5,361      context-switches
>                169      cpu-migrations
>           2,577.44 msec task-clock
>        685,769,153      cpu_atom/cycles/
>        685,768,407      cpu_core/cycles/
>    <not supported>      cpu_atom/instructions/
>        321,336,297      cpu_core/instructions/

Thanks for sharing those numbers. For completeness, can you also add the
"With IRQ putting of jobs on + no bypass" case?

I'm a bit surprised by the difference in number of context switches
given I'd expect the local-CPU to be picked in priority, and so queuing
work items on the same wq from another work item to be almost free in
term on scheduling. But I guess there's some load-balancing happening
when you execute jobs at such a high rate.

Also, I don't know if that's just noise or if it's reproducible, but
task-clock seems to be ~40usec lower with the deferred cleanup and
no-bypass (higher throughput because you're not blocking the dequeuing
of the next job on the cleanup of the previous one, I suspect).

Reply via email to