On Mon, 23 Mar 2026 11:38:06 -0700 Matthew Brost <[email protected]> wrote:
> > Ok, getting stats is easier than I thought... > > ./perf stat -a -e > context-switches,cpu-migrations,task-clock,cycles,instructions > /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads > --r threads-basic > > This test creates one thread per engine instance (7 instances this BMG > device) and submits 1k exec IOCTLs per thread, each performing a DW > write. Each exec IOCTL typically does not have unsignaled input dependencies. > > With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0): > > 8,449 context-switches > 412 cpu-migrations > 2,531.43 msec task-clock > 1,847,846,588 cpu_atom/cycles/ > 1,847,856,947 cpu_core/cycles/ > <not supported> cpu_atom/instructions/ > 460,744,020 cpu_core/instructions/ > > With IRQ putting of jobs off + bypass (drm_dep_queue_flags = > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED): > > 8,655 context-switches > 229 cpu-migrations > 2,571.33 msec task-clock > 855,900,607 cpu_atom/cycles/ > 855,900,272 cpu_core/cycles/ > <not supported> cpu_atom/instructions/ > 403,651,469 cpu_core/instructions/ > > With IRQ putting of jobs on + bypass (drm_dep_queue_flags = > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED | > DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE): > > 5,361 context-switches > 169 cpu-migrations > 2,577.44 msec task-clock > 685,769,153 cpu_atom/cycles/ > 685,768,407 cpu_core/cycles/ > <not supported> cpu_atom/instructions/ > 321,336,297 cpu_core/instructions/ Thanks for sharing those numbers. For completeness, can you also add the "With IRQ putting of jobs on + no bypass" case? I'm a bit surprised by the difference in number of context switches given I'd expect the local-CPU to be picked in priority, and so queuing work items on the same wq from another work item to be almost free in term on scheduling. But I guess there's some load-balancing happening when you execute jobs at such a high rate. Also, I don't know if that's just noise or if it's reproducible, but task-clock seems to be ~40usec lower with the deferred cleanup and no-bypass (higher throughput because you're not blocking the dequeuing of the next job on the cleanup of the previous one, I suspect).
