On Tue, Mar 24, 2026 at 10:23:45AM +0100, Boris Brezillon wrote:
> On Mon, 23 Mar 2026 11:38:06 -0700
> Matthew Brost <[email protected]> wrote:
> 
> > 
> > Ok, getting stats is easier than I thought...
> > 
> > ./perf stat -a -e 
> > context-switches,cpu-migrations,task-clock,cycles,instructions 
> > /home/mbrost/xe/source/drivers.gpu.i915.igt-gpu-tools/build/tests/xe_exec_threads
> >  --r threads-basic
> > 
> > This test creates one thread per engine instance (7 instances this BMG
> > device) and submits 1k exec IOCTLs per thread, each performing a DW
> > write. Each exec IOCTL typically does not have unsignaled input 
> > dependencies.
> > 
> > With IRQ putting of jobs off + no bypass (drm_dep_queue_flags = 0):
> > 
> >              8,449      context-switches
> >                412      cpu-migrations
> >           2,531.43 msec task-clock
> >      1,847,846,588      cpu_atom/cycles/
> >      1,847,856,947      cpu_core/cycles/
> >    <not supported>      cpu_atom/instructions/
> >        460,744,020      cpu_core/instructions/
> > 
> > With IRQ putting of jobs off + bypass (drm_dep_queue_flags =
> > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED):
> > 
> >              8,655      context-switches
> >                229      cpu-migrations
> >           2,571.33 msec task-clock
> >        855,900,607      cpu_atom/cycles/
> >        855,900,272      cpu_core/cycles/
> >    <not supported>      cpu_atom/instructions/
> >        403,651,469      cpu_core/instructions/
> > 
> > With IRQ putting of jobs on + bypass (drm_dep_queue_flags =
> > DRM_DEP_QUEUE_FLAGS_BYPASS_SUPPORTED |
> > DRM_DEP_QUEUE_FLAGS_JOB_PUT_IRQ_SAFE):
> > 
> >              5,361      context-switches
> >                169      cpu-migrations
> >           2,577.44 msec task-clock
> >        685,769,153      cpu_atom/cycles/
> >        685,768,407      cpu_core/cycles/
> >    <not supported>      cpu_atom/instructions/
> >        321,336,297      cpu_core/instructions/
> 
> Thanks for sharing those numbers. For completeness, can you also add the
> "With IRQ putting of jobs on + no bypass" case?
> 

Yes, I also will share a DRM sched baseline too + I figured out power
can be measured too - initial results confirm what I expected too - less
power.

I'm putting together a doc based on running glxgears and another
benchmark on top Ubuntu 24.10 + Wayland which has explicit sync
(linux-drm-syncobj, behaves like surfface flinger when rendering flag to
not pass in fences to draw jobs).

Almost have all the data. Will share here once I have it.

> I'm a bit surprised by the difference in number of context switches
> given I'd expect the local-CPU to be picked in priority, and so queuing
> work items on the same wq from another work item to be almost free in
> term on scheduling. But I guess there's some load-balancing happening
> when you execute jobs at such a high rate.
> 
> Also, I don't know if that's just noise or if it's reproducible, but
> task-clock seems to be ~40usec lower with the deferred cleanup and
> no-bypass (higher throughput because you're not blocking the dequeuing
> of the next job on the cleanup of the previous one, I suspect).

I think that is just noise of what the test is doing in user space -
that bounces around a bit.

Matt

> 

Reply via email to