On Thu, Oct 6, 2022 at 1:25 PM Dave Airlie <airl...@gmail.com> wrote: > > > [ 1234.778760] BUG: kernel NULL pointer dereference, address: 0000000000000088 > [ 1234.778813] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x140 [gpu_sched]
As far as I can tell, that's the line struct drm_gpu_scheduler *sched = s_fence->sched; where 's_fence' is NULL. The code is 0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 5: 41 54 push %r12 7: 55 push %rbp 8: 53 push %rbx 9: 48 89 fb mov %rdi,%rbx c:* 48 8b af 88 00 00 00 mov 0x88(%rdi),%rbp <-- trapping instruction 13: f0 ff 8d f0 00 00 00 lock decl 0xf0(%rbp) 1a: 48 8b 85 80 01 00 00 mov 0x180(%rbp),%rax and that next 'lock decl' instruction would have been the atomic_dec(&sched->hw_rq_count); at the top of drm_sched_job_done(). Now, as to *why* you'd have a NULL s_fence, it would seem that drm_sched_job_cleanup() was called with an active job. Looking at that code, it does if (kref_read(&job->s_fence->finished.refcount)) { /* drm_sched_job_arm() has been called */ dma_fence_put(&job->s_fence->finished); ... but then it does job->s_fence = NULL; anyway, despite the job still being active. The logic of that kind of "fake refcount" escapes me. The above looks fundamentally racy, not to say pointless and wrong (a refcount is a _count_, not a flag, so there could be multiple references to it, what says that you can just decrement one of them and say "I'm done"). Now, _why_ any of that happens, I have no idea. I'm just looking at the immediate "that pointer is NULL" thing, and reacting to what looks like a completely bogus refcount pattern. But that odd refcount pattern isn't new, so it's presumably some user on the amd gpu side that changed. The problem hasn't happened again for me, but that's not saying a lot, since it was very random to begin with. Linus