amdgpu: Refactor GPU reset for XGMI hive case.

Christian König Mon, 26 Nov 2018 11:34:21 -0800

What I mean is - should we get rid of dma_fence_add/remove_callbacklogic in drm_sched_job_timedout and do it for each driver in between
scheduler deactivation  and activation back ?

Yes, exactly. That's the reason why I already have a revert for thepatch and remove the dance from drm_sched_job_timedout again.


Christian.


Am 26.11.18 um 20:28 schrieb Grodzovsky, Andrey:

Actually, after looking again at drm_sched_job_timedout from whichthe amdgpu_device_gpu_recover will be called I see that we alreadydisconnect all the pending scheduler fences from the HW fence, includingthe guilty job. I also see that in drm_sched_job_timedoutjob_list_lock is released before calling sched->ops->timedout_job andthen required after, so new jobs can slip into ring_mirror_list inbetween.
And also i will end up going over the ring_mirror_list twice, oncefrom amdgpu_device_post_asic_reset and later fromdrm_sched_job_timedout - this might cause double fence processing.
Isn't it more correct only do the disconnect from HW fence after theschedules have been stopped and connect back before we restart theschedulers (as you pointed out here before)
What I mean is - should we get rid of dma_fence_add/remove_callbacklogic in drm_sched_job_timedout and do it for each driver in between
scheduler deactivation  and activation back ?

Andrey


On 11/22/2018 02:56 PM, Grodzovsky, Andrey wrote:
Additional to that I would try improve the pre, middle, post handling
towards checking if we made some progress in between.

In other words we stop all schedulers in the pre handling and
disconnect the scheduler fences from the hardware fence like I did in
patch "drm/sched: fix timeout handling v2".

Then before we do the actual reset in the middle handling we check if
the offending job has completed or at least made some progress in the
meantime.
I understand how to check if the job completed - if it's fence already
signaled, but how do I test if the job made 'at least some progress' ?
Good question. Maybe we can somehow query from the hardware the number
of primitives or pixels processed so far and then compare after a moment?
I will check on this later. In the mean while I will update the code
with the proposed per hive locking and I will add the check if the
guilty job completed before ASIC reset skipping the reset if it's did.

Andrey
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 5/5] drm/amdgpu: Refactor GPU reset for XGMI hive case.

Reply via email to