Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
[SNIP] Maybe just empirically - let's try it and see under different test scenarios what actually happens ? Not a good idea in general, we have that approach way to often at AMD and are then surprised that everything works in QA but fails in production. But Daniel already noted in his reply that waiting for a fence while holding the SRCU is expected to work. So let's stick with the approach of high level locking for hotplug. To my understanding this is true for other devises, not the one being extracted, for him you still need to do all the HW fence signalling dance because the HW is gone and we block any TDRs (which won't help anyway). Andrey Do you agree to the above ? Yeah, I think that is correct. But on the other hand what Daniel reminded me of is that the handling needs to be consistent over different devices. And since some device already go with the approach of canceling everything we simply have to go down that route as well. Christian. What does it mean in our context ? What needs to be done which we are not doing now ? I think we are fine, we just need to continue with the approach of forcefully signaling all fences on hotplug. Christian. Andrey Andrey Christian. Andrey Christian. Andrey Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway - https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca64b1f5e0df0403a656408d8ffdc7bdb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637540669732692484%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=pLcplnlDIESV998tLO7iydxEo5lh71BjQCbAOxKif2Q%3Dreserved=0 Yes, good point as well. Christian. Andrey Christian. Andrey Christian. Andrey Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-15 3:02 a.m., Christian König wrote: Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky: On 2021-04-14 10:58 a.m., Christian König wrote: Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky: [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers consuming jobs is not the problem, we already handle that correct. The problem is that the entities might continue feeding stuff into the scheduler. Missed that. Ok, can I just use non sleeping RCU with a flag around drm_sched_entity_push_job at the amdgpu level (only 2 functions call it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to flush and block in flight and future submissions to entity queue ? Double checking the code I think we can use the notifier_lock for this. E.g. in amdgpu_cs.c see where we have the goto error_abort. That is the place where such a check could be added without any additional overhead. Sure, I will just have to add this lock to amdgpu_job_submit too. Not ideal, but I think that's fine with me. You might want to rename the lock for this thought. [SNIP] Maybe just empirically - let's try it and see under different test scenarios what actually happens ? Not a good idea in general, we have that approach way to often at AMD and are then surprised that everything works in QA but fails in production. But Daniel already noted in his reply that waiting for a fence while holding the SRCU is expected to work. So let's stick with the approach of high level locking for hotplug. To my understanding this is true for other devises, not the one being extracted, for him you still need to do all the HW fence signalling dance because the HW is gone and we block any TDRs (which won't help anyway). Andrey Do you agree to the above ? Yeah, I think that is correct. But on the other hand what Daniel reminded me of is that the handling needs to be consistent over different devices. And since some device already go with the approach of canceling everything we simply have to go down that route as well. Christian. What does it mean in our context ? What needs to be done which we are not doing now ? Andrey Andrey Christian. Andrey Christian. Andrey Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway -
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky: On 2021-04-14 10:58 a.m., Christian König wrote: Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky: [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers consuming jobs is not the problem, we already handle that correct. The problem is that the entities might continue feeding stuff into the scheduler. Missed that. Ok, can I just use non sleeping RCU with a flag around drm_sched_entity_push_job at the amdgpu level (only 2 functions call it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to flush and block in flight and future submissions to entity queue ? Double checking the code I think we can use the notifier_lock for this. E.g. in amdgpu_cs.c see where we have the goto error_abort. That is the place where such a check could be added without any additional overhead. Sure, I will just have to add this lock to amdgpu_job_submit too. Not ideal, but I think that's fine with me. You might want to rename the lock for this thought. [SNIP] Maybe just empirically - let's try it and see under different test scenarios what actually happens ? Not a good idea in general, we have that approach way to often at AMD and are then surprised that everything works in QA but fails in production. But Daniel already noted in his reply that waiting for a fence while holding the SRCU is expected to work. So let's stick with the approach of high level locking for hotplug. To my understanding this is true for other devises, not the one being extracted, for him you still need to do all the HW fence signalling dance because the HW is gone and we block any TDRs (which won't help anyway). Andrey Do you agree to the above ? Yeah, I think that is correct. But on the other hand what Daniel reminded me of is that the handling needs to be consistent over different devices. And since some device already go with the approach of canceling everything we simply have to go down that route as well. Christian. Andrey Christian. Andrey Christian. Andrey Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway -
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-14 10:58 a.m., Christian König wrote: Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky: [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers consuming jobs is not the problem, we already handle that correct. The problem is that the entities might continue feeding stuff into the scheduler. Missed that. Ok, can I just use non sleeping RCU with a flag around drm_sched_entity_push_job at the amdgpu level (only 2 functions call it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to flush and block in flight and future submissions to entity queue ? Double checking the code I think we can use the notifier_lock for this. E.g. in amdgpu_cs.c see where we have the goto error_abort. That is the place where such a check could be added without any additional overhead. Sure, I will just have to add this lock to amdgpu_job_submit too. Christian. For waiting for other device I have no idea if that couldn't deadlock somehow. Yea, not sure for imported fences and dma_bufs, I would assume the other devices should not be impacted by our device removal but, who knows... So I guess we are NOT going with finalizing HW fences before drm_dev_unplug and instead will just call drm_dev_enter/exit at the back-ends all over the place where there are MMIO accesses ? Good question. As you said that is really the hard path. Handling it all at once at IOCTL level certainly has some appeal as well, but I have no idea if we can guarantee that this is lock free. Maybe just empirically - let's try it and see under different test scenarios what actually happens ? Not a good idea in general, we have that approach way to often at AMD and are then surprised that everything works in QA but fails in production. But Daniel already noted in his reply that waiting for a fence while holding the SRCU is expected to work. So let's stick with the approach of high level locking for hotplug. To my understanding this is true for other devises, not the one being extracted, for him you still need to do all the HW fence signalling dance because the HW is gone and we block any TDRs (which won't help anyway). Andrey Do you agree to the above ? Andrey Christian. Andrey Christian. Andrey Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky: [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers consuming jobs is not the problem, we already handle that correct. The problem is that the entities might continue feeding stuff into the scheduler. Missed that. Ok, can I just use non sleeping RCU with a flag around drm_sched_entity_push_job at the amdgpu level (only 2 functions call it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to flush and block in flight and future submissions to entity queue ? Double checking the code I think we can use the notifier_lock for this. E.g. in amdgpu_cs.c see where we have the goto error_abort. That is the place where such a check could be added without any additional overhead. Christian. For waiting for other device I have no idea if that couldn't deadlock somehow. Yea, not sure for imported fences and dma_bufs, I would assume the other devices should not be impacted by our device removal but, who knows... So I guess we are NOT going with finalizing HW fences before drm_dev_unplug and instead will just call drm_dev_enter/exit at the back-ends all over the place where there are MMIO accesses ? Good question. As you said that is really the hard path. Handling it all at once at IOCTL level certainly has some appeal as well, but I have no idea if we can guarantee that this is lock free. Maybe just empirically - let's try it and see under different test scenarios what actually happens ? Not a good idea in general, we have that approach way to often at AMD and are then surprised that everything works in QA but fails in production. But Daniel already noted in his reply that waiting for a fence while holding the SRCU is expected to work. So let's stick with the approach of high level locking for hotplug. To my understanding this is true for other devises, not the one being extracted, for him you still need to do all the HW fence signalling dance because the HW is gone and we block any TDRs (which won't help anyway). Andrey Christian. Andrey Christian. Andrey Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway -
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-14 3:01 a.m., Christian König wrote: Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky: On 2021-04-13 2:25 p.m., Christian König wrote: Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. Ok, but then looks like I am missing something, see the following steps in amdgpu_pci_remove - 1) Use disable_irq API function to stop and flush all in flight HW interrupts handlers 2) Grab the reset lock and stop all the schedulers After above 2 steps the HW fences array is idle, no more insertions and no more extractions from the array 3) Run one time amdgpu_fence_process to signal all current HW fences 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the GPU reset lock and go on with the rest of the sequence (cancel timers, work items e.t.c) What's problematic in this sequence ? drm_dev_unplug() will wait for the IOCTLs to finish. The IOCTLs in turn can wait for fences. That can be both hardware fences, scheduler fences, as well as fences from other devices (and KIQ fences for register writes under SRIOV, but we can hopefully ignore them for now). We have handled the hardware fences, but we have no idea when the scheduler fences or the fences from other devices will signal. Scheduler fences won't signal until the scheduler threads are restarted or we somehow cancel the submissions. Doable, but tricky as well. For scheduler fences I am not worried, for the sched_fence->finished fence they are by definition attached to HW fences which already signaledfor sched_fence->scheduled we should run drm_sched_entity_kill_jobs for each entity after stopping the scheduler threads and before setting drm_dev_unplug. Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() assumes that there are no more jobs pushed into the entity. We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers consuming jobs is not the problem, we already handle that correct. The problem is that the entities might continue feeding stuff into the scheduler. Missed that. Ok, can I just use non sleeping RCU with a flag around drm_sched_entity_push_job at the amdgpu level (only 2 functions call it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to flush and block in flight and future submissions to entity queue ? For waiting for other device I have no idea if that couldn't deadlock somehow. Yea, not sure for imported fences and dma_bufs, I would assume the other devices should not be impacted by our device removal but, who knows... So I guess we are NOT going with finalizing HW fences before drm_dev_unplug and instead will just call drm_dev_enter/exit at the back-ends all over the place where there are MMIO accesses ? Good question. As you said that is really the hard path. Handling it all at once at IOCTL level certainly has some appeal as well, but I have no idea if we can guarantee that this is lock free. Maybe just empirically - let's try it and see under different test scenarios what actually happens ? Not a good idea in general, we have that approach way to often at AMD and are then surprised that everything works in QA but fails in production. But Daniel already noted in his reply that waiting for a fence while holding the SRCU is expected to work. So let's stick with the approach of high level locking for hotplug. To my understanding this is true for other
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky: On 2021-04-13 2:25 p.m., Christian König wrote: Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. Ok, but then looks like I am missing something, see the following steps in amdgpu_pci_remove - 1) Use disable_irq API function to stop and flush all in flight HW interrupts handlers 2) Grab the reset lock and stop all the schedulers After above 2 steps the HW fences array is idle, no more insertions and no more extractions from the array 3) Run one time amdgpu_fence_process to signal all current HW fences 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the GPU reset lock and go on with the rest of the sequence (cancel timers, work items e.t.c) What's problematic in this sequence ? drm_dev_unplug() will wait for the IOCTLs to finish. The IOCTLs in turn can wait for fences. That can be both hardware fences, scheduler fences, as well as fences from other devices (and KIQ fences for register writes under SRIOV, but we can hopefully ignore them for now). We have handled the hardware fences, but we have no idea when the scheduler fences or the fences from other devices will signal. Scheduler fences won't signal until the scheduler threads are restarted or we somehow cancel the submissions. Doable, but tricky as well. For scheduler fences I am not worried, for the sched_fence->finished fence they are by definition attached to HW fences which already signaledfor sched_fence->scheduled we should run drm_sched_entity_kill_jobs for each entity after stopping the scheduler threads and before setting drm_dev_unplug. Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() assumes that there are no more jobs pushed into the entity. We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers consuming jobs is not the problem, we already handle that correct. The problem is that the entities might continue feeding stuff into the scheduler. For waiting for other device I have no idea if that couldn't deadlock somehow. Yea, not sure for imported fences and dma_bufs, I would assume the other devices should not be impacted by our device removal but, who knows... So I guess we are NOT going with finalizing HW fences before drm_dev_unplug and instead will just call drm_dev_enter/exit at the back-ends all over the place where there are MMIO accesses ? Good question. As you said that is really the hard path. Handling it all at once at IOCTL level certainly has some appeal as well, but I have no idea if we can guarantee that this is lock free. Maybe just empirically - let's try it and see under different test scenarios what actually happens ? Not a good idea in general, we have that approach way to often at AMD and are then surprised that everything works in QA but fails in production. But Daniel already noted in his reply that waiting for a fence while holding the SRCU is expected to work. So let's stick with the approach of high level locking for hotplug. Christian. Andrey Christian. Andrey Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On Tue, Apr 13, 2021 at 11:13 AM Li, Dennis wrote: > > [AMD Official Use Only - Internal Distribution Only] > > Hi, Christian and Andrey, > We maybe try to implement "wait" callback function of dma_fence_ops, > when GPU reset or unplug happen, make this callback return - ENODEV, to > notify the caller device lost. > > * Must return -ERESTARTSYS if the wait is intr = true and the wait > was > * interrupted, and remaining jiffies if fence has signaled, or 0 if > wait > * timed out. Can also return other error values on custom > implementations, > * which should be treated as if the fence is signaled. For example a > hardware > * lockup could be reported like that. > * > * This callback is optional. > */ > signed long (*wait)(struct dma_fence *fence, > bool intr, signed long timeout); Uh this callback is for old horros like unreliable irq delivery on radeon. Please don't use it for anything, if we need to make fences bail out on error then we need something that works for all fences. Also TDR should recovery you here already and make sure the dma_fence_wait() is bound in time. -Daniel > > Best Regards > Dennis Li > -Original Message- > From: Christian König > Sent: Tuesday, April 13, 2021 3:10 PM > To: Grodzovsky, Andrey ; Koenig, Christian > ; Li, Dennis ; > amd-gfx@lists.freedesktop.org; Deucher, Alexander > ; Kuehling, Felix ; Zhang, > Hawking ; Daniel Vetter > Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability > > Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: > > > > On 2021-04-12 3:18 p.m., Christian König wrote: > >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: > >>> [SNIP] > >>>>> > >>>>> So what's the right approach ? How we guarantee that when running > >>>>> amdgpu_fence_driver_force_completion we will signal all the HW > >>>>> fences and not racing against some more fences insertion into that > >>>>> array ? > >>>>> > >>>> > >>>> Well I would still say the best approach would be to insert this > >>>> between the front end and the backend and not rely on signaling > >>>> fences while holding the device srcu. > >>> > >>> > >>> My question is, even now, when we run > >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or > >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, > >>> what there prevents a race with another fence being at the same time > >>> emitted and inserted into the fence array ? Looks like nothing. > >>> > >> > >> Each ring can only be used by one thread at the same time, this > >> includes emitting fences as well as other stuff. > >> > >> During GPU reset we make sure nobody writes to the rings by stopping > >> the scheduler and taking the GPU reset lock (so that nobody else can > >> start the scheduler again). > > > > > > What about direct submissions not through scheduler - > > amdgpu_job_submit_direct, I don't see how this is protected. > > Those only happen during startup and GPU reset. > > >> > >>>> > >>>> BTW: Could it be that the device SRCU protects more than one device > >>>> and we deadlock because of this? > >>> > >>> > >>> I haven't actually experienced any deadlock until now but, yes, > >>> drm_unplug_srcu is defined as static in drm_drv.c and so in the > >>> presence of multiple devices from same or different drivers we in > >>> fact are dependent on all their critical sections i guess. > >>> > >> > >> Shit, yeah the devil is a squirrel. So for A+I laptops we actually > >> need to sync that up with Daniel and the rest of the i915 guys. > >> > >> IIRC we could actually have an amdgpu device in a docking station > >> which needs hotplug and the driver might depend on waiting for the > >> i915 driver as well. > > > > > > Can't we propose a patch to make drm_unplug_srcu per drm_device ? I > > don't see why it has to be global and not per device thing. > > I'm really wondering the same thing for quite a while now. > > Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. > > Regards, > Christian. > > > > > Andrey > > > > > >>
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On Tue, Apr 13, 2021 at 9:10 AM Christian König wrote: > > Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: > > > > On 2021-04-12 3:18 p.m., Christian König wrote: > >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: > >>> [SNIP] > > > > So what's the right approach ? How we guarantee that when running > > amdgpu_fence_driver_force_completion we will signal all the HW > > fences and not racing against some more fences insertion into that > > array ? > > > > Well I would still say the best approach would be to insert this > between the front end and the backend and not rely on signaling > fences while holding the device srcu. > >>> > >>> > >>> My question is, even now, when we run > >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or > >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, > >>> what there prevents a race with another fence being at the same time > >>> emitted and inserted into the fence array ? Looks like nothing. > >>> > >> > >> Each ring can only be used by one thread at the same time, this > >> includes emitting fences as well as other stuff. > >> > >> During GPU reset we make sure nobody writes to the rings by stopping > >> the scheduler and taking the GPU reset lock (so that nobody else can > >> start the scheduler again). > > > > > > What about direct submissions not through scheduler - > > amdgpu_job_submit_direct, I don't see how this is protected. > > Those only happen during startup and GPU reset. > > >> > > BTW: Could it be that the device SRCU protects more than one device > and we deadlock because of this? > >>> > >>> > >>> I haven't actually experienced any deadlock until now but, yes, > >>> drm_unplug_srcu is defined as static in drm_drv.c and so in the > >>> presence of multiple devices from same or different drivers we in > >>> fact are dependent on all their critical sections i guess. > >>> > >> > >> Shit, yeah the devil is a squirrel. So for A+I laptops we actually > >> need to sync that up with Daniel and the rest of the i915 guys. > >> > >> IIRC we could actually have an amdgpu device in a docking station > >> which needs hotplug and the driver might depend on waiting for the > >> i915 driver as well. > > > > > > Can't we propose a patch to make drm_unplug_srcu per drm_device ? I > > don't see why it has to be global and not per device thing. > > I'm really wondering the same thing for quite a while now. > > Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. SRCU isn't exactly the cheapest thing, but aside from that we could make it per-device. I'm not seeing the point much since if you do end up being stuck on an ioctl this might happen with anything really. Also note that dma_fence_waits are supposed to be time bound, so you shouldn't end up waiting on them forever. It should all get sorted out in due time with TDR I hope (e.g. if i915 is stuck on a fence because you're unlucky). -Daniel > > Regards, > Christian. > > > > > Andrey > > > > > >> > >> Christian. > >> > >>> Andrey > >>> > >>> > > Christian. > > > Andrey > > > > > >> > >>> Andrey > >>> > >>> > > Christian. > > > /* Past this point no more fence are submitted to HW ring > > and hence we can safely call force signal on all that are > > currently there. > > * Any subsequently created HW fences will be returned > > signaled with an error code right away > > */ > > > > for_each_ring(adev) > > amdgpu_fence_process(ring) > > > > drm_dev_unplug(dev); > > Stop schedulers > > cancel_sync(all timers and queued works); > > hw_fini > > unmap_mmio > > > > } > > > > > > Andrey > > > > > >> > >> > >>> > >> > >> Alternatively grabbing the reset write side and stopping > >> and then restarting the scheduler could work as well. > >> > >> Christian. > > > > > > I didn't get the above and I don't see why I need to reuse > > the GPU reset rw_lock. I rely on the SRCU unplug flag for > > unplug. Also, not clear to me why are we focusing on the > > scheduler threads, any code patch to generate HW fences > > should be covered, so any code leading to > > amdgpu_fence_emit needs to be taken into account such as, > > direct IB submissions, VM flushes e.t.c > > You need to work together with the reset lock anyway, cause > a hotplug could run at the same time as a reset. > >>> > >>> > >>> For going my way indeed now I see now that I have to take > >>> reset write side lock during HW
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-13 2:25 p.m., Christian König wrote: Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. Ok, but then looks like I am missing something, see the following steps in amdgpu_pci_remove - 1) Use disable_irq API function to stop and flush all in flight HW interrupts handlers 2) Grab the reset lock and stop all the schedulers After above 2 steps the HW fences array is idle, no more insertions and no more extractions from the array 3) Run one time amdgpu_fence_process to signal all current HW fences 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the GPU reset lock and go on with the rest of the sequence (cancel timers, work items e.t.c) What's problematic in this sequence ? drm_dev_unplug() will wait for the IOCTLs to finish. The IOCTLs in turn can wait for fences. That can be both hardware fences, scheduler fences, as well as fences from other devices (and KIQ fences for register writes under SRIOV, but we can hopefully ignore them for now). We have handled the hardware fences, but we have no idea when the scheduler fences or the fences from other devices will signal. Scheduler fences won't signal until the scheduler threads are restarted or we somehow cancel the submissions. Doable, but tricky as well. For scheduler fences I am not worried, for the sched_fence->finished fence they are by definition attached to HW fences which already signaledfor sched_fence->scheduled we should run drm_sched_entity_kill_jobs for each entity after stopping the scheduler threads and before setting drm_dev_unplug. Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() assumes that there are no more jobs pushed into the entity. We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. For waiting for other device I have no idea if that couldn't deadlock somehow. Yea, not sure for imported fences and dma_bufs, I would assume the other devices should not be impacted by our device removal but, who knows... So I guess we are NOT going with finalizing HW fences before drm_dev_unplug and instead will just call drm_dev_enter/exit at the back-ends all over the place where there are MMIO accesses ? Good question. As you said that is really the hard path. Handling it all at once at IOCTL level certainly has some appeal as well, but I have no idea if we can guarantee that this is lock free. Maybe just empirically - let's try it and see under different test scenarios what actually happens ? Andrey Christian. Andrey Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. Ok, but then looks like I am missing something, see the following steps in amdgpu_pci_remove - 1) Use disable_irq API function to stop and flush all in flight HW interrupts handlers 2) Grab the reset lock and stop all the schedulers After above 2 steps the HW fences array is idle, no more insertions and no more extractions from the array 3) Run one time amdgpu_fence_process to signal all current HW fences 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the GPU reset lock and go on with the rest of the sequence (cancel timers, work items e.t.c) What's problematic in this sequence ? drm_dev_unplug() will wait for the IOCTLs to finish. The IOCTLs in turn can wait for fences. That can be both hardware fences, scheduler fences, as well as fences from other devices (and KIQ fences for register writes under SRIOV, but we can hopefully ignore them for now). We have handled the hardware fences, but we have no idea when the scheduler fences or the fences from other devices will signal. Scheduler fences won't signal until the scheduler threads are restarted or we somehow cancel the submissions. Doable, but tricky as well. For scheduler fences I am not worried, for the sched_fence->finished fence they are by definition attached to HW fences which already signaledfor sched_fence->scheduled we should run drm_sched_entity_kill_jobs for each entity after stopping the scheduler threads and before setting drm_dev_unplug. Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() assumes that there are no more jobs pushed into the entity. We are racing here once more and need to handle that. For waiting for other device I have no idea if that couldn't deadlock somehow. Yea, not sure for imported fences and dma_bufs, I would assume the other devices should not be impacted by our device removal but, who knows... So I guess we are NOT going with finalizing HW fences before drm_dev_unplug and instead will just call drm_dev_enter/exit at the back-ends all over the place where there are MMIO accesses ? Good question. As you said that is really the hard path. Handling it all at once at IOCTL level certainly has some appeal as well, but I have no idea if we can guarantee that this is lock free. Christian. Andrey Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. Ok, but then looks like I am missing something, see the following steps in amdgpu_pci_remove - 1) Use disable_irq API function to stop and flush all in flight HW interrupts handlers 2) Grab the reset lock and stop all the schedulers After above 2 steps the HW fences array is idle, no more insertions and no more extractions from the array 3) Run one time amdgpu_fence_process to signal all current HW fences 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the GPU reset lock and go on with the rest of the sequence (cancel timers, work items e.t.c) What's problematic in this sequence ? drm_dev_unplug() will wait for the IOCTLs to finish. The IOCTLs in turn can wait for fences. That can be both hardware fences, scheduler fences, as well as fences from other devices (and KIQ fences for register writes under SRIOV, but we can hopefully ignore them for now). We have handled the hardware fences, but we have no idea when the scheduler fences or the fences from other devices will signal. Scheduler fences won't signal until the scheduler threads are restarted or we somehow cancel the submissions. Doable, but tricky as well. For scheduler fences I am not worried, for the sched_fence->finished fence they are by definition attached to HW fences which already signaled,for sched_fence->scheduled we should run drm_sched_entity_kill_jobs for each entity after stopping the scheduler threads and before setting drm_dev_unplug. For waiting for other device I have no idea if that couldn't deadlock somehow. Yea, not sure for imported fences and dma_bufs, I would assume the other devices should not be impacted by our device removal but, who knows... So I guess we are NOT going with finalizing HW fences before drm_dev_unplug and instead will just call drm_dev_enter/exit at the back-ends all over the place where there are MMIO accesses ? Andrey Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. Ok, but then looks like I am missing something, see the following steps in amdgpu_pci_remove - 1) Use disable_irq API function to stop and flush all in flight HW interrupts handlers 2) Grab the reset lock and stop all the schedulers After above 2 steps the HW fences array is idle, no more insertions and no more extractions from the array 3) Run one time amdgpu_fence_process to signal all current HW fences 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the GPU reset lock and go on with the rest of the sequence (cancel timers, work items e.t.c) What's problematic in this sequence ? drm_dev_unplug() will wait for the IOCTLs to finish. The IOCTLs in turn can wait for fences. That can be both hardware fences, scheduler fences, as well as fences from other devices (and KIQ fences for register writes under SRIOV, but we can hopefully ignore them for now). We have handled the hardware fences, but we have no idea when the scheduler fences or the fences from other devices will signal. Scheduler fences won't signal until the scheduler threads are restarted or we somehow cancel the submissions. Doable, but tricky as well. For waiting for other device I have no idea if that couldn't deadlock somehow. Regards, Christian. Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. Ok, but then looks like I am missing something, see the following steps in amdgpu_pci_remove - 1) Use disable_irq API function to stop and flush all in flight HW interrupts handlers 2) Grab the reset lock and stop all the schedulers After above 2 steps the HW fences array is idle, no more insertions and no more extractions from the array 3) Run one time amdgpu_fence_process to signal all current HW fences 4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the GPU reset lock and go on with the rest of the sequence (cancel timers, work items e.t.c) What's problematic in this sequence ? Andrey BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway -
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Hi Dennis, yeah, that just has the same down side of a lot of additional overhead as the is_signaled callback. Bouncing cache lines on the CPU isn't funny at all. Christian. Am 13.04.21 um 11:13 schrieb Li, Dennis: [AMD Official Use Only - Internal Distribution Only] Hi, Christian and Andrey, We maybe try to implement "wait" callback function of dma_fence_ops, when GPU reset or unplug happen, make this callback return - ENODEV, to notify the caller device lost. * Must return -ERESTARTSYS if the wait is intr = true and the wait was * interrupted, and remaining jiffies if fence has signaled, or 0 if wait * timed out. Can also return other error values on custom implementations, * which should be treated as if the fence is signaled. For example a hardware * lockup could be reported like that. * * This callback is optional. */ signed long (*wait)(struct dma_fence *fence, bool intr, signed long timeout); Best Regards Dennis Li -Original Message- From: Christian König Sent: Tuesday, April 13, 2021 3:10 PM To: Grodzovsky, Andrey ; Koenig, Christian ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; Daniel Vetter Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go wi
RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
[AMD Official Use Only - Internal Distribution Only] Hi, Christian and Andrey, We maybe try to implement "wait" callback function of dma_fence_ops, when GPU reset or unplug happen, make this callback return - ENODEV, to notify the caller device lost. * Must return -ERESTARTSYS if the wait is intr = true and the wait was * interrupted, and remaining jiffies if fence has signaled, or 0 if wait * timed out. Can also return other error values on custom implementations, * which should be treated as if the fence is signaled. For example a hardware * lockup could be reported like that. * * This callback is optional. */ signed long (*wait)(struct dma_fence *fence, bool intr, signed long timeout); Best Regards Dennis Li -Original Message- From: Christian König Sent: Tuesday, April 13, 2021 3:10 PM To: Grodzovsky, Andrey ; Koenig, Christian ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; Daniel Vetter Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: > > On 2021-04-12 3:18 p.m., Christian König wrote: >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: >>> [SNIP] >>>>> >>>>> So what's the right approach ? How we guarantee that when running >>>>> amdgpu_fence_driver_force_completion we will signal all the HW >>>>> fences and not racing against some more fences insertion into that >>>>> array ? >>>>> >>>> >>>> Well I would still say the best approach would be to insert this >>>> between the front end and the backend and not rely on signaling >>>> fences while holding the device srcu. >>> >>> >>> My question is, even now, when we run >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, >>> what there prevents a race with another fence being at the same time >>> emitted and inserted into the fence array ? Looks like nothing. >>> >> >> Each ring can only be used by one thread at the same time, this >> includes emitting fences as well as other stuff. >> >> During GPU reset we make sure nobody writes to the rings by stopping >> the scheduler and taking the GPU reset lock (so that nobody else can >> start the scheduler again). > > > What about direct submissions not through scheduler - > amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. >> >>>> >>>> BTW: Could it be that the device SRCU protects more than one device >>>> and we deadlock because of this? >>> >>> >>> I haven't actually experienced any deadlock until now but, yes, >>> drm_unplug_srcu is defined as static in drm_drv.c and so in the >>> presence of multiple devices from same or different drivers we in >>> fact are dependent on all their critical sections i guess. >>> >> >> Shit, yeah the devil is a squirrel. So for A+I laptops we actually >> need to sync that up with Daniel and the rest of the i915 guys. >> >> IIRC we could actually have an amdgpu device in a docking station >> which needs hotplug and the driver might depend on waiting for the >> i915 driver as well. > > > Can't we propose a patch to make drm_unplug_srcu per drm_device ? I > don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. > > Andrey > > >> >> Christian. >> >>> Andrey >>> >>> >>>> >>>> Christian. >>>> >>>>> Andrey >>>>> >>>>> >>>>>> >>>>>>> Andrey >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Christian. >>>>>>>> >>>>>>>>> /* Past this point no more fence are submitted to HW ring >>>>>>>>> and hence we can safely call force signal on all that are >>>>>>>>> currently there. >>>>>>>>> * Any subsequently created HW fences will be returned >>>>>>>>> signaled with an error code right away >>>>>>>>> */ >>>&g
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. Those only happen during startup and GPU reset. BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. I'm really wondering the same thing for quite a while now. Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global. Regards, Christian. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway - https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0 Yes, good point as well. Christian. Andrey Christian. Andrey Christian. Andrey Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 13.04.21 um 07:36 schrieb Andrey Grodzovsky: [SNIP] emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/** dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite the ring buffers. If device is present how can I ignore this ? I think you missed my question here Sorry I thought I answered that below. See this is just the last resort so that we don't need to worry about ring buffer overflows during testing. We should not get here in practice and if we get here generating a deadlock might actually be the best handling. The alternative would be to call BUG(). BTW, I am not sure it's so improbable to get here in case of sudden device remove, if you are during rapid commands submission to the ring during this time you could easily get to ring buffer overrun because EOP interrupts are gone and fences are not removed anymore but new ones keep arriving from new submissions which don't stop yet. During normal operation hardware fences are only created by two code paths: 1. The scheduler when it pushes jobs to the hardware. 2. The KIQ when it does register access on SRIOV. Both are limited in how many submissions could be made. The only case where this here becomes necessary is during GPU reset when we do direct submission bypassing the scheduler for IB and other tests. Christian. Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-12 2:23 p.m., Christian König wrote: Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Ah, wait a second this gave me another idea. See amdgpu_fence_driver_force_completion(): amdgpu_fence_write(ring, ring->fence_drv.sync_seq); If we change that to something like: amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF); Not only the currently submitted, but also the next 0x3FFF fences will be considered signaled. This basically solves out problem of making sure that new fences are also signaled without any additional overhead whatsoever. Problem with this is that the act of setting the sync_seq to some MAX value alone is not enough, you actually have to call amdgpu_fence_process to iterate and signal the fences currently stored in ring->fence_drv.fences array and to guarantee that once you done your signalling no more HW fences will be added to that array anymore. I was thinking to do something like bellow: Well we could implement the is_signaled callback once more, but I'm not sure if that is a good idea. This indeed could save the explicit signaling I am doing bellow but I also set an error code there which might be helpful to propagate to users amdgpu_fence_emit() { dma_fence_init(fence); srcu_read_lock(amdgpu_unplug_srcu) if (!adev->unplug)) { seq = ++ring->fence_drv.sync_seq; emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/** dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite the ring buffers. If device is present how can I ignore this ? I think you missed my question here Sorry I thought I answered that below. See this is just the last resort so that we don't need to worry about ring buffer overflows during testing. We should not get here in practice and if we get here generating a deadlock might actually be the best handling. The alternative would be to call BUG(). BTW, I am not sure it's so improbable to get here in case of sudden device remove, if you are during rapid commands submission to the ring during this time you could easily get to ring buffer overrun because EOP interrupts are gone and fences are not removed anymore but new ones keep arriving from new submissions which don't stop yet. Andrey But it should not have a timeout as far as I can see. Without timeout wait the who approach falls apart as I can't call srcu_synchronize on this scope because once device is physically gone the wait here will be forever Yeah, but this is intentional. The only alternative to avoid corruption is to wait with a timeout and call BUG() if that triggers. That isn't much better. ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] = fence; } else { dma_fence_set_error(fence, -ENODEV); DMA_fence_signal(fence) } srcu_read_unlock(amdgpu_unplug_srcu) return fence; } amdgpu_pci_remove { adev->unplug = true; synchronize_srcu(amdgpu_unplug_srcu) Well that is just duplicating what drm_dev_unplug() should be doing on a different level. drm_dev_unplug is on a much wider scope, for everything in the device including 'flushing' in flight IOCTLs, this deals specifically with the issue of force signalling HW fences Yeah, but it adds the same overhead as the device srcu. Christian. So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). What about direct submissions not through scheduler - amdgpu_job_submit_direct, I don't see how this is protected. BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Can't we propose a patch to make drm_unplug_srcu per drm_device ? I don't see why it has to be global and not per device thing. Andrey Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway - https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0 Yes, good point as well. Christian. Andrey Christian. Andrey Christian. Andrey Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and inserted into the fence array ? Looks like nothing. Each ring can only be used by one thread at the same time, this includes emitting fences as well as other stuff. During GPU reset we make sure nobody writes to the rings by stopping the scheduler and taking the GPU reset lock (so that nobody else can start the scheduler again). BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? I haven't actually experienced any deadlock until now but, yes, drm_unplug_srcu is defined as static in drm_drv.c and so in the presence of multiple devices from same or different drivers we in fact are dependent on all their critical sections i guess. Shit, yeah the devil is a squirrel. So for A+I laptops we actually need to sync that up with Daniel and the rest of the i915 guys. IIRC we could actually have an amdgpu device in a docking station which needs hotplug and the driver might depend on waiting for the i915 driver as well. Christian. Andrey Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway - https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0 Yes, good point as well. Christian. Andrey Christian. Andrey Christian. Andrey Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-12 2:23 p.m., Christian König wrote: Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Ah, wait a second this gave me another idea. See amdgpu_fence_driver_force_completion(): amdgpu_fence_write(ring, ring->fence_drv.sync_seq); If we change that to something like: amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF); Not only the currently submitted, but also the next 0x3FFF fences will be considered signaled. This basically solves out problem of making sure that new fences are also signaled without any additional overhead whatsoever. Problem with this is that the act of setting the sync_seq to some MAX value alone is not enough, you actually have to call amdgpu_fence_process to iterate and signal the fences currently stored in ring->fence_drv.fences array and to guarantee that once you done your signalling no more HW fences will be added to that array anymore. I was thinking to do something like bellow: Well we could implement the is_signaled callback once more, but I'm not sure if that is a good idea. This indeed could save the explicit signaling I am doing bellow but I also set an error code there which might be helpful to propagate to users amdgpu_fence_emit() { dma_fence_init(fence); srcu_read_lock(amdgpu_unplug_srcu) if (!adev->unplug)) { seq = ++ring->fence_drv.sync_seq; emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/** dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite the ring buffers. If device is present how can I ignore this ? I think you missed my question here Sorry I thought I answered that below. See this is just the last resort so that we don't need to worry about ring buffer overflows during testing. We should not get here in practice and if we get here generating a deadlock might actually be the best handling. The alternative would be to call BUG(). But it should not have a timeout as far as I can see. Without timeout wait the who approach falls apart as I can't call srcu_synchronize on this scope because once device is physically gone the wait here will be forever Yeah, but this is intentional. The only alternative to avoid corruption is to wait with a timeout and call BUG() if that triggers. That isn't much better. ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] = fence; } else { dma_fence_set_error(fence, -ENODEV); DMA_fence_signal(fence) } srcu_read_unlock(amdgpu_unplug_srcu) return fence; } amdgpu_pci_remove { adev->unplug = true; synchronize_srcu(amdgpu_unplug_srcu) Well that is just duplicating what drm_dev_unplug() should be doing on a different level. drm_dev_unplug is on a much wider scope, for everything in the device including 'flushing' in flight IOCTLs, this deals specifically with the issue of force signalling HW fences Yeah, but it adds the same overhead as the device srcu. Christian. So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. My question is, even now, when we run amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what there prevents a race with another fence being at the same time emitted and
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Ah, wait a second this gave me another idea. See amdgpu_fence_driver_force_completion(): amdgpu_fence_write(ring, ring->fence_drv.sync_seq); If we change that to something like: amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF); Not only the currently submitted, but also the next 0x3FFF fences will be considered signaled. This basically solves out problem of making sure that new fences are also signaled without any additional overhead whatsoever. Problem with this is that the act of setting the sync_seq to some MAX value alone is not enough, you actually have to call amdgpu_fence_process to iterate and signal the fences currently stored in ring->fence_drv.fences array and to guarantee that once you done your signalling no more HW fences will be added to that array anymore. I was thinking to do something like bellow: Well we could implement the is_signaled callback once more, but I'm not sure if that is a good idea. This indeed could save the explicit signaling I am doing bellow but I also set an error code there which might be helpful to propagate to users amdgpu_fence_emit() { dma_fence_init(fence); srcu_read_lock(amdgpu_unplug_srcu) if (!adev->unplug)) { seq = ++ring->fence_drv.sync_seq; emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/** dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite the ring buffers. If device is present how can I ignore this ? I think you missed my question here Sorry I thought I answered that below. See this is just the last resort so that we don't need to worry about ring buffer overflows during testing. We should not get here in practice and if we get here generating a deadlock might actually be the best handling. The alternative would be to call BUG(). But it should not have a timeout as far as I can see. Without timeout wait the who approach falls apart as I can't call srcu_synchronize on this scope because once device is physically gone the wait here will be forever Yeah, but this is intentional. The only alternative to avoid corruption is to wait with a timeout and call BUG() if that triggers. That isn't much better. ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] = fence; } else { dma_fence_set_error(fence, -ENODEV); DMA_fence_signal(fence) } srcu_read_unlock(amdgpu_unplug_srcu) return fence; } amdgpu_pci_remove { adev->unplug = true; synchronize_srcu(amdgpu_unplug_srcu) Well that is just duplicating what drm_dev_unplug() should be doing on a different level. drm_dev_unplug is on a much wider scope, for everything in the device including 'flushing' in flight IOCTLs, this deals specifically with the issue of force signalling HW fences Yeah, but it adds the same overhead as the device srcu. Christian. So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the best approach would be to insert this between the front end and the backend and not rely on signaling fences while holding the device srcu. BTW: Could it be that the device SRCU protects more than one device and we deadlock because of this? Christian. Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. *
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Ah, wait a second this gave me another idea. See amdgpu_fence_driver_force_completion(): amdgpu_fence_write(ring, ring->fence_drv.sync_seq); If we change that to something like: amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF); Not only the currently submitted, but also the next 0x3FFF fences will be considered signaled. This basically solves out problem of making sure that new fences are also signaled without any additional overhead whatsoever. Problem with this is that the act of setting the sync_seq to some MAX value alone is not enough, you actually have to call amdgpu_fence_process to iterate and signal the fences currently stored in ring->fence_drv.fences array and to guarantee that once you done your signalling no more HW fences will be added to that array anymore. I was thinking to do something like bellow: Well we could implement the is_signaled callback once more, but I'm not sure if that is a good idea. This indeed could save the explicit signaling I am doing bellow but I also set an error code there which might be helpful to propagate to users amdgpu_fence_emit() { dma_fence_init(fence); srcu_read_lock(amdgpu_unplug_srcu) if (!adev->unplug)) { seq = ++ring->fence_drv.sync_seq; emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/** dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite the ring buffers. If device is present how can I ignore this ? I think you missed my question here But it should not have a timeout as far as I can see. Without timeout wait the who approach falls apart as I can't call srcu_synchronize on this scope because once device is physically gone the wait here will be forever Yeah, but this is intentional. The only alternative to avoid corruption is to wait with a timeout and call BUG() if that triggers. That isn't much better. ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] = fence; } else { dma_fence_set_error(fence, -ENODEV); DMA_fence_signal(fence) } srcu_read_unlock(amdgpu_unplug_srcu) return fence; } amdgpu_pci_remove { adev->unplug = true; synchronize_srcu(amdgpu_unplug_srcu) Well that is just duplicating what drm_dev_unplug() should be doing on a different level. drm_dev_unplug is on a much wider scope, for everything in the device including 'flushing' in flight IOCTLs, this deals specifically with the issue of force signalling HW fences Yeah, but it adds the same overhead as the device srcu. Christian. So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Andrey Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Ah, wait a second this gave me another idea. See amdgpu_fence_driver_force_completion(): amdgpu_fence_write(ring, ring->fence_drv.sync_seq); If we change that to something like: amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF); Not only the currently submitted, but also the next 0x3FFF fences will be considered signaled. This basically solves out problem of making sure that new fences are also signaled without any additional overhead whatsoever. Problem with this is that the act of setting the sync_seq to some MAX value alone is not enough, you actually have to call amdgpu_fence_process to iterate and signal the fences currently stored in ring->fence_drv.fences array and to guarantee that once you done your signalling no more HW fences will be added to that array anymore. I was thinking to do something like bellow: Well we could implement the is_signaled callback once more, but I'm not sure if that is a good idea. This indeed could save the explicit signaling I am doing bellow but I also set an error code there which might be helpful to propagate to users amdgpu_fence_emit() { dma_fence_init(fence); srcu_read_lock(amdgpu_unplug_srcu) if (!adev->unplug)) { seq = ++ring->fence_drv.sync_seq; emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/** dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite the ring buffers. If device is present how can I ignore this ? But it should not have a timeout as far as I can see. Without timeout wait the who approach falls apart as I can't call srcu_synchronize on this scope because once device is physically gone the wait here will be forever Yeah, but this is intentional. The only alternative to avoid corruption is to wait with a timeout and call BUG() if that triggers. That isn't much better. ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] = fence; } else { dma_fence_set_error(fence, -ENODEV); DMA_fence_signal(fence) } srcu_read_unlock(amdgpu_unplug_srcu) return fence; } amdgpu_pci_remove { adev->unplug = true; synchronize_srcu(amdgpu_unplug_srcu) Well that is just duplicating what drm_dev_unplug() should be doing on a different level. drm_dev_unplug is on a much wider scope, for everything in the device including 'flushing' in flight IOCTLs, this deals specifically with the issue of force signalling HW fences Yeah, but it adds the same overhead as the device srcu. Christian. Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Ah, wait a second this gave me another idea. See amdgpu_fence_driver_force_completion(): amdgpu_fence_write(ring, ring->fence_drv.sync_seq); If we change that to something like: amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF); Not only the currently submitted, but also the next 0x3FFF fences will be considered signaled. This basically solves out problem of making sure that new fences are also signaled without any additional overhead whatsoever. Problem with this is that the act of setting the sync_seq to some MAX value alone is not enough, you actually have to call amdgpu_fence_process to iterate and signal the fences currently stored in ring->fence_drv.fences array and to guarantee that once you done your signalling no more HW fences will be added to that array anymore. I was thinking to do something like bellow: Well we could implement the is_signaled callback once more, but I'm not sure if that is a good idea. This indeed could save the explicit signaling I am doing bellow but I also set an error code there which might be helpful to propagate to users amdgpu_fence_emit() { dma_fence_init(fence); srcu_read_lock(amdgpu_unplug_srcu) if (!adev->unplug)) { seq = ++ring->fence_drv.sync_seq; emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/** dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite the ring buffers. If device is present how can I ignore this ? But it should not have a timeout as far as I can see. Without timeout wait the who approach falls apart as I can't call srcu_synchronize on this scope because once device is physically gone the wait here will be forever ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] = fence; } else { dma_fence_set_error(fence, -ENODEV); DMA_fence_signal(fence) } srcu_read_unlock(amdgpu_unplug_srcu) return fence; } amdgpu_pci_remove { adev->unplug = true; synchronize_srcu(amdgpu_unplug_srcu) Well that is just duplicating what drm_dev_unplug() should be doing on a different level. drm_dev_unplug is on a much wider scope, for everything in the device including 'flushing' in flight IOCTLs, this deals specifically with the issue of force signalling HW fences Andrey Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Ah, wait a second this gave me another idea. See amdgpu_fence_driver_force_completion(): amdgpu_fence_write(ring, ring->fence_drv.sync_seq); If we change that to something like: amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF); Not only the currently submitted, but also the next 0x3FFF fences will be considered signaled. This basically solves out problem of making sure that new fences are also signaled without any additional overhead whatsoever. Problem with this is that the act of setting the sync_seq to some MAX value alone is not enough, you actually have to call amdgpu_fence_process to iterate and signal the fences currently stored in ring->fence_drv.fences array and to guarantee that once you done your signalling no more HW fences will be added to that array anymore. I was thinking to do something like bellow: Well we could implement the is_signaled callback once more, but I'm not sure if that is a good idea. amdgpu_fence_emit() { dma_fence_init(fence); srcu_read_lock(amdgpu_unplug_srcu) if (!adev->unplug)) { seq = ++ring->fence_drv.sync_seq; emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/** dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite the ring buffers. But it should not have a timeout as far as I can see. ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] = fence; } else { dma_fence_set_error(fence, -ENODEV); DMA_fence_signal(fence) } srcu_read_unlock(amdgpu_unplug_srcu) return fence; } amdgpu_pci_remove { adev->unplug = true; synchronize_srcu(amdgpu_unplug_srcu) Well that is just duplicating what drm_dev_unplug() should be doing on a different level. Christian. /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway - https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0 Yes, good point
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Ah, wait a second this gave me another idea. See amdgpu_fence_driver_force_completion(): amdgpu_fence_write(ring, ring->fence_drv.sync_seq); If we change that to something like: amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF); Not only the currently submitted, but also the next 0x3FFF fences will be considered signaled. This basically solves out problem of making sure that new fences are also signaled without any additional overhead whatsoever. Problem with this is that the act of setting the sync_seq to some MAX value alone is not enough, you actually have to call amdgpu_fence_process to iterate and signal the fences currently stored in ring->fence_drv.fences array and to guarantee that once you done your signalling no more HW fences will be added to that array anymore. I was thinking to do something like bellow: amdgpu_fence_emit() { dma_fence_init(fence); srcu_read_lock(amdgpu_unplug_srcu) if (!adev->unplug)) { seq = ++ring->fence_drv.sync_seq; emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/** dma_fence_wait_timeout(old_fence, 5S);* ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] = fence; } else { dma_fence_set_error(fence, -ENODEV); DMA_fence_signal(fence) } srcu_read_unlock(amdgpu_unplug_srcu) return fence; } amdgpu_pci_remove { adev->unplug = true; synchronize_srcu(amdgpu_unplug_srcu) /* Past this point no more fence are submitted to HW ring and hence we can safely call force signal on all that are currently there. * Any subsequently created HW fences will be returned signaled with an error code right away */ for_each_ring(adev) amdgpu_fence_process(ring) drm_dev_unplug(dev); Stop schedulers cancel_sync(all timers and queued works); hw_fini unmap_mmio } Andrey Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway - https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0 Yes, good point as well. Christian. Andrey Christian. Andrey Christian. Andrey Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Ah, wait a second this gave me another idea. See amdgpu_fence_driver_force_completion(): amdgpu_fence_write(ring, ring->fence_drv.sync_seq); If we change that to something like: amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF); Not only the currently submitted, but also the next 0x3FFF fences will be considered signaled. This basically solves out problem of making sure that new fences are also signaled without any additional overhead whatsoever. Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. For going my way indeed now I see now that I have to take reset write side lock during HW fences signalling in order to protect against scheduler/HW fences detachment and reattachment during schedulers stop/restart. But if we go with your approach then calling drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit should be enough to prevent any concurrent GPU resets during unplug. In fact I already do it anyway - https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next=ef0ea4dd29ef44d2649c5eda16c8f4869acc36b1 Yes, good point as well. Christian. Andrey Christian. Andrey Christian. Andrey Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-09 12:39 p.m., Christian König wrote: Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky: On 2021-04-09 3:01 a.m., Christian König wrote: Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: [SNIP] But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is much less effort and less room for error then going through each IOCTL and trying to identify at what point (possibly multiple points) they are about to access HW, some of this is hidden deep in HAL layers such as DC layer in display driver or the multi layers of powerplay/SMU libraries. Also, we can't only limit our-self to back-end if by this you mean ASIC specific functions which access registers. We also need to take care of any MMIO kernel BO (VRAM BOs) where we may access directly MMIO space by pointer from the front end of the driver (HW agnostic) and TTM/DRM layers. Exactly, yes. The key point is we need to identify such places anyway for GPU reset to work properly. So we could just piggy back hotplug on top of that work and are done. I see most of this was done By Denis in this patch https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next=df9c8d1aa278c435c30a69b8f2418b4a52fcb929, indeed this doesn't cover the direct by pointer accesses of MMIO and will introduce much more of those and, as people write new code, new places to cover will pop up leading to regressions and extra work to fix. It would be really much better if we could blanket cover it at the very top such as root of all IOCTLs or, for any queued work/timer at the very top function, to handle it once and for all. And exactly that's what is not possible. At least for the reset case you need to look into each hardware access and handle that bit by bit and I think that for the hotplug case we should go down that route as well. Our problem here is how to signal all the existing fences on one hand and on the other prevent any new dma_fence waits after we finished signaling existing fences. Once we solved this then there is no problem using drm_dev_unplug in conjunction with drm_dev_enter/exit at the highest level of drm_ioctl to flush any IOCTLs in flight and block any new ones. IMHO when we speak about signalling all fences we don't mean ALL the currently existing dma_fence structs (they are spread all over the place) but rather signal all the HW fences because HW is what's gone and we can't expect for those fences to be ever signaled. All the rest such as: scheduler fences, user fences, drm_gem reservation objects e.t.c. are either dependent on those HW fences and hence signaling the HW fences will in turn signal them or, are not impacted by the HW being gone and hence can still be waited on and will complete. If this assumption is correct then I think that we should use some flag to prevent any new submission to HW which creates HW fences (somewhere around amdgpu_fence_emit), then traverse all existing HW fences (currently they are spread in a few places so maybe we need to track them in a list) and signal them. After that it's safe to cal drm_dev_unplug and be sure synchronize_srcu won't stall because of of dma_fence_wait. After that we can proceed to canceling work items, stopping schedulers e.t.c. That is problematic as well since you need to make sure that the scheduler is not creating a new hardware fence in the moment you try to signal all of them. It would require another SRCU or lock for this. If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. As far as I see in the code, amdgpu_fence_emit is only called from task context. Also, we can skip this list I proposed and just use amdgpu_fence_driver_force_completion for each ring to signal all created HW fences. Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky: On 2021-04-09 3:01 a.m., Christian König wrote: Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: [SNIP] But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is much less effort and less room for error then going through each IOCTL and trying to identify at what point (possibly multiple points) they are about to access HW, some of this is hidden deep in HAL layers such as DC layer in display driver or the multi layers of powerplay/SMU libraries. Also, we can't only limit our-self to back-end if by this you mean ASIC specific functions which access registers. We also need to take care of any MMIO kernel BO (VRAM BOs) where we may access directly MMIO space by pointer from the front end of the driver (HW agnostic) and TTM/DRM layers. Exactly, yes. The key point is we need to identify such places anyway for GPU reset to work properly. So we could just piggy back hotplug on top of that work and are done. I see most of this was done By Denis in this patch https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next=df9c8d1aa278c435c30a69b8f2418b4a52fcb929, indeed this doesn't cover the direct by pointer accesses of MMIO and will introduce much more of those and, as people write new code, new places to cover will pop up leading to regressions and extra work to fix. It would be really much better if we could blanket cover it at the very top such as root of all IOCTLs or, for any queued work/timer at the very top function, to handle it once and for all. And exactly that's what is not possible. At least for the reset case you need to look into each hardware access and handle that bit by bit and I think that for the hotplug case we should go down that route as well. Our problem here is how to signal all the existing fences on one hand and on the other prevent any new dma_fence waits after we finished signaling existing fences. Once we solved this then there is no problem using drm_dev_unplug in conjunction with drm_dev_enter/exit at the highest level of drm_ioctl to flush any IOCTLs in flight and block any new ones. IMHO when we speak about signalling all fences we don't mean ALL the currently existing dma_fence structs (they are spread all over the place) but rather signal all the HW fences because HW is what's gone and we can't expect for those fences to be ever signaled. All the rest such as: scheduler fences, user fences, drm_gem reservation objects e.t.c. are either dependent on those HW fences and hence signaling the HW fences will in turn signal them or, are not impacted by the HW being gone and hence can still be waited on and will complete. If this assumption is correct then I think that we should use some flag to prevent any new submission to HW which creates HW fences (somewhere around amdgpu_fence_emit), then traverse all existing HW fences (currently they are spread in a few places so maybe we need to track them in a list) and signal them. After that it's safe to cal drm_dev_unplug and be sure synchronize_srcu won't stall because of of dma_fence_wait. After that we can proceed to canceling work items, stopping schedulers e.t.c. That is problematic as well since you need to make sure that the scheduler is not creating a new hardware fence in the moment you try to signal all of them. It would require another SRCU or lock for this. If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, set the flag to false, and then iterate the list and force signal it. Will this not prevent any new HW fence creation from now on from any place trying to do so ? Way to much overhead. The fence processing is intentionally lock free to avoid cache line bouncing because the IRQ can move from CPU to CPU. We need something which at least the processing of fences in the interrupt handler doesn't affect at all. Alternatively grabbing the reset write side and stopping and then restarting the scheduler could work as well. Christian. I didn't get the above and I don't see why I need to reuse the GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not clear to me why are we focusing on the scheduler threads, any code patch to generate HW fences should be covered, so any code leading to amdgpu_fence_emit needs to be taken into account such as, direct IB submissions, VM flushes e.t.c You need to work together with the reset lock anyway, cause a hotplug could run at the same time as a reset. Christian. Andrey Christian. Andrey Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-09 3:01 a.m., Christian König wrote: Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and scheduler threads we also need to guarantee no IOCTL can access MMIO post device unplug OR in flight IOCTLs are done before we finish pci_remove (amdgpu_pci_remove for us). For this I suggest we do something like what we worked on with Takashi Iwai the ALSA maintainer recently when he helped implementing PCI BARs move support for snd_hda_intel. Take a look at https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3Dcbaa324799718e2b828a8c7b5b001dd896748497data=04%7C01%7Candrey.grodzovsky%40amd.com%7C1c5e440d332f46b7f86208d8fb25422c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637535484734581904%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=n%2FG3bLYUKdl9mitR9f1a8qLpkToLdKM3Iz4y23GFg60%3Dreserved=0 and https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3De36365d9ab5bbc30bdc221ab4b3437de34492440data=04%7C01%7Candrey.grodzovsky%40amd.com%7C1c5e440d332f46b7f86208d8fb25422c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637535484734581904%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=xI88SgbdAK%2FUmCC3JOvAknFTdbDbfu4AIPL%2Bf8ol4ZI%3Dreserved=0 We also had same issue there, how to prevent MMIO accesses while the BARs are migrating. What was done there is a refcount was added to count all IOCTLs in flight, for any in flight IOCTL the BAR migration handler would block for the refcount to drop to 0 before it would proceed, for any later IOCTL it stops and wait if device is in migration state. We even don't need the wait part, nothing to wait for, we just return with -ENODEV for this case. This is essentially what the DRM SRCU is doing as well. For the hotplug case we could do this in the toplevel since we can signal the fence and don't need to block memory management. To make SRCU 'wait for' all IOCTLs in flight we would need to wrap every IOCTL ( practically - just drm_ioctl function) with drm_dev_enter/drm_dev_exit - can we do it ? Sorry totally missed this question. Yes, exactly that. As discussed for the hotplug case we can do this. Thinking more about it - assuming we are treating synchronize_srcu as a 'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit} scope, some of those scopes might do dma_fence_wait inside. Since we haven't force signaled the fences yet we will end up a deadlock. We have to signal all the various fences before doing the 'wait for'. But we can't signal the fences before setting 'dev->unplugged = true' to reject further CS and other stuff which might create more fences we were supposed-to force signal and now missed them. Effectively setting 'dev->unplugged = true' and doing synchronize_srcu in one call like drm_dev_unplug does without signalling all the fences in the device in between these two steps looks luck a possible deadlock to me - what do you think ? Indeed, that is a really good argument to handle it the same way as the reset lock. E.g. not taking it at the high level IOCTL, but rather when the frontend of the driver has acquired all the necessary locks (BO resv, VM lock etc...) before calling into the backend to actually do things with the hardware. Christian. From what you said I understand that you want to solve this problem by using drm_dev_enter/exit brackets low enough in the code such that it will not include and fence wait. But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is much less effort and less room for error then going through each IOCTL and trying to identify at what point (possibly multiple points) they are about to access HW, some of this is hidden deep in HAL layers such as DC layer in display driver or the multi layers of powerplay/SMU libraries. Also, we can't only limit our-self to back-end if by this you mean ASIC specific functions which access registers. We also need to take care of any MMIO kernel BO (VRAM BOs) where we may access directly MMIO space by pointer from the front end of the driver (HW agnostic) and TTM/DRM layers. Exactly, yes. The key point is we need to identify such places anyway for GPU reset to work properly. So we could just piggy back hotplug on top of that work and are done. I see most of this was done By Denis in this patch
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and scheduler threads we also need to guarantee no IOCTL can access MMIO post device unplug OR in flight IOCTLs are done before we finish pci_remove (amdgpu_pci_remove for us). For this I suggest we do something like what we worked on with Takashi Iwai the ALSA maintainer recently when he helped implementing PCI BARs move support for snd_hda_intel. Take a look at https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=cbaa324799718e2b828a8c7b5b001dd896748497 and https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=e36365d9ab5bbc30bdc221ab4b3437de34492440 We also had same issue there, how to prevent MMIO accesses while the BARs are migrating. What was done there is a refcount was added to count all IOCTLs in flight, for any in flight IOCTL the BAR migration handler would block for the refcount to drop to 0 before it would proceed, for any later IOCTL it stops and wait if device is in migration state. We even don't need the wait part, nothing to wait for, we just return with -ENODEV for this case. This is essentially what the DRM SRCU is doing as well. For the hotplug case we could do this in the toplevel since we can signal the fence and don't need to block memory management. To make SRCU 'wait for' all IOCTLs in flight we would need to wrap every IOCTL ( practically - just drm_ioctl function) with drm_dev_enter/drm_dev_exit - can we do it ? Sorry totally missed this question. Yes, exactly that. As discussed for the hotplug case we can do this. Thinking more about it - assuming we are treating synchronize_srcu as a 'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit} scope, some of those scopes might do dma_fence_wait inside. Since we haven't force signaled the fences yet we will end up a deadlock. We have to signal all the various fences before doing the 'wait for'. But we can't signal the fences before setting 'dev->unplugged = true' to reject further CS and other stuff which might create more fences we were supposed-to force signal and now missed them. Effectively setting 'dev->unplugged = true' and doing synchronize_srcu in one call like drm_dev_unplug does without signalling all the fences in the device in between these two steps looks luck a possible deadlock to me - what do you think ? Indeed, that is a really good argument to handle it the same way as the reset lock. E.g. not taking it at the high level IOCTL, but rather when the frontend of the driver has acquired all the necessary locks (BO resv, VM lock etc...) before calling into the backend to actually do things with the hardware. Christian. From what you said I understand that you want to solve this problem by using drm_dev_enter/exit brackets low enough in the code such that it will not include and fence wait. But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is much less effort and less room for error then going through each IOCTL and trying to identify at what point (possibly multiple points) they are about to access HW, some of this is hidden deep in HAL layers such as DC layer in display driver or the multi layers of powerplay/SMU libraries. Also, we can't only limit our-self to back-end if by this you mean ASIC specific functions which access registers. We also need to take care of any MMIO kernel BO (VRAM BOs) where we may access directly MMIO space by pointer from the front end of the driver (HW agnostic) and TTM/DRM layers. Exactly, yes. The key point is we need to identify such places anyway for GPU reset to work properly. So we could just piggy back hotplug on top of that work and are done. Our problem here is how to signal all the existing fences on one hand and on the other prevent any new dma_fence waits after we finished signaling existing fences. Once we solved this then there is no problem using drm_dev_unplug in conjunction with drm_dev_enter/exit at the highest level of drm_ioctl to flush any IOCTLs in flight and block any new ones. IMHO when we speak about signalling all fences we don't mean ALL the currently existing dma_fence structs (they are spread all over the place) but rather signal all the HW fences because HW is what's gone and we can't expect for those fences to be ever signaled. All the rest such as: scheduler fences, user fences, drm_gem reservation objects e.t.c. are either dependent on those HW fences and hence signaling the HW fences will in turn signal them or, are not impacted by the HW being gone and hence can still be waited on and will complete. If this assumption is
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and scheduler threads we also need to guarantee no IOCTL can access MMIO post device unplug OR in flight IOCTLs are done before we finish pci_remove (amdgpu_pci_remove for us). For this I suggest we do something like what we worked on with Takashi Iwai the ALSA maintainer recently when he helped implementing PCI BARs move support for snd_hda_intel. Take a look at https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=cbaa324799718e2b828a8c7b5b001dd896748497 and https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=e36365d9ab5bbc30bdc221ab4b3437de34492440 We also had same issue there, how to prevent MMIO accesses while the BARs are migrating. What was done there is a refcount was added to count all IOCTLs in flight, for any in flight IOCTL the BAR migration handler would block for the refcount to drop to 0 before it would proceed, for any later IOCTL it stops and wait if device is in migration state. We even don't need the wait part, nothing to wait for, we just return with -ENODEV for this case. This is essentially what the DRM SRCU is doing as well. For the hotplug case we could do this in the toplevel since we can signal the fence and don't need to block memory management. To make SRCU 'wait for' all IOCTLs in flight we would need to wrap every IOCTL ( practically - just drm_ioctl function) with drm_dev_enter/drm_dev_exit - can we do it ? Sorry totally missed this question. Yes, exactly that. As discussed for the hotplug case we can do this. Thinking more about it - assuming we are treating synchronize_srcu as a 'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit} scope, some of those scopes might do dma_fence_wait inside. Since we haven't force signaled the fences yet we will end up a deadlock. We have to signal all the various fences before doing the 'wait for'. But we can't signal the fences before setting 'dev->unplugged = true' to reject further CS and other stuff which might create more fences we were supposed-to force signal and now missed them. Effectively setting 'dev->unplugged = true' and doing synchronize_srcu in one call like drm_dev_unplug does without signalling all the fences in the device in between these two steps looks luck a possible deadlock to me - what do you think ? Indeed, that is a really good argument to handle it the same way as the reset lock. E.g. not taking it at the high level IOCTL, but rather when the frontend of the driver has acquired all the necessary locks (BO resv, VM lock etc...) before calling into the backend to actually do things with the hardware. Christian. From what you said I understand that you want to solve this problem by using drm_dev_enter/exit brackets low enough in the code such that it will not include and fence wait. But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is much less effort and less room for error then going through each IOCTL and trying to identify at what point (possibly multiple points) they are about to access HW, some of this is hidden deep in HAL layers such as DC layer in display driver or the multi layers of powerplay/SMU libraries. Also, we can't only limit our-self to back-end if by this you mean ASIC specific functions which access registers. We also need to take care of any MMIO kernel BO (VRAM BOs) where we may access directly MMIO space by pointer from the front end of the driver (HW agnostic) and TTM/DRM layers. Exactly, yes. The key point is we need to identify such places anyway for GPU reset to work properly. So we could just piggy back hotplug on top of that work and are done. Our problem here is how to signal all the existing fences on one hand and on the other prevent any new dma_fence waits after we finished signaling existing fences. Once we solved this then there is no problem using drm_dev_unplug in conjunction with drm_dev_enter/exit at the highest level of drm_ioctl to flush any IOCTLs in flight and block any new ones. IMHO when we speak about signalling all fences we don't mean ALL the currently existing dma_fence structs (they are spread all over the place) but rather signal all the HW fences because HW is what's gone and we can't expect for those fences to be ever signaled. All the rest such as: scheduler fences, user fences, drm_gem reservation objects e.t.c. are either dependent on those HW fences and hence signaling the HW fences will in turn signal them or, are not impacted by the HW being gone and hence can still be waited on and will complete. If this assumption is correct then I think that we should use some flag
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and scheduler threads we also need to guarantee no IOCTL can access MMIO post device unplug OR in flight IOCTLs are done before we finish pci_remove (amdgpu_pci_remove for us). For this I suggest we do something like what we worked on with Takashi Iwai the ALSA maintainer recently when he helped implementing PCI BARs move support for snd_hda_intel. Take a look at https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=cbaa324799718e2b828a8c7b5b001dd896748497 and https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=e36365d9ab5bbc30bdc221ab4b3437de34492440 We also had same issue there, how to prevent MMIO accesses while the BARs are migrating. What was done there is a refcount was added to count all IOCTLs in flight, for any in flight IOCTL the BAR migration handler would block for the refcount to drop to 0 before it would proceed, for any later IOCTL it stops and wait if device is in migration state. We even don't need the wait part, nothing to wait for, we just return with -ENODEV for this case. This is essentially what the DRM SRCU is doing as well. For the hotplug case we could do this in the toplevel since we can signal the fence and don't need to block memory management. To make SRCU 'wait for' all IOCTLs in flight we would need to wrap every IOCTL ( practically - just drm_ioctl function) with drm_dev_enter/drm_dev_exit - can we do it ? Sorry totally missed this question. Yes, exactly that. As discussed for the hotplug case we can do this. Thinking more about it - assuming we are treating synchronize_srcu as a 'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit} scope, some of those scopes might do dma_fence_wait inside. Since we haven't force signaled the fences yet we will end up a deadlock. We have to signal all the various fences before doing the 'wait for'. But we can't signal the fences before setting 'dev->unplugged = true' to reject further CS and other stuff which might create more fences we were supposed-to force signal and now missed them. Effectively setting 'dev->unplugged = true' and doing synchronize_srcu in one call like drm_dev_unplug does without signalling all the fences in the device in between these two steps looks luck a possible deadlock to me - what do you think ? Indeed, that is a really good argument to handle it the same way as the reset lock. E.g. not taking it at the high level IOCTL, but rather when the frontend of the driver has acquired all the necessary locks (BO resv, VM lock etc...) before calling into the backend to actually do things with the hardware. Christian. From what you said I understand that you want to solve this problem by using drm_dev_enter/exit brackets low enough in the code such that it will not include and fence wait. But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is much less effort and less room for error then going through each IOCTL and trying to identify at what point (possibly multiple points) they are about to access HW, some of this is hidden deep in HAL layers such as DC layer in display driver or the multi layers of powerplay/SMU libraries. Also, we can't only limit our-self to back-end if by this you mean ASIC specific functions which access registers. We also need to take care of any MMIO kernel BO (VRAM BOs) where we may access directly MMIO space by pointer from the front end of the driver (HW agnostic) and TTM/DRM layers. Our problem here is how to signal all the existing fences on one hand and on the other prevent any new dma_fence waits after we finished signaling existing fences. Once we solved this then there is no problem using drm_dev_unplug in conjunction with drm_dev_enter/exit at the highest level of drm_ioctl to flush any IOCTLs in flight and block any new ones. IMHO when we speak about signalling all fences we don't mean ALL the currently existing dma_fence structs (they are spread all over the place) but rather signal all the HW fences because HW is what's gone and we can't expect for those fences to be ever signaled. All the rest such as: scheduler fences, user fences, drm_gem reservation objects e.t.c. are either dependent on those HW fences and hence signaling the HW fences will in turn signal them or, are not impacted by the HW being gone and hence can still be waited on and will complete. If this assumption is correct then I think that we should use some flag to prevent any new submission to HW which creates HW fences (somewhere around amdgpu_fence_emit), then traverse all existing HW fences (currently they are spread in a few places so maybe we need to track them in a list) and
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and scheduler threads we also need to guarantee no IOCTL can access MMIO post device unplug OR in flight IOCTLs are done before we finish pci_remove (amdgpu_pci_remove for us). For this I suggest we do something like what we worked on with Takashi Iwai the ALSA maintainer recently when he helped implementing PCI BARs move support for snd_hda_intel. Take a look at https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=cbaa324799718e2b828a8c7b5b001dd896748497 and https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=e36365d9ab5bbc30bdc221ab4b3437de34492440 We also had same issue there, how to prevent MMIO accesses while the BARs are migrating. What was done there is a refcount was added to count all IOCTLs in flight, for any in flight IOCTL the BAR migration handler would block for the refcount to drop to 0 before it would proceed, for any later IOCTL it stops and wait if device is in migration state. We even don't need the wait part, nothing to wait for, we just return with -ENODEV for this case. This is essentially what the DRM SRCU is doing as well. For the hotplug case we could do this in the toplevel since we can signal the fence and don't need to block memory management. To make SRCU 'wait for' all IOCTLs in flight we would need to wrap every IOCTL ( practically - just drm_ioctl function) with drm_dev_enter/drm_dev_exit - can we do it ? Sorry totally missed this question. Yes, exactly that. As discussed for the hotplug case we can do this. Thinking more about it - assuming we are treating synchronize_srcu as a 'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit} scope, some of those scopes might do dma_fence_wait inside. Since we haven't force signaled the fences yet we will end up a deadlock. We have to signal all the various fences before doing the 'wait for'. But we can't signal the fences before setting 'dev->unplugged = true' to reject further CS and other stuff which might create more fences we were supposed-to force signal and now missed them. Effectively setting 'dev->unplugged = true' and doing synchronize_srcu in one call like drm_dev_unplug does without signalling all the fences in the device in between these two steps looks luck a possible deadlock to me - what do you think ? Indeed, that is a really good argument to handle it the same way as the reset lock. E.g. not taking it at the high level IOCTL, but rather when the frontend of the driver has acquired all the necessary locks (BO resv, VM lock etc...) before calling into the backend to actually do things with the hardware. Christian. Andrey ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. *Von:*Li, Dennis mailto:dennis...@amd.com>> *Gesendet:* Donnerstag, 18. März 2021 09:28 *An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander <mailto:alexander.deuc...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, Hawking <mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, Christian <mailto:christian.koe...@amd.com>> Sent: Thursday, March 18, 2021 3:54 PM To: Li, Dennis mailto:dennis...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander <mailto:alexander.deuc...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, Hawking <mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic type variable in_gpu_reset is used to avoid recovery thread reenter and make lower functions return more earlier when recovery start, but couldn't block recovery thread when it access hardware. The r/w semaphore reset_sem is used to solve these synchronization issues between recovery thread and other threads. > > The original solution locked registers' access in lower functions, which will introduce following issues: > > 1) many lower functions are used in both recovery thread and others. Firstly we must harvest these functions, it is easy to miss someones. Secondly these functions need select which lock (read lock or write lock) will be used, according to the thread it is running in. If the thread context isn't considered, the added lock will easily introduce deadlock. Besides that, in most time, developer easily forget to add locks for new functions. > > 2) performance drop. More lower functions are more frequently called. > > 3) easily introduce false positive lockdep complaint, because write lock has big range in recovery thread, but low level functions will hold read lock may be protected by other locks in other threads. > > Therefore the new solution will try to add lock protection for ioctls of kfd. Its goal is that there are no threads except for recovery thread or its children (for xgmi) to access hardware when doing GPU reset and resume. So refine recovery thread as the following: > > Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1) > 1). if failed, it means system had a recovery thread running, current thread exit directly; > 2). if success, enter recovery thread; > > Step 1: cancel all delay works, stop drm schedule, complete all unreceived fences and so on. It try to stop or pause other threads. > > Step 2: call down_write(>reset_sem) to hold write lock, which will block recovery thread until other threads release read locks. Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. Just to make it clear until this is fixed the whole patch set is a NAK. Regards, Christian. > > Step 3: normally, there is only recovery threads running to access hardware, it is safe to do gpu reset now. > > Step 4: do post gpu reset, such as call all ips' resume functions; > > Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release write lock. Recovery thread exit normally. > > Other threads call the amdgpu_read_lock to synchronize with recovery thread. If it finds that in_gpu_reset is 1, it should release read lock if it has holden one, and then blocks itself to wait for recovery finished event. If thread successfully hold read lock and in_gpu_reset is 0, it continues. It will exit nor
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
h...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, Christian <mailto:christian.koe...@amd.com>> Sent: Thursday, March 18, 2021 3:54 PM To: Li, Dennis mailto:dennis...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic type variable in_gpu_reset is used to avoid recovery thread reenter and make lower functions return more earlier when recovery start, but couldn't block recovery thread when it access hardware. The r/w semaphore reset_sem is used to solve these synchronization issues between recovery thread and other threads. > > The original solution locked registers' access in lower functions, which will introduce following issues: > > 1) many lower functions are used in both recovery thread and others. Firstly we must harvest these functions, it is easy to miss someones. Secondly these functions need select which lock (read lock or write lock) will be used, according to the thread it is running in. If the thread context isn't considered, the added lock will easily introduce deadlock. Besides that, in most time, developer easily forget to add locks for new functions. > > 2) performance drop. More lower functions are more frequently called. > > 3) easily introduce false positive lockdep complaint, because write lock has big range in recovery thread, but low level functions will hold read lock may be protected by other locks in other threads. > > Therefore the new solution will try to add lock protection for ioctls of kfd. Its goal is that there are no threads except for recovery thread or its children (for xgmi) to access hardware when doing GPU reset and resume. So refine recovery thread as the following: > > Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1) > 1). if failed, it means system had a recovery thread running, current thread exit directly; > 2). if success, enter recovery thread; > > Step 1: cancel all delay works, stop drm schedule, complete all unreceived fences and so on. It try to stop or pause other threads. > > Step 2: call down_write(>reset_sem) to hold write lock, which will block recovery thread until other threads release read locks. Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. Just to make it clear until this is fixed the whole patch set is a NAK. Regards, Christian. > > Step 3: normally, there is only recovery threads running to access hardware, it is safe to do gpu reset now. > > Step 4: do post gpu reset, such as call all ips' resume functions; > > Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release write lock. Recovery thread exit normally. > > Other threads call the amdgpu_read_lock to synchronize with recovery thread. If it finds that in_gpu_reset is 1, it should release read lock if it has holden one, and then blocks itself to wait for recovery finished event. If thread successfully hold read lock and in_gpu_reset is 0, it continues. It will exit normally or be stopped by recovery thread in step 1. > > Dennis Li (4): > drm/amdgpu: remove reset lock from low level functions > drm/amdgpu: refine the GPU recovery sequence > drm/amdgpu: instead of using down/up_read directly > drm/amdkfd: add reset lock protection for kfd entry functions > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 173 +- > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 8 - > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 4 +- > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 9 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172 - > drivers/
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
, maybe we should handle it the same way as reset or maybe we should have it at the top level. If by top level you mean checking for device unplugged and bailing out at the entry to IOCTL or right at start of any work_item/timer function we have then seems to me it's better and more clear. Once we flushed all of them in flight there is no reason for them to execute any more when device is unplugged. Andrey Regards, Christian. The above approach should allow us to wait for all the IOCTLs in flight, together with stopping scheduler threads and cancelling and flushing all in flight work items and timers i think It should give as full solution for the hot-unplug case of preventing any MMIO accesses post device pci_remove. Let me know what you think guys. Andrey And then testing, testing, testing to see if we have missed something. Christian. Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky: Denis, Christian, are there any updates in the plan on how to move on with this ? As you know I need very similar code for my up-streaming of device hot-unplug. My latest solution (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) was not acceptable because of low level guards on the register accessors level which was hurting performance. Basically I need a way to prevent any MMIO write accesses from kernel driver after device is removed (UMD accesses are taken care of by page faulting dummy page). We are using now hot-unplug code for Freemont program and so up-streaming became more of a priority then before. This MMIO access issue is currently my main blocker from up-streaming. Is there any way I can assist in pushing this on ? Andrey On 2021-03-18 5:51 a.m., Christian König wrote: Am 18.03.21 um 10:30 schrieb Li, Dennis: >>> The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. >>> So waiting for a fence while holding the reset lock is illegal and needs to be avoided. I understood your concern. It is more complex for DRM GFX, therefore I abandon adding lock protection for DRM ioctls now. Maybe we can try to add all kernel dma_fence waiting in a list, and signal all in recovery threads. Do you have same concern for compute cases? Yes, compute (KFD) is even harder to handle. See you can't signal the dma_fence waiting. Waiting for a dma_fence also means you wait for the GPU reset to finish. When we would signal the dma_fence during the GPU reset then we would run into memory corruption because the hardware jobs running after the GPU reset would access memory which is already freed. >>> Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Agree. This approach will escape the monitor of lockdep. Its goal is to block other threads when GPU recovery thread start. But I couldn’t find a better method to solve this problem. Do you have some suggestion? Well, completely abandon those change here. What we need to do is to identify where hardware access happens and then insert taking the read side of the GPU reset lock so that we don't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. *Von:*Li, Dennis mailto:dennis...@amd.com>> *Gesendet:* Donnerstag, 18. März 2021 09:28 *An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchan
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
in flight there is no reason for them to execute any more when device is unplugged. Andrey Regards, Christian. The above approach should allow us to wait for all the IOCTLs in flight, together with stopping scheduler threads and cancelling and flushing all in flight work items and timers i think It should give as full solution for the hot-unplug case of preventing any MMIO accesses post device pci_remove. Let me know what you think guys. Andrey And then testing, testing, testing to see if we have missed something. Christian. Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky: Denis, Christian, are there any updates in the plan on how to move on with this ? As you know I need very similar code for my up-streaming of device hot-unplug. My latest solution (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) was not acceptable because of low level guards on the register accessors level which was hurting performance. Basically I need a way to prevent any MMIO write accesses from kernel driver after device is removed (UMD accesses are taken care of by page faulting dummy page). We are using now hot-unplug code for Freemont program and so up-streaming became more of a priority then before. This MMIO access issue is currently my main blocker from up-streaming. Is there any way I can assist in pushing this on ? Andrey On 2021-03-18 5:51 a.m., Christian König wrote: Am 18.03.21 um 10:30 schrieb Li, Dennis: >>> The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. >>> So waiting for a fence while holding the reset lock is illegal and needs to be avoided. I understood your concern. It is more complex for DRM GFX, therefore I abandon adding lock protection for DRM ioctls now. Maybe we can try to add all kernel dma_fence waiting in a list, and signal all in recovery threads. Do you have same concern for compute cases? Yes, compute (KFD) is even harder to handle. See you can't signal the dma_fence waiting. Waiting for a dma_fence also means you wait for the GPU reset to finish. When we would signal the dma_fence during the GPU reset then we would run into memory corruption because the hardware jobs running after the GPU reset would access memory which is already freed. >>> Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Agree. This approach will escape the monitor of lockdep. Its goal is to block other threads when GPU recovery thread start. But I couldn’t find a better method to solve this problem. Do you have some suggestion? Well, completely abandon those change here. What we need to do is to identify where hardware access happens and then insert taking the read side of the GPU reset lock so that we don't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. *Von:*Li, Dennis mailto:dennis...@amd.com>> *Gesendet:* Donnerstag, 18. März 2021 09:28 *An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, Christian <
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Christian, are there any updates in the plan on how to move on with this ? As you know I need very similar code for my up-streaming of device hot-unplug. My latest solution (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) was not acceptable because of low level guards on the register accessors level which was hurting performance. Basically I need a way to prevent any MMIO write accesses from kernel driver after device is removed (UMD accesses are taken care of by page faulting dummy page). We are using now hot-unplug code for Freemont program and so up-streaming became more of a priority then before. This MMIO access issue is currently my main blocker from up-streaming. Is there any way I can assist in pushing this on ? Andrey On 2021-03-18 5:51 a.m., Christian König wrote: Am 18.03.21 um 10:30 schrieb Li, Dennis: >>> The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. >>> So waiting for a fence while holding the reset lock is illegal and needs to be avoided. I understood your concern. It is more complex for DRM GFX, therefore I abandon adding lock protection for DRM ioctls now. Maybe we can try to add all kernel dma_fence waiting in a list, and signal all in recovery threads. Do you have same concern for compute cases? Yes, compute (KFD) is even harder to handle. See you can't signal the dma_fence waiting. Waiting for a dma_fence also means you wait for the GPU reset to finish. When we would signal the dma_fence during the GPU reset then we would run into memory corruption because the hardware jobs running after the GPU reset would access memory which is already freed. >>> Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Agree. This approach will escape the monitor of lockdep. Its goal is to block other threads when GPU recovery thread start. But I couldn’t find a better method to solve this problem. Do you have some suggestion? Well, completely abandon those change here. What we need to do is to identify where hardware access happens and then insert taking the read side of the GPU reset lock so that we don't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. *Von:*Li, Dennis mailto:dennis...@amd.com>> *Gesendet:* Donnerstag, 18. März 2021 09:28 *An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, Christian <mailto:christian.koe...@amd.com>> Sent: Thursday, March 18, 2021 3:54 PM To: Li, Dennis mailto:dennis...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic type variable in_g
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
e wait for. It only completes the hardware fences as part of the reset. >>> So waiting for a fence while holding the reset lock is illegal and needs to be avoided. I understood your concern. It is more complex for DRM GFX, therefore I abandon adding lock protection for DRM ioctls now. Maybe we can try to add all kernel dma_fence waiting in a list, and signal all in recovery threads. Do you have same concern for compute cases? Yes, compute (KFD) is even harder to handle. See you can't signal the dma_fence waiting. Waiting for a dma_fence also means you wait for the GPU reset to finish. When we would signal the dma_fence during the GPU reset then we would run into memory corruption because the hardware jobs running after the GPU reset would access memory which is already freed. >>> Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Agree. This approach will escape the monitor of lockdep. Its goal is to block other threads when GPU recovery thread start. But I couldn’t find a better method to solve this problem. Do you have some suggestion? Well, completely abandon those change here. What we need to do is to identify where hardware access happens and then insert taking the read side of the GPU reset lock so that we don't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. *Von:*Li, Dennis mailto:dennis...@amd.com>> *Gesendet:* Donnerstag, 18. März 2021 09:28 *An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, Christian <mailto:christian.koe...@amd.com>> Sent: Thursday, March 18, 2021 3:54 PM To: Li, Dennis mailto:dennis...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic type variable in_gpu_reset is used to avoid recovery thread reenter and make lower functions return more earlier when recovery start, but couldn't block recovery thread when it access hardware. The r/w semaphore reset_sem is used to solve these synchronization issues between recovery thread and other threads. > > The original solution locked registers' access in lower functions, which will introduce following issues: > > 1) many lower functions are used in both recovery thread and others. Firstly we must harvest these functions, it is easy to miss someones. Secondly these functions need select which lock (read lock or write lock) will be used, according to the thread it is running in. If the thread context isn't considered, the added lock will easily introduce deadlock. Besides that, in most time, developer easily forget to add locks for new functions. > > 2) performance drop. More lower
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 06.04.21 um 12:34 schrieb Christian König: Hi Andrey, well good question. My job is to watch over the implementation and design and while I always help I can adjust anybodies schedule. That should read "I can't adjust anybodies schedule". Christian. Is the patch to print a warning when the hardware is accessed without holding the locks merged yet? If not then that would probably be a good starting point. Then we would need to unify this with the SRCU to make sure that we have both the reset lock as well as block the hotplug code from reusing the MMIO space. And then testing, testing, testing to see if we have missed something. Christian. Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky: Denis, Christian, are there any updates in the plan on how to move on with this ? As you know I need very similar code for my up-streaming of device hot-unplug. My latest solution (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) was not acceptable because of low level guards on the register accessors level which was hurting performance. Basically I need a way to prevent any MMIO write accesses from kernel driver after device is removed (UMD accesses are taken care of by page faulting dummy page). We are using now hot-unplug code for Freemont program and so up-streaming became more of a priority then before. This MMIO access issue is currently my main blocker from up-streaming. Is there any way I can assist in pushing this on ? Andrey On 2021-03-18 5:51 a.m., Christian König wrote: Am 18.03.21 um 10:30 schrieb Li, Dennis: >>> The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. >>> So waiting for a fence while holding the reset lock is illegal and needs to be avoided. I understood your concern. It is more complex for DRM GFX, therefore I abandon adding lock protection for DRM ioctls now. Maybe we can try to add all kernel dma_fence waiting in a list, and signal all in recovery threads. Do you have same concern for compute cases? Yes, compute (KFD) is even harder to handle. See you can't signal the dma_fence waiting. Waiting for a dma_fence also means you wait for the GPU reset to finish. When we would signal the dma_fence during the GPU reset then we would run into memory corruption because the hardware jobs running after the GPU reset would access memory which is already freed. >>> Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Agree. This approach will escape the monitor of lockdep. Its goal is to block other threads when GPU recovery thread start. But I couldn’t find a better method to solve this problem. Do you have some suggestion? Well, completely abandon those change here. What we need to do is to identify where hardware access happens and then insert taking the read side of the GPU reset lock so that we don't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. *Von:*Li, Dennis mailto:dennis...@amd.com>> *Gesendet:* Donnerstag, 18. März 2021 09:28 *An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete f
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Hi Andrey, well good question. My job is to watch over the implementation and design and while I always help I can adjust anybodies schedule. Is the patch to print a warning when the hardware is accessed without holding the locks merged yet? If not then that would probably be a good starting point. Then we would need to unify this with the SRCU to make sure that we have both the reset lock as well as block the hotplug code from reusing the MMIO space. And then testing, testing, testing to see if we have missed something. Christian. Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky: Denis, Christian, are there any updates in the plan on how to move on with this ? As you know I need very similar code for my up-streaming of device hot-unplug. My latest solution (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) was not acceptable because of low level guards on the register accessors level which was hurting performance. Basically I need a way to prevent any MMIO write accesses from kernel driver after device is removed (UMD accesses are taken care of by page faulting dummy page). We are using now hot-unplug code for Freemont program and so up-streaming became more of a priority then before. This MMIO access issue is currently my main blocker from up-streaming. Is there any way I can assist in pushing this on ? Andrey On 2021-03-18 5:51 a.m., Christian König wrote: Am 18.03.21 um 10:30 schrieb Li, Dennis: >>> The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. >>> So waiting for a fence while holding the reset lock is illegal and needs to be avoided. I understood your concern. It is more complex for DRM GFX, therefore I abandon adding lock protection for DRM ioctls now. Maybe we can try to add all kernel dma_fence waiting in a list, and signal all in recovery threads. Do you have same concern for compute cases? Yes, compute (KFD) is even harder to handle. See you can't signal the dma_fence waiting. Waiting for a dma_fence also means you wait for the GPU reset to finish. When we would signal the dma_fence during the GPU reset then we would run into memory corruption because the hardware jobs running after the GPU reset would access memory which is already freed. >>> Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Agree. This approach will escape the monitor of lockdep. Its goal is to block other threads when GPU recovery thread start. But I couldn’t find a better method to solve this problem. Do you have some suggestion? Well, completely abandon those change here. What we need to do is to identify where hardware access happens and then insert taking the read side of the GPU reset lock so that we don't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. *Von:*Li, Dennis mailto:dennis...@amd.com>> *Gesendet:* Donnerstag, 18. März 2021 09:28 *An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, C
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Denis, Christian, are there any updates in the plan on how to move on with this ? As you know I need very similar code for my up-streaming of device hot-unplug. My latest solution (https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) was not acceptable because of low level guards on the register accessors level which was hurting performance. Basically I need a way to prevent any MMIO write accesses from kernel driver after device is removed (UMD accesses are taken care of by page faulting dummy page). We are using now hot-unplug code for Freemont program and so up-streaming became more of a priority then before. This MMIO access issue is currently my main blocker from up-streaming. Is there any way I can assist in pushing this on ? Andrey On 2021-03-18 5:51 a.m., Christian König wrote: Am 18.03.21 um 10:30 schrieb Li, Dennis: >>> The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. >>> So waiting for a fence while holding the reset lock is illegal and needs to be avoided. I understood your concern. It is more complex for DRM GFX, therefore I abandon adding lock protection for DRM ioctls now. Maybe we can try to add all kernel dma_fence waiting in a list, and signal all in recovery threads. Do you have same concern for compute cases? Yes, compute (KFD) is even harder to handle. See you can't signal the dma_fence waiting. Waiting for a dma_fence also means you wait for the GPU reset to finish. When we would signal the dma_fence during the GPU reset then we would run into memory corruption because the hardware jobs running after the GPU reset would access memory which is already freed. >>> Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Agree. This approach will escape the monitor of lockdep. Its goal is to block other threads when GPU recovery thread start. But I couldn’t find a better method to solve this problem. Do you have some suggestion? Well, completely abandon those change here. What we need to do is to identify where hardware access happens and then insert taking the read side of the GPU reset lock so that we don't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. *Von:*Li, Dennis mailto:dennis...@amd.com>> *Gesendet:* Donnerstag, 18. März 2021 09:28 *An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, Christian <mailto:christian.koe...@amd.com>> Sent: Thursday, March 18, 2021 3:54 PM To: Li, Dennis mailto:dennis...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander <mailto:alexander.deuc...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 18.03.21 um 10:30 schrieb Li, Dennis: >>> The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. >>> So waiting for a fence while holding the reset lock is illegal and needs to be avoided. I understood your concern. It is more complex for DRM GFX, therefore I abandon adding lock protection for DRM ioctls now. Maybe we can try to add all kernel dma_fence waiting in a list, and signal all in recovery threads. Do you have same concern for compute cases? Yes, compute (KFD) is even harder to handle. See you can't signal the dma_fence waiting. Waiting for a dma_fence also means you wait for the GPU reset to finish. When we would signal the dma_fence during the GPU reset then we would run into memory corruption because the hardware jobs running after the GPU reset would access memory which is already freed. >>> Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Agree. This approach will escape the monitor of lockdep. Its goal is to block other threads when GPU recovery thread start. But I couldn’t find a better method to solve this problem. Do you have some suggestion? Well, completely abandon those change here. What we need to do is to identify where hardware access happens and then insert taking the read side of the GPU reset lock so that we don't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. *Von:*Li, Dennis mailto:dennis...@amd.com>> *Gesendet:* Donnerstag, 18. März 2021 09:28 *An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, Christian <mailto:christian.koe...@amd.com>> Sent: Thursday, March 18, 2021 3:54 PM To: Li, Dennis mailto:dennis...@amd.com>>; amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander <mailto:alexander.deuc...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic type variable in_gpu_reset is used to avoid recovery thread reenter and make lower functions return more earlier when recovery start, but couldn't block recovery thread when it access hardware. The r/w semaphore reset_sem is used to solve these synchronization issues between recovery thread and other threads. > > The original solution locked registers' access in lower functions, which will introduce following issues: > > 1) many lower functions are used in both recovery thread and others. Firstly we must harvest these functions, it is easy to miss someones. Secondly these functions need select which lock (read lock or write lock) will be used, according to the thread it is running in. If the thread context isn't considered, the added lock will easily introduce deadlock. Besides that, in most time
RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
>>> The GPU reset doesn't complete the fences we wait for. It only completes >>> the hardware fences as part of the reset. >>> So waiting for a fence while holding the reset lock is illegal and needs to >>> be avoided. I understood your concern. It is more complex for DRM GFX, therefore I abandon adding lock protection for DRM ioctls now. Maybe we can try to add all kernel dma_fence waiting in a list, and signal all in recovery threads. Do you have same concern for compute cases? >>> Lockdep also complains about this when it is used correctly. The only >>> reason it doesn't complain here is because you use an atomic+wait_event >>> instead of a locking primitive. Agree. This approach will escape the monitor of lockdep. Its goal is to block other threads when GPU recovery thread start. But I couldn’t find a better method to solve this problem. Do you have some suggestion? Best Regards Dennis Li From: Koenig, Christian Sent: Thursday, March 18, 2021 4:59 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the hardware fences as part of the reset. So waiting for a fence while holding the reset lock is illegal and needs to be avoided. Lockdep also complains about this when it is used correctly. The only reason it doesn't complain here is because you use an atomic+wait_event instead of a locking primitive. Regards, Christian. Von: Li, Dennis mailto:dennis...@amd.com>> Gesendet: Donnerstag, 18. März 2021 09:28 An: Koenig, Christian mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Betreff: RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new >>> delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, Christian mailto:christian.koe...@amd.com>> Sent: Thursday, March 18, 2021 3:54 PM To: Li, Dennis mailto:dennis...@amd.com>>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in adev object. The > atomic type variable in_gpu_reset is used to avoid recovery thread reenter > and make lower functions return more earlier when recovery start, but > couldn't block recovery thread when it access hardware. The r/w semaphore > reset_sem is used to solve these synchronization issues between recovery > thread and other threads. > > The original solution locked registers' access in lower functions, which will > introduce following issues: > > 1) many lower functions are used in both recovery thread and others. Firstly > we must harvest these functions, it is easy to miss someones. Secondly these > functions need select which lock (read lock or write lock) will be used, > according to the thread it is running in. If the thread context isn't > considered, the added lock will easily introduce deadlock. Besides that, in > most time, developer easily forget to add locks for new functions. > > 2) performance drop. More lower functions are more frequently called. > > 3) easily introduce false positive lockdep complaint, because write lock has > big range in recovery thread, but low level functions will hold read lock may > be protected by other locks in other threads. > > Therefore the new solution will try to add lock protection for ioctls of kfd. > Its goal is that there are no threads except for recovery thread or its > children (for xgmi) to access hardware when doing GPU reset and resume. So > refine recovery thread as the following: > > Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
>>> Those two steps need to be exchanged or otherwise it is possible that new >>> delayed work items etc are started before the lock is taken. What about adding check for adev->in_gpu_reset in work item? If exchange the two steps, it maybe introduce the deadlock. For example, the user thread hold the read lock and waiting for the fence, if recovery thread try to hold write lock and then complete fences, in this case, recovery thread will always be blocked. Best Regards Dennis Li -Original Message- From: Koenig, Christian Sent: Thursday, March 18, 2021 3:54 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in adev object. The > atomic type variable in_gpu_reset is used to avoid recovery thread reenter > and make lower functions return more earlier when recovery start, but > couldn't block recovery thread when it access hardware. The r/w semaphore > reset_sem is used to solve these synchronization issues between recovery > thread and other threads. > > The original solution locked registers' access in lower functions, which will > introduce following issues: > > 1) many lower functions are used in both recovery thread and others. Firstly > we must harvest these functions, it is easy to miss someones. Secondly these > functions need select which lock (read lock or write lock) will be used, > according to the thread it is running in. If the thread context isn't > considered, the added lock will easily introduce deadlock. Besides that, in > most time, developer easily forget to add locks for new functions. > > 2) performance drop. More lower functions are more frequently called. > > 3) easily introduce false positive lockdep complaint, because write lock has > big range in recovery thread, but low level functions will hold read lock may > be protected by other locks in other threads. > > Therefore the new solution will try to add lock protection for ioctls of kfd. > Its goal is that there are no threads except for recovery thread or its > children (for xgmi) to access hardware when doing GPU reset and resume. So > refine recovery thread as the following: > > Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1) > 1). if failed, it means system had a recovery thread running, current > thread exit directly; > 2). if success, enter recovery thread; > > Step 1: cancel all delay works, stop drm schedule, complete all unreceived > fences and so on. It try to stop or pause other threads. > > Step 2: call down_write(>reset_sem) to hold write lock, which will > block recovery thread until other threads release read locks. Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. Just to make it clear until this is fixed the whole patch set is a NAK. Regards, Christian. > > Step 3: normally, there is only recovery threads running to access hardware, > it is safe to do gpu reset now. > > Step 4: do post gpu reset, such as call all ips' resume functions; > > Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release > write lock. Recovery thread exit normally. > > Other threads call the amdgpu_read_lock to synchronize with recovery thread. > If it finds that in_gpu_reset is 1, it should release read lock if it has > holden one, and then blocks itself to wait for recovery finished event. If > thread successfully hold read lock and in_gpu_reset is 0, it continues. It > will exit normally or be stopped by recovery thread in step 1. > > Dennis Li (4): >drm/amdgpu: remove reset lock from low level functions >drm/amdgpu: refine the GPU recovery sequence >drm/amdgpu: instead of using down/up_read directly >drm/amdkfd: add reset lock protection for kfd entry functions > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 173 +- > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 8 - > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c| 4 +- > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 9 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172 - > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 + > .../amd/amdkfd/kfd_process_queue_manager.c| 17 ++ > 12 files changed, 345 insertions(+), 75 deletions(-) > ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 18.03.21 um 08:23 schrieb Dennis Li: We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic type variable in_gpu_reset is used to avoid recovery thread reenter and make lower functions return more earlier when recovery start, but couldn't block recovery thread when it access hardware. The r/w semaphore reset_sem is used to solve these synchronization issues between recovery thread and other threads. The original solution locked registers' access in lower functions, which will introduce following issues: 1) many lower functions are used in both recovery thread and others. Firstly we must harvest these functions, it is easy to miss someones. Secondly these functions need select which lock (read lock or write lock) will be used, according to the thread it is running in. If the thread context isn't considered, the added lock will easily introduce deadlock. Besides that, in most time, developer easily forget to add locks for new functions. 2) performance drop. More lower functions are more frequently called. 3) easily introduce false positive lockdep complaint, because write lock has big range in recovery thread, but low level functions will hold read lock may be protected by other locks in other threads. Therefore the new solution will try to add lock protection for ioctls of kfd. Its goal is that there are no threads except for recovery thread or its children (for xgmi) to access hardware when doing GPU reset and resume. So refine recovery thread as the following: Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1) 1). if failed, it means system had a recovery thread running, current thread exit directly; 2). if success, enter recovery thread; Step 1: cancel all delay works, stop drm schedule, complete all unreceived fences and so on. It try to stop or pause other threads. Step 2: call down_write(>reset_sem) to hold write lock, which will block recovery thread until other threads release read locks. Those two steps need to be exchanged or otherwise it is possible that new delayed work items etc are started before the lock is taken. Just to make it clear until this is fixed the whole patch set is a NAK. Regards, Christian. Step 3: normally, there is only recovery threads running to access hardware, it is safe to do gpu reset now. Step 4: do post gpu reset, such as call all ips' resume functions; Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release write lock. Recovery thread exit normally. Other threads call the amdgpu_read_lock to synchronize with recovery thread. If it finds that in_gpu_reset is 1, it should release read lock if it has holden one, and then blocks itself to wait for recovery finished event. If thread successfully hold read lock and in_gpu_reset is 0, it continues. It will exit normally or be stopped by recovery thread in step 1. Dennis Li (4): drm/amdgpu: remove reset lock from low level functions drm/amdgpu: refine the GPU recovery sequence drm/amdgpu: instead of using down/up_read directly drm/amdkfd: add reset lock protection for kfd entry functions drivers/gpu/drm/amd/amdgpu/amdgpu.h | 6 + drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 173 +- .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 8 - drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c| 4 +- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 9 +- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 5 +- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 5 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172 - drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 4 + .../amd/amdkfd/kfd_process_queue_manager.c| 17 ++ 12 files changed, 345 insertions(+), 75 deletions(-) ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx