Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
My bad, I see u already fixed this in amd-staging-drm-next. We had an issue in an internal branch with this and just reinvented the wheel :)) Andrey On 2022-04-14 10:32, Andrey Grodzovsky wrote: Yea, i need to improve it a bit, ignore this one, will be back with V2. Andrey On 2022-04-14 03:12, Chen, Guchun wrote: It's in amdgpu_pci_resume. Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? Otherwise, an unset/unlock leak will happen when pci_channel_state != pci_channel_io_frozen. Regards, Guchun -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, April 14, 2022 2:40 PM To: Grodzovsky, Andrey ; amd-gfx@lists.freedesktop.org Cc: Antonovitch, Anatoli Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1cc488a767d8..c65f25e3a0fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta adev->pci_channel_state = state; + /* + * Locking adev->reset_domain->sem will prevent any external access + * to GPU during PCI error recovery + */ + amdgpu_device_lock_reset_domain(adev->reset_domain); + amdgpu_device_set_mp1_state(adev); + switch (state) { case pci_channel_io_normal: return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? /* Fatal error, prepare for slot reset */ case pci_channel_io_frozen: - /* - * Locking adev->reset_domain->sem will prevent any external access - * to GPU during PCI error recovery - */ - amdgpu_device_lock_reset_domain(adev->reset_domain); - amdgpu_device_set_mp1_state(adev); - /* * Block any work scheduling as we do for regular GPU reset * for the duration of the recovery
Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
Yea, i need to improve it a bit, ignore this one, will be back with V2. Andrey On 2022-04-14 03:12, Chen, Guchun wrote: It's in amdgpu_pci_resume. Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? Otherwise, an unset/unlock leak will happen when pci_channel_state != pci_channel_io_frozen. Regards, Guchun -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, April 14, 2022 2:40 PM To: Grodzovsky, Andrey ; amd-gfx@lists.freedesktop.org Cc: Antonovitch, Anatoli Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1cc488a767d8..c65f25e3a0fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta adev->pci_channel_state = state; + /* +* Locking adev->reset_domain->sem will prevent any external access +* to GPU during PCI error recovery +*/ + amdgpu_device_lock_reset_domain(adev->reset_domain); + amdgpu_device_set_mp1_state(adev); + switch (state) { case pci_channel_io_normal: return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? /* Fatal error, prepare for slot reset */ case pci_channel_io_frozen: - /* -* Locking adev->reset_domain->sem will prevent any external access -* to GPU during PCI error recovery -*/ - amdgpu_device_lock_reset_domain(adev->reset_domain); - amdgpu_device_set_mp1_state(adev); - /* * Block any work scheduling as we do for regular GPU reset * for the duration of the recovery
Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
On 2022-04-14 02:40, Christian König wrote: Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1cc488a767d8..c65f25e3a0fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta adev->pci_channel_state = state; + /* + * Locking adev->reset_domain->sem will prevent any external access + * to GPU during PCI error recovery + */ + amdgpu_device_lock_reset_domain(adev->reset_domain); + amdgpu_device_set_mp1_state(adev); + switch (state) { case pci_channel_io_normal: return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? In amdgpu_pci_resume, but you made realize I can do this better. I will be back with V2. Andrey /* Fatal error, prepare for slot reset */ case pci_channel_io_frozen: - /* - * Locking adev->reset_domain->sem will prevent any external access - * to GPU during PCI error recovery - */ - amdgpu_device_lock_reset_domain(adev->reset_domain); - amdgpu_device_set_mp1_state(adev); - /* * Block any work scheduling as we do for regular GPU reset * for the duration of the recovery
RE: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
It's in amdgpu_pci_resume. Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? Otherwise, an unset/unlock leak will happen when pci_channel_state != pci_channel_io_frozen. Regards, Guchun -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, April 14, 2022 2:40 PM To: Grodzovsky, Andrey ; amd-gfx@lists.freedesktop.org Cc: Antonovitch, Anatoli Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: > Lock reset domain unconditionally because on resume we unlock it > unconditionally. > This solved mutex deadlock when handling both FATAL and non FATAL PCI > errors one after another. > > Signed-off-by: Andrey Grodzovsky > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 1cc488a767d8..c65f25e3a0fc 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -5531,18 +5531,18 @@ pci_ers_result_t > amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta > > adev->pci_channel_state = state; > > + /* > + * Locking adev->reset_domain->sem will prevent any external access > + * to GPU during PCI error recovery > + */ > + amdgpu_device_lock_reset_domain(adev->reset_domain); > + amdgpu_device_set_mp1_state(adev); > + > switch (state) { > case pci_channel_io_normal: > return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? > /* Fatal error, prepare for slot reset */ > case pci_channel_io_frozen: > - /* > - * Locking adev->reset_domain->sem will prevent any external > access > - * to GPU during PCI error recovery > - */ > - amdgpu_device_lock_reset_domain(adev->reset_domain); > - amdgpu_device_set_mp1_state(adev); > - > /* >* Block any work scheduling as we do for regular GPU reset >* for the duration of the recovery
Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1cc488a767d8..c65f25e3a0fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta adev->pci_channel_state = state; + /* +* Locking adev->reset_domain->sem will prevent any external access +* to GPU during PCI error recovery +*/ + amdgpu_device_lock_reset_domain(adev->reset_domain); + amdgpu_device_set_mp1_state(adev); + switch (state) { case pci_channel_io_normal: return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? /* Fatal error, prepare for slot reset */ case pci_channel_io_frozen: - /* -* Locking adev->reset_domain->sem will prevent any external access -* to GPU during PCI error recovery -*/ - amdgpu_device_lock_reset_domain(adev->reset_domain); - amdgpu_device_set_mp1_state(adev); - /* * Block any work scheduling as we do for regular GPU reset * for the duration of the recovery