Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
My bad, I see u already fixed this in amd-staging-drm-next. We had an issue in an internal branch with this and just reinvented the wheel :)) Andrey On 2022-04-14 10:32, Andrey Grodzovsky wrote: Yea, i need to improve it a bit, ignore this one, will be back with V2. Andrey On 2022-04-14 03:12, Chen, Guchun wrote: It's in amdgpu_pci_resume. Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? Otherwise, an unset/unlock leak will happen when pci_channel_state != pci_channel_io_frozen. Regards, Guchun -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, April 14, 2022 2:40 PM To: Grodzovsky, Andrey ; amd-gfx@lists.freedesktop.org Cc: Antonovitch, Anatoli Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1cc488a767d8..c65f25e3a0fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta adev->pci_channel_state = state; + /* + * Locking adev->reset_domain->sem will prevent any external access + * to GPU during PCI error recovery + */ + amdgpu_device_lock_reset_domain(adev->reset_domain); + amdgpu_device_set_mp1_state(adev); + switch (state) { case pci_channel_io_normal: return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? /* Fatal error, prepare for slot reset */ case pci_channel_io_frozen: - /* - * Locking adev->reset_domain->sem will prevent any external access - * to GPU during PCI error recovery - */ - amdgpu_device_lock_reset_domain(adev->reset_domain); - amdgpu_device_set_mp1_state(adev); - /* * Block any work scheduling as we do for regular GPU reset * for the duration of the recovery
Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
Yea, i need to improve it a bit, ignore this one, will be back with V2. Andrey On 2022-04-14 03:12, Chen, Guchun wrote: It's in amdgpu_pci_resume. Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? Otherwise, an unset/unlock leak will happen when pci_channel_state != pci_channel_io_frozen. Regards, Guchun -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, April 14, 2022 2:40 PM To: Grodzovsky, Andrey ; amd-gfx@lists.freedesktop.org Cc: Antonovitch, Anatoli Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1cc488a767d8..c65f25e3a0fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta adev->pci_channel_state = state; + /* +* Locking adev->reset_domain->sem will prevent any external access +* to GPU during PCI error recovery +*/ + amdgpu_device_lock_reset_domain(adev->reset_domain); + amdgpu_device_set_mp1_state(adev); + switch (state) { case pci_channel_io_normal: return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? /* Fatal error, prepare for slot reset */ case pci_channel_io_frozen: - /* -* Locking adev->reset_domain->sem will prevent any external access -* to GPU during PCI error recovery -*/ - amdgpu_device_lock_reset_domain(adev->reset_domain); - amdgpu_device_set_mp1_state(adev); - /* * Block any work scheduling as we do for regular GPU reset * for the duration of the recovery
Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
On 2022-04-14 02:40, Christian König wrote: Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1cc488a767d8..c65f25e3a0fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta adev->pci_channel_state = state; + /* + * Locking adev->reset_domain->sem will prevent any external access + * to GPU during PCI error recovery + */ + amdgpu_device_lock_reset_domain(adev->reset_domain); + amdgpu_device_set_mp1_state(adev); + switch (state) { case pci_channel_io_normal: return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? In amdgpu_pci_resume, but you made realize I can do this better. I will be back with V2. Andrey /* Fatal error, prepare for slot reset */ case pci_channel_io_frozen: - /* - * Locking adev->reset_domain->sem will prevent any external access - * to GPU during PCI error recovery - */ - amdgpu_device_lock_reset_domain(adev->reset_domain); - amdgpu_device_set_mp1_state(adev); - /* * Block any work scheduling as we do for regular GPU reset * for the duration of the recovery
RE: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
It's in amdgpu_pci_resume. Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? Otherwise, an unset/unlock leak will happen when pci_channel_state != pci_channel_io_frozen. Regards, Guchun -Original Message- From: amd-gfx On Behalf Of Christian König Sent: Thursday, April 14, 2022 2:40 PM To: Grodzovsky, Andrey ; amd-gfx@lists.freedesktop.org Cc: Antonovitch, Anatoli Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: > Lock reset domain unconditionally because on resume we unlock it > unconditionally. > This solved mutex deadlock when handling both FATAL and non FATAL PCI > errors one after another. > > Signed-off-by: Andrey Grodzovsky > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 1cc488a767d8..c65f25e3a0fc 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -5531,18 +5531,18 @@ pci_ers_result_t > amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta > > adev->pci_channel_state = state; > > + /* > + * Locking adev->reset_domain->sem will prevent any external access > + * to GPU during PCI error recovery > + */ > + amdgpu_device_lock_reset_domain(adev->reset_domain); > + amdgpu_device_set_mp1_state(adev); > + > switch (state) { > case pci_channel_io_normal: > return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? > /* Fatal error, prepare for slot reset */ > case pci_channel_io_frozen: > - /* > - * Locking adev->reset_domain->sem will prevent any external > access > - * to GPU during PCI error recovery > - */ > - amdgpu_device_lock_reset_domain(adev->reset_domain); > - amdgpu_device_set_mp1_state(adev); > - > /* >* Block any work scheduling as we do for regular GPU reset >* for the duration of the recovery
Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler
Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky: Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1cc488a767d8..c65f25e3a0fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta adev->pci_channel_state = state; + /* +* Locking adev->reset_domain->sem will prevent any external access +* to GPU during PCI error recovery +*/ + amdgpu_device_lock_reset_domain(adev->reset_domain); + amdgpu_device_set_mp1_state(adev); + switch (state) { case pci_channel_io_normal: return PCI_ERS_RESULT_CAN_RECOVER; BTW: Where are we unlocking that again? /* Fatal error, prepare for slot reset */ case pci_channel_io_frozen: - /* -* Locking adev->reset_domain->sem will prevent any external access -* to GPU during PCI error recovery -*/ - amdgpu_device_lock_reset_domain(adev->reset_domain); - amdgpu_device_set_mp1_state(adev); - /* * Block any work scheduling as we do for regular GPU reset * for the duration of the recovery
[PATCH] drm/amdgpu: Move reset domain locking in DPC handler
Lock reset domain unconditionally because on resume we unlock it unconditionally. This solved mutex deadlock when handling both FATAL and non FATAL PCI errors one after another. Signed-off-by: Andrey Grodzovsky --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1cc488a767d8..c65f25e3a0fc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta adev->pci_channel_state = state; + /* +* Locking adev->reset_domain->sem will prevent any external access +* to GPU during PCI error recovery +*/ + amdgpu_device_lock_reset_domain(adev->reset_domain); + amdgpu_device_set_mp1_state(adev); + switch (state) { case pci_channel_io_normal: return PCI_ERS_RESULT_CAN_RECOVER; /* Fatal error, prepare for slot reset */ case pci_channel_io_frozen: - /* -* Locking adev->reset_domain->sem will prevent any external access -* to GPU during PCI error recovery -*/ - amdgpu_device_lock_reset_domain(adev->reset_domain); - amdgpu_device_set_mp1_state(adev); - /* * Block any work scheduling as we do for regular GPU reset * for the duration of the recovery -- 2.25.1