Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky
My bad, I see u already fixed this in amd-staging-drm-next. We had an 
issue in an internal branch with this and just reinvented the wheel :))


Andrey

On 2022-04-14 10:32, Andrey Grodzovsky wrote:

Yea, i need to improve it a bit, ignore this one, will be back with V2.

Andrey

On 2022-04-14 03:12, Chen, Guchun wrote:

It's in amdgpu_pci_resume.

Andrey, shall we modify the code accordingly in amdgpu_pci_resume as 
well? Otherwise, an unset/unlock leak will happen when 
pci_channel_state != pci_channel_io_frozen.


Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of 
Christian König

Sent: Thursday, April 14, 2022 2:40 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org

Cc: Antonovitch, Anatoli 
Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC 
handler




Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume we unlock it
unconditionally.
This solved mutex deadlock when handling both FATAL and non FATAL PCI
errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
   1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t
amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
      adev->pci_channel_state = state;
   +    /*
+ * Locking adev->reset_domain->sem will prevent any external 
access

+ * to GPU during PCI error recovery
+ */
+    amdgpu_device_lock_reset_domain(adev->reset_domain);
+    amdgpu_device_set_mp1_state(adev);
+
   switch (state) {
   case pci_channel_io_normal:
   return PCI_ERS_RESULT_CAN_RECOVER;

BTW: Where are we unlocking that again?


   /* Fatal error, prepare for slot reset */
   case pci_channel_io_frozen:
-    /*
- * Locking adev->reset_domain->sem will prevent any 
external access

- * to GPU during PCI error recovery
- */
- amdgpu_device_lock_reset_domain(adev->reset_domain);
-    amdgpu_device_set_mp1_state(adev);
-
   /*
    * Block any work scheduling as we do for regular GPU reset
    * for the duration of the recovery


Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky

Yea, i need to improve it a bit, ignore this one, will be back with V2.

Andrey

On 2022-04-14 03:12, Chen, Guchun wrote:

It's in amdgpu_pci_resume.

Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? 
Otherwise, an unset/unlock leak will happen when pci_channel_state != 
pci_channel_io_frozen.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, April 14, 2022 2:40 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org
Cc: Antonovitch, Anatoli 
Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler



Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume we unlock it
unconditionally.
This solved mutex deadlock when handling both FATAL and non FATAL PCI
errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
   1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t
amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
   
   	adev->pci_channel_state = state;
   
+	/*

+* Locking adev->reset_domain->sem will prevent any external access
+* to GPU during PCI error recovery
+*/
+   amdgpu_device_lock_reset_domain(adev->reset_domain);
+   amdgpu_device_set_mp1_state(adev);
+
switch (state) {
case pci_channel_io_normal:
return PCI_ERS_RESULT_CAN_RECOVER;

BTW: Where are we unlocking that again?


/* Fatal error, prepare for slot reset */
case pci_channel_io_frozen:
-   /*
-* Locking adev->reset_domain->sem will prevent any external 
access
-* to GPU during PCI error recovery
-*/
-   amdgpu_device_lock_reset_domain(adev->reset_domain);
-   amdgpu_device_set_mp1_state(adev);
-
/*
 * Block any work scheduling as we do for regular GPU reset
 * for the duration of the recovery


Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky



On 2022-04-14 02:40, Christian König wrote:



Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume
we unlock it unconditionally.
This solved mutex deadlock when handling both FATAL
and non FATAL PCI errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t 
amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta

    adev->pci_channel_state = state;
  +    /*
+ * Locking adev->reset_domain->sem will prevent any external access
+ * to GPU during PCI error recovery
+ */
+    amdgpu_device_lock_reset_domain(adev->reset_domain);
+    amdgpu_device_set_mp1_state(adev);
+
  switch (state) {
  case pci_channel_io_normal:
  return PCI_ERS_RESULT_CAN_RECOVER;


BTW: Where are we unlocking that again?



In amdgpu_pci_resume, but you made realize I can do this better.
I will be back with V2.

Andrey





  /* Fatal error, prepare for slot reset */
  case pci_channel_io_frozen:
-    /*
- * Locking adev->reset_domain->sem will prevent any external 
access

- * to GPU during PCI error recovery
- */
-    amdgpu_device_lock_reset_domain(adev->reset_domain);
-    amdgpu_device_set_mp1_state(adev);
-
  /*
   * Block any work scheduling as we do for regular GPU reset
   * for the duration of the recovery




RE: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Chen, Guchun
It's in amdgpu_pci_resume.

Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? 
Otherwise, an unset/unlock leak will happen when pci_channel_state != 
pci_channel_io_frozen.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, April 14, 2022 2:40 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org
Cc: Antonovitch, Anatoli 
Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler



Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:
> Lock reset domain unconditionally because on resume we unlock it 
> unconditionally.
> This solved mutex deadlock when handling both FATAL and non FATAL PCI 
> errors one after another.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
>   1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1cc488a767d8..c65f25e3a0fc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5531,18 +5531,18 @@ pci_ers_result_t 
> amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>   
>   adev->pci_channel_state = state;
>   
> + /*
> +  * Locking adev->reset_domain->sem will prevent any external access
> +  * to GPU during PCI error recovery
> +  */
> + amdgpu_device_lock_reset_domain(adev->reset_domain);
> + amdgpu_device_set_mp1_state(adev);
> +
>   switch (state) {
>   case pci_channel_io_normal:
>   return PCI_ERS_RESULT_CAN_RECOVER;

BTW: Where are we unlocking that again?

>   /* Fatal error, prepare for slot reset */
>   case pci_channel_io_frozen:
> - /*
> -  * Locking adev->reset_domain->sem will prevent any external 
> access
> -  * to GPU during PCI error recovery
> -  */
> - amdgpu_device_lock_reset_domain(adev->reset_domain);
> - amdgpu_device_set_mp1_state(adev);
> -
>   /*
>* Block any work scheduling as we do for regular GPU reset
>* for the duration of the recovery



Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Christian König




Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume
we unlock it unconditionally.
This solved mutex deadlock when handling both FATAL
and non FATAL PCI errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct 
pci_dev *pdev, pci_channel_sta
  
  	adev->pci_channel_state = state;
  
+	/*

+* Locking adev->reset_domain->sem will prevent any external access
+* to GPU during PCI error recovery
+*/
+   amdgpu_device_lock_reset_domain(adev->reset_domain);
+   amdgpu_device_set_mp1_state(adev);
+
switch (state) {
case pci_channel_io_normal:
return PCI_ERS_RESULT_CAN_RECOVER;


BTW: Where are we unlocking that again?


/* Fatal error, prepare for slot reset */
case pci_channel_io_frozen:
-   /*
-* Locking adev->reset_domain->sem will prevent any external 
access
-* to GPU during PCI error recovery
-*/
-   amdgpu_device_lock_reset_domain(adev->reset_domain);
-   amdgpu_device_set_mp1_state(adev);
-
/*
 * Block any work scheduling as we do for regular GPU reset
 * for the duration of the recovery