Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky
My bad, I see u already fixed this in amd-staging-drm-next. We had an 
issue in an internal branch with this and just reinvented the wheel :))


Andrey

On 2022-04-14 10:32, Andrey Grodzovsky wrote:

Yea, i need to improve it a bit, ignore this one, will be back with V2.

Andrey

On 2022-04-14 03:12, Chen, Guchun wrote:

It's in amdgpu_pci_resume.

Andrey, shall we modify the code accordingly in amdgpu_pci_resume as 
well? Otherwise, an unset/unlock leak will happen when 
pci_channel_state != pci_channel_io_frozen.


Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of 
Christian König

Sent: Thursday, April 14, 2022 2:40 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org

Cc: Antonovitch, Anatoli 
Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC 
handler




Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume we unlock it
unconditionally.
This solved mutex deadlock when handling both FATAL and non FATAL PCI
errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
   1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t
amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
      adev->pci_channel_state = state;
   +    /*
+ * Locking adev->reset_domain->sem will prevent any external 
access

+ * to GPU during PCI error recovery
+ */
+    amdgpu_device_lock_reset_domain(adev->reset_domain);
+    amdgpu_device_set_mp1_state(adev);
+
   switch (state) {
   case pci_channel_io_normal:
   return PCI_ERS_RESULT_CAN_RECOVER;

BTW: Where are we unlocking that again?


   /* Fatal error, prepare for slot reset */
   case pci_channel_io_frozen:
-    /*
- * Locking adev->reset_domain->sem will prevent any 
external access

- * to GPU during PCI error recovery
- */
- amdgpu_device_lock_reset_domain(adev->reset_domain);
-    amdgpu_device_set_mp1_state(adev);
-
   /*
    * Block any work scheduling as we do for regular GPU reset
    * for the duration of the recovery


Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky

Yea, i need to improve it a bit, ignore this one, will be back with V2.

Andrey

On 2022-04-14 03:12, Chen, Guchun wrote:

It's in amdgpu_pci_resume.

Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? 
Otherwise, an unset/unlock leak will happen when pci_channel_state != 
pci_channel_io_frozen.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, April 14, 2022 2:40 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org
Cc: Antonovitch, Anatoli 
Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler



Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume we unlock it
unconditionally.
This solved mutex deadlock when handling both FATAL and non FATAL PCI
errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
   1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t
amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
   
   	adev->pci_channel_state = state;
   
+	/*

+* Locking adev->reset_domain->sem will prevent any external access
+* to GPU during PCI error recovery
+*/
+   amdgpu_device_lock_reset_domain(adev->reset_domain);
+   amdgpu_device_set_mp1_state(adev);
+
switch (state) {
case pci_channel_io_normal:
return PCI_ERS_RESULT_CAN_RECOVER;

BTW: Where are we unlocking that again?


/* Fatal error, prepare for slot reset */
case pci_channel_io_frozen:
-   /*
-* Locking adev->reset_domain->sem will prevent any external 
access
-* to GPU during PCI error recovery
-*/
-   amdgpu_device_lock_reset_domain(adev->reset_domain);
-   amdgpu_device_set_mp1_state(adev);
-
/*
 * Block any work scheduling as we do for regular GPU reset
 * for the duration of the recovery


Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Andrey Grodzovsky



On 2022-04-14 02:40, Christian König wrote:



Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume
we unlock it unconditionally.
This solved mutex deadlock when handling both FATAL
and non FATAL PCI errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t 
amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta

    adev->pci_channel_state = state;
  +    /*
+ * Locking adev->reset_domain->sem will prevent any external access
+ * to GPU during PCI error recovery
+ */
+    amdgpu_device_lock_reset_domain(adev->reset_domain);
+    amdgpu_device_set_mp1_state(adev);
+
  switch (state) {
  case pci_channel_io_normal:
  return PCI_ERS_RESULT_CAN_RECOVER;


BTW: Where are we unlocking that again?



In amdgpu_pci_resume, but you made realize I can do this better.
I will be back with V2.

Andrey





  /* Fatal error, prepare for slot reset */
  case pci_channel_io_frozen:
-    /*
- * Locking adev->reset_domain->sem will prevent any external 
access

- * to GPU during PCI error recovery
- */
-    amdgpu_device_lock_reset_domain(adev->reset_domain);
-    amdgpu_device_set_mp1_state(adev);
-
  /*
   * Block any work scheduling as we do for regular GPU reset
   * for the duration of the recovery




RE: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Chen, Guchun
It's in amdgpu_pci_resume.

Andrey, shall we modify the code accordingly in amdgpu_pci_resume as well? 
Otherwise, an unset/unlock leak will happen when pci_channel_state != 
pci_channel_io_frozen.

Regards,
Guchun

-Original Message-
From: amd-gfx  On Behalf Of Christian 
König
Sent: Thursday, April 14, 2022 2:40 PM
To: Grodzovsky, Andrey ; 
amd-gfx@lists.freedesktop.org
Cc: Antonovitch, Anatoli 
Subject: Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler



Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:
> Lock reset domain unconditionally because on resume we unlock it 
> unconditionally.
> This solved mutex deadlock when handling both FATAL and non FATAL PCI 
> errors one after another.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
>   1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1cc488a767d8..c65f25e3a0fc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5531,18 +5531,18 @@ pci_ers_result_t 
> amdgpu_pci_error_detected(struct pci_dev *pdev, pci_channel_sta
>   
>   adev->pci_channel_state = state;
>   
> + /*
> +  * Locking adev->reset_domain->sem will prevent any external access
> +  * to GPU during PCI error recovery
> +  */
> + amdgpu_device_lock_reset_domain(adev->reset_domain);
> + amdgpu_device_set_mp1_state(adev);
> +
>   switch (state) {
>   case pci_channel_io_normal:
>   return PCI_ERS_RESULT_CAN_RECOVER;

BTW: Where are we unlocking that again?

>   /* Fatal error, prepare for slot reset */
>   case pci_channel_io_frozen:
> - /*
> -  * Locking adev->reset_domain->sem will prevent any external 
> access
> -  * to GPU during PCI error recovery
> -  */
> - amdgpu_device_lock_reset_domain(adev->reset_domain);
> - amdgpu_device_set_mp1_state(adev);
> -
>   /*
>* Block any work scheduling as we do for regular GPU reset
>* for the duration of the recovery



Re: [PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-14 Thread Christian König




Am 13.04.22 um 21:31 schrieb Andrey Grodzovsky:

Lock reset domain unconditionally because on resume
we unlock it unconditionally.
This solved mutex deadlock when handling both FATAL
and non FATAL PCI errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
  1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct 
pci_dev *pdev, pci_channel_sta
  
  	adev->pci_channel_state = state;
  
+	/*

+* Locking adev->reset_domain->sem will prevent any external access
+* to GPU during PCI error recovery
+*/
+   amdgpu_device_lock_reset_domain(adev->reset_domain);
+   amdgpu_device_set_mp1_state(adev);
+
switch (state) {
case pci_channel_io_normal:
return PCI_ERS_RESULT_CAN_RECOVER;


BTW: Where are we unlocking that again?


/* Fatal error, prepare for slot reset */
case pci_channel_io_frozen:
-   /*
-* Locking adev->reset_domain->sem will prevent any external 
access
-* to GPU during PCI error recovery
-*/
-   amdgpu_device_lock_reset_domain(adev->reset_domain);
-   amdgpu_device_set_mp1_state(adev);
-
/*
 * Block any work scheduling as we do for regular GPU reset
 * for the duration of the recovery




[PATCH] drm/amdgpu: Move reset domain locking in DPC handler

2022-04-13 Thread Andrey Grodzovsky
Lock reset domain unconditionally because on resume
we unlock it unconditionally.
This solved mutex deadlock when handling both FATAL
and non FATAL PCI errors one after another.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1cc488a767d8..c65f25e3a0fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5531,18 +5531,18 @@ pci_ers_result_t amdgpu_pci_error_detected(struct 
pci_dev *pdev, pci_channel_sta
 
adev->pci_channel_state = state;
 
+   /*
+* Locking adev->reset_domain->sem will prevent any external access
+* to GPU during PCI error recovery
+*/
+   amdgpu_device_lock_reset_domain(adev->reset_domain);
+   amdgpu_device_set_mp1_state(adev);
+
switch (state) {
case pci_channel_io_normal:
return PCI_ERS_RESULT_CAN_RECOVER;
/* Fatal error, prepare for slot reset */
case pci_channel_io_frozen:
-   /*
-* Locking adev->reset_domain->sem will prevent any external 
access
-* to GPU during PCI error recovery
-*/
-   amdgpu_device_lock_reset_domain(adev->reset_domain);
-   amdgpu_device_set_mp1_state(adev);
-
/*
 * Block any work scheduling as we do for regular GPU reset
 * for the duration of the recovery
-- 
2.25.1