Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

2022-08-24 Thread Andrey Grodzovsky



On 2022-08-17 10:01, Andrey Grodzovsky wrote:


On 2022-08-17 09:44, Alex Deucher wrote:
On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas  
wrote:

[AMD Official Use Only - General]

Hi Alex:
   When removing an amdgpu device, it may be difficult to change the 
order of psp_hw_fini calls.


1. The drm_dev_unplug call is at the beginning in the 
amdgpu_pci_remove function,  which makes the gpu device inaccessible 
for userspace operations.  If the call to psp_hw_fini was moved 
before drm_dev_unplug,  userspace could access the gpu device but 
the psp might be removing. It has unknown issues.



+Andrey Grodzovsky

We should fix the ordering in amdgpu_pci_remove() then I guess? There
are lots of places where drm_dev_enter() is used to protect access to
the hardware which could be similarly affected.

Alex



We probably can try to move drm_dev_unplug after 
amdgpu_driver_unload_kms. I don't remember now why drm_dev_unplug must 
be the first thing we do in amdgpu_pci_remove and what impact it will 
have but maybe give it a try.
Also see if you can run libdrm hotplug test suite before and after the 
change.


Andrey



Thinking a bit more about this - i guess the main problem with this will 
be that in case of real hot unplug (which is hard to test unless you 
have a real GPU cage with extenal GPU) this move will cause trying to 
accesses HW registers
or MMIO ranges from VRAM BOs when HW is missing when you try to shut 
down the HW in HW fini IP block specific callbacks , this in turn will 
return garbage for reads (or all 1s maybe) which is what we probably 
were trying to avoid by putting drm_dev_unplug as the first thing. So 
it's probably a bit problematic.


Andrey







2. psp_hw_fini is called by the .hw_fini iterator in 
amdgpu_device_ip_fini_early, referring to the code starting from 
amdgpu_pci_remove to .hw_fini is called,
    there are many preparatory operations before calling .hw_fini,  
which makes it very difficult to change the order of psp_hw_fini or 
all block .hw_fini.


    So can we do a workaround in psp_cmd_submit_buf when removing 
amdgpu device?


-Original Message-
From: Alex Deucher 
Sent: Monday, August 15, 2022 10:22 PM
To: Chai, Thomas 
Cc: amd-gfx@lists.freedesktop.org; Zhang, Hawking 
; Chen, Guchun ; Chai, 
Thomas 
Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually 
sent to psp when amdgpu is uninstalled


On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai  
wrote:

The psp_cmd_submit_buf function is called by psp_hw_fini to send TA
unload messages to psp to terminate ras, asd and tmr.
But when amdgpu is uninstalled, drm_dev_unplug is called earlier than
psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
static void amdgpu_pci_remove(struct pci_dev *pdev) {
 drm_dev_unplug
 ..
amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
 ->.hw_fini->psp_hw_fini->...
 ->psp_ta_unload->psp_cmd_submit_buf
 ..
}
The program will return when calling drm_dev_enter in
psp_cmd_submit_buf.

So the call to drm_dev_enter in psp_cmd_submit_buf should be removed,
so that the TA unload messages can be sent to the psp when amdgpu is
uninstalled.

Signed-off-by: YiPeng Chai 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index b067ce45d226..0578d8d094a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
 if (psp->adev->no_hw_access)
 return 0;

-   if (!drm_dev_enter(adev_to_drm(psp->adev), &idx))
-   return 0;
-
This check is to prevent the hardware from being accessed if the 
card is removed.  I think we need to fix the ordering elsewhere.


Alex


 memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);

 memcpy(psp->cmd_buf_mem, cmd, sizeof(struct
psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct 
psp_context *psp,

 }

  exit:
-   drm_dev_exit(idx);
 return ret;
  }

--
2.25.1



Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

2022-08-24 Thread Alex Deucher
On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai  wrote:
>
> The psp_cmd_submit_buf function is called by psp_hw_fini to
> send TA unload messages to psp to terminate ras, asd and tmr.
> But when amdgpu is uninstalled, drm_dev_unplug is called
> earlier than psp_hw_fini in amdgpu_pci_remove, the calling
> order as follows:
> static void amdgpu_pci_remove(struct pci_dev *pdev)
> {
> drm_dev_unplug
> ..
> amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
> ->.hw_fini->psp_hw_fini->...
> ->psp_ta_unload->psp_cmd_submit_buf
> ..
> }
> The program will return when calling drm_dev_enter in
> psp_cmd_submit_buf.
>
> So the call to drm_dev_enter in psp_cmd_submit_buf should
> be removed, so that the TA unload messages can be sent to the
> psp when amdgpu is uninstalled.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
>  1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index b067ce45d226..0578d8d094a7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
> if (psp->adev->no_hw_access)
> return 0;
>
> -   if (!drm_dev_enter(adev_to_drm(psp->adev), &idx))
> -   return 0;
> -

This check is to prevent the hardware from being accessed if the card
is removed.  I think we need to fix the ordering elsewhere.

Alex

> memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
>
> memcpy(psp->cmd_buf_mem, cmd, sizeof(struct psp_gfx_cmd_resp));
> @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
> }
>
>  exit:
> -   drm_dev_exit(idx);
> return ret;
>  }
>
> --
> 2.25.1
>


Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

2022-08-17 Thread Andrey Grodzovsky



On 2022-08-17 09:44, Alex Deucher wrote:

On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas  wrote:

[AMD Official Use Only - General]

Hi Alex:
   When removing an amdgpu device, it may be difficult to change the order of 
psp_hw_fini calls.

1. The drm_dev_unplug call is at the beginning in the amdgpu_pci_remove 
function,  which makes the gpu device inaccessible for userspace operations.  
If the call to psp_hw_fini was moved before drm_dev_unplug,  userspace could 
access the gpu device but the psp might be removing. It has unknown issues.


+Andrey Grodzovsky

We should fix the ordering in amdgpu_pci_remove() then I guess?  There
are lots of places where drm_dev_enter() is used to protect access to
the hardware which could be similarly affected.

Alex



We probably can try to move drm_dev_unplug after 
amdgpu_driver_unload_kms. I don't remember now why drm_dev_unplug must 
be the first thing we do in amdgpu_pci_remove and what impact it will 
have but maybe give it a try.
Also see if you can run libdrm hotplug test suite before and after the 
change.


Andrey





2. psp_hw_fini is called by the .hw_fini iterator in 
amdgpu_device_ip_fini_early, referring to the code starting from 
amdgpu_pci_remove to .hw_fini is called,
there are many preparatory operations before calling .hw_fini,  which makes 
it very difficult to change the order of psp_hw_fini or all block .hw_fini.

So can we do a workaround in psp_cmd_submit_buf when removing amdgpu device?

-Original Message-
From: Alex Deucher 
Sent: Monday, August 15, 2022 10:22 PM
To: Chai, Thomas 
Cc: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; Chen, Guchun 
; Chai, Thomas 
Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to 
psp when amdgpu is uninstalled

On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai  wrote:

The psp_cmd_submit_buf function is called by psp_hw_fini to send TA
unload messages to psp to terminate ras, asd and tmr.
But when amdgpu is uninstalled, drm_dev_unplug is called earlier than
psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
static void amdgpu_pci_remove(struct pci_dev *pdev) {
 drm_dev_unplug
 ..
 amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
 ->.hw_fini->psp_hw_fini->...
 ->psp_ta_unload->psp_cmd_submit_buf
 ..
}
The program will return when calling drm_dev_enter in
psp_cmd_submit_buf.

So the call to drm_dev_enter in psp_cmd_submit_buf should be removed,
so that the TA unload messages can be sent to the psp when amdgpu is
uninstalled.

Signed-off-by: YiPeng Chai 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
  1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index b067ce45d226..0578d8d094a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
 if (psp->adev->no_hw_access)
 return 0;

-   if (!drm_dev_enter(adev_to_drm(psp->adev), &idx))
-   return 0;
-

This check is to prevent the hardware from being accessed if the card is 
removed.  I think we need to fix the ordering elsewhere.

Alex


 memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);

 memcpy(psp->cmd_buf_mem, cmd, sizeof(struct
psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct psp_context 
*psp,
 }

  exit:
-   drm_dev_exit(idx);
 return ret;
  }

--
2.25.1



Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

2022-08-17 Thread Alex Deucher
On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas  wrote:
>
> [AMD Official Use Only - General]
>
> Hi Alex:
>   When removing an amdgpu device, it may be difficult to change the order of 
> psp_hw_fini calls.
>
> 1. The drm_dev_unplug call is at the beginning in the amdgpu_pci_remove 
> function,  which makes the gpu device inaccessible for userspace operations.  
> If the call to psp_hw_fini was moved before drm_dev_unplug,  userspace could 
> access the gpu device but the psp might be removing. It has unknown issues.
>

+Andrey Grodzovsky

We should fix the ordering in amdgpu_pci_remove() then I guess?  There
are lots of places where drm_dev_enter() is used to protect access to
the hardware which could be similarly affected.

Alex

> 2. psp_hw_fini is called by the .hw_fini iterator in 
> amdgpu_device_ip_fini_early, referring to the code starting from 
> amdgpu_pci_remove to .hw_fini is called,
>there are many preparatory operations before calling .hw_fini,  which 
> makes it very difficult to change the order of psp_hw_fini or all block 
> .hw_fini.
>
>So can we do a workaround in psp_cmd_submit_buf when removing amdgpu 
> device?
>
> -Original Message-
> From: Alex Deucher 
> Sent: Monday, August 15, 2022 10:22 PM
> To: Chai, Thomas 
> Cc: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
> Chen, Guchun ; Chai, Thomas 
> Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to 
> psp when amdgpu is uninstalled
>
> On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai  wrote:
> >
> > The psp_cmd_submit_buf function is called by psp_hw_fini to send TA
> > unload messages to psp to terminate ras, asd and tmr.
> > But when amdgpu is uninstalled, drm_dev_unplug is called earlier than
> > psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
> > static void amdgpu_pci_remove(struct pci_dev *pdev) {
> > drm_dev_unplug
> > ..
> > amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
> > ->.hw_fini->psp_hw_fini->...
> > ->psp_ta_unload->psp_cmd_submit_buf
> > ..
> > }
> > The program will return when calling drm_dev_enter in
> > psp_cmd_submit_buf.
> >
> > So the call to drm_dev_enter in psp_cmd_submit_buf should be removed,
> > so that the TA unload messages can be sent to the psp when amdgpu is
> > uninstalled.
> >
> > Signed-off-by: YiPeng Chai 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
> >  1 file changed, 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > index b067ce45d226..0578d8d094a7 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> > @@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
> > if (psp->adev->no_hw_access)
> > return 0;
> >
> > -   if (!drm_dev_enter(adev_to_drm(psp->adev), &idx))
> > -   return 0;
> > -
>
> This check is to prevent the hardware from being accessed if the card is 
> removed.  I think we need to fix the ordering elsewhere.
>
> Alex
>
> > memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
> >
> > memcpy(psp->cmd_buf_mem, cmd, sizeof(struct
> > psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct 
> > psp_context *psp,
> > }
> >
> >  exit:
> > -   drm_dev_exit(idx);
> > return ret;
> >  }
> >
> > --
> > 2.25.1
> >


RE: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled

2022-08-16 Thread Chai, Thomas
[AMD Official Use Only - General]

Hi Alex:
  When removing an amdgpu device, it may be difficult to change the order of 
psp_hw_fini calls.

1. The drm_dev_unplug call is at the beginning in the amdgpu_pci_remove 
function,  which makes the gpu device inaccessible for userspace operations.  
If the call to psp_hw_fini was moved before drm_dev_unplug,  userspace could 
access the gpu device but the psp might be removing. It has unknown issues.

2. psp_hw_fini is called by the .hw_fini iterator in 
amdgpu_device_ip_fini_early, referring to the code starting from 
amdgpu_pci_remove to .hw_fini is called,
   there are many preparatory operations before calling .hw_fini,  which makes 
it very difficult to change the order of psp_hw_fini or all block .hw_fini.
   
   So can we do a workaround in psp_cmd_submit_buf when removing amdgpu device?

-Original Message-
From: Alex Deucher  
Sent: Monday, August 15, 2022 10:22 PM
To: Chai, Thomas 
Cc: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; 
Chen, Guchun ; Chai, Thomas 
Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to 
psp when amdgpu is uninstalled

On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai  wrote:
>
> The psp_cmd_submit_buf function is called by psp_hw_fini to send TA 
> unload messages to psp to terminate ras, asd and tmr.
> But when amdgpu is uninstalled, drm_dev_unplug is called earlier than 
> psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
> static void amdgpu_pci_remove(struct pci_dev *pdev) {
> drm_dev_unplug
> ..
> amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
> ->.hw_fini->psp_hw_fini->...
> ->psp_ta_unload->psp_cmd_submit_buf
> ..
> }
> The program will return when calling drm_dev_enter in 
> psp_cmd_submit_buf.
>
> So the call to drm_dev_enter in psp_cmd_submit_buf should be removed, 
> so that the TA unload messages can be sent to the psp when amdgpu is 
> uninstalled.
>
> Signed-off-by: YiPeng Chai 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 
>  1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index b067ce45d226..0578d8d094a7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp,
> if (psp->adev->no_hw_access)
> return 0;
>
> -   if (!drm_dev_enter(adev_to_drm(psp->adev), &idx))
> -   return 0;
> -

This check is to prevent the hardware from being accessed if the card is 
removed.  I think we need to fix the ordering elsewhere.

Alex

> memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE);
>
> memcpy(psp->cmd_buf_mem, cmd, sizeof(struct 
> psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct psp_context 
> *psp,
> }
>
>  exit:
> -   drm_dev_exit(idx);
> return ret;
>  }
>
> --
> 2.25.1
>