Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled
On 2022-08-17 10:01, Andrey Grodzovsky wrote: On 2022-08-17 09:44, Alex Deucher wrote: On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas wrote: [AMD Official Use Only - General] Hi Alex: When removing an amdgpu device, it may be difficult to change the order of psp_hw_fini calls. 1. The drm_dev_unplug call is at the beginning in the amdgpu_pci_remove function, which makes the gpu device inaccessible for userspace operations. If the call to psp_hw_fini was moved before drm_dev_unplug, userspace could access the gpu device but the psp might be removing. It has unknown issues. +Andrey Grodzovsky We should fix the ordering in amdgpu_pci_remove() then I guess? There are lots of places where drm_dev_enter() is used to protect access to the hardware which could be similarly affected. Alex We probably can try to move drm_dev_unplug after amdgpu_driver_unload_kms. I don't remember now why drm_dev_unplug must be the first thing we do in amdgpu_pci_remove and what impact it will have but maybe give it a try. Also see if you can run libdrm hotplug test suite before and after the change. Andrey Thinking a bit more about this - i guess the main problem with this will be that in case of real hot unplug (which is hard to test unless you have a real GPU cage with extenal GPU) this move will cause trying to accesses HW registers or MMIO ranges from VRAM BOs when HW is missing when you try to shut down the HW in HW fini IP block specific callbacks , this in turn will return garbage for reads (or all 1s maybe) which is what we probably were trying to avoid by putting drm_dev_unplug as the first thing. So it's probably a bit problematic. Andrey 2. psp_hw_fini is called by the .hw_fini iterator in amdgpu_device_ip_fini_early, referring to the code starting from amdgpu_pci_remove to .hw_fini is called, there are many preparatory operations before calling .hw_fini, which makes it very difficult to change the order of psp_hw_fini or all block .hw_fini. So can we do a workaround in psp_cmd_submit_buf when removing amdgpu device? -Original Message- From: Alex Deucher Sent: Monday, August 15, 2022 10:22 PM To: Chai, Thomas Cc: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; Chen, Guchun ; Chai, Thomas Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai wrote: The psp_cmd_submit_buf function is called by psp_hw_fini to send TA unload messages to psp to terminate ras, asd and tmr. But when amdgpu is uninstalled, drm_dev_unplug is called earlier than psp_hw_fini in amdgpu_pci_remove, the calling order as follows: static void amdgpu_pci_remove(struct pci_dev *pdev) { drm_dev_unplug .. amdgpu_driver_unload_kms->amdgpu_device_fini_hw->... ->.hw_fini->psp_hw_fini->... ->psp_ta_unload->psp_cmd_submit_buf .. } The program will return when calling drm_dev_enter in psp_cmd_submit_buf. So the call to drm_dev_enter in psp_cmd_submit_buf should be removed, so that the TA unload messages can be sent to the psp when amdgpu is uninstalled. Signed-off-by: YiPeng Chai --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index b067ce45d226..0578d8d094a7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp, if (psp->adev->no_hw_access) return 0; - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) - return 0; - This check is to prevent the hardware from being accessed if the card is removed. I think we need to fix the ordering elsewhere. Alex memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); memcpy(psp->cmd_buf_mem, cmd, sizeof(struct psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct psp_context *psp, } exit: - drm_dev_exit(idx); return ret; } -- 2.25.1
Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled
On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai wrote: > > The psp_cmd_submit_buf function is called by psp_hw_fini to > send TA unload messages to psp to terminate ras, asd and tmr. > But when amdgpu is uninstalled, drm_dev_unplug is called > earlier than psp_hw_fini in amdgpu_pci_remove, the calling > order as follows: > static void amdgpu_pci_remove(struct pci_dev *pdev) > { > drm_dev_unplug > .. > amdgpu_driver_unload_kms->amdgpu_device_fini_hw->... > ->.hw_fini->psp_hw_fini->... > ->psp_ta_unload->psp_cmd_submit_buf > .. > } > The program will return when calling drm_dev_enter in > psp_cmd_submit_buf. > > So the call to drm_dev_enter in psp_cmd_submit_buf should > be removed, so that the TA unload messages can be sent to the > psp when amdgpu is uninstalled. > > Signed-off-by: YiPeng Chai > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 > 1 file changed, 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > index b067ce45d226..0578d8d094a7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > @@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp, > if (psp->adev->no_hw_access) > return 0; > > - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) > - return 0; > - This check is to prevent the hardware from being accessed if the card is removed. I think we need to fix the ordering elsewhere. Alex > memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); > > memcpy(psp->cmd_buf_mem, cmd, sizeof(struct psp_gfx_cmd_resp)); > @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct psp_context *psp, > } > > exit: > - drm_dev_exit(idx); > return ret; > } > > -- > 2.25.1 >
Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled
On 2022-08-17 09:44, Alex Deucher wrote: On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas wrote: [AMD Official Use Only - General] Hi Alex: When removing an amdgpu device, it may be difficult to change the order of psp_hw_fini calls. 1. The drm_dev_unplug call is at the beginning in the amdgpu_pci_remove function, which makes the gpu device inaccessible for userspace operations. If the call to psp_hw_fini was moved before drm_dev_unplug, userspace could access the gpu device but the psp might be removing. It has unknown issues. +Andrey Grodzovsky We should fix the ordering in amdgpu_pci_remove() then I guess? There are lots of places where drm_dev_enter() is used to protect access to the hardware which could be similarly affected. Alex We probably can try to move drm_dev_unplug after amdgpu_driver_unload_kms. I don't remember now why drm_dev_unplug must be the first thing we do in amdgpu_pci_remove and what impact it will have but maybe give it a try. Also see if you can run libdrm hotplug test suite before and after the change. Andrey 2. psp_hw_fini is called by the .hw_fini iterator in amdgpu_device_ip_fini_early, referring to the code starting from amdgpu_pci_remove to .hw_fini is called, there are many preparatory operations before calling .hw_fini, which makes it very difficult to change the order of psp_hw_fini or all block .hw_fini. So can we do a workaround in psp_cmd_submit_buf when removing amdgpu device? -Original Message- From: Alex Deucher Sent: Monday, August 15, 2022 10:22 PM To: Chai, Thomas Cc: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; Chen, Guchun ; Chai, Thomas Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai wrote: The psp_cmd_submit_buf function is called by psp_hw_fini to send TA unload messages to psp to terminate ras, asd and tmr. But when amdgpu is uninstalled, drm_dev_unplug is called earlier than psp_hw_fini in amdgpu_pci_remove, the calling order as follows: static void amdgpu_pci_remove(struct pci_dev *pdev) { drm_dev_unplug .. amdgpu_driver_unload_kms->amdgpu_device_fini_hw->... ->.hw_fini->psp_hw_fini->... ->psp_ta_unload->psp_cmd_submit_buf .. } The program will return when calling drm_dev_enter in psp_cmd_submit_buf. So the call to drm_dev_enter in psp_cmd_submit_buf should be removed, so that the TA unload messages can be sent to the psp when amdgpu is uninstalled. Signed-off-by: YiPeng Chai --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c index b067ce45d226..0578d8d094a7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c @@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp, if (psp->adev->no_hw_access) return 0; - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) - return 0; - This check is to prevent the hardware from being accessed if the card is removed. I think we need to fix the ordering elsewhere. Alex memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); memcpy(psp->cmd_buf_mem, cmd, sizeof(struct psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct psp_context *psp, } exit: - drm_dev_exit(idx); return ret; } -- 2.25.1
Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled
On Tue, Aug 16, 2022 at 10:54 PM Chai, Thomas wrote: > > [AMD Official Use Only - General] > > Hi Alex: > When removing an amdgpu device, it may be difficult to change the order of > psp_hw_fini calls. > > 1. The drm_dev_unplug call is at the beginning in the amdgpu_pci_remove > function, which makes the gpu device inaccessible for userspace operations. > If the call to psp_hw_fini was moved before drm_dev_unplug, userspace could > access the gpu device but the psp might be removing. It has unknown issues. > +Andrey Grodzovsky We should fix the ordering in amdgpu_pci_remove() then I guess? There are lots of places where drm_dev_enter() is used to protect access to the hardware which could be similarly affected. Alex > 2. psp_hw_fini is called by the .hw_fini iterator in > amdgpu_device_ip_fini_early, referring to the code starting from > amdgpu_pci_remove to .hw_fini is called, >there are many preparatory operations before calling .hw_fini, which > makes it very difficult to change the order of psp_hw_fini or all block > .hw_fini. > >So can we do a workaround in psp_cmd_submit_buf when removing amdgpu > device? > > -Original Message- > From: Alex Deucher > Sent: Monday, August 15, 2022 10:22 PM > To: Chai, Thomas > Cc: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; > Chen, Guchun ; Chai, Thomas > Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to > psp when amdgpu is uninstalled > > On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai wrote: > > > > The psp_cmd_submit_buf function is called by psp_hw_fini to send TA > > unload messages to psp to terminate ras, asd and tmr. > > But when amdgpu is uninstalled, drm_dev_unplug is called earlier than > > psp_hw_fini in amdgpu_pci_remove, the calling order as follows: > > static void amdgpu_pci_remove(struct pci_dev *pdev) { > > drm_dev_unplug > > .. > > amdgpu_driver_unload_kms->amdgpu_device_fini_hw->... > > ->.hw_fini->psp_hw_fini->... > > ->psp_ta_unload->psp_cmd_submit_buf > > .. > > } > > The program will return when calling drm_dev_enter in > > psp_cmd_submit_buf. > > > > So the call to drm_dev_enter in psp_cmd_submit_buf should be removed, > > so that the TA unload messages can be sent to the psp when amdgpu is > > uninstalled. > > > > Signed-off-by: YiPeng Chai > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 > > 1 file changed, 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > index b067ce45d226..0578d8d094a7 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > > @@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp, > > if (psp->adev->no_hw_access) > > return 0; > > > > - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) > > - return 0; > > - > > This check is to prevent the hardware from being accessed if the card is > removed. I think we need to fix the ordering elsewhere. > > Alex > > > memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); > > > > memcpy(psp->cmd_buf_mem, cmd, sizeof(struct > > psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct > > psp_context *psp, > > } > > > > exit: > > - drm_dev_exit(idx); > > return ret; > > } > > > > -- > > 2.25.1 > >
RE: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled
[AMD Official Use Only - General] Hi Alex: When removing an amdgpu device, it may be difficult to change the order of psp_hw_fini calls. 1. The drm_dev_unplug call is at the beginning in the amdgpu_pci_remove function, which makes the gpu device inaccessible for userspace operations. If the call to psp_hw_fini was moved before drm_dev_unplug, userspace could access the gpu device but the psp might be removing. It has unknown issues. 2. psp_hw_fini is called by the .hw_fini iterator in amdgpu_device_ip_fini_early, referring to the code starting from amdgpu_pci_remove to .hw_fini is called, there are many preparatory operations before calling .hw_fini, which makes it very difficult to change the order of psp_hw_fini or all block .hw_fini. So can we do a workaround in psp_cmd_submit_buf when removing amdgpu device? -Original Message- From: Alex Deucher Sent: Monday, August 15, 2022 10:22 PM To: Chai, Thomas Cc: amd-gfx@lists.freedesktop.org; Zhang, Hawking ; Chen, Guchun ; Chai, Thomas Subject: Re: [PATCH] drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled On Mon, Aug 15, 2022 at 3:06 AM YiPeng Chai wrote: > > The psp_cmd_submit_buf function is called by psp_hw_fini to send TA > unload messages to psp to terminate ras, asd and tmr. > But when amdgpu is uninstalled, drm_dev_unplug is called earlier than > psp_hw_fini in amdgpu_pci_remove, the calling order as follows: > static void amdgpu_pci_remove(struct pci_dev *pdev) { > drm_dev_unplug > .. > amdgpu_driver_unload_kms->amdgpu_device_fini_hw->... > ->.hw_fini->psp_hw_fini->... > ->psp_ta_unload->psp_cmd_submit_buf > .. > } > The program will return when calling drm_dev_enter in > psp_cmd_submit_buf. > > So the call to drm_dev_enter in psp_cmd_submit_buf should be removed, > so that the TA unload messages can be sent to the psp when amdgpu is > uninstalled. > > Signed-off-by: YiPeng Chai > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 4 > 1 file changed, 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > index b067ce45d226..0578d8d094a7 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c > @@ -585,9 +585,6 @@ psp_cmd_submit_buf(struct psp_context *psp, > if (psp->adev->no_hw_access) > return 0; > > - if (!drm_dev_enter(adev_to_drm(psp->adev), &idx)) > - return 0; > - This check is to prevent the hardware from being accessed if the card is removed. I think we need to fix the ordering elsewhere. Alex > memset(psp->cmd_buf_mem, 0, PSP_CMD_BUFFER_SIZE); > > memcpy(psp->cmd_buf_mem, cmd, sizeof(struct > psp_gfx_cmd_resp)); @@ -651,7 +648,6 @@ psp_cmd_submit_buf(struct psp_context > *psp, > } > > exit: > - drm_dev_exit(idx); > return ret; > } > > -- > 2.25.1 >