Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
I see, thanks for clarifying. So this is happening because we unmap the HIQ with direct MMIO register writes instead of using the KIQ. I'm OK with this patch as a workaround, but as a proper fix, we should probably add a hiq_hqd_destroy function that uses KIQ, similar to how we have hiq_mqd_load functions that use KIQ to map the HIQ. Regards, Felix Am 2022-01-27 um 21:34 schrieb Yin, Tianci (Rico): [AMD Official Use Only] The error message is from HIQ dequeue procedure, not from HCQ, so no doorbell writing. Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace: Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072] dump_stack+0x7d/0x9c Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651] hqd_destroy_v10_3+0x58/0x254 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.48] destroy_mqd+0x1e/0x30 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884] kernel_queue_uninit+0xcf/0x100 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985] pm_uninit+0x1a/0x30 [amdgpu] #kernel_queue_uninit(pm->priv_queue, hanging); this priv_queue == HIQ Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127] stop_cpsch+0x98/0x100 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242] kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338] kgd2kfd_suspend+0x1b/0x20 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433] amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529] amdgpu_device_fini_hw+0x182/0x335 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655] amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732] amdgpu_pci_remove+0x27/0x40 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806] pci_device_remove+0x3e/0xb0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809] device_release_driver_internal+0x103/0x1d0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813] driver_detach+0x4c/0x90 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814] bus_remove_driver+0x5c/0xd0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815] driver_unregister+0x31/0x50 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817] pci_unregister_driver+0x40/0x90 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818] amdgpu_exit+0x15/0x2d1 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942] __x64_sys_delete_module+0x147/0x260 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944] ? exit_to_user_mode_prepare+0x41/0x1d0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946] ? ksys_write+0x67/0xe0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948] do_syscall_64+0x40/0xb0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951] entry_SYSCALL_64_after_hwframe+0x44/0xae Regards, Rico *From:* Kuehling, Felix *Sent:* Thursday, January 27, 2022 23:28 *To:* Yin, Tianci (Rico) ; Wang, Yang(Kevin) ; amd-gfx@lists.freedesktop.org *Cc:* Grodzovsky, Andrey ; Chen, Guchun *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod The hang you're seeing is the result of a command submission of an UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this doorbell not trigger gfxoff exit during rmmod? Regards, Felix Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico): > > [AMD Official Use Only] > > > The rmmod ops has prerequisite multi-user target and blacklist amdgpu, > which is IGT requirement so that IGT can make itself DRM master to > test KMS. > igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload > > From my understanding, the KFD process belongs to the regular way of > gfxoff exit, which doorbell writing triggers gfxoff exit. For example, > KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, > these both trigger doorbell writing(pls refer to > gfx_v10_0_ring_set_wptr_compute()). > > As to the IGT reload test, the dequeue request is not thru a cmd on a > ring, it directly writes CP registers, so GFX core remains in gfxoff. > > Thanks, > Rico > > > *From:* Kuehling, Felix > *Sent:* Wednesday, January 26, 2022 23:08 > *To:* Yin, Tianci (Rico) ; Wang, Yang(Kevin) > ; amd-gfx@lists.freedesktop.org > > *Cc:* Grodzovsky, Andrey ; Chen, Guchun > > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod > My question is, why is this problem only seen during module unload? Why > aren't we seeing HWS hangs due to GFX_OFF all the time in normal > operations? For example when the GPU is idle and a new KFD process is > started, creating a new runlist. Are we just getting lucky because the > process first has to alloca
Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
[AMD Official Use Only] The error message is from HIQ dequeue procedure, not from HCQ, so no doorbell writing. Jan 25 16:10:58 lnx-ci-node kernel: [18161.477067] Call Trace: Jan 25 16:10:58 lnx-ci-node kernel: [18161.477072] dump_stack+0x7d/0x9c Jan 25 16:10:58 lnx-ci-node kernel: [18161.477651] hqd_destroy_v10_3+0x58/0x254 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.48] destroy_mqd+0x1e/0x30 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.477884] kernel_queue_uninit+0xcf/0x100 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.477985] pm_uninit+0x1a/0x30 [amdgpu] #kernel_queue_uninit(pm->priv_queue, hanging); this priv_queue == HIQ Jan 25 16:10:58 lnx-ci-node kernel: [18161.478127] stop_cpsch+0x98/0x100 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478242] kgd2kfd_suspend.part.0+0x32/0x50 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478338] kgd2kfd_suspend+0x1b/0x20 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478433] amdgpu_amdkfd_suspend+0x1e/0x30 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478529] amdgpu_device_fini_hw+0x182/0x335 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478655] amdgpu_driver_unload_kms+0x5c/0x80 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478732] amdgpu_pci_remove+0x27/0x40 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478806] pci_device_remove+0x3e/0xb0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478809] device_release_driver_internal+0x103/0x1d0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478813] driver_detach+0x4c/0x90 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478814] bus_remove_driver+0x5c/0xd0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478815] driver_unregister+0x31/0x50 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478817] pci_unregister_driver+0x40/0x90 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478818] amdgpu_exit+0x15/0x2d1 [amdgpu] Jan 25 16:10:58 lnx-ci-node kernel: [18161.478942] __x64_sys_delete_module+0x147/0x260 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478944] ? exit_to_user_mode_prepare+0x41/0x1d0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478946] ? ksys_write+0x67/0xe0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478948] do_syscall_64+0x40/0xb0 Jan 25 16:10:58 lnx-ci-node kernel: [18161.478951] entry_SYSCALL_64_after_hwframe+0x44/0xae Regards, Rico From: Kuehling, Felix Sent: Thursday, January 27, 2022 23:28 To: Yin, Tianci (Rico) ; Wang, Yang(Kevin) ; amd-gfx@lists.freedesktop.org Cc: Grodzovsky, Andrey ; Chen, Guchun Subject: Re: [PATCH] drm/amdgpu: Fix an error message in rmmod The hang you're seeing is the result of a command submission of an UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this doorbell not trigger gfxoff exit during rmmod? Regards, Felix Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico): > > [AMD Official Use Only] > > > The rmmod ops has prerequisite multi-user target and blacklist amdgpu, > which is IGT requirement so that IGT can make itself DRM master to > test KMS. > igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload > > From my understanding, the KFD process belongs to the regular way of > gfxoff exit, which doorbell writing triggers gfxoff exit. For example, > KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, > these both trigger doorbell writing(pls refer to > gfx_v10_0_ring_set_wptr_compute()). > > As to the IGT reload test, the dequeue request is not thru a cmd on a > ring, it directly writes CP registers, so GFX core remains in gfxoff. > > Thanks, > Rico > > > *From:* Kuehling, Felix > *Sent:* Wednesday, January 26, 2022 23:08 > *To:* Yin, Tianci (Rico) ; Wang, Yang(Kevin) > ; amd-gfx@lists.freedesktop.org > > *Cc:* Grodzovsky, Andrey ; Chen, Guchun > > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod > My question is, why is this problem only seen during module unload? Why > aren't we seeing HWS hangs due to GFX_OFF all the time in normal > operations? For example when the GPU is idle and a new KFD process is > started, creating a new runlist. Are we just getting lucky because the > process first has to allocate some memory, which maybe makes some HW > access (flushing TLBs etc.) that wakes up the GPU? > > > Regards, >Felix > > > > Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico): > > > > [AMD Official Use Only] > > > > > > Thanks Kevin and Felix! > > > > In gfxoff state, the dequeue request(by cp register writing) can't > > make gfxoff exit, actually the cp is powered off and the cp register > > writi
Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
The hang you're seeing is the result of a command submission of an UNMAP_QUEUES and QUERY_STATUS command to the HIQ. This is done using a doorbell. KFD writes commands to the HIQ and rings a doorbell to wake up the HWS (see kq_submit_packet in kfd_kernel_queue.c). Why does this doorbell not trigger gfxoff exit during rmmod? Regards, Felix Am 2022-01-26 um 22:38 schrieb Yin, Tianci (Rico): [AMD Official Use Only] The rmmod ops has prerequisite multi-user target and blacklist amdgpu, which is IGT requirement so that IGT can make itself DRM master to test KMS. igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload From my understanding, the KFD process belongs to the regular way of gfxoff exit, which doorbell writing triggers gfxoff exit. For example, KFD maps HCQ thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, these both trigger doorbell writing(pls refer to gfx_v10_0_ring_set_wptr_compute()). As to the IGT reload test, the dequeue request is not thru a cmd on a ring, it directly writes CP registers, so GFX core remains in gfxoff. Thanks, Rico *From:* Kuehling, Felix *Sent:* Wednesday, January 26, 2022 23:08 *To:* Yin, Tianci (Rico) ; Wang, Yang(Kevin) ; amd-gfx@lists.freedesktop.org *Cc:* Grodzovsky, Andrey ; Chen, Guchun *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod My question is, why is this problem only seen during module unload? Why aren't we seeing HWS hangs due to GFX_OFF all the time in normal operations? For example when the GPU is idle and a new KFD process is started, creating a new runlist. Are we just getting lucky because the process first has to allocate some memory, which maybe makes some HW access (flushing TLBs etc.) that wakes up the GPU? Regards, Felix Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico): > > [AMD Official Use Only] > > > Thanks Kevin and Felix! > > In gfxoff state, the dequeue request(by cp register writing) can't > make gfxoff exit, actually the cp is powered off and the cp register > writing is invalid, doorbell registers writing(regluar way) or > directly request smu to disable gfx powergate(by invoking > amdgpu_gfx_off_ctrl) can trigger gfxoff exit. > > I have also tryed > amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false), > but it has no effect. > > [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed > [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0x > [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0x > [10386.162297] amdgpu: mmCP_STAT : 0x > [10386.162303] amdgpu: mmCP_BUSY_STAT : 0x > [10386.162308] amdgpu: mmRLC_STAT : 0x > [10386.162314] amdgpu: mmGRBM_STATUS : 0x > [10386.162320] amdgpu: mmGRBM_STATUS2: 0x > > Thanks again! > Rico > > *From:* Kuehling, Felix > *Sent:* Tuesday, January 25, 2022 23:31 > *To:* Wang, Yang(Kevin) ; Yin, Tianci (Rico) > ; amd-gfx@lists.freedesktop.org > > *Cc:* Grodzovsky, Andrey ; Chen, Guchun > > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod > I have no objection to the change. It restores the sequence that was > used before e9669fb78262. But I don't understand why GFX_OFF is causing > a preemption error during module unload, but not when KFD is in normal > use. Maybe it's because of the compute power profile that's normally set > by amdgpu_amdkfd_set_compute_idle before we interact with the HWS. > > > Either way, the patch is > > Acked-by: Felix Kuehling > > > > Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin): > > > > [AMD Official Use Only] > > > > > > [AMD Official Use Only] > > > > > > the issue is introduced in following patch, so add following > > information is better. > > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/ > > / > > / > > Reviewed-by: Yang Wang > > / > > / > > Best Regards, > > Kevin > > > > -------------------- > > *From:* amd-gfx on behalf of > > Tianci Yin > > *Sent:* Tuesday, January 25, 2022 6:03 PM > > *To:* amd-gfx@lists.freedesktop.org > > *Cc:* Grodzovsky, Andrey ; Yin, Tianci > > (Rico) ; Chen, Guchun > > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod > > From: "Tianci.Yin" > > > > [why] > > In rmmod procedure, kfd sends cp a dequeue request, but the > > request does not get response, then an error message "cp > > queue pipe 4 queue 0 preemption failed" printed. > > > > [how] > > Performing kfd suspending
Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
[AMD Official Use Only] The rmmod ops has prerequisite multi-user target and blacklist amdgpu, which is IGT requirement so that IGT can make itself DRM master to test KMS. igt-gpu-tools/build/tests/amdgpu/amd_module_load --run-subtest reload >From my understanding, the KFD process belongs to the regular way of gfxoff >exit, which doorbell writing triggers gfxoff exit. For example, KFD maps HCQ >thru cmd on HIQ or KIQ ring, or UMD commits jobs on HCQ, these both trigger >doorbell writing(pls refer to gfx_v10_0_ring_set_wptr_compute()). As to the IGT reload test, the dequeue request is not thru a cmd on a ring, it directly writes CP registers, so GFX core remains in gfxoff. Thanks, Rico From: Kuehling, Felix Sent: Wednesday, January 26, 2022 23:08 To: Yin, Tianci (Rico) ; Wang, Yang(Kevin) ; amd-gfx@lists.freedesktop.org Cc: Grodzovsky, Andrey ; Chen, Guchun Subject: Re: [PATCH] drm/amdgpu: Fix an error message in rmmod My question is, why is this problem only seen during module unload? Why aren't we seeing HWS hangs due to GFX_OFF all the time in normal operations? For example when the GPU is idle and a new KFD process is started, creating a new runlist. Are we just getting lucky because the process first has to allocate some memory, which maybe makes some HW access (flushing TLBs etc.) that wakes up the GPU? Regards, Felix Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico): > > [AMD Official Use Only] > > > Thanks Kevin and Felix! > > In gfxoff state, the dequeue request(by cp register writing) can't > make gfxoff exit, actually the cp is powered off and the cp register > writing is invalid, doorbell registers writing(regluar way) or > directly request smu to disable gfx powergate(by invoking > amdgpu_gfx_off_ctrl) can trigger gfxoff exit. > > I have also tryed > amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false), > but it has no effect. > > [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed > [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0x > [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0x > [10386.162297] amdgpu: mmCP_STAT : 0x > [10386.162303] amdgpu: mmCP_BUSY_STAT : 0x > [10386.162308] amdgpu: mmRLC_STAT : 0x > [10386.162314] amdgpu: mmGRBM_STATUS : 0x > [10386.162320] amdgpu: mmGRBM_STATUS2: 0x > > Thanks again! > Rico > > *From:* Kuehling, Felix > *Sent:* Tuesday, January 25, 2022 23:31 > *To:* Wang, Yang(Kevin) ; Yin, Tianci (Rico) > ; amd-gfx@lists.freedesktop.org > > *Cc:* Grodzovsky, Andrey ; Chen, Guchun > > *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod > I have no objection to the change. It restores the sequence that was > used before e9669fb78262. But I don't understand why GFX_OFF is causing > a preemption error during module unload, but not when KFD is in normal > use. Maybe it's because of the compute power profile that's normally set > by amdgpu_amdkfd_set_compute_idle before we interact with the HWS. > > > Either way, the patch is > > Acked-by: Felix Kuehling > > > > Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin): > > > > [AMD Official Use Only] > > > > > > [AMD Official Use Only] > > > > > > the issue is introduced in following patch, so add following > > information is better. > > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/ > > / > > / > > Reviewed-by: Yang Wang > > / > > / > > Best Regards, > > Kevin > > > > -------------------- > > *From:* amd-gfx on behalf of > > Tianci Yin > > *Sent:* Tuesday, January 25, 2022 6:03 PM > > *To:* amd-gfx@lists.freedesktop.org > > *Cc:* Grodzovsky, Andrey ; Yin, Tianci > > (Rico) ; Chen, Guchun > > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod > > From: "Tianci.Yin" > > > > [why] > > In rmmod procedure, kfd sends cp a dequeue request, but the > > request does not get response, then an error message "cp > > queue pipe 4 queue 0 preemption failed" printed. > > > > [how] > > Performing kfd suspending after disabling gfxoff can fix it. > > > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930 > > Signed-off-by: Tianci.Yin > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index b75d67f644e5..77
Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
My question is, why is this problem only seen during module unload? Why aren't we seeing HWS hangs due to GFX_OFF all the time in normal operations? For example when the GPU is idle and a new KFD process is started, creating a new runlist. Are we just getting lucky because the process first has to allocate some memory, which maybe makes some HW access (flushing TLBs etc.) that wakes up the GPU? Regards, Felix Am 2022-01-26 um 01:43 schrieb Yin, Tianci (Rico): [AMD Official Use Only] Thanks Kevin and Felix! In gfxoff state, the dequeue request(by cp register writing) can't make gfxoff exit, actually the cp is powered off and the cp register writing is invalid, doorbell registers writing(regluar way) or directly request smu to disable gfx powergate(by invoking amdgpu_gfx_off_ctrl) can trigger gfxoff exit. I have also tryed amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false), but it has no effect. [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0x [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0x [10386.162297] amdgpu: mmCP_STAT : 0x [10386.162303] amdgpu: mmCP_BUSY_STAT : 0x [10386.162308] amdgpu: mmRLC_STAT : 0x [10386.162314] amdgpu: mmGRBM_STATUS : 0x [10386.162320] amdgpu: mmGRBM_STATUS2: 0x Thanks again! Rico *From:* Kuehling, Felix *Sent:* Tuesday, January 25, 2022 23:31 *To:* Wang, Yang(Kevin) ; Yin, Tianci (Rico) ; amd-gfx@lists.freedesktop.org *Cc:* Grodzovsky, Andrey ; Chen, Guchun *Subject:* Re: [PATCH] drm/amdgpu: Fix an error message in rmmod I have no objection to the change. It restores the sequence that was used before e9669fb78262. But I don't understand why GFX_OFF is causing a preemption error during module unload, but not when KFD is in normal use. Maybe it's because of the compute power profile that's normally set by amdgpu_amdkfd_set_compute_idle before we interact with the HWS. Either way, the patch is Acked-by: Felix Kuehling Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin): > > [AMD Official Use Only] > > > [AMD Official Use Only] > > > the issue is introduced in following patch, so add following > information is better. > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/ > / > / > Reviewed-by: Yang Wang > / > / > Best Regards, > Kevin > > > *From:* amd-gfx on behalf of > Tianci Yin > *Sent:* Tuesday, January 25, 2022 6:03 PM > *To:* amd-gfx@lists.freedesktop.org > *Cc:* Grodzovsky, Andrey ; Yin, Tianci > (Rico) ; Chen, Guchun > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod > From: "Tianci.Yin" > > [why] > In rmmod procedure, kfd sends cp a dequeue request, but the > request does not get response, then an error message "cp > queue pipe 4 queue 0 preemption failed" printed. > > [how] > Performing kfd suspending after disabling gfxoff can fix it. > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930 > Signed-off-by: Tianci.Yin > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index b75d67f644e5..77e9837ba342 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct > amdgpu_device *adev) > } > } > > - amdgpu_amdkfd_suspend(adev, false); > - > amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); > amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); > > + amdgpu_amdkfd_suspend(adev, false); > + > /* Workaroud for ASICs need to disable SMC first */ > amdgpu_device_smu_fini_early(adev); > > -- > 2.25.1 >
Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
[AMD Official Use Only] Thanks Kevin and Felix! In gfxoff state, the dequeue request(by cp register writing) can't make gfxoff exit, actually the cp is powered off and the cp register writing is invalid, doorbell registers writing(regluar way) or directly request smu to disable gfx powergate(by invoking amdgpu_gfx_off_ctrl) can trigger gfxoff exit. I have also tryed amdgpu_dpm_switch_power_profile(adev,PP_SMC_POWER_PROFILE_COMPUTE,false), but it has no effect. [10386.162273] amdgpu: cp queue pipe 4 queue 0 preemption failed [10671.225065] amdgpu: mmCP_HQD_ACTIVE : 0x [10386.162290] amdgpu: mmCP_HQD_HQ_STATUS0 : 0x [10386.162297] amdgpu: mmCP_STAT : 0x [10386.162303] amdgpu: mmCP_BUSY_STAT : 0x [10386.162308] amdgpu: mmRLC_STAT : 0x [10386.162314] amdgpu: mmGRBM_STATUS : 0x [10386.162320] amdgpu: mmGRBM_STATUS2: 0x Thanks again! Rico From: Kuehling, Felix Sent: Tuesday, January 25, 2022 23:31 To: Wang, Yang(Kevin) ; Yin, Tianci (Rico) ; amd-gfx@lists.freedesktop.org Cc: Grodzovsky, Andrey ; Chen, Guchun Subject: Re: [PATCH] drm/amdgpu: Fix an error message in rmmod I have no objection to the change. It restores the sequence that was used before e9669fb78262. But I don't understand why GFX_OFF is causing a preemption error during module unload, but not when KFD is in normal use. Maybe it's because of the compute power profile that's normally set by amdgpu_amdkfd_set_compute_idle before we interact with the HWS. Either way, the patch is Acked-by: Felix Kuehling Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin): > > [AMD Official Use Only] > > > [AMD Official Use Only] > > > the issue is introduced in following patch, so add following > information is better. > /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/ > / > / > Reviewed-by: Yang Wang > / > / > Best Regards, > Kevin > > > *From:* amd-gfx on behalf of > Tianci Yin > *Sent:* Tuesday, January 25, 2022 6:03 PM > *To:* amd-gfx@lists.freedesktop.org > *Cc:* Grodzovsky, Andrey ; Yin, Tianci > (Rico) ; Chen, Guchun > *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod > From: "Tianci.Yin" > > [why] > In rmmod procedure, kfd sends cp a dequeue request, but the > request does not get response, then an error message "cp > queue pipe 4 queue 0 preemption failed" printed. > > [how] > Performing kfd suspending after disabling gfxoff can fix it. > > Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930 > Signed-off-by: Tianci.Yin > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index b75d67f644e5..77e9837ba342 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct > amdgpu_device *adev) > } > } > > - amdgpu_amdkfd_suspend(adev, false); > - > amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); > amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); > > + amdgpu_amdkfd_suspend(adev, false); > + > /* Workaroud for ASICs need to disable SMC first */ > amdgpu_device_smu_fini_early(adev); > > -- > 2.25.1 >
Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
I have no objection to the change. It restores the sequence that was used before e9669fb78262. But I don't understand why GFX_OFF is causing a preemption error during module unload, but not when KFD is in normal use. Maybe it's because of the compute power profile that's normally set by amdgpu_amdkfd_set_compute_idle before we interact with the HWS. Either way, the patch is Acked-by: Felix Kuehling Am 2022-01-25 um 05:48 schrieb Wang, Yang(Kevin): [AMD Official Use Only] [AMD Official Use Only] the issue is introduced in following patch, so add following information is better. /fixes: (e9669fb78262) drm/amdgpu: Add early fini callback/ / / Reviewed-by: Yang Wang / / Best Regards, Kevin *From:* amd-gfx on behalf of Tianci Yin *Sent:* Tuesday, January 25, 2022 6:03 PM *To:* amd-gfx@lists.freedesktop.org *Cc:* Grodzovsky, Andrey ; Yin, Tianci (Rico) ; Chen, Guchun *Subject:* [PATCH] drm/amdgpu: Fix an error message in rmmod From: "Tianci.Yin" [why] In rmmod procedure, kfd sends cp a dequeue request, but the request does not get response, then an error message "cp queue pipe 4 queue 0 preemption failed" printed. [how] Performing kfd suspending after disabling gfxoff can fix it. Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930 Signed-off-by: Tianci.Yin --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index b75d67f644e5..77e9837ba342 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev) } } - amdgpu_amdkfd_suspend(adev, false); - amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); + amdgpu_amdkfd_suspend(adev, false); + /* Workaroud for ASICs need to disable SMC first */ amdgpu_device_smu_fini_early(adev); -- 2.25.1
Re: [PATCH] drm/amdgpu: Fix an error message in rmmod
[AMD Official Use Only] the issue is introduced in following patch, so add following information is better. fixes: (e9669fb78262) drm/amdgpu: Add early fini callback Reviewed-by: Yang Wang Best Regards, Kevin From: amd-gfx on behalf of Tianci Yin Sent: Tuesday, January 25, 2022 6:03 PM To: amd-gfx@lists.freedesktop.org Cc: Grodzovsky, Andrey ; Yin, Tianci (Rico) ; Chen, Guchun Subject: [PATCH] drm/amdgpu: Fix an error message in rmmod From: "Tianci.Yin" [why] In rmmod procedure, kfd sends cp a dequeue request, but the request does not get response, then an error message "cp queue pipe 4 queue 0 preemption failed" printed. [how] Performing kfd suspending after disabling gfxoff can fix it. Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930 Signed-off-by: Tianci.Yin --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index b75d67f644e5..77e9837ba342 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev) } } - amdgpu_amdkfd_suspend(adev, false); - amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); + amdgpu_amdkfd_suspend(adev, false); + /* Workaroud for ASICs need to disable SMC first */ amdgpu_device_smu_fini_early(adev); -- 2.25.1
[PATCH] drm/amdgpu: Fix an error message in rmmod
From: "Tianci.Yin" [why] In rmmod procedure, kfd sends cp a dequeue request, but the request does not get response, then an error message "cp queue pipe 4 queue 0 preemption failed" printed. [how] Performing kfd suspending after disabling gfxoff can fix it. Change-Id: I0453f28820542d4a5ab26e38fb5b87ed76ce6930 Signed-off-by: Tianci.Yin --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index b75d67f644e5..77e9837ba342 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2720,11 +2720,11 @@ static int amdgpu_device_ip_fini_early(struct amdgpu_device *adev) } } - amdgpu_amdkfd_suspend(adev, false); - amdgpu_device_set_pg_state(adev, AMD_PG_STATE_UNGATE); amdgpu_device_set_cg_state(adev, AMD_CG_STATE_UNGATE); + amdgpu_amdkfd_suspend(adev, false); + /* Workaroud for ASICs need to disable SMC first */ amdgpu_device_smu_fini_early(adev); -- 2.25.1