[PATCH] drm/amdgpu: get RAS poison status from DF v4_6_2
Add DF block and RAS poison mode query for DF v4_6_2. Signed-off-by: Tao Zhou Reviewed-by: Stanley.Yang --- drivers/gpu/drm/amd/amdgpu/Makefile | 3 +- drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c | 4 +++ drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c| 34 +++ drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h| 31 + 4 files changed, 71 insertions(+), 1 deletion(-) create mode 100644 drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c create mode 100644 drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile b/drivers/gpu/drm/amd/amdgpu/Makefile index ec1daf7112a9..260e32ef7bae 100644 --- a/drivers/gpu/drm/amd/amdgpu/Makefile +++ b/drivers/gpu/drm/amd/amdgpu/Makefile @@ -104,7 +104,8 @@ amdgpu-y += \ amdgpu-y += \ df_v1_7.o \ df_v3_6.o \ - df_v4_3.o + df_v4_3.o \ + df_v4_6_2.o # add GMC block amdgpu-y += \ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c index 17d4311e22d5..8d3681172cea 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c @@ -35,6 +35,7 @@ #include "df_v1_7.h" #include "df_v3_6.h" #include "df_v4_3.h" +#include "df_v4_6_2.h" #include "nbio_v6_1.h" #include "nbio_v7_0.h" #include "nbio_v7_4.h" @@ -2557,6 +2558,9 @@ int amdgpu_discovery_set_ip_blocks(struct amdgpu_device *adev) case IP_VERSION(4, 3, 0): adev->df.funcs = &df_v4_3_funcs; break; + case IP_VERSION(4, 6, 2): + adev->df.funcs = &df_v4_6_2_funcs; + break; default: break; } diff --git a/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c b/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c new file mode 100644 index ..a47960a0babd --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c @@ -0,0 +1,34 @@ +/* + * Copyright 2023 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ +#include "amdgpu.h" +#include "df_v4_6_2.h" + +static bool df_v4_6_2_query_ras_poison_mode(struct amdgpu_device *adev) +{ + /* return true since related regs are inaccessible */ + return true; +} + +const struct amdgpu_df_funcs df_v4_6_2_funcs = { + .query_ras_poison_mode = df_v4_6_2_query_ras_poison_mode, +}; diff --git a/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h b/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h new file mode 100644 index ..3bc3e6d216e2 --- /dev/null +++ b/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h @@ -0,0 +1,31 @@ +/* + * Copyright 2023 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#ifndef __DF_V4_6_2_H__ +#define __DF_V4_6_2_H__ + +#include "soc15_common.h" + +extern const struct amdgpu_df_funcs df_v4_6_2_funcs; + +#endif -- 2.35.1
RE: [PATCH 1/3] drm/amdgpu: ungate power gating when system suspend
[AMD Official Use Only - General] Reviewed-by: Kenneth Feng -Original Message- From: Yuan, Perry Sent: Tuesday, October 24, 2023 10:33 AM To: Zhang, Yifan ; Feng, Kenneth ; Limonciello, Mario Cc: Deucher, Alexander ; Wang, Yang(Kevin) ; amd-gfx@lists.freedesktop.org Subject: [PATCH 1/3] drm/amdgpu: ungate power gating when system suspend [Why] During suspend, if GFX DPM is enabled and GFXOFF feature is enabled the system may get hung. So, it is suggested to disable GFXOFF feature during suspend and enable it after resume. [How] Update the code to disable GFXOFF feature during suspend and enable it after resume. [ 311.396526] amdgpu :03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x001E SMN_C2PMSG_82:0x [ 311.396530] amdgpu :03:00.0: amdgpu: Fail to disable dpm features! [ 311.396531] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block failed -62 Signed-off-by: Perry Yuan Signed-off-by: Kun Liu --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index d9ccacd06fba..6399bc71c56d 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3498,6 +3498,8 @@ static void gfx_v10_0_ring_invalidate_tlbs(struct amdgpu_ring *ring, static void gfx_v10_0_update_spm_vmid_internal(struct amdgpu_device *adev, unsigned int vmid); +static int gfx_v10_0_set_powergating_state(void *handle, + enum amd_powergating_state state); static void gfx10_kiq_set_resources(struct amdgpu_ring *kiq_ring, uint64_t queue_mask) { amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_SET_RESOURCES, 6)); @@ -7172,6 +7174,13 @@ static int gfx_v10_0_hw_fini(void *handle) amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0); amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0); + /* WA added for Vangogh asic fixing the SMU suspend failure +* It needs to set power gating again during gfxoff control +* otherwise the gfxoff disallowing will be failed to set. +*/ + if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(10, 3, 1)) + gfx_v10_0_set_powergating_state(handle, AMD_PG_STATE_UNGATE); + if (!adev->no_hw_access) { if (amdgpu_async_gfx_ring) { if (amdgpu_gfx_disable_kgq(adev, 0)) -- 2.34.1
Re: [PATCH 2/2] drm/amdgpu: Add timeout for sync wait
Am 20.10.23 um 11:59 schrieb Emily Deng: Issue: Dead heappen during gpu recover, the call sequence as below: amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset->flush_delayed_work-> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait Resolving a deadlock with a timeout is illegal in general. So this patch here is an obvious no-go. Additional to this problem Xinhu already investigated that the delayed work is causing issues during suspend because because flushing doesn't guarantee that a new one isn't started right after doing that. After talking with Felix about this the correct solution is to stop flushing the delayed work and instead submitting it to the freezable work queue. Regards, Christian. It is because the amdgpu_sync_wait is waiting for the bad job's fence, and never return, so the recover couldn't continue. Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index dcd8c066bc1f..9d4f122a7bf0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr) int i, r; hash_for_each_safe(sync->fences, i, tmp, e, node) { - r = dma_fence_wait(e->fence, intr); - if (r) + struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence); + long timeout = msecs_to_jiffies(1); + + if (s_fence) + timeout = s_fence->sched->timeout; + r = dma_fence_wait_timeout(e->fence, intr, timeout); + if (r == 0) + r = -ETIMEDOUT; + if (r < 0) return r; amdgpu_sync_entry_free(e);
Re: [PATCH] drm/amdgpu: Initialize schedulers before using them
Am 24.10.23 um 04:55 schrieb Luben Tuikov: On 2023-10-23 01:49, Christian König wrote: Am 23.10.23 um 05:23 schrieb Luben Tuikov: Initialize ring schedulers before using them, very early in the amdgpu boot, at PCI probe time, specifically at frame-buffer dumb-create at fill-buffer. This was discovered by using dynamic scheduler run-queues, which showed that amdgpu was using a scheduler before calling drm_sched_init(), and the only reason it was working was because sched_rq[] was statically allocated in the scheduler structure. However, the scheduler structure had _not_ been initialized. When switching to dynamically allocated run-queues, this lack of initialization was causing an oops and a blank screen at boot up. This patch fixes this amdgpu bug. This patch depends on the "drm/sched: Convert the GPU scheduler to variable number of run-queues" patch, as that patch prevents subsequent scheduler initialization if a scheduler has already been initialized. Cc: Christian König Cc: Alex Deucher Cc: Felix Kuehling Cc: AMD Graphics Signed-off-by: Luben Tuikov --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 4e51dce3aab5d6..575ef7e1e30fd4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -60,6 +60,7 @@ #include "amdgpu_atomfirmware.h" #include "amdgpu_res_cursor.h" #include "bif/bif_4_1_d.h" +#include "amdgpu_reset.h" MODULE_IMPORT_NS(DMA_BUF); @@ -2059,6 +2060,19 @@ void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable) ring = adev->mman.buffer_funcs_ring; sched = &ring->sched; + + r = drm_sched_init(sched, &amdgpu_sched_ops, + DRM_SCHED_PRIORITY_COUNT, + ring->num_hw_submission, 0, + adev->sdma_timeout, adev->reset_domain->wq, + ring->sched_score, ring->name, + adev->dev); + if (r) { + drm_err(adev, "%s: couldn't initialize ring:%s error:%d\n", + __func__, ring->name, r); + return; + } That doesn't look correct either. amdgpu_ttm_set_buffer_funcs_status() should only be called with enable=true as argument *after* the copy ring is initialized and valid to use. One part of this ring initialization is to setup the scheduler. It's the only way to keep the functionality of amdgpu_fill_buffer() from amdgpu_mode_dumb_create(), from drm_client_framebuffer_create(), from ... without an oops and a blank screen at boot up. Here is a stack of the oops: Oct 20 22:12:34 fedora kernel: RIP: 0010:drm_sched_job_arm+0x1f/0x60 [gpu_sched] Oct 20 22:12:34 fedora kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 53 48 8b 6f 58 48 85 ed 74 3f 48 89 fb 48 89 ef e8 95 34 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 54 b8 01 00 00 00 f0 48 0f Oct 20 22:12:34 fedora kernel: RSP: 0018:c90001613838 EFLAGS: 00010246 Oct 20 22:12:34 fedora kernel: RAX: RBX: 88812f33b400 RCX: 0004 Oct 20 22:12:34 fedora kernel: RDX: RSI: c9000395145c RDI: 88812eacf850 Oct 20 22:12:34 fedora kernel: RBP: 88812eacf850 R08: 0004 R09: 0003 Oct 20 22:12:34 fedora kernel: R10: c066b850 R11: bc848ef1 R12: Oct 20 22:12:34 fedora kernel: R13: 0004 R14: 00800300 R15: 0100 Oct 20 22:12:34 fedora kernel: FS: 7f7be4866940() GS:0ed0() knlGS: Oct 20 22:12:34 fedora kernel: CS: 0010 DS: ES: CR0: 80050033 Oct 20 22:12:34 fedora kernel: CR2: 0008 CR3: 00012cf22000 CR4: 003506e0 Oct 20 22:12:34 fedora kernel: Call Trace: Oct 20 22:12:34 fedora kernel: Oct 20 22:12:34 fedora kernel: ? __die+0x1f/0x70 Oct 20 22:12:34 fedora kernel: ? page_fault_oops+0x149/0x440 Oct 20 22:12:34 fedora kernel: ? drm_sched_fence_alloc+0x1a/0x40 [gpu_sched] Oct 20 22:12:34 fedora kernel: ? amdgpu_job_alloc_with_ib+0x34/0xb0 [amdgpu] Oct 20 22:12:34 fedora kernel: ? srso_return_thunk+0x5/0x10 Oct 20 22:12:34 fedora kernel: ? do_user_addr_fault+0x65/0x650 Oct 20 22:12:34 fedora kernel: ? drm_client_framebuffer_create+0xa3/0x280 [drm] Oct 20 22:12:34 fedora kernel: ? exc_page_fault+0x7b/0x180 Oct 20 22:12:34 fedora kernel: ? asm_exc_page_fault+0x22/0x30 Oct 20 22:12:34 fedora kernel: ? local_pci_probe+0x41/0x90 Oct 20 22:12:34 fedora kernel: ? __pfx_sdma_v5_0_emit_fill_buffer+0x10/0x10 [amdgpu] Oct 20 22:12:34 fedora kernel: ? drm_sched_job_arm+0x1f/0x60 [gpu_sched] Oct 20 22:12:34 fedora kernel: ? drm_sched_job_arm+0x1b/0x60 [gpu_sched] Oct 20
RE: [PATCH 2/3] drm/amdgpu: avoid sending csib command when system resumes from S3
[AMD Official Use Only - General] -Original Message- From: Yuan, Perry Sent: Tuesday, October 24, 2023 10:33 AM To: Zhang, Yifan ; Feng, Kenneth ; Limonciello, Mario Cc: Deucher, Alexander ; Wang, Yang(Kevin) ; amd-gfx@lists.freedesktop.org Subject: [PATCH 2/3] drm/amdgpu: avoid sending csib command when system resumes from S3 Previously the CSIB command pocket was sent to GFX block while amdgpu driver loading or S3 resuming time all the time. As the CP protocol required, the CSIB is not needed to send again while GC is not powered down while resuming from aborted S3 suspend sequence. PREAMBLE_CNTL packet coming in the ring after PG event where the RLC already sent its copy of CSIB, send another CSIB pocket will cause Gfx IB testing timeout when system resume from S3. Add flag `csib_initialized` to make sure normal S3 suspend/resume will initialize csib normally, when system abort to S3 suspend and resume immediately because of some failed suspend callback, GPU is not power down at that time, so csib command is not needed to send again. Error dmesg log: amdgpu :04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110). [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110). PM: resume of devices complete after 2373.995 msecs PM: Finishing wakeup. Signed-off-by: Perry Yuan --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 29 ++--- 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 44df1a5bce7f..e5d85ea26a5e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1114,6 +1114,7 @@ struct amdgpu_device { booldebug_vm; booldebug_largebar; booldebug_disable_soft_recovery; + boolcsib_initialized; [Kevin]: you'd better use space to instead of "tab" , to align with other field. }; static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 420196a17e22..a47c9f840754 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2468,6 +2468,11 @@ static int amdgpu_pmops_suspend_noirq(struct device *dev) if (amdgpu_acpi_should_gpu_reset(adev)) return amdgpu_asic_reset(adev); + /* update flag to make sure csib will be sent when system +* resume from normal S3 +*/ + adev->csib_initialized = false; + return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 6399bc71c56d..ab2e3e592dfc 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3481,6 +3481,7 @@ static uint64_t gfx_v10_0_get_gpu_clock_counter(struct amdgpu_device *adev); static void gfx_v10_0_select_se_sh(struct amdgpu_device *adev, u32 se_num, u32 sh_num, u32 instance, int xcc_id); static u32 gfx_v10_0_get_wgp_active_bitmap_per_sh(struct amdgpu_device *adev); +static int gfx_v10_0_wait_for_idle(void *handle); static int gfx_v10_0_rlc_backdoor_autoload_buffer_init(struct amdgpu_device *adev); static void gfx_v10_0_rlc_backdoor_autoload_buffer_fini(struct amdgpu_device *adev); @@ -5958,7 +5959,7 @@ static int gfx_v10_0_cp_gfx_load_microcode(struct amdgpu_device *adev) return 0; } -static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev) +static int gfx_v10_csib_submit(struct amdgpu_device *adev) { struct amdgpu_ring *ring; const struct cs_section_def *sect = NULL; @@ -5966,13 +5967,6 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev) int r, i; int ctx_reg_offset; - /* init the CP */ - WREG32_SOC15(GC, 0, mmCP_MAX_CONTEXT, -adev->gfx.config.max_hw_contexts - 1); - WREG32_SOC15(GC, 0, mmCP_DEVICE_ID, 1); - - gfx_v10_0_cp_gfx_enable(adev, true); - ring = &adev->gfx.gfx_ring[0]; r = amdgpu_ring_alloc(ring, gfx_v10_0_get_csb_size(adev) + 4); if (r) { @@ -6035,6 +6029,25 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev) amdgpu_ring_commit(ring); } + + gfx_v10_0_wait_for_idle(adev); [kevin]: Do you forgot to check return value here? If you want to ignore the return result, you'd better put some comments here. Thanks. Best Regards, Kevin + adev->csib_initialized = true; + + return 0; +}; + +static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev) { + /* init the CP */ + WREG32_SOC15(GC, 0, mmCP_MAX_CONTEXT, +
[PATCH] drm/amd/amdgpu: avoid to disable gfxhub interrupt when driver is unloaded
avoid to disable gfxhub interrupt when driver is unloaded on gmc 11 Signed-off-by: Kenneth Feng --- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c index 80ca2c05b0b8..8e36a8395464 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c @@ -73,7 +73,8 @@ gmc_v11_0_vm_fault_interrupt_state(struct amdgpu_device *adev, * fini/suspend, so the overall state doesn't * change over the course of suspend/resume. */ - if (!adev->in_s0ix) + if (!adev->in_s0ix && (adev->in_runpm || adev->in_suspend || + amdgpu_in_reset(adev))) amdgpu_gmc_set_vm_fault_masks(adev, AMDGPU_GFXHUB(0), false); break; case AMDGPU_IRQ_STATE_ENABLE: -- 2.34.1
Re: [PATCH] drm/amdgpu: Initialize schedulers before using them
On 2023-10-23 01:49, Christian König wrote: > > > Am 23.10.23 um 05:23 schrieb Luben Tuikov: >> Initialize ring schedulers before using them, very early in the amdgpu boot, >> at PCI probe time, specifically at frame-buffer dumb-create at fill-buffer. >> >> This was discovered by using dynamic scheduler run-queues, which showed that >> amdgpu was using a scheduler before calling drm_sched_init(), and the only >> reason it was working was because sched_rq[] was statically allocated in the >> scheduler structure. However, the scheduler structure had _not_ been >> initialized. >> >> When switching to dynamically allocated run-queues, this lack of >> initialization was causing an oops and a blank screen at boot up. This patch >> fixes this amdgpu bug. >> >> This patch depends on the "drm/sched: Convert the GPU scheduler to variable >> number of run-queues" patch, as that patch prevents subsequent scheduler >> initialization if a scheduler has already been initialized. >> >> Cc: Christian König >> Cc: Alex Deucher >> Cc: Felix Kuehling >> Cc: AMD Graphics >> Signed-off-by: Luben Tuikov >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++ >> 1 file changed, 14 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> index 4e51dce3aab5d6..575ef7e1e30fd4 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> @@ -60,6 +60,7 @@ >> #include "amdgpu_atomfirmware.h" >> #include "amdgpu_res_cursor.h" >> #include "bif/bif_4_1_d.h" >> +#include "amdgpu_reset.h" >> >> MODULE_IMPORT_NS(DMA_BUF); >> >> @@ -2059,6 +2060,19 @@ void amdgpu_ttm_set_buffer_funcs_status(struct >> amdgpu_device *adev, bool enable) >> >> ring = adev->mman.buffer_funcs_ring; >> sched = &ring->sched; >> + >> +r = drm_sched_init(sched, &amdgpu_sched_ops, >> + DRM_SCHED_PRIORITY_COUNT, >> + ring->num_hw_submission, 0, >> + adev->sdma_timeout, adev->reset_domain->wq, >> + ring->sched_score, ring->name, >> + adev->dev); >> +if (r) { >> +drm_err(adev, "%s: couldn't initialize ring:%s >> error:%d\n", >> +__func__, ring->name, r); >> +return; >> +} > > That doesn't look correct either. > > amdgpu_ttm_set_buffer_funcs_status() should only be called with > enable=true as argument *after* the copy ring is initialized and valid > to use. One part of this ring initialization is to setup the scheduler. It's the only way to keep the functionality of amdgpu_fill_buffer() from amdgpu_mode_dumb_create(), from drm_client_framebuffer_create(), from ... without an oops and a blank screen at boot up. Here is a stack of the oops: Oct 20 22:12:34 fedora kernel: RIP: 0010:drm_sched_job_arm+0x1f/0x60 [gpu_sched] Oct 20 22:12:34 fedora kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 53 48 8b 6f 58 48 85 ed 74 3f 48 89 fb 48 89 ef e8 95 34 00 00 48 8b 45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 54 b8 01 00 00 00 f0 48 0f Oct 20 22:12:34 fedora kernel: RSP: 0018:c90001613838 EFLAGS: 00010246 Oct 20 22:12:34 fedora kernel: RAX: RBX: 88812f33b400 RCX: 0004 Oct 20 22:12:34 fedora kernel: RDX: RSI: c9000395145c RDI: 88812eacf850 Oct 20 22:12:34 fedora kernel: RBP: 88812eacf850 R08: 0004 R09: 0003 Oct 20 22:12:34 fedora kernel: R10: c066b850 R11: bc848ef1 R12: Oct 20 22:12:34 fedora kernel: R13: 0004 R14: 00800300 R15: 0100 Oct 20 22:12:34 fedora kernel: FS: 7f7be4866940() GS:0ed0() knlGS: Oct 20 22:12:34 fedora kernel: CS: 0010 DS: ES: CR0: 80050033 Oct 20 22:12:34 fedora kernel: CR2: 0008 CR3: 00012cf22000 CR4: 003506e0 Oct 20 22:12:34 fedora kernel: Call Trace: Oct 20 22:12:34 fedora kernel: Oct 20 22:12:34 fedora kernel: ? __die+0x1f/0x70 Oct 20 22:12:34 fedora kernel: ? page_fault_oops+0x149/0x440 Oct 20 22:12:34 fedora kernel: ? drm_sched_fence_alloc+0x1a/0x40 [gpu_sched] Oct 20 22:12:34 fedora kernel: ? amdgpu_job_alloc_with_ib+0x34/0xb0 [amdgpu] Oct 20 22:12:34 fedora kernel: ? srso_return_thunk+0x5/0x10 Oct 20 22:12:34 fedora kernel: ? do_user_addr_fault+0x65/0x650 Oct 20 22:12:34 fedora kernel: ? drm_client_framebuffer_create+0xa3/0x280 [drm] Oct 20 22:12:34 fedora kernel: ? exc_page_fault+0x7b/0x180 Oct 20 22:12:34 fedora kernel: ? asm_exc_page_fault+0x22/0x30 Oct 20 22:12:34 fedora kernel: ? local_pci_probe+0x41/0x90 Oct 20 22:12:34 fedora kernel: ? __pfx_sdma_v5_0_emit_fill_buffer+0x10/0x10 [amdgpu] Oct 20 22:12:34 fedora kernel: ? drm_sched_job_arm+
[PATCH 2/3] drm/amdgpu: avoid sending csib command when system resumes from S3
Previously the CSIB command pocket was sent to GFX block while amdgpu driver loading or S3 resuming time all the time. As the CP protocol required, the CSIB is not needed to send again while GC is not powered down while resuming from aborted S3 suspend sequence. PREAMBLE_CNTL packet coming in the ring after PG event where the RLC already sent its copy of CSIB, send another CSIB pocket will cause Gfx IB testing timeout when system resume from S3. Add flag `csib_initialized` to make sure normal S3 suspend/resume will initialize csib normally, when system abort to S3 suspend and resume immediately because of some failed suspend callback, GPU is not power down at that time, so csib command is not needed to send again. Error dmesg log: amdgpu :04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx_0.0.0 (-110). [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110). PM: resume of devices complete after 2373.995 msecs PM: Finishing wakeup. Signed-off-by: Perry Yuan --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 + drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 29 ++--- 3 files changed, 27 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 44df1a5bce7f..e5d85ea26a5e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1114,6 +1114,7 @@ struct amdgpu_device { booldebug_vm; booldebug_largebar; booldebug_disable_soft_recovery; + boolcsib_initialized; }; static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 420196a17e22..a47c9f840754 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -2468,6 +2468,11 @@ static int amdgpu_pmops_suspend_noirq(struct device *dev) if (amdgpu_acpi_should_gpu_reset(adev)) return amdgpu_asic_reset(adev); + /* update flag to make sure csib will be sent when system +* resume from normal S3 +*/ + adev->csib_initialized = false; + return 0; } diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index 6399bc71c56d..ab2e3e592dfc 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3481,6 +3481,7 @@ static uint64_t gfx_v10_0_get_gpu_clock_counter(struct amdgpu_device *adev); static void gfx_v10_0_select_se_sh(struct amdgpu_device *adev, u32 se_num, u32 sh_num, u32 instance, int xcc_id); static u32 gfx_v10_0_get_wgp_active_bitmap_per_sh(struct amdgpu_device *adev); +static int gfx_v10_0_wait_for_idle(void *handle); static int gfx_v10_0_rlc_backdoor_autoload_buffer_init(struct amdgpu_device *adev); static void gfx_v10_0_rlc_backdoor_autoload_buffer_fini(struct amdgpu_device *adev); @@ -5958,7 +5959,7 @@ static int gfx_v10_0_cp_gfx_load_microcode(struct amdgpu_device *adev) return 0; } -static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev) +static int gfx_v10_csib_submit(struct amdgpu_device *adev) { struct amdgpu_ring *ring; const struct cs_section_def *sect = NULL; @@ -5966,13 +5967,6 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev) int r, i; int ctx_reg_offset; - /* init the CP */ - WREG32_SOC15(GC, 0, mmCP_MAX_CONTEXT, -adev->gfx.config.max_hw_contexts - 1); - WREG32_SOC15(GC, 0, mmCP_DEVICE_ID, 1); - - gfx_v10_0_cp_gfx_enable(adev, true); - ring = &adev->gfx.gfx_ring[0]; r = amdgpu_ring_alloc(ring, gfx_v10_0_get_csb_size(adev) + 4); if (r) { @@ -6035,6 +6029,25 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev) amdgpu_ring_commit(ring); } + + gfx_v10_0_wait_for_idle(adev); + adev->csib_initialized = true; + + return 0; +}; + +static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev) +{ + /* init the CP */ + WREG32_SOC15(GC, 0, mmCP_MAX_CONTEXT, +adev->gfx.config.max_hw_contexts - 1); + WREG32_SOC15(GC, 0, mmCP_DEVICE_ID, 1); + + gfx_v10_0_cp_gfx_enable(adev, true); + + if (!adev->csib_initialized) + gfx_v10_csib_submit(adev); + return 0; } -- 2.34.1
[PATCH 3/3] drm/amdgpu: optimize RLC powerdown notification on Vangogh
The smu needs to get the rlc power down message to sync the rlc state with smu, the rlc state updating message need to be sent at while smu begin suspend sequence , otherwise SMU will crash while RLC state is not notified by driver, and rlc state probally changed after that notification, so it needs to notify rlc state to smu at the end of the suspend sequence in amdgpu_device_suspend() that can make sure the rlc state is correctly set to SMU. [ 101.000590] amdgpu :03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x001E SMN_C2PMSG_82:0x [ 101.000598] amdgpu :03:00.0: amdgpu: Failed to disable gfxoff! [ 110.838026] amdgpu :03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x001E SMN_C2PMSG_82:0x [ 110.838035] amdgpu :03:00.0: amdgpu: Failed to disable smu features. [ 110.838039] amdgpu :03:00.0: amdgpu: Fail to disable dpm features! [ 110.838040] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block failed -62 [ 110.884394] PM: suspend of devices aborted after 21213.620 msecs [ 110.884402] PM: start suspend of devices aborted after 21213.882 msecs [ 110.884405] PM: Some devices failed to suspend, or early wake event detected Signed-off-by: Perry Yuan --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 drivers/gpu/drm/amd/include/kgd_pp_interface.h | 1 + drivers/gpu/drm/amd/pm/amdgpu_dpm.c| 18 ++ drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h| 2 ++ drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 10 ++ drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 5 + .../gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c | 5 ++--- drivers/gpu/drm/amd/pm/swsmu/smu_internal.h| 1 + 8 files changed, 43 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index cc047fe0b7ee..be08ffc69231 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4428,6 +4428,10 @@ int amdgpu_device_suspend(struct drm_device *dev, bool fbcon) if (amdgpu_sriov_vf(adev)) amdgpu_virt_release_full_gpu(adev, false); + r = amdgpu_dpm_notify_rlc_state(adev, false); + if (r) + return r; + return 0; } diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h b/drivers/gpu/drm/amd/include/kgd_pp_interface.h index 3201808c2dd8..4eacfdfcfd4b 100644 --- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h +++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h @@ -444,6 +444,7 @@ struct amd_pm_funcs { struct dpm_clocks *clock_table); int (*get_smu_prv_buf_details)(void *handle, void **addr, size_t *size); void (*pm_compute_clocks)(void *handle); + int (*notify_rlc_state)(void *handle, bool en); }; struct metrics_table_header { diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c index acf3527fff2d..ed7237bb64c8 100644 --- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c +++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c @@ -181,6 +181,24 @@ int amdgpu_dpm_set_mp1_state(struct amdgpu_device *adev, return ret; } +int amdgpu_dpm_notify_rlc_state(struct amdgpu_device *adev, bool en) +{ + int ret = 0; + const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs; + + if (pp_funcs && pp_funcs->notify_rlc_state) { + mutex_lock(&adev->pm.mutex); + + ret = pp_funcs->notify_rlc_state( + adev->powerplay.pp_handle, + en); + + mutex_unlock(&adev->pm.mutex); + } + + return ret; +} + bool amdgpu_dpm_is_baco_supported(struct amdgpu_device *adev) { const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs; diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h index feccd2a7120d..482ea30147ab 100644 --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h @@ -415,6 +415,8 @@ int amdgpu_dpm_mode1_reset(struct amdgpu_device *adev); int amdgpu_dpm_set_mp1_state(struct amdgpu_device *adev, enum pp_mp1_state mp1_state); +int amdgpu_dpm_notify_rlc_state(struct amdgpu_device *adev, bool en); + int amdgpu_dpm_set_gfx_power_up_by_imu(struct amdgpu_device *adev); int amdgpu_dpm_baco_exit(struct amdgpu_device *adev); diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c index a0b8d5d78beb..a8fb914f746b 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c @@ -1710,6 +1710,16 @@ static int smu_disable_dpms(struct smu_context *smu) } } + /* Notify SMU RLC is going to be off, stop RLC and SMU interaction. +* otherwise SMU will hang
[PATCH 1/3] drm/amdgpu: ungate power gating when system suspend
[Why] During suspend, if GFX DPM is enabled and GFXOFF feature is enabled the system may get hung. So, it is suggested to disable GFXOFF feature during suspend and enable it after resume. [How] Update the code to disable GFXOFF feature during suspend and enable it after resume. [ 311.396526] amdgpu :03:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x001E SMN_C2PMSG_82:0x [ 311.396530] amdgpu :03:00.0: amdgpu: Fail to disable dpm features! [ 311.396531] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block failed -62 Signed-off-by: Perry Yuan Signed-off-by: Kun Liu --- drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c index d9ccacd06fba..6399bc71c56d 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c @@ -3498,6 +3498,8 @@ static void gfx_v10_0_ring_invalidate_tlbs(struct amdgpu_ring *ring, static void gfx_v10_0_update_spm_vmid_internal(struct amdgpu_device *adev, unsigned int vmid); +static int gfx_v10_0_set_powergating_state(void *handle, + enum amd_powergating_state state); static void gfx10_kiq_set_resources(struct amdgpu_ring *kiq_ring, uint64_t queue_mask) { amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_SET_RESOURCES, 6)); @@ -7172,6 +7174,13 @@ static int gfx_v10_0_hw_fini(void *handle) amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0); amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0); + /* WA added for Vangogh asic fixing the SMU suspend failure +* It needs to set power gating again during gfxoff control +* otherwise the gfxoff disallowing will be failed to set. +*/ + if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(10, 3, 1)) + gfx_v10_0_set_powergating_state(handle, AMD_PG_STATE_UNGATE); + if (!adev->no_hw_access) { if (amdgpu_async_gfx_ring) { if (amdgpu_gfx_disable_kgq(adev, 0)) -- 2.34.1
[PATCH v3 05/10] drm/ci: clean up xfails (specially flakes list)
Since the script that collected the list of the expectation files was bogus and placing test to the flakes list incorrectly, restart the expectation files with the correct script. This reduces a lot the number of tests in the flakes list. Signed-off-by: Helen Koike Reviewed-by: David Heidelberg --- v2: - fix typo in the commit message - re-add kms_cursor_legacy@flip-vs-cursor-toggle back to msm-sdm845-flakes.txt - removed kms_async_flips@crc,Fail from i915-cml-fails.txt v3: - add kms_rmfb@close-fd,Fail to amdgpu-stoney-fails.txt - add kms_async_flips@crc to i915-kbl-flakes.txt Signed-off-by: Helen Koike --- .../gpu/drm/ci/xfails/amdgpu-stoney-fails.txt | 12 +- .../drm/ci/xfails/amdgpu-stoney-flakes.txt| 20 - drivers/gpu/drm/ci/xfails/i915-amly-fails.txt | 9 .../gpu/drm/ci/xfails/i915-amly-flakes.txt| 32 --- drivers/gpu/drm/ci/xfails/i915-apl-fails.txt | 11 - drivers/gpu/drm/ci/xfails/i915-apl-flakes.txt | 1 - drivers/gpu/drm/ci/xfails/i915-cml-fails.txt | 14 ++- drivers/gpu/drm/ci/xfails/i915-cml-flakes.txt | 38 - drivers/gpu/drm/ci/xfails/i915-glk-fails.txt | 17 drivers/gpu/drm/ci/xfails/i915-glk-flakes.txt | 41 --- drivers/gpu/drm/ci/xfails/i915-kbl-fails.txt | 7 drivers/gpu/drm/ci/xfails/i915-kbl-flakes.txt | 25 --- drivers/gpu/drm/ci/xfails/i915-tgl-fails.txt | 1 - drivers/gpu/drm/ci/xfails/i915-tgl-flakes.txt | 5 --- drivers/gpu/drm/ci/xfails/i915-whl-flakes.txt | 1 - .../drm/ci/xfails/mediatek-mt8173-flakes.txt | 0 .../drm/ci/xfails/mediatek-mt8183-fails.txt | 5 ++- .../drm/ci/xfails/mediatek-mt8183-flakes.txt | 14 --- .../gpu/drm/ci/xfails/meson-g12b-fails.txt| 14 --- .../gpu/drm/ci/xfails/meson-g12b-flakes.txt | 4 -- .../gpu/drm/ci/xfails/msm-apq8016-flakes.txt | 4 -- .../gpu/drm/ci/xfails/msm-apq8096-fails.txt | 2 + .../gpu/drm/ci/xfails/msm-apq8096-flakes.txt | 4 -- .../gpu/drm/ci/xfails/msm-sc7180-fails.txt| 15 --- .../gpu/drm/ci/xfails/msm-sc7180-flakes.txt | 24 +++ .../gpu/drm/ci/xfails/msm-sc7180-skips.txt| 18 +--- .../gpu/drm/ci/xfails/msm-sdm845-fails.txt| 9 +--- .../gpu/drm/ci/xfails/msm-sdm845-flakes.txt | 19 + .../drm/ci/xfails/rockchip-rk3288-fails.txt | 6 +++ .../drm/ci/xfails/rockchip-rk3288-flakes.txt | 9 .../drm/ci/xfails/rockchip-rk3399-fails.txt | 40 +- .../drm/ci/xfails/rockchip-rk3399-flakes.txt | 28 +++-- .../drm/ci/xfails/virtio_gpu-none-flakes.txt | 0 33 files changed, 162 insertions(+), 287 deletions(-) delete mode 100644 drivers/gpu/drm/ci/xfails/i915-amly-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/i915-apl-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/i915-cml-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/i915-glk-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/i915-tgl-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/i915-whl-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/mediatek-mt8173-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/mediatek-mt8183-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/meson-g12b-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/msm-apq8016-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/msm-apq8096-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/rockchip-rk3288-flakes.txt delete mode 100644 drivers/gpu/drm/ci/xfails/virtio_gpu-none-flakes.txt diff --git a/drivers/gpu/drm/ci/xfails/amdgpu-stoney-fails.txt b/drivers/gpu/drm/ci/xfails/amdgpu-stoney-fails.txt index bd9392536e7c..ea87dc46bc2b 100644 --- a/drivers/gpu/drm/ci/xfails/amdgpu-stoney-fails.txt +++ b/drivers/gpu/drm/ci/xfails/amdgpu-stoney-fails.txt @@ -1,8 +1,14 @@ kms_addfb_basic@bad-pitch-65536,Fail kms_addfb_basic@bo-too-small,Fail +kms_addfb_basic@too-high,Fail +kms_async_flips@async-flip-with-page-flip-events,Fail +kms_async_flips@crc,Fail kms_async_flips@invalid-async-flip,Fail -kms_atomic@plane-immutable-zpos,Fail +kms_atomic_transition@plane-all-modeset-transition-internal-panels,Fail +kms_atomic_transition@plane-all-transition,Fail +kms_atomic_transition@plane-all-transition-nonblocking,Fail kms_atomic_transition@plane-toggle-modeset-transition,Fail +kms_atomic_transition@plane-use-after-nonblocking-unbind,Fail kms_bw@linear-tiling-1-displays-2560x1440p,Fail kms_bw@linear-tiling-1-displays-3840x2160p,Fail kms_bw@linear-tiling-2-displays-3840x2160p,Fail @@ -11,9 +17,11 @@ kms_color@degamma,Fail kms_cursor_crc@cursor-size-change,Fail kms_cursor_crc@pipe-A-cursor-size-change,Fail kms_cursor_crc@pipe-B-cursor-size-change,Fail -kms_cursor_legacy@forked-move,Fail +kms_flip@flip-vs-modeset-vs-hang,Fail +kms_flip@flip-vs-panning-vs-hang,Fail kms_hdr@bpc-switch,Fail kms_hdr@bpc-switch-dpms,Fail +kms_plane@pixel-format,Fail kms_plane_multiple@atomic-pipe-A-tiling-none,Fail kms_rmfb@close-fd,Fail kms_rotation_
Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"
[sorry, I hit send too early] On 2023-10-23 11:15, Christian König wrote: Am 23.10.23 um 15:06 schrieb Daniel Tang: That commit causes the screen to freeze a few moments after running clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer including ssh also freezes. On v6.5-rc1, it only results in a NULL pointer deference message in dmesg and the process to become a zombie whose unkillableness prevents shutdown without REISUB. Although llama.cpp and hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by this revert, pytorch-rocm is now working with stability and without whole-computer freezes caused by any accidental running of clinfo. This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639. That result doesn't make much sense. Felix please correct me, but AFAIK the ATS stuff was completely removed by now. Are you sure that this is pure v6.6-rc7 and not some other patches applied? If yes than we must have missed something. This revert doesn't really affect systems with ATS. It moves the sanity check back out of the ATS-specific code. The Null pointer dereference in the bug report comes from the CPU page table update code: [10089.267556] BUG: kernel NULL pointer dereference, address: [10089.267563] #PF: supervisor write access in kernel mode [10089.267566] #PF: error_code(0x0002) - not-present page [10089.267569] PGD 0 P4D 0 [10089.267574] Oops: 0002 [#1] PREEMPT SMP NOPTI [10089.267578] CPU: 23 PID: 18191 Comm: clinfo Tainted: G OE 6.5.0-9-generic #9-Ubuntu [10089.267582] Hardware name: Micro-Star International Co., Ltd. MS-7C37/X570-A PRO (MS-7C37), BIOS H.I0 08/10/2022 [10089.267585] RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu] [10089.267820] Code: 90 90 90 90 90 90 90 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 00 55 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 48 89 e5 <48> 89 0e 31 c0 5d 31 d2 31 c9 31 f6 45 31 c0 e9 89 7e 27 fb 66 0f [10089.267823] RSP: 0018:b49805eeb8b0 EFLAGS: 00010246 [10089.267827] RAX: RBX: 0020 RCX: 00400480 [10089.267830] RDX: RSI: RDI: 9890d438 [10089.267832] RBP: b49805eeb8b0 R08: 00400480 R09: 0020 [10089.267835] R10: 000800100200 R11: 000800100200 R12: b49805eeba98 [10089.267837] R13: 0001 R14: 0020 R15: 0001 [10089.267840] FS: 7f8ca9f09740() GS:9897befc() knlGS: [10089.267843] CS: 0010 DS: ES: CR0: 80050033 [10089.267846] CR2: CR3: 0002e0746000 CR4: 00750ee0 [10089.267849] PKRU: 5554 [10089.267851] Call Trace: [10089.267853] [10089.267858] ? show_regs+0x6d/0x80 [10089.267865] ? __die+0x24/0x80 [10089.267870] ? page_fault_oops+0x99/0x1b0 [10089.267876] ? do_user_addr_fault+0x316/0x6b0 [10089.267879] ? srso_alias_return_thunk+0x5/0x7f [10089.267884] ? scsi_dispatch_cmd+0x91/0x240 [10089.267891] ? exc_page_fault+0x83/0x1b0 [10089.267896] ? asm_exc_page_fault+0x27/0x30 [10089.267904] ? amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu] [10089.268140] amdgpu_vm_cpu_update+0xa9/0x130 [amdgpu] ... This revert is just a roundabout way of disabling CPU page table updates for compute VMs. But I don't think it really addresses the root cause. Regards, Felix Regards, Christian. Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596 Signed-off-by: Daniel Tang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 82f25996ff5e..602f311ab766 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm) if (r) return r; + /* Sanity checks */ + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { + r = -EINVAL; + goto unreserve_bo; + } + /* Check if PD needs to be reinitialized and do it before * changing any other state, in case it fails. */ if (pte_support_ats != vm->pte_support_ats) { - /* Sanity checks */ - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { - r = -EINVAL; - goto unreserve_bo; - } - vm->pte_support_ats = pte_support_ats; r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo), false); -- 2.40.1
Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"
On 2023-10-23 11:15, Christian König wrote: Am 23.10.23 um 15:06 schrieb Daniel Tang: That commit causes the screen to freeze a few moments after running clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer including ssh also freezes. On v6.5-rc1, it only results in a NULL pointer deference message in dmesg and the process to become a zombie whose unkillableness prevents shutdown without REISUB. Although llama.cpp and hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by this revert, pytorch-rocm is now working with stability and without whole-computer freezes caused by any accidental running of clinfo. This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639. That result doesn't make much sense. Felix please correct me, but AFAIK the ATS stuff was completely removed by now. Are you sure that this is pure v6.6-rc7 and not some other patches applied? If yes than we must have missed something. This revert doesn't really affect systems with ATS. It moves the sanity check back out of the ATS-specific code. The Null pointer dereference in the bug report comes from the CPU page table update code: This revert is just a roundabout way of disabling CPU page table updates for compute VMs. But I don't think it really addresses the root cause. Regards, Christian. Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596 Signed-off-by: Daniel Tang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 82f25996ff5e..602f311ab766 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm) if (r) return r; + /* Sanity checks */ + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { + r = -EINVAL; + goto unreserve_bo; + } + /* Check if PD needs to be reinitialized and do it before * changing any other state, in case it fails. */ if (pte_support_ats != vm->pte_support_ats) { - /* Sanity checks */ - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { - r = -EINVAL; - goto unreserve_bo; - } - vm->pte_support_ats = pte_support_ats; r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo), false); -- 2.40.1
Re: [PATCH v3] drm/amdkfd: Use partial mapping in GPU page faults
On 2023-10-20 17:53, Xiaogang.Chen wrote: From: Xiaogang Chen After partial migration to recover GPU page fault this patch does GPU vm space mapping for same page range that got migrated intead of mapping all pages of svm range in which the page fault happened. Signed-off-by: Xiaogang Chen --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 29 1 file changed, 21 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 54af7a2b29f8..3a71d04779b1 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1619,6 +1619,7 @@ static void *kfd_svm_page_owner(struct kfd_process *p, int32_t gpuidx) * 5. Release page table (and SVM BO) reservation */ static int svm_range_validate_and_map(struct mm_struct *mm, + unsigned long map_start, unsigned long map_last, struct svm_range *prange, int32_t gpuidx, bool intr, bool wait, bool flush_tlb) { @@ -1699,6 +1700,8 @@ static int svm_range_validate_and_map(struct mm_struct *mm, end = (prange->last + 1) << PAGE_SHIFT; for (addr = start; !r && addr < end; ) { struct hmm_range *hmm_range; + unsigned long map_start_vma; + unsigned long map_last_vma; struct vm_area_struct *vma; uint64_t vram_pages_vma; unsigned long next = 0; @@ -1747,9 +1750,16 @@ static int svm_range_validate_and_map(struct mm_struct *mm, r = -EAGAIN; } - if (!r) - r = svm_range_map_to_gpus(prange, offset, npages, readonly, - ctx->bitmap, wait, flush_tlb); + if (!r) { + map_start_vma = max(map_start, prange->start + offset); + map_last_vma = min(map_last, prange->start + offset + npages - 1); + if (map_start_vma <= map_last_vma) { + offset = map_start_vma - prange->start; + npages = map_last_vma - map_start_vma + 1; + r = svm_range_map_to_gpus(prange, offset, npages, readonly, + ctx->bitmap, wait, flush_tlb); + } + } if (!r && next == end) prange->mapped_to_gpu = true; @@ -1855,8 +1865,8 @@ static void svm_range_restore_work(struct work_struct *work) */ mutex_lock(&prange->migrate_mutex); - r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE, - false, true, false); + r = svm_range_validate_and_map(mm, prange->start, prange->last, prange, + MAX_GPU_INSTANCE, false, true, false); if (r) pr_debug("failed %d to map 0x%lx to gpus\n", r, prange->start); @@ -3069,6 +3079,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, kfd_smi_event_page_fault_start(node, p->lead_thread->pid, addr, write_fault, timestamp); + start = prange->start; + last = prange->last; This means, page faults that don't migrate will map the whole range. Should we move the proper assignment of start and last out of the condition below, so it applies equally to page faults that migrate and those that don't? Regards, Felix if (prange->actual_loc != 0 || best_loc != 0) { migration = true; /* Align migration range start and size to granularity size */ @@ -3102,10 +3114,11 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, } } - r = svm_range_validate_and_map(mm, prange, gpuidx, false, false, false); + r = svm_range_validate_and_map(mm, start, last, prange, gpuidx, false, + false, false); if (r) pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpus\n", -r, svms, prange->start, prange->last); +r, svms, start, last); kfd_smi_event_page_fault_end(node, p->lead_thread->pid, addr, migration); @@ -3650,7 +3663,7 @@ svm_range_set_attr(struct kfd_process *p, struct mm_struct *mm, flush_tlb = !migrated && update_mapping && prange->mapped_to_gpu; - r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE, + r = svm_range_validate_and_map(mm, prange->start, prange->last, prange, MAX_GPU_INSTANCE, true, true, flush_tlb); if (r)
Re: [PATCH 3/3] Revert "[PATCH] drm/amdkfd: Use partial migrations in GPU page faults"
On 2023-10-23 16:37, Philip Yang wrote: This reverts commit 1fd60d88c4b57d715c0ae09794061c0cc53009e3. The change prevents migrating the entire range to VRAM because retry fault restore_pages map the remaining system memory range to GPUs. It will work correctly to submit together with partial mapping to GPU patch later. Signed-off-by: Philip Yang The series is Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 150 ++- drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 83 +++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 6 +- 4 files changed, 85 insertions(+), 160 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 81d25a679427..6c25dab051d5 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -442,10 +442,10 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct svm_range *prange, goto out_free; } if (cpages != npages) - pr_debug("partial migration, 0x%lx/0x%llx pages collected\n", + pr_debug("partial migration, 0x%lx/0x%llx pages migrated\n", cpages, npages); else - pr_debug("0x%lx pages collected\n", cpages); + pr_debug("0x%lx pages migrated\n", cpages); r = svm_migrate_copy_to_vram(node, prange, &migrate, &mfence, scratch, ttm_res_offset); migrate_vma_pages(&migrate); @@ -479,8 +479,6 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct svm_range *prange, * svm_migrate_ram_to_vram - migrate svm range from system to device * @prange: range structure * @best_loc: the device to migrate to - * @start_mgr: start page to migrate - * @last_mgr: last page to migrate * @mm: the process mm structure * @trigger: reason of migration * @@ -491,7 +489,6 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct svm_range *prange, */ static int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, - unsigned long start_mgr, unsigned long last_mgr, struct mm_struct *mm, uint32_t trigger) { unsigned long addr, start, end; @@ -501,30 +498,23 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, unsigned long cpages = 0; long r = 0; - if (!best_loc) { - pr_debug("svms 0x%p [0x%lx 0x%lx] migrate to sys ram\n", - prange->svms, start_mgr, last_mgr); + if (prange->actual_loc == best_loc) { + pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n", +prange->svms, prange->start, prange->last, best_loc); return 0; } - if (start_mgr < prange->start || last_mgr > prange->last) { - pr_debug("range [0x%lx 0x%lx] out prange [0x%lx 0x%lx]\n", -start_mgr, last_mgr, prange->start, prange->last); - return -EFAULT; - } - node = svm_range_get_node_by_id(prange, best_loc); if (!node) { pr_debug("failed to get kfd node by id 0x%x\n", best_loc); return -ENODEV; } - pr_debug("svms 0x%p [0x%lx 0x%lx] in [0x%lx 0x%lx] to gpu 0x%x\n", - prange->svms, start_mgr, last_mgr, prange->start, prange->last, - best_loc); + pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms, +prange->start, prange->last, best_loc); - start = start_mgr << PAGE_SHIFT; - end = (last_mgr + 1) << PAGE_SHIFT; + start = prange->start << PAGE_SHIFT; + end = (prange->last + 1) << PAGE_SHIFT; r = svm_range_vram_node_new(node, prange, true); if (r) { @@ -554,11 +544,8 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, if (cpages) { prange->actual_loc = best_loc; - prange->vram_pages = prange->vram_pages + cpages; - } else if (!prange->actual_loc) { - /* if no page migrated and all pages from prange are at -* sys ram drop svm_bo got from svm_range_vram_node_new -*/ + svm_range_dma_unmap(prange); + } else { svm_range_vram_node_free(prange); } @@ -676,8 +663,9 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange, * Context: Process context, caller hold mmap read lock, prange->migrate_mutex * * Return: + * 0 - success with all pages migrated * negative values - indicate error - * positive values or zero - number of pages got migrated + * positive values - partial migration, number of pages not migrated */ static long svm_migrate_vma_to_ram(struct kfd_node *node, struct svm_range *prange, @@ -688,7 +676,6 @@ svm_migrate_vma_to_ram(struct kfd_nod
Re: [PATCH 3/3] drm/amd: Explicitly disable ASPM when dynamic switching disabled
On Mon, Oct 23, 2023 at 5:12 PM Mario Limonciello wrote: > > Currently there are separate but related checks: > * amdgpu_device_should_use_aspm() > * amdgpu_device_aspm_support_quirk() > * amdgpu_device_pcie_dynamic_switching_supported() > > Simplify into checking whether DPM was enabled or not in the auto > case. This works because amdgpu_device_pcie_dynamic_switching_supported() > populates that value. > > Signed-off-by: Mario Limonciello Series is: Reviewed-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 -- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 ++--- > drivers/gpu/drm/amd/amdgpu/nv.c| 7 +++ > drivers/gpu/drm/amd/amdgpu/vi.c| 2 +- > 4 files changed, 10 insertions(+), 22 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > index 44df1a5bce7f..c1c98bd2d489 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h > @@ -1339,9 +1339,7 @@ void amdgpu_device_pci_config_reset(struct > amdgpu_device *adev); > int amdgpu_device_pci_reset(struct amdgpu_device *adev); > bool amdgpu_device_need_post(struct amdgpu_device *adev); > bool amdgpu_device_seamless_boot_supported(struct amdgpu_device *adev); > -bool amdgpu_device_pcie_dynamic_switching_supported(void); > bool amdgpu_device_should_use_aspm(struct amdgpu_device *adev); > -bool amdgpu_device_aspm_support_quirk(void); > > void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes, > u64 num_vis_bytes); > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 4e144be7f044..7ec32b44df05 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -1456,14 +1456,14 @@ bool amdgpu_device_seamless_boot_supported(struct > amdgpu_device *adev) > } > > /* > - * Intel hosts such as Raptor Lake and Sapphire Rapids don't support dynamic > - * speed switching. Until we have confirmation from Intel that a specific > host > - * supports it, it's safer that we keep it disabled for all. > + * Intel hosts such as Rocket Lake, Alder Lake, Raptor Lake and Sapphire > Rapids > + * don't support dynamic speed switching. Until we have confirmation from > Intel > + * that a specific host supports it, it's safer that we keep it disabled for > all. > * > * > https://edc.intel.com/content/www/us/en/design/products/platforms/details/raptor-lake-s/13th-generation-core-processors-datasheet-volume-1-of-2/005/pci-express-support/ > * https://gitlab.freedesktop.org/drm/amd/-/issues/2663 > */ > -bool amdgpu_device_pcie_dynamic_switching_supported(void) > +static bool amdgpu_device_pcie_dynamic_switching_supported(void) > { > #if IS_ENABLED(CONFIG_X86) > struct cpuinfo_x86 *c = &cpu_data(0); > @@ -1498,20 +1498,11 @@ bool amdgpu_device_should_use_aspm(struct > amdgpu_device *adev) > } > if (adev->flags & AMD_IS_APU) > return false; > + if (!(adev->pm.pp_feature & PP_PCIE_DPM_MASK)) > + return false; > return pcie_aspm_enabled(adev->pdev); > } > > -bool amdgpu_device_aspm_support_quirk(void) > -{ > -#if IS_ENABLED(CONFIG_X86) > - struct cpuinfo_x86 *c = &cpu_data(0); > - > - return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); > -#else > - return true; > -#endif > -} > - > /* if we get transitioned to only one device, take VGA back */ > /** > * amdgpu_device_vga_set_decode - enable/disable vga decode > diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c > index 9fa220de1490..4d7976b77767 100644 > --- a/drivers/gpu/drm/amd/amdgpu/nv.c > +++ b/drivers/gpu/drm/amd/amdgpu/nv.c > @@ -513,7 +513,7 @@ static int nv_set_vce_clocks(struct amdgpu_device *adev, > u32 evclk, u32 ecclk) > > static void nv_program_aspm(struct amdgpu_device *adev) > { > - if (!amdgpu_device_should_use_aspm(adev) || > !amdgpu_device_aspm_support_quirk()) > + if (!amdgpu_device_should_use_aspm(adev)) > return; > > if (adev->nbio.funcs->program_aspm) > @@ -608,9 +608,8 @@ static int nv_update_umd_stable_pstate(struct > amdgpu_device *adev, > if (adev->gfx.funcs->update_perfmon_mgcg) > adev->gfx.funcs->update_perfmon_mgcg(adev, !enter); > > - if (!(adev->flags & AMD_IS_APU) && > - (adev->nbio.funcs->enable_aspm) && > -amdgpu_device_should_use_aspm(adev)) > + if (adev->nbio.funcs->enable_aspm && > + amdgpu_device_should_use_aspm(adev)) > adev->nbio.funcs->enable_aspm(adev, !enter); > > return 0; > diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c > index 1a08052bade3..1a98812981f4 100644 > --- a/drivers/gpu/drm/amd/amdgpu/vi.c > +++ b/drivers/gpu/drm/amd/amdgpu/vi.c > @@ -1124,7 +1
[PATCH 2/3] drm/amd: Move AMD_IS_APU check for ASPM into top level function
There is no need for every ASIC driver to perform the same check. Move the duplicated code into amdgpu_device_should_use_aspm(). Signed-off-by: Mario Limonciello --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ drivers/gpu/drm/amd/amdgpu/cik.c | 4 drivers/gpu/drm/amd/amdgpu/nv.c| 3 +-- drivers/gpu/drm/amd/amdgpu/si.c| 2 -- drivers/gpu/drm/amd/amdgpu/soc15.c | 3 +-- drivers/gpu/drm/amd/amdgpu/soc21.c | 3 +-- drivers/gpu/drm/amd/amdgpu/vi.c| 3 +-- 7 files changed, 6 insertions(+), 14 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index b345c7bcc3bc..4e144be7f044 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -1496,6 +1496,8 @@ bool amdgpu_device_should_use_aspm(struct amdgpu_device *adev) default: return false; } + if (adev->flags & AMD_IS_APU) + return false; return pcie_aspm_enabled(adev->pdev); } diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c index 5641cf05d856..4cd13486a349 100644 --- a/drivers/gpu/drm/amd/amdgpu/cik.c +++ b/drivers/gpu/drm/amd/amdgpu/cik.c @@ -1725,10 +1725,6 @@ static void cik_program_aspm(struct amdgpu_device *adev) if (pci_is_root_bus(adev->pdev->bus)) return; - /* XXX double check APUs */ - if (adev->flags & AMD_IS_APU) - return; - orig = data = RREG32_PCIE(ixPCIE_LC_N_FTS_CNTL); data &= ~PCIE_LC_N_FTS_CNTL__LC_XMIT_N_FTS_MASK; data |= (0x24 << PCIE_LC_N_FTS_CNTL__LC_XMIT_N_FTS__SHIFT) | diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c index 1995c7459f20..9fa220de1490 100644 --- a/drivers/gpu/drm/amd/amdgpu/nv.c +++ b/drivers/gpu/drm/amd/amdgpu/nv.c @@ -516,8 +516,7 @@ static void nv_program_aspm(struct amdgpu_device *adev) if (!amdgpu_device_should_use_aspm(adev) || !amdgpu_device_aspm_support_quirk()) return; - if (!(adev->flags & AMD_IS_APU) && - (adev->nbio.funcs->program_aspm)) + if (adev->nbio.funcs->program_aspm) adev->nbio.funcs->program_aspm(adev); } diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c index f64b87b11b1b..456ca581f517 100644 --- a/drivers/gpu/drm/amd/amdgpu/si.c +++ b/drivers/gpu/drm/amd/amdgpu/si.c @@ -2456,8 +2456,6 @@ static void si_program_aspm(struct amdgpu_device *adev) if (!amdgpu_device_should_use_aspm(adev)) return; - if (adev->flags & AMD_IS_APU) - return; orig = data = RREG32_PCIE_PORT(PCIE_LC_N_FTS_CNTL); data &= ~LC_XMIT_N_FTS_MASK; data |= LC_XMIT_N_FTS(0x24) | LC_XMIT_N_FTS_OVERRIDE_EN; diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c b/drivers/gpu/drm/amd/amdgpu/soc15.c index 66ed28136bc8..d4b8d62f4294 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc15.c +++ b/drivers/gpu/drm/amd/amdgpu/soc15.c @@ -646,8 +646,7 @@ static void soc15_program_aspm(struct amdgpu_device *adev) if (!amdgpu_device_should_use_aspm(adev)) return; - if (!(adev->flags & AMD_IS_APU) && - (adev->nbio.funcs->program_aspm)) + if (adev->nbio.funcs->program_aspm) adev->nbio.funcs->program_aspm(adev); } diff --git a/drivers/gpu/drm/amd/amdgpu/soc21.c b/drivers/gpu/drm/amd/amdgpu/soc21.c index 8c6cab641a1c..d5083c549330 100644 --- a/drivers/gpu/drm/amd/amdgpu/soc21.c +++ b/drivers/gpu/drm/amd/amdgpu/soc21.c @@ -433,8 +433,7 @@ static void soc21_program_aspm(struct amdgpu_device *adev) if (!amdgpu_device_should_use_aspm(adev)) return; - if (!(adev->flags & AMD_IS_APU) && - (adev->nbio.funcs->program_aspm)) + if (adev->nbio.funcs->program_aspm) adev->nbio.funcs->program_aspm(adev); } diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index fe8ba9e9837b..1a08052bade3 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -1127,8 +1127,7 @@ static void vi_program_aspm(struct amdgpu_device *adev) if (!amdgpu_device_should_use_aspm(adev) || !amdgpu_device_pcie_dynamic_switching_supported()) return; - if (adev->flags & AMD_IS_APU || - adev->asic_type < CHIP_POLARIS10) + if (adev->asic_type < CHIP_POLARIS10) return; orig = data = RREG32_PCIE(ixPCIE_LC_CNTL); -- 2.34.1
[PATCH 3/3] drm/amd: Explicitly disable ASPM when dynamic switching disabled
Currently there are separate but related checks: * amdgpu_device_should_use_aspm() * amdgpu_device_aspm_support_quirk() * amdgpu_device_pcie_dynamic_switching_supported() Simplify into checking whether DPM was enabled or not in the auto case. This works because amdgpu_device_pcie_dynamic_switching_supported() populates that value. Signed-off-by: Mario Limonciello --- drivers/gpu/drm/amd/amdgpu/amdgpu.h| 2 -- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 ++--- drivers/gpu/drm/amd/amdgpu/nv.c| 7 +++ drivers/gpu/drm/amd/amdgpu/vi.c| 2 +- 4 files changed, 10 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 44df1a5bce7f..c1c98bd2d489 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -1339,9 +1339,7 @@ void amdgpu_device_pci_config_reset(struct amdgpu_device *adev); int amdgpu_device_pci_reset(struct amdgpu_device *adev); bool amdgpu_device_need_post(struct amdgpu_device *adev); bool amdgpu_device_seamless_boot_supported(struct amdgpu_device *adev); -bool amdgpu_device_pcie_dynamic_switching_supported(void); bool amdgpu_device_should_use_aspm(struct amdgpu_device *adev); -bool amdgpu_device_aspm_support_quirk(void); void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes, u64 num_vis_bytes); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 4e144be7f044..7ec32b44df05 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -1456,14 +1456,14 @@ bool amdgpu_device_seamless_boot_supported(struct amdgpu_device *adev) } /* - * Intel hosts such as Raptor Lake and Sapphire Rapids don't support dynamic - * speed switching. Until we have confirmation from Intel that a specific host - * supports it, it's safer that we keep it disabled for all. + * Intel hosts such as Rocket Lake, Alder Lake, Raptor Lake and Sapphire Rapids + * don't support dynamic speed switching. Until we have confirmation from Intel + * that a specific host supports it, it's safer that we keep it disabled for all. * * https://edc.intel.com/content/www/us/en/design/products/platforms/details/raptor-lake-s/13th-generation-core-processors-datasheet-volume-1-of-2/005/pci-express-support/ * https://gitlab.freedesktop.org/drm/amd/-/issues/2663 */ -bool amdgpu_device_pcie_dynamic_switching_supported(void) +static bool amdgpu_device_pcie_dynamic_switching_supported(void) { #if IS_ENABLED(CONFIG_X86) struct cpuinfo_x86 *c = &cpu_data(0); @@ -1498,20 +1498,11 @@ bool amdgpu_device_should_use_aspm(struct amdgpu_device *adev) } if (adev->flags & AMD_IS_APU) return false; + if (!(adev->pm.pp_feature & PP_PCIE_DPM_MASK)) + return false; return pcie_aspm_enabled(adev->pdev); } -bool amdgpu_device_aspm_support_quirk(void) -{ -#if IS_ENABLED(CONFIG_X86) - struct cpuinfo_x86 *c = &cpu_data(0); - - return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE); -#else - return true; -#endif -} - /* if we get transitioned to only one device, take VGA back */ /** * amdgpu_device_vga_set_decode - enable/disable vga decode diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c index 9fa220de1490..4d7976b77767 100644 --- a/drivers/gpu/drm/amd/amdgpu/nv.c +++ b/drivers/gpu/drm/amd/amdgpu/nv.c @@ -513,7 +513,7 @@ static int nv_set_vce_clocks(struct amdgpu_device *adev, u32 evclk, u32 ecclk) static void nv_program_aspm(struct amdgpu_device *adev) { - if (!amdgpu_device_should_use_aspm(adev) || !amdgpu_device_aspm_support_quirk()) + if (!amdgpu_device_should_use_aspm(adev)) return; if (adev->nbio.funcs->program_aspm) @@ -608,9 +608,8 @@ static int nv_update_umd_stable_pstate(struct amdgpu_device *adev, if (adev->gfx.funcs->update_perfmon_mgcg) adev->gfx.funcs->update_perfmon_mgcg(adev, !enter); - if (!(adev->flags & AMD_IS_APU) && - (adev->nbio.funcs->enable_aspm) && -amdgpu_device_should_use_aspm(adev)) + if (adev->nbio.funcs->enable_aspm && + amdgpu_device_should_use_aspm(adev)) adev->nbio.funcs->enable_aspm(adev, !enter); return 0; diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 1a08052bade3..1a98812981f4 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -1124,7 +1124,7 @@ static void vi_program_aspm(struct amdgpu_device *adev) bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev) || !amdgpu_device_pcie_dynamic_switching_supported()) + if (!amdgpu_device_should_use_aspm(adev)) return; if (ade
[PATCH 1/3] drm/amd: Disable PP_PCIE_DPM_MASK when dynamic speed switching not supported
Rather than individual ASICs checking for the quirk, set the quirk at the driver level. Signed-off-by: Mario Limonciello --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c | 4 +--- drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 2 +- drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index cc047fe0b7ee..b345c7bcc3bc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2315,6 +2315,8 @@ static int amdgpu_device_ip_early_init(struct amdgpu_device *adev) adev->pm.pp_feature &= ~PP_GFXOFF_MASK; if (amdgpu_sriov_vf(adev) && adev->asic_type == CHIP_SIENNA_CICHLID) adev->pm.pp_feature &= ~PP_OVERDRIVE_MASK; + if (!amdgpu_device_pcie_dynamic_switching_supported()) + adev->pm.pp_feature &= ~PP_PCIE_DPM_MASK; total = true; for (i = 0; i < adev->num_ip_blocks; i++) { diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c index 5a2371484a58..11372fcc59c8 100644 --- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c +++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c @@ -1823,9 +1823,7 @@ static void smu7_init_dpm_defaults(struct pp_hwmgr *hwmgr) data->mclk_dpm_key_disabled = hwmgr->feature_mask & PP_MCLK_DPM_MASK ? false : true; data->sclk_dpm_key_disabled = hwmgr->feature_mask & PP_SCLK_DPM_MASK ? false : true; - data->pcie_dpm_key_disabled = - !amdgpu_device_pcie_dynamic_switching_supported() || - !(hwmgr->feature_mask & PP_PCIE_DPM_MASK); + data->pcie_dpm_key_disabled = !(hwmgr->feature_mask & PP_PCIE_DPM_MASK); /* need to set voltage control types before EVV patching */ data->voltage_control = SMU7_VOLTAGE_CONTROL_NONE; data->vddci_control = SMU7_VOLTAGE_CONTROL_NONE; diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c index 090249b6422a..97a5c9b3e941 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c @@ -2115,7 +2115,7 @@ static int sienna_cichlid_update_pcie_parameters(struct smu_context *smu, min_lane_width = min_lane_width > max_lane_width ? max_lane_width : min_lane_width; - if (!amdgpu_device_pcie_dynamic_switching_supported()) { + if (!(smu->adev->pm.pp_feature & PP_PCIE_DPM_MASK)) { pcie_table->pcie_gen[0] = max_gen_speed; pcie_table->pcie_lane[0] = max_lane_width; } else { diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c index bcb7ab9d2221..e06de3524a1a 100644 --- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c +++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c @@ -2437,7 +2437,7 @@ int smu_v13_0_update_pcie_parameters(struct smu_context *smu, uint32_t smu_pcie_arg; int ret, i; - if (!amdgpu_device_pcie_dynamic_switching_supported()) { + if (!(smu->adev->pm.pp_feature & PP_PCIE_DPM_MASK)) { if (pcie_table->pcie_gen[num_of_levels - 1] < pcie_gen_cap) pcie_gen_cap = pcie_table->pcie_gen[num_of_levels - 1]; -- 2.34.1
[PATCH 2/3] Revert "drm/amdkfd:remove unused code"
This reverts commit d97e7b1eb8afd7a404466533b0bc192351b760c7. Needed for the next revert patch. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 60 drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 3 ++ 2 files changed, 63 insertions(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 4d000c63cde8..3422eee8d0d0 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1145,6 +1145,66 @@ svm_range_add_child(struct svm_range *prange, struct mm_struct *mm, list_add_tail(&pchild->child_list, &prange->child_list); } +/** + * svm_range_split_by_granularity - collect ranges within granularity boundary + * + * @p: the process with svms list + * @mm: mm structure + * @addr: the vm fault address in pages, to split the prange + * @parent: parent range if prange is from child list + * @prange: prange to split + * + * Trims @prange to be a single aligned block of prange->granularity if + * possible. The head and tail are added to the child_list in @parent. + * + * Context: caller must hold mmap_read_lock and prange->lock + * + * Return: + * 0 - OK, otherwise error code + */ +int +svm_range_split_by_granularity(struct kfd_process *p, struct mm_struct *mm, + unsigned long addr, struct svm_range *parent, + struct svm_range *prange) +{ + struct svm_range *head, *tail; + unsigned long start, last, size; + int r; + + /* Align splited range start and size to granularity size, then a single +* PTE will be used for whole range, this reduces the number of PTE +* updated and the L1 TLB space used for translation. +*/ + size = 1UL << prange->granularity; + start = ALIGN_DOWN(addr, size); + last = ALIGN(addr + 1, size) - 1; + + pr_debug("svms 0x%p split [0x%lx 0x%lx] to [0x%lx 0x%lx] size 0x%lx\n", +prange->svms, prange->start, prange->last, start, last, size); + + if (start > prange->start) { + r = svm_range_split(prange, start, prange->last, &head); + if (r) + return r; + svm_range_add_child(parent, mm, head, SVM_OP_ADD_RANGE); + } + + if (last < prange->last) { + r = svm_range_split(prange, prange->start, last, &tail); + if (r) + return r; + svm_range_add_child(parent, mm, tail, SVM_OP_ADD_RANGE); + } + + /* xnack on, update mapping on GPUs with ACCESS_IN_PLACE */ + if (p->xnack_enabled && prange->work_item.op == SVM_OP_ADD_RANGE) { + prange->work_item.op = SVM_OP_ADD_RANGE_AND_MAP; + pr_debug("change prange 0x%p [0x%lx 0x%lx] op %d\n", +prange, prange->start, prange->last, +SVM_OP_ADD_RANGE_AND_MAP); + } + return 0; +} static bool svm_nodes_in_same_hive(struct kfd_node *node_a, struct kfd_node *node_b) { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index 026863a0abcd..be11ba0c4289 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -172,6 +172,9 @@ struct kfd_node *svm_range_get_node_by_id(struct svm_range *prange, int svm_range_vram_node_new(struct kfd_node *node, struct svm_range *prange, bool clear); void svm_range_vram_node_free(struct svm_range *prange); +int svm_range_split_by_granularity(struct kfd_process *p, struct mm_struct *mm, + unsigned long addr, struct svm_range *parent, + struct svm_range *prange); int svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, uint32_t vmid, uint32_t node_id, uint64_t addr, bool write_fault); -- 2.35.1
[PATCH 3/3] Revert "[PATCH] drm/amdkfd: Use partial migrations in GPU page faults"
This reverts commit 1fd60d88c4b57d715c0ae09794061c0cc53009e3. The change prevents migrating the entire range to VRAM because retry fault restore_pages map the remaining system memory range to GPUs. It will work correctly to submit together with partial mapping to GPU patch later. Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 150 ++- drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 83 +++-- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 6 +- 4 files changed, 85 insertions(+), 160 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 81d25a679427..6c25dab051d5 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -442,10 +442,10 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct svm_range *prange, goto out_free; } if (cpages != npages) - pr_debug("partial migration, 0x%lx/0x%llx pages collected\n", + pr_debug("partial migration, 0x%lx/0x%llx pages migrated\n", cpages, npages); else - pr_debug("0x%lx pages collected\n", cpages); + pr_debug("0x%lx pages migrated\n", cpages); r = svm_migrate_copy_to_vram(node, prange, &migrate, &mfence, scratch, ttm_res_offset); migrate_vma_pages(&migrate); @@ -479,8 +479,6 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct svm_range *prange, * svm_migrate_ram_to_vram - migrate svm range from system to device * @prange: range structure * @best_loc: the device to migrate to - * @start_mgr: start page to migrate - * @last_mgr: last page to migrate * @mm: the process mm structure * @trigger: reason of migration * @@ -491,7 +489,6 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct svm_range *prange, */ static int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, - unsigned long start_mgr, unsigned long last_mgr, struct mm_struct *mm, uint32_t trigger) { unsigned long addr, start, end; @@ -501,30 +498,23 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, unsigned long cpages = 0; long r = 0; - if (!best_loc) { - pr_debug("svms 0x%p [0x%lx 0x%lx] migrate to sys ram\n", - prange->svms, start_mgr, last_mgr); + if (prange->actual_loc == best_loc) { + pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n", +prange->svms, prange->start, prange->last, best_loc); return 0; } - if (start_mgr < prange->start || last_mgr > prange->last) { - pr_debug("range [0x%lx 0x%lx] out prange [0x%lx 0x%lx]\n", -start_mgr, last_mgr, prange->start, prange->last); - return -EFAULT; - } - node = svm_range_get_node_by_id(prange, best_loc); if (!node) { pr_debug("failed to get kfd node by id 0x%x\n", best_loc); return -ENODEV; } - pr_debug("svms 0x%p [0x%lx 0x%lx] in [0x%lx 0x%lx] to gpu 0x%x\n", - prange->svms, start_mgr, last_mgr, prange->start, prange->last, - best_loc); + pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms, +prange->start, prange->last, best_loc); - start = start_mgr << PAGE_SHIFT; - end = (last_mgr + 1) << PAGE_SHIFT; + start = prange->start << PAGE_SHIFT; + end = (prange->last + 1) << PAGE_SHIFT; r = svm_range_vram_node_new(node, prange, true); if (r) { @@ -554,11 +544,8 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, if (cpages) { prange->actual_loc = best_loc; - prange->vram_pages = prange->vram_pages + cpages; - } else if (!prange->actual_loc) { - /* if no page migrated and all pages from prange are at -* sys ram drop svm_bo got from svm_range_vram_node_new -*/ + svm_range_dma_unmap(prange); + } else { svm_range_vram_node_free(prange); } @@ -676,8 +663,9 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange, * Context: Process context, caller hold mmap read lock, prange->migrate_mutex * * Return: + * 0 - success with all pages migrated * negative values - indicate error - * positive values or zero - number of pages got migrated + * positive values - partial migration, number of pages not migrated */ static long svm_migrate_vma_to_ram(struct kfd_node *node, struct svm_range *prange, @@ -688,7 +676,6 @@ svm_migrate_vma_to_ram(struct kfd_node *node, struct svm_range *prange, uint64_t npages = (end - start) >> PAGE_SHI
[PATCH 1/3] Revert "drm/amdkfd: Use partial mapping in GPU page fault recovery"
This reverts commit c45c3bc930bf60e7658f87c519a40f77513b96aa. Found KFDSVMEvict test regression on vega10, kernel BUG backtrace: [ 135.365083] amdgpu: Migration failed during eviction [ 135.365090] [ cut here ] [ 135.365097] This was not the last reference [ 135.365122] WARNING: CPU: 5 PID: 1998 at drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:3515 svm_range_evict_svm_bo_worker+0x21c/0x390 [amdgpu] [ 135.365836] svm_range_evict_svm_bo_worker+0x21c/0x390 [amdgpu] [ 135.366249] process_one_work+0x298/0x590 [ 135.366256] worker_thread+0x3d/0x3d0 .. [ 135.721257] kernel BUG at include/linux/swapops.h:472! [ 135.721537] Call Trace: [ 135.721540] [ 135.721592] hmm_vma_walk_pmd+0x5c8/0x780 [ 135.721598] walk_pgd_range+0x3bc/0x7c0 [ 135.721604] __walk_page_range+0x1ec/0x200 [ 135.721609] walk_page_range+0x119/0x1a0 [ 135.721613] hmm_range_fault+0x5d/0xb0 [ 135.721617] amdgpu_hmm_range_get_pages+0x159/0x240 [amdgpu] [ 135.721820] svm_range_validate_and_map+0x57f/0x16c0 [amdgpu] [ 135.722411] svm_range_restore_pages+0xcd8/0x1150 [amdgpu] [ 135.722613] amdgpu_vm_handle_fault+0xc2/0x360 [amdgpu] [ 135.722777] gmc_v9_0_process_interrupt+0x255/0x670 [amdgpu] Signed-off-by: Philip Yang --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 35 +--- 1 file changed, 11 insertions(+), 24 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index f2b33fb2afcf..4d000c63cde8 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1565,7 +1565,6 @@ static void *kfd_svm_page_owner(struct kfd_process *p, int32_t gpuidx) * 5. Release page table (and SVM BO) reservation */ static int svm_range_validate_and_map(struct mm_struct *mm, - unsigned long map_start, unsigned long map_last, struct svm_range *prange, int32_t gpuidx, bool intr, bool wait, bool flush_tlb) { @@ -1646,8 +1645,6 @@ static int svm_range_validate_and_map(struct mm_struct *mm, end = (prange->last + 1) << PAGE_SHIFT; for (addr = start; !r && addr < end; ) { struct hmm_range *hmm_range; - unsigned long map_start_vma; - unsigned long map_last_vma; struct vm_area_struct *vma; uint64_t vram_pages_vma; unsigned long next = 0; @@ -1696,16 +1693,9 @@ static int svm_range_validate_and_map(struct mm_struct *mm, r = -EAGAIN; } - if (!r) { - map_start_vma = max(map_start, prange->start + offset); - map_last_vma = min(map_last, prange->start + offset + npages - 1); - if (map_start_vma <= map_last_vma) { - offset = map_start_vma - prange->start; - npages = map_last_vma - map_start_vma + 1; - r = svm_range_map_to_gpus(prange, offset, npages, readonly, - ctx->bitmap, wait, flush_tlb); - } - } + if (!r) + r = svm_range_map_to_gpus(prange, offset, npages, readonly, + ctx->bitmap, wait, flush_tlb); if (!r && next == end) prange->mapped_to_gpu = true; @@ -1811,8 +1801,8 @@ static void svm_range_restore_work(struct work_struct *work) */ mutex_lock(&prange->migrate_mutex); - r = svm_range_validate_and_map(mm, prange->start, prange->last, prange, - MAX_GPU_INSTANCE, false, true, false); + r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE, + false, true, false); if (r) pr_debug("failed %d to map 0x%lx to gpus\n", r, prange->start); @@ -3026,8 +3016,6 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, kfd_smi_event_page_fault_start(node, p->lead_thread->pid, addr, write_fault, timestamp); - start = prange->start; - last = prange->last; if (prange->actual_loc != 0 || best_loc != 0) { migration = true; /* Align migration range start and size to granularity size */ @@ -3061,11 +3049,10 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, } } - r = svm_range_validate_and_map(mm, start, last, prange, gpuidx, false, - false, false); + r = svm_range_validate_and_map(mm, prange, gpuidx, false, false, false); if (r)
Re: [PATCH] drm/amdgpu: Fix a null pointer access when the smc_rreg pointer is NULL
Applied. Thanks! Alex On Mon, Oct 23, 2023 at 9:06 AM wrote: > > In certain types of chips, such as VEGA20, reading the amdgpu_regs_smc file > could result in an abnormal null pointer access when the smc_rreg pointer is > NULL. Below are the steps to reproduce this issue and the corresponding > exception log: > > 1. Navigate to the directory: /sys/kernel/debug/dri/0 > 2. Execute command: cat amdgpu_regs_smc > 3. Exception Log:: > [4005007.702554] BUG: kernel NULL pointer dereference, address: > > [4005007.702562] #PF: supervisor instruction fetch in kernel mode > [4005007.702567] #PF: error_code(0x0010) - not-present page > [4005007.702570] PGD 0 P4D 0 > [4005007.702576] Oops: 0010 [#1] SMP NOPTI > [4005007.702581] CPU: 4 PID: 62563 Comm: cat Tainted: G OE > 5.15.0-43-generic #46-Ubunt u > [4005007.702590] RIP: 0010:0x0 > [4005007.702598] Code: Unable to access opcode bytes at RIP > 0xffd6. > [4005007.702600] RSP: 0018:a82b46d27da0 EFLAGS: 00010206 > [4005007.702605] RAX: RBX: RCX: > a82b46d27e68 > [4005007.702609] RDX: 0001 RSI: RDI: > 9940656e > [4005007.702612] RBP: a82b46d27dd8 R08: R09: > 994060c07980 > [4005007.702615] R10: 0002 R11: R12: > 7f5e06753000 > [4005007.702618] R13: 9940656e R14: a82b46d27e68 R15: > 7f5e06753000 > [4005007.702622] FS: 7f5e0755b740() GS:99479d30() > knlGS: > [4005007.702626] CS: 0010 DS: ES: CR0: 80050033 > [4005007.702629] CR2: ffd6 CR3: 0003253fc000 CR4: > 003506e0 > [4005007.702633] Call Trace: > [4005007.702636] > [4005007.702640] amdgpu_debugfs_regs_smc_read+0xb0/0x120 [amdgpu] > [4005007.703002] full_proxy_read+0x5c/0x80 > [4005007.703011] vfs_read+0x9f/0x1a0 > [4005007.703019] ksys_read+0x67/0xe0 > [4005007.703023] __x64_sys_read+0x19/0x20 > [4005007.703028] do_syscall_64+0x5c/0xc0 > [4005007.703034] ? do_user_addr_fault+0x1e3/0x670 > [4005007.703040] ? exit_to_user_mode_prepare+0x37/0xb0 > [4005007.703047] ? irqentry_exit_to_user_mode+0x9/0x20 > [4005007.703052] ? irqentry_exit+0x19/0x30 > [4005007.703057] ? exc_page_fault+0x89/0x160 > [4005007.703062] ? asm_exc_page_fault+0x8/0x30 > [4005007.703068] entry_SYSCALL_64_after_hwframe+0x44/0xae > [4005007.703075] RIP: 0033:0x7f5e07672992 > [4005007.703079] Code: c0 e9 b2 fe ff ff 50 48 8d 3d fa b2 0c 00 e8 c5 1d 02 > 00 0f 1f 44 00 00 f3 0f1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f > 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 e c 28 48 89 54 24 > [4005007.703083] RSP: 002b:7ffe03097898 EFLAGS: 0246 ORIG_RAX: > > [4005007.703088] RAX: ffda RBX: 0002 RCX: > 7f5e07672992 > [4005007.703091] RDX: 0002 RSI: 7f5e06753000 RDI: > 0003 > [4005007.703094] RBP: 7f5e06753000 R08: 7f5e06752010 R09: > 7f5e06752010 > [4005007.703096] R10: 0022 R11: 0246 R12: > 00022000 > [4005007.703099] R13: 0003 R14: 0002 R15: > 0002 > [4005007.703105] > [4005007.703107] Modules linked in: nf_tables libcrc32c nfnetlink algif_hash > af_alg binfmt_misc nls_ iso8859_1 ipmi_ssif ast intel_rapl_msr > intel_rapl_common drm_vram_helper drm_ttm_helper amd64_edac t tm > edac_mce_amd kvm_amd ccp mac_hid k10temp kvm acpi_ipmi ipmi_si rapl > sch_fq_codel ipmi_devintf ipm i_msghandler msr parport_pc ppdev lp > parport mtd pstore_blk efi_pstore ramoops pstore_zone reed_solo mon > ip_tables x_tables autofs4 ib_uverbs ib_core amdgpu(OE) amddrm_ttm_helper(OE) > amdttm(OE) iommu_v 2 amd_sched(OE) amdkcl(OE) drm_kms_helper > syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_coredrm igb ahci > xhci_pci libahci i2c_piix4 i2c_algo_bit xhci_pci_renesas dca > [4005007.703184] CR2: > [4005007.703188] ---[ end trace ac65a538d240da39 ]--- > [4005007.800865] RIP: 0010:0x0 > [4005007.800871] Code: Unable to access opcode bytes at RIP > 0xffd6. > [4005007.800874] RSP: 0018:a82b46d27da0 EFLAGS: 00010206 > [4005007.800878] RAX: RBX: RCX: > a82b46d27e68 > [4005007.800881] RDX: 0001 RSI: RDI: > 9940656e > [4005007.800883] RBP: a82b46d27dd8 R08: R09: > 994060c07980 > [4005007.800886] R10: 0002 R11: R12: > 7f5e06753000 > [4005007.800888] R13: 9940656e R14: a82b46d27e68 R15: > 7f5e06753000 > [4005007.800891] FS: 7f5e0755b740() GS:99479d30() > knlGS: > [4005007.800895] CS: 0010 DS: ES: CR0: 80050033 > [4005007.800898] CR2: ffd6 CR3: 0003253fc000 CR4: > 003506e0 > > Signed-
Re: [PATCH] drm/amdxcp: fix amdxcp unloads incompletely
[Public] Acked-by: Alex Deucher From: amd-gfx on behalf of James Zhu Sent: Thursday, September 7, 2023 10:41 AM To: amd-gfx@lists.freedesktop.org Cc: Lin, Amber ; Zhu, James ; Kamal, Asad Subject: [PATCH] drm/amdxcp: fix amdxcp unloads incompletely amdxcp unloads incompletely, and below error will be seen during load/unload, sysfs: cannot create duplicate filename '/devices/platform/amdgpu_xcp.0' devres_release_group will free xcp device at first, platform device will be unregistered later in platform_device_unregister. Signed-off-by: James Zhu --- drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c index 353597fc908d..90ddd8371176 100644 --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c @@ -89,9 +89,10 @@ EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc); void amdgpu_xcp_drv_release(void) { for (--pdev_num; pdev_num >= 0; --pdev_num) { - devres_release_group(&xcp_dev[pdev_num]->pdev->dev, NULL); - platform_device_unregister(xcp_dev[pdev_num]->pdev); - xcp_dev[pdev_num]->pdev = NULL; + struct platform_device *pdev = xcp_dev[pdev_num]->pdev; + + devres_release_group(&pdev->dev, NULL); + platform_device_unregister(pdev); xcp_dev[pdev_num] = NULL; } pdev_num = 0; -- 2.34.1
Re: [PATCH] drm/amdgpu: Use pcie domain of xcc acpi objects
On Sat, Oct 21, 2023 at 8:02 PM Lijo Lazar wrote: > > PCI domain/segment information of xccs is available through ACPI DSM > methods. Consider that also while looking for devices. > > Signed-off-by: Lijo Lazar Acked-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 40 +--- > 1 file changed, 22 insertions(+), 18 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > index 2bca37044ad0..d62e49758635 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c > @@ -68,7 +68,7 @@ struct amdgpu_acpi_xcc_info { > struct amdgpu_acpi_dev_info { > struct list_head list; > struct list_head xcc_list; > - uint16_t bdf; > + uint32_t sbdf; > uint16_t supp_xcp_mode; > uint16_t xcp_mode; > uint16_t mem_mode; > @@ -927,7 +927,7 @@ static acpi_status amdgpu_acpi_get_node_id(acpi_handle > handle, > #endif > } > > -static struct amdgpu_acpi_dev_info *amdgpu_acpi_get_dev(u16 bdf) > +static struct amdgpu_acpi_dev_info *amdgpu_acpi_get_dev(u32 sbdf) > { > struct amdgpu_acpi_dev_info *acpi_dev; > > @@ -935,14 +935,14 @@ static struct amdgpu_acpi_dev_info > *amdgpu_acpi_get_dev(u16 bdf) > return NULL; > > list_for_each_entry(acpi_dev, &amdgpu_acpi_dev_list, list) > - if (acpi_dev->bdf == bdf) > + if (acpi_dev->sbdf == sbdf) > return acpi_dev; > > return NULL; > } > > static int amdgpu_acpi_dev_init(struct amdgpu_acpi_dev_info **dev_info, > - struct amdgpu_acpi_xcc_info *xcc_info, u16 > bdf) > + struct amdgpu_acpi_xcc_info *xcc_info, u32 > sbdf) > { > struct amdgpu_acpi_dev_info *tmp; > union acpi_object *obj; > @@ -955,7 +955,7 @@ static int amdgpu_acpi_dev_init(struct > amdgpu_acpi_dev_info **dev_info, > > INIT_LIST_HEAD(&tmp->xcc_list); > INIT_LIST_HEAD(&tmp->list); > - tmp->bdf = bdf; > + tmp->sbdf = sbdf; > > obj = acpi_evaluate_dsm_typed(xcc_info->handle, &amd_xcc_dsm_guid, 0, > AMD_XCC_DSM_GET_SUPP_MODE, NULL, > @@ -1007,7 +1007,7 @@ static int amdgpu_acpi_dev_init(struct > amdgpu_acpi_dev_info **dev_info, > > DRM_DEBUG_DRIVER( > "New dev(%x): Supported xcp mode: %x curr xcp_mode : %x mem > mode : %x, tmr base: %llx tmr size: %llx ", > - tmp->bdf, tmp->supp_xcp_mode, tmp->xcp_mode, tmp->mem_mode, > + tmp->sbdf, tmp->supp_xcp_mode, tmp->xcp_mode, tmp->mem_mode, > tmp->tmr_base, tmp->tmr_size); > list_add_tail(&tmp->list, &amdgpu_acpi_dev_list); > *dev_info = tmp; > @@ -1023,7 +1023,7 @@ static int amdgpu_acpi_dev_init(struct > amdgpu_acpi_dev_info **dev_info, > } > > static int amdgpu_acpi_get_xcc_info(struct amdgpu_acpi_xcc_info *xcc_info, > - u16 *bdf) > + u32 *sbdf) > { > union acpi_object *obj; > acpi_status status; > @@ -1054,8 +1054,10 @@ static int amdgpu_acpi_get_xcc_info(struct > amdgpu_acpi_xcc_info *xcc_info, > xcc_info->phy_id = (obj->integer.value >> 32) & 0xFF; > /* xcp node of this xcc [47:40] */ > xcc_info->xcp_node = (obj->integer.value >> 40) & 0xFF; > + /* PF domain of this xcc [31:16] */ > + *sbdf = (obj->integer.value) & 0x; > /* PF bus/dev/fn of this xcc [63:48] */ > - *bdf = (obj->integer.value >> 48) & 0x; > + *sbdf |= (obj->integer.value >> 48) & 0x; > ACPI_FREE(obj); > obj = NULL; > > @@ -1079,7 +1081,7 @@ static int amdgpu_acpi_enumerate_xcc(void) > struct acpi_device *acpi_dev; > char hid[ACPI_ID_LEN]; > int ret, id; > - u16 bdf; > + u32 sbdf; > > INIT_LIST_HEAD(&amdgpu_acpi_dev_list); > xa_init(&numa_info_xa); > @@ -1107,16 +1109,16 @@ static int amdgpu_acpi_enumerate_xcc(void) > xcc_info->handle = acpi_device_handle(acpi_dev); > acpi_dev_put(acpi_dev); > > - ret = amdgpu_acpi_get_xcc_info(xcc_info, &bdf); > + ret = amdgpu_acpi_get_xcc_info(xcc_info, &sbdf); > if (ret) { > kfree(xcc_info); > continue; > } > > - dev_info = amdgpu_acpi_get_dev(bdf); > + dev_info = amdgpu_acpi_get_dev(sbdf); > > if (!dev_info) > - ret = amdgpu_acpi_dev_init(&dev_info, xcc_info, bdf); > + ret = amdgpu_acpi_dev_init(&dev_info, xcc_info, sbdf); > > if (ret == -ENOMEM) > return ret; > @@ -1136,13 +1138,14 @@ int amdgpu_acpi_get_tmr_info(struct amdgpu_device > *adev, u64 *tmr_offset, >
Re: [PATCH] drm/amdkfd: Address 'remap_list' not described in 'svm_range_add'
On 2023-10-23 12:12, Srinivasan Shanmugam wrote: Fixes the below: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:2073: warning: Function parameter or member 'remap_list' not described in 'svm_range_add' Cc: Felix Kuehling Cc: Christian König Cc: Alex Deucher Cc: "Pan, Xinhui" Signed-off-by: Srinivasan Shanmugam Reviewed-by: Felix Kuehling --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index f2b33fb2afcf..f43dedf3e240 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -2046,6 +2046,7 @@ svm_range_split_new(struct svm_range_list *svms, uint64_t start, uint64_t last, * @update_list: output, the ranges need validate and update GPU mapping * @insert_list: output, the ranges need insert to svms * @remove_list: output, the ranges are replaced and need remove from svms + * @remap_list: output, remap unaligned svm ranges * * Check if the virtual address range has overlap with any existing ranges, * split partly overlapping ranges and add new ranges in the gaps. All changes
Re: [PATCH] drm/amdgpu: Use pcie domain of xcc acpi objects
[AMD Official Use Only - General] Thanks, Lijo From: amd-gfx on behalf of Lijo Lazar Sent: Friday, October 20, 2023 8:44:22 PM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Kasiviswanathan, Harish ; Zhang, Hawking Subject: [PATCH] drm/amdgpu: Use pcie domain of xcc acpi objects PCI domain/segment information of xccs is available through ACPI DSM methods. Consider that also while looking for devices. Signed-off-by: Lijo Lazar --- drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 40 +--- 1 file changed, 22 insertions(+), 18 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c index 2bca37044ad0..d62e49758635 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c @@ -68,7 +68,7 @@ struct amdgpu_acpi_xcc_info { struct amdgpu_acpi_dev_info { struct list_head list; struct list_head xcc_list; - uint16_t bdf; + uint32_t sbdf; uint16_t supp_xcp_mode; uint16_t xcp_mode; uint16_t mem_mode; @@ -927,7 +927,7 @@ static acpi_status amdgpu_acpi_get_node_id(acpi_handle handle, #endif } -static struct amdgpu_acpi_dev_info *amdgpu_acpi_get_dev(u16 bdf) +static struct amdgpu_acpi_dev_info *amdgpu_acpi_get_dev(u32 sbdf) { struct amdgpu_acpi_dev_info *acpi_dev; @@ -935,14 +935,14 @@ static struct amdgpu_acpi_dev_info *amdgpu_acpi_get_dev(u16 bdf) return NULL; list_for_each_entry(acpi_dev, &amdgpu_acpi_dev_list, list) - if (acpi_dev->bdf == bdf) + if (acpi_dev->sbdf == sbdf) return acpi_dev; return NULL; } static int amdgpu_acpi_dev_init(struct amdgpu_acpi_dev_info **dev_info, - struct amdgpu_acpi_xcc_info *xcc_info, u16 bdf) + struct amdgpu_acpi_xcc_info *xcc_info, u32 sbdf) { struct amdgpu_acpi_dev_info *tmp; union acpi_object *obj; @@ -955,7 +955,7 @@ static int amdgpu_acpi_dev_init(struct amdgpu_acpi_dev_info **dev_info, INIT_LIST_HEAD(&tmp->xcc_list); INIT_LIST_HEAD(&tmp->list); - tmp->bdf = bdf; + tmp->sbdf = sbdf; obj = acpi_evaluate_dsm_typed(xcc_info->handle, &amd_xcc_dsm_guid, 0, AMD_XCC_DSM_GET_SUPP_MODE, NULL, @@ -1007,7 +1007,7 @@ static int amdgpu_acpi_dev_init(struct amdgpu_acpi_dev_info **dev_info, DRM_DEBUG_DRIVER( "New dev(%x): Supported xcp mode: %x curr xcp_mode : %x mem mode : %x, tmr base: %llx tmr size: %llx ", - tmp->bdf, tmp->supp_xcp_mode, tmp->xcp_mode, tmp->mem_mode, + tmp->sbdf, tmp->supp_xcp_mode, tmp->xcp_mode, tmp->mem_mode, tmp->tmr_base, tmp->tmr_size); list_add_tail(&tmp->list, &amdgpu_acpi_dev_list); *dev_info = tmp; @@ -1023,7 +1023,7 @@ static int amdgpu_acpi_dev_init(struct amdgpu_acpi_dev_info **dev_info, } static int amdgpu_acpi_get_xcc_info(struct amdgpu_acpi_xcc_info *xcc_info, - u16 *bdf) + u32 *sbdf) { union acpi_object *obj; acpi_status status; @@ -1054,8 +1054,10 @@ static int amdgpu_acpi_get_xcc_info(struct amdgpu_acpi_xcc_info *xcc_info, xcc_info->phy_id = (obj->integer.value >> 32) & 0xFF; /* xcp node of this xcc [47:40] */ xcc_info->xcp_node = (obj->integer.value >> 40) & 0xFF; + /* PF domain of this xcc [31:16] */ + *sbdf = (obj->integer.value) & 0x; /* PF bus/dev/fn of this xcc [63:48] */ - *bdf = (obj->integer.value >> 48) & 0x; + *sbdf |= (obj->integer.value >> 48) & 0x; ACPI_FREE(obj); obj = NULL; @@ -1079,7 +1081,7 @@ static int amdgpu_acpi_enumerate_xcc(void) struct acpi_device *acpi_dev; char hid[ACPI_ID_LEN]; int ret, id; - u16 bdf; + u32 sbdf; INIT_LIST_HEAD(&amdgpu_acpi_dev_list); xa_init(&numa_info_xa); @@ -1107,16 +1109,16 @@ static int amdgpu_acpi_enumerate_xcc(void) xcc_info->handle = acpi_device_handle(acpi_dev); acpi_dev_put(acpi_dev); - ret = amdgpu_acpi_get_xcc_info(xcc_info, &bdf); + ret = amdgpu_acpi_get_xcc_info(xcc_info, &sbdf); if (ret) { kfree(xcc_info); continue; } - dev_info = amdgpu_acpi_get_dev(bdf); + dev_info = amdgpu_acpi_get_dev(sbdf); if (!dev_info) - ret = amdgpu_acpi_dev_init(&dev_info, xcc_info, bdf); + ret = amdgpu_acpi_dev_init(&dev_info, xcc_info, sbdf); if (ret == -ENOMEM) return ret; @@ -1136,13 +1138,14 @@ int amdgpu_acpi_get_tmr_info
Re: [PATCH] drm/amdxcp: fix amdxcp unloads incompletely
[AMD Official Use Only - General] ping ... Thanks & Best Regards! James Zhu From: Zhu, James Sent: Thursday, September 7, 2023 10:41 AM To: amd-gfx@lists.freedesktop.org Cc: Kamal, Asad ; Lin, Amber ; Zhu, James Subject: [PATCH] drm/amdxcp: fix amdxcp unloads incompletely amdxcp unloads incompletely, and below error will be seen during load/unload, sysfs: cannot create duplicate filename '/devices/platform/amdgpu_xcp.0' devres_release_group will free xcp device at first, platform device will be unregistered later in platform_device_unregister. Signed-off-by: James Zhu --- drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c index 353597fc908d..90ddd8371176 100644 --- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c +++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c @@ -89,9 +89,10 @@ EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc); void amdgpu_xcp_drv_release(void) { for (--pdev_num; pdev_num >= 0; --pdev_num) { - devres_release_group(&xcp_dev[pdev_num]->pdev->dev, NULL); - platform_device_unregister(xcp_dev[pdev_num]->pdev); - xcp_dev[pdev_num]->pdev = NULL; + struct platform_device *pdev = xcp_dev[pdev_num]->pdev; + + devres_release_group(&pdev->dev, NULL); + platform_device_unregister(pdev); xcp_dev[pdev_num] = NULL; } pdev_num = 0; -- 2.34.1
[PATCH] drm/amdkfd: Address 'remap_list' not described in 'svm_range_add'
Fixes the below: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:2073: warning: Function parameter or member 'remap_list' not described in 'svm_range_add' Cc: Felix Kuehling Cc: Christian König Cc: Alex Deucher Cc: "Pan, Xinhui" Signed-off-by: Srinivasan Shanmugam --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index f2b33fb2afcf..f43dedf3e240 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -2046,6 +2046,7 @@ svm_range_split_new(struct svm_range_list *svms, uint64_t start, uint64_t last, * @update_list: output, the ranges need validate and update GPU mapping * @insert_list: output, the ranges need insert to svms * @remove_list: output, the ranges are replaced and need remove from svms + * @remap_list: output, remap unaligned svm ranges * * Check if the virtual address range has overlap with any existing ranges, * split partly overlapping ranges and add new ranges in the gaps. All changes -- 2.34.1
Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"
Am 23.10.23 um 15:06 schrieb Daniel Tang: That commit causes the screen to freeze a few moments after running clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer including ssh also freezes. On v6.5-rc1, it only results in a NULL pointer deference message in dmesg and the process to become a zombie whose unkillableness prevents shutdown without REISUB. Although llama.cpp and hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by this revert, pytorch-rocm is now working with stability and without whole-computer freezes caused by any accidental running of clinfo. This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639. That result doesn't make much sense. Felix please correct me, but AFAIK the ATS stuff was completely removed by now. Are you sure that this is pure v6.6-rc7 and not some other patches applied? If yes than we must have missed something. Regards, Christian. Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596 Signed-off-by: Daniel Tang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 82f25996ff5e..602f311ab766 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm) if (r) return r; + /* Sanity checks */ + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { + r = -EINVAL; + goto unreserve_bo; + } + /* Check if PD needs to be reinitialized and do it before * changing any other state, in case it fails. */ if (pte_support_ats != vm->pte_support_ats) { - /* Sanity checks */ - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { - r = -EINVAL; - goto unreserve_bo; - } - vm->pte_support_ats = pte_support_ats; r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo), false); -- 2.40.1
Re: [PATCH] drm/amd/pm: fix the high voltage and temperature issue on smu 13
On Sun, Oct 22, 2023 at 9:05 PM Feng, Kenneth wrote: > > [AMD Official Use Only - General] > > Thanks Alex, I will make another patch. > And please refer to the comments inline below. > > > -Original Message- > From: Alex Deucher > Sent: Friday, October 20, 2023 9:58 PM > To: Feng, Kenneth > Cc: amd-gfx@lists.freedesktop.org; Wang, Yang(Kevin) > Subject: Re: [PATCH] drm/amd/pm: fix the high voltage and temperature issue > on smu 13 > > Caution: This message originated from an External Source. Use proper caution > when opening attachments, clicking links, or responding. > > > On Fri, Oct 20, 2023 at 4:32 AM Kenneth Feng wrote: > > > > fix the high voltage and temperature issue after the driver is > > unloaded on smu 13.0.0, smu 13.0.7 and smu 13.0.10 > > > > Signed-off-by: Kenneth Feng > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 36 +++ > > drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c| 4 +-- > > drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 27 -- > > drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h | 1 + > > drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h | 2 ++ > > .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c| 13 +++ > > .../drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c | 8 - > > .../drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c | 8 - > > 8 files changed, 86 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index 31f8c3ead161..c5c892a8b3f9 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -3986,13 +3986,23 @@ int amdgpu_device_init(struct amdgpu_device *adev, > > } > > } > > } else { > > - tmp = amdgpu_reset_method; > > - /* It should do a default reset when loading or > > reloading the driver, > > -* regardless of the module parameter reset_method. > > -*/ > > - amdgpu_reset_method = AMD_RESET_METHOD_NONE; > > - r = amdgpu_asic_reset(adev); > > - amdgpu_reset_method = tmp; > > + switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) { > > + case IP_VERSION(13, 0, 0): > > + case IP_VERSION(13, 0, 7): > > + case IP_VERSION(13, 0, 10): > > + r = psp_gpu_reset(adev); > > + break; > > + default: > > + tmp = amdgpu_reset_method; > > + /* It should do a default reset when > > loading or reloading the driver, > > +* regardless of the module parameter > > reset_method. > > +*/ > > + amdgpu_reset_method = AMD_RESET_METHOD_NONE; > > + r = amdgpu_asic_reset(adev); > > + amdgpu_reset_method = tmp; > > + break; > > + } > > + > > if (r) { > > dev_err(adev->dev, "asic reset on init > > failed\n"); > > goto failed; @@ -5945,6 +5955,18 @@ > > int amdgpu_device_baco_exit(struct drm_device *dev) > > return -ENOTSUPP; > > > > ret = amdgpu_dpm_baco_exit(adev); > > + > > + if (!ret) > > + switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) { > > + case IP_VERSION(13, 0, 0): > > + case IP_VERSION(13, 0, 7): > > + case IP_VERSION(13, 0, 10): > > + adev->gfx.is_poweron = false; > > + break; > > + default: > > + break; > > + } > > Maybe better to move this into smu_v13_0_0_baco_exit() so we keep the asic > specific details out of the common files? > > > + > > if (ret) > > return ret; > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > > b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > > index 80ca2c05b0b8..3ad38e42773b 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c > > @@ -73,7 +73,7 @@ gmc_v11_0_vm_fault_interrupt_state(struct amdgpu_device > > *adev, > > * fini/suspend, so the overall state doesn't > > * change over the course of suspend/resume. > > */ > > - if (!adev->in_s0ix) > > + if (!adev->in_s0ix && adev->gfx.is_poweron) > > amdgpu_gmc_set_vm_fault_masks(adev, > > AMDGPU_GFXHUB(0), false); > > break; > > case AMDGPU_IRQ_STATE_ENABLE: > > @@ -85,7 +85,7 @@ gmc_v11_0_vm
RE: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems
[Public] > -Original Message- > From: Deucher, Alexander > Sent: Monday, October 23, 2023 09:22 > To: Limonciello, Mario ; amd- > g...@lists.freedesktop.org > Cc: Limonciello, Mario ; > paolo.gent...@canonical.com > Subject: RE: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems > > [Public] > > > -Original Message- > > From: amd-gfx On Behalf Of > Mario > > Limonciello > > Sent: Monday, October 23, 2023 9:45 AM > > To: amd-gfx@lists.freedesktop.org > > Cc: Limonciello, Mario ; > > paolo.gent...@canonical.com > > Subject: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems > > > > Originally we were quirking ASPM disabled specifically for VI when used with > > Alder Lake, but it appears to have problems with Rocket Lake as well. > > > > Like we've done in the case of dpm for newer platforms, disable ASPM for all > > Intel systems. > > > > Cc: sta...@vger.kernel.org # 5.15+ > > Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") > > Reported-and-tested-by: Paolo Gentili > > Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036742 > > Signed-off-by: Mario Limonciello > > Reviewed-by: Alex Deucher > > As a follow on, we probably want to apply this to all of the program_aspm() > functions for each asic family. > Yeah; I had that thought too but wanted to have a narrow patch for fixes and stable first. I will merge and send a follow up for that. > Alex > > > --- > > drivers/gpu/drm/amd/amdgpu/vi.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c > > b/drivers/gpu/drm/amd/amdgpu/vi.c index 6a8494f98d3e..fe8ba9e9837b > > 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/vi.c > > +++ b/drivers/gpu/drm/amd/amdgpu/vi.c > > @@ -1124,7 +1124,7 @@ static void vi_program_aspm(struct > > amdgpu_device *adev) > > bool bL1SS = false; > > bool bClkReqSupport = true; > > > > - if (!amdgpu_device_should_use_aspm(adev) || > > !amdgpu_device_aspm_support_quirk()) > > + if (!amdgpu_device_should_use_aspm(adev) || > > +!amdgpu_device_pcie_dynamic_switching_supported()) > > return; > > > > if (adev->flags & AMD_IS_APU || > > -- > > 2.34.1 >
RE: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems
[Public] > -Original Message- > From: amd-gfx On Behalf Of Mario > Limonciello > Sent: Monday, October 23, 2023 9:45 AM > To: amd-gfx@lists.freedesktop.org > Cc: Limonciello, Mario ; > paolo.gent...@canonical.com > Subject: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems > > Originally we were quirking ASPM disabled specifically for VI when used with > Alder Lake, but it appears to have problems with Rocket Lake as well. > > Like we've done in the case of dpm for newer platforms, disable ASPM for all > Intel systems. > > Cc: sta...@vger.kernel.org # 5.15+ > Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") > Reported-and-tested-by: Paolo Gentili > Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036742 > Signed-off-by: Mario Limonciello Reviewed-by: Alex Deucher As a follow on, we probably want to apply this to all of the program_aspm() functions for each asic family. Alex > --- > drivers/gpu/drm/amd/amdgpu/vi.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c > b/drivers/gpu/drm/amd/amdgpu/vi.c index 6a8494f98d3e..fe8ba9e9837b > 100644 > --- a/drivers/gpu/drm/amd/amdgpu/vi.c > +++ b/drivers/gpu/drm/amd/amdgpu/vi.c > @@ -1124,7 +1124,7 @@ static void vi_program_aspm(struct > amdgpu_device *adev) > bool bL1SS = false; > bool bClkReqSupport = true; > > - if (!amdgpu_device_should_use_aspm(adev) || > !amdgpu_device_aspm_support_quirk()) > + if (!amdgpu_device_should_use_aspm(adev) || > +!amdgpu_device_pcie_dynamic_switching_supported()) > return; > > if (adev->flags & AMD_IS_APU || > -- > 2.34.1
RE: [PATCH v2 00/24] DC Patches October 18, 2023
[Public] Hi all, This week this patchset was tested on the following systems: * Lenovo ThinkBook T13s Gen4 with AMD Ryzen 5 6600U * MSI Gaming X Trio RX 6800 * Gigabyte Gaming OC RX 7900 XTX These systems were tested on the following display/connection types: * eDP, (1080p 60hz [5650U]) (1920x1200 60hz [6600U]) (2560x1600 120hz[6600U]) * VGA and DVI (1680x1050 60hz [DP to VGA/DVI, USB-C to VGA/DVI]) * DP/HDMI/USB-C (1440p 170hz, 4k 60hz, 4k 144hz, 4k 240hz [Includes USB-C to DP/HDMI adapters]) * Thunderbolt (LG Ultrafine 5k) * MST (Startech MST14DP123DP [DP to 3x DP] and 2x 4k 60Hz displays) * DSC (with Cable Matters 101075 [DP to 3x DP] with 3x 4k60 displays, and HP Hook G2 with 1 4k60 display) * USB 4 (Kensington SD5700T and 1x 4k 60Hz display) * PCON (Club3D CAC-1085 and 1x 4k 144Hz display [at 4k 120HZ, as that is the max the adapter supports]) The testing is a mix of automated and manual tests. Manual testing includes (but is not limited to): * Changing display configurations and settings * Benchmark testing * Feature testing (Freesync, etc.) Automated testing includes (but is not limited to): * Script testing (scripts to automate some of the manual checks) * IGT testing The patchset consists of the amd-staging-drm-next branch (Head commit - 310b5f1a3c9eb1ed96e437ead40f900f3b7bf530 -> drm/amd/display: Revert "drm/amd/display: Use drm_connector in create_validate_stream_for_sink") with new patches added on top of it. Tested on Ubuntu 22.04.3, on Wayland and X11, using KDE Plasma and Gnome. Tested-by: Daniel Wheeler Thank you, Dan Wheeler Sr. Technologist | AMD SW Display -- 1 Commerce Valley Dr E, Thornhill, ON L3T 7X6 amd.com -Original Message- From: roman...@amd.com Sent: Thursday, October 19, 2023 9:32 AM To: amd-gfx@lists.freedesktop.org Cc: Wentland, Harry ; Li, Sun peng (Leo) ; Siqueira, Rodrigo ; Pillai, Aurabindo ; Li, Roman ; Lin, Wayne ; Wang, Chao-kai (Stylon) ; Kotarac, Pavle ; Gutierrez, Agustin ; Chung, ChiaHsuan (Tom) ; Wu, Hersen ; Zuo, Jerry ; Li, Roman ; Wheeler, Daniel Subject: [PATCH v2 00/24] DC Patches October 18, 2023 From: Roman Li This DC patchset brings improvements in multiple areas. In summary, we highlight: * Fixes null-deref regression after "drm/amd/display: Update OPP counter from new interface" * Fixes display flashing when VSR and HDR enabled on dcn32 * Fixes dcn3x intermittent hangs due to FPO * Fixes MST Multi-Stream light up on dcn35 * Fixes green screen on DCN31x when DVI and HDMI monitors attached * Adds DML2 improvements * Adds idle power optimization improvements * Accommodates panels with lower nit backlight * Updates SDP VSC colorimetry from DP test automation request * Reverts "drm/amd/display: allow edp updates for virtual signal" Cc: Daniel Wheeler Agustin Gutierrez (1): drm/amd/display: Remove power sequencing check Alex Hung (2): drm/amd/display: Revert "drm/amd/display: allow edp updates for virtual signal" drm/amd/display: Set emulated sink type to HDMI accordingly. Alvin Lee (1): drm/amd/display: Update FAMS sequence for DCN30 & DCN32 Aric Cyr (1): drm/amd/display: 3.2.256 Aurabindo Pillai (1): drm/amd/display: add interface to query SubVP status Fangzhi Zuo (1): drm/amd/display: Fix MST Multi-Stream Not Lighting Up on dcn35 George Shen (1): drm/amd/display: Update SDP VSC colorimetry from DP test automation request Hugo Hu (1): drm/amd/display: reprogram det size while seamless boot Ilya Bakoulin (1): drm/amd/display: Fix shaper using bad LUT params Iswara Nagulendran (1): drm/amd/display: Read before writing Backlight Mode Set Register Michael Strauss (1): drm/amd/display: Disable SYMCLK32_SE RCO on DCN314 Nicholas Kazlauskas (2): drm/amd/display: Revert "Improve x86 and dmub ips handshake" drm/amd/display: Fix IPS handshake for idle optimizations Rodrigo Siqueira (3): drm/amd/display: Correct enum typo drm/amd/display: Add prefix to amdgpu crtc functions drm/amd/display: Add prefix for plane functions Samson Tam (2): drm/amd/display: fix num_ways overflow error drm/amd/display: add null check for invalid opps Sung Joon Kim (2): drm/amd/display: Add a check for idle power optimization drm/amd/display: Fix HDMI framepack 3D test issue Swapnil Patel (1): drm/amd/display: Reduce default backlight min from 5 nits to 1 nits Wenjing Liu (2): drm/amd/display: add pipe resource management callbacks to DML2 drm/amd/display: implement map dc pipe with callback in DML2 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 5 +- .../amd/display/amdgpu_dm/amdgpu_dm_crtc.c| 48 +- .../amd/display/amdgpu_dm/amdgpu_dm_debugfs.c | 4 + .../amd/display/amdgpu_dm/amdgpu_dm_plane.c | 542 +- ...
[PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems
Originally we were quirking ASPM disabled specifically for VI when used with Alder Lake, but it appears to have problems with Rocket Lake as well. Like we've done in the case of dpm for newer platforms, disable ASPM for all Intel systems. Cc: sta...@vger.kernel.org # 5.15+ Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default") Reported-and-tested-by: Paolo Gentili Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036742 Signed-off-by: Mario Limonciello --- drivers/gpu/drm/amd/amdgpu/vi.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 6a8494f98d3e..fe8ba9e9837b 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -1124,7 +1124,7 @@ static void vi_program_aspm(struct amdgpu_device *adev) bool bL1SS = false; bool bClkReqSupport = true; - if (!amdgpu_device_should_use_aspm(adev) || !amdgpu_device_aspm_support_quirk()) + if (!amdgpu_device_should_use_aspm(adev) || !amdgpu_device_pcie_dynamic_switching_supported()) return; if (adev->flags & AMD_IS_APU || -- 2.34.1
[PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"
That commit causes the screen to freeze a few moments after running clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer including ssh also freezes. On v6.5-rc1, it only results in a NULL pointer deference message in dmesg and the process to become a zombie whose unkillableness prevents shutdown without REISUB. Although llama.cpp and hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by this revert, pytorch-rocm is now working with stability and without whole-computer freezes caused by any accidental running of clinfo. This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639. Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596 Signed-off-by: Daniel Tang --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 82f25996ff5e..602f311ab766 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm) if (r) return r; + /* Sanity checks */ + if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { + r = -EINVAL; + goto unreserve_bo; + } + /* Check if PD needs to be reinitialized and do it before * changing any other state, in case it fails. */ if (pte_support_ats != vm->pte_support_ats) { - /* Sanity checks */ - if (!amdgpu_vm_pt_is_root_clean(adev, vm)) { - r = -EINVAL; - goto unreserve_bo; - } - vm->pte_support_ats = pte_support_ats; r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo), false); -- 2.40.1
[PATCH] drm/amdgpu: Fix a null pointer access when the smc_rreg pointer is NULL
In certain types of chips, such as VEGA20, reading the amdgpu_regs_smc file could result in an abnormal null pointer access when the smc_rreg pointer is NULL. Below are the steps to reproduce this issue and the corresponding exception log: 1. Navigate to the directory: /sys/kernel/debug/dri/0 2. Execute command: cat amdgpu_regs_smc 3. Exception Log:: [4005007.702554] BUG: kernel NULL pointer dereference, address: [4005007.702562] #PF: supervisor instruction fetch in kernel mode [4005007.702567] #PF: error_code(0x0010) - not-present page [4005007.702570] PGD 0 P4D 0 [4005007.702576] Oops: 0010 [#1] SMP NOPTI [4005007.702581] CPU: 4 PID: 62563 Comm: cat Tainted: G OE 5.15.0-43-generic #46-Ubunt u [4005007.702590] RIP: 0010:0x0 [4005007.702598] Code: Unable to access opcode bytes at RIP 0xffd6. [4005007.702600] RSP: 0018:a82b46d27da0 EFLAGS: 00010206 [4005007.702605] RAX: RBX: RCX: a82b46d27e68 [4005007.702609] RDX: 0001 RSI: RDI: 9940656e [4005007.702612] RBP: a82b46d27dd8 R08: R09: 994060c07980 [4005007.702615] R10: 0002 R11: R12: 7f5e06753000 [4005007.702618] R13: 9940656e R14: a82b46d27e68 R15: 7f5e06753000 [4005007.702622] FS: 7f5e0755b740() GS:99479d30() knlGS: [4005007.702626] CS: 0010 DS: ES: CR0: 80050033 [4005007.702629] CR2: ffd6 CR3: 0003253fc000 CR4: 003506e0 [4005007.702633] Call Trace: [4005007.702636] [4005007.702640] amdgpu_debugfs_regs_smc_read+0xb0/0x120 [amdgpu] [4005007.703002] full_proxy_read+0x5c/0x80 [4005007.703011] vfs_read+0x9f/0x1a0 [4005007.703019] ksys_read+0x67/0xe0 [4005007.703023] __x64_sys_read+0x19/0x20 [4005007.703028] do_syscall_64+0x5c/0xc0 [4005007.703034] ? do_user_addr_fault+0x1e3/0x670 [4005007.703040] ? exit_to_user_mode_prepare+0x37/0xb0 [4005007.703047] ? irqentry_exit_to_user_mode+0x9/0x20 [4005007.703052] ? irqentry_exit+0x19/0x30 [4005007.703057] ? exc_page_fault+0x89/0x160 [4005007.703062] ? asm_exc_page_fault+0x8/0x30 [4005007.703068] entry_SYSCALL_64_after_hwframe+0x44/0xae [4005007.703075] RIP: 0033:0x7f5e07672992 [4005007.703079] Code: c0 e9 b2 fe ff ff 50 48 8d 3d fa b2 0c 00 e8 c5 1d 02 00 0f 1f 44 00 00 f3 0f1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 e c 28 48 89 54 24 [4005007.703083] RSP: 002b:7ffe03097898 EFLAGS: 0246 ORIG_RAX: [4005007.703088] RAX: ffda RBX: 0002 RCX: 7f5e07672992 [4005007.703091] RDX: 0002 RSI: 7f5e06753000 RDI: 0003 [4005007.703094] RBP: 7f5e06753000 R08: 7f5e06752010 R09: 7f5e06752010 [4005007.703096] R10: 0022 R11: 0246 R12: 00022000 [4005007.703099] R13: 0003 R14: 0002 R15: 0002 [4005007.703105] [4005007.703107] Modules linked in: nf_tables libcrc32c nfnetlink algif_hash af_alg binfmt_misc nls_ iso8859_1 ipmi_ssif ast intel_rapl_msr intel_rapl_common drm_vram_helper drm_ttm_helper amd64_edac t tm edac_mce_amd kvm_amd ccp mac_hid k10temp kvm acpi_ipmi ipmi_si rapl sch_fq_codel ipmi_devintf ipm i_msghandler msr parport_pc ppdev lp parport mtd pstore_blk efi_pstore ramoops pstore_zone reed_solo mon ip_tables x_tables autofs4 ib_uverbs ib_core amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) iommu_v 2 amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_coredrm igb ahci xhci_pci libahci i2c_piix4 i2c_algo_bit xhci_pci_renesas dca [4005007.703184] CR2: [4005007.703188] ---[ end trace ac65a538d240da39 ]--- [4005007.800865] RIP: 0010:0x0 [4005007.800871] Code: Unable to access opcode bytes at RIP 0xffd6. [4005007.800874] RSP: 0018:a82b46d27da0 EFLAGS: 00010206 [4005007.800878] RAX: RBX: RCX: a82b46d27e68 [4005007.800881] RDX: 0001 RSI: RDI: 9940656e [4005007.800883] RBP: a82b46d27dd8 R08: R09: 994060c07980 [4005007.800886] R10: 0002 R11: R12: 7f5e06753000 [4005007.800888] R13: 9940656e R14: a82b46d27e68 R15: 7f5e06753000 [4005007.800891] FS: 7f5e0755b740() GS:99479d30() knlGS: [4005007.800895] CS: 0010 DS: ES: CR0: 80050033 [4005007.800898] CR2: ffd6 CR3: 0003253fc000 CR4: 003506e0 Signed-off-by: Qu Huang --- drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c index a4faea4..05405da 100644 --- a/drivers/gpu/drm/
Re: [PATCH 1/2] drm/amdgpu: handle the return for sync wait
Am 20.10.23 um 11:59 schrieb Emily Deng: Add error handling for amdgpu_sync_wait. Signed-off-by: Emily Deng Reviewed-by: Christian König for this one. Going to discuss with Felix later today what we do with the timeout. Christian. --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 6 +- 2 files changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 54f31a420229..3011c191d7dd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -2668,7 +2668,7 @@ static int validate_invalid_user_pages(struct amdkfd_process_info *process_info) unreserve_out: ttm_eu_backoff_reservation(&ticket, &resv_list); - amdgpu_sync_wait(&sync, false); + ret = amdgpu_sync_wait(&sync, false); amdgpu_sync_free(&sync); out_free: kfree(pd_bo_list_entries); @@ -2939,8 +2939,11 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef) } /* Wait for validate and PT updates to finish */ - amdgpu_sync_wait(&sync_obj, false); - + ret = amdgpu_sync_wait(&sync_obj, false); + if (ret) { + pr_err("Failed to wait for validate and PT updates to finish\n"); + goto validate_map_fail; + } /* Release old eviction fence and create new one, because fence only * goes from unsignaled to signaled, fence cannot be reused. * Use context and mm from the old fence. diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c index 70fe3b39c004..a63139277583 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c @@ -1153,7 +1153,11 @@ int amdgpu_mes_ctx_map_meta_data(struct amdgpu_device *adev, } amdgpu_sync_fence(&sync, vm->last_update); - amdgpu_sync_wait(&sync, false); + r = amdgpu_sync_wait(&sync, false); + if (r) { + DRM_ERROR("failed to wait sync\n"); + goto error; + } ttm_eu_backoff_reservation(&ticket, &list); amdgpu_sync_free(&sync);
Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait
Am 20.10.23 um 21:47 schrieb Felix Kuehling: On 2023-10-20 09:10, Christian König wrote: No, the wait forever is what is expected and perfectly valid user experience. Waiting with a timeout on the other hand sounds like a really bad idea to me. Every wait with a timeout needs a justification, e.g. for example that userspace explicitly specified it. And I absolutely don't see that here. In this case the wait is in a kernel worker thread, and the wait is not interruptible. Not having a timeout means, you can have a kernel worker stuck forever. The restore worker also has retry logic already, so it can handle a timeout perfectly well. But maybe this shouldn't be done automatically for all callers of amdgpu_sync_wait, but only for this particular caller in the restore_process_worker. So we'd need to add a timeout parameter to amdgpu_sync_wait. Adding a parameter sounds like a good idea to me, but it's mandatory that dma_fence operations finish in a reasonable amount of time in the first place. This is even documented by now and basically means we need timeouts in the area of 100ms for each operation and not between 10 and 60 seconds. If upstream starts to taint the kernel for longer timeouts we will need to reduce the current values massively. Regards, Christian. Regards, Felix Regards, Christian. Am 20.10.23 um 10:52 schrieb Deng, Emily: [AMD Official Use Only - General] Hi Christian, The issue is running a compute hang with a quark and trigger a compute job timeout. For compute, the timeout setting is 60s, but for gfx and sdma, it is 10s. So, get the timeout from the sched is reasonable for different sched. And if wait timeout, it will print error, so won't hint real issues. And even it has real issue, the wait forever is bad user experience, and driver couldn't work anymore. Emily Deng Best Wishes -Original Message- From: Christian König Sent: Friday, October 20, 2023 3:29 PM To: Deng, Emily ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait Am 20.10.23 um 08:13 schrieb Emily Deng: Issue: Dead heappen during gpu recover, the call sequence as below: amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset- flush_delayed_work -> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait It is because the amdgpu_sync_wait is waiting for the bad job's fence, and never return, so the recover couldn't continue. Signed-off-by: Emily Deng --- drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c index dcd8c066bc1f..6253d6aab7f8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c @@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr) int i, r; hash_for_each_safe(sync->fences, i, tmp, e, node) { - r = dma_fence_wait(e->fence, intr); - if (r) + struct drm_sched_fence *s_fence = to_drm_sched_fence(e- fence); + long timeout = msecs_to_jiffies(1); That handling doesn't make much sense. If you need a timeout then you need a timeout for the whole function. Additional to that timeouts often just hide real problems which needs fixing. So this here needs a much better justification otherwise it's a pretty clear NAK. Regards, Christian. + + if (s_fence) + timeout = s_fence->sched->timeout; + + if (r == 0) + r = -ETIMEDOUT; + if (r < 0) return r; amdgpu_sync_entry_free(e);
Re: [PATCH 7/8] Documentation/gpu: Add an explanation about the DC weekly patches
On Fri, 20 Oct 2023, Rodrigo Siqueira wrote: > Sharing code with other OSes is confusing and raises some questions. > This patch introduces some explanation about our upstream process with > the shared code. Thanks for writing this! It does help with the transparency. Please find a comment inline. > > Cc: Mario Limonciello > Cc: Alex Deucher > Cc: Harry Wentland > Cc: Hamza Mahfooz > Signed-off-by: Rodrigo Siqueira > --- > Documentation/gpu/amdgpu/display/index.rst | 111 - > 1 file changed, 109 insertions(+), 2 deletions(-) > > diff --git a/Documentation/gpu/amdgpu/display/index.rst > b/Documentation/gpu/amdgpu/display/index.rst > index b09d1434754d..9d53a42c5339 100644 > --- a/Documentation/gpu/amdgpu/display/index.rst > +++ b/Documentation/gpu/amdgpu/display/index.rst > @@ -10,7 +10,114 @@ reason, our Display Core Driver is divided into two > pieces: > 1. **Display Core (DC)** contains the OS-agnostic components. Things like > hardware programming and resource management are handled here. > 2. **Display Manager (DM)** contains the OS-dependent components. Hooks to > the > - amdgpu base driver and DRM are implemented here. > + amdgpu base driver and DRM are implemented here. For example, you can > check > + display/amdgpu_dm/ folder. > + > + > +How AMD shares code? > + > + > +Maintaining the same code-base across multiple OSes requires a lot of > +synchronization effort between repositories. In the DC case, we maintain a > +central repository where everyone who works from other OSes can put their > +change in this centralized repository. In a simple way, this shared > repository > +is identical to all code that you can see in the display folder. The shared > +repo has integration tests with our Linux CI farm, and we run an exhaustive > set > +of IGT tests in various AMD GPUs/APUs. Our CI also checks ARM64/32, PPC64/32, > +and x86_64/32 compilation with DCN enabled and disabled. After all tests pass > +and the developer gets reviewed by someone else, the change gets merged into > +the shared repository. > + > +To maintain this shared code working properly, we run two activities every > +week: > + > +1. **Weekly backport**: We bring changes from Linux to the other shared > + repositories. This work gets massive support from our CI tools, which can > + detect new changes and send them to internal maintainers. > +2. **Weekly promotion**: Every week, we get changes from other teams in the > + shared repo that have yet to be made public. For this reason, at the > + beginning of each week, a developer will review that internal repo and > + prepare a series of patches that can be sent to the public upstream > + (promotion). > + > +For the context of this documentation, promotion is the essential part that > +deserves a good elaboration here. > + > +Weekly promotion > + > + > +As described in the previous sections, the display folder has its equivalent > as > +an internal repository shared with multiple teams. The promotion activity is > +the task of 'promoting' those internal changes to the upstream; this is > +possible thanks to numerous tools that help us manage the code-sharing > +challenges. The weekly promotion usually takes one week, sliced like this: > + > +1. Extract all merged patches from the previous week that can be sent to the > + upstream. In other words, we check the week's time frame. > +2. Evaluate if any potential new patches make sense to the upstream. > +3. Create a branch candidate with the latest amd-staging-drm-next code > together > + with the new patches. At this step, we must ensure that every patch > compiles > + and the entire series pass our set of IGT test in different hardware > (i.e., > + it has to pass to our CI). > +4. Send the new candidate branch for an internal quality test and extra CI > + validation. > +5. Send patches to amd-gfx for reviews. We wait a few days for community > + feedback after sending a series to the public mailing list. So we've debated this one before. :) Again, I applaud the transparency in writing the document, but I can't help feeling the weekly promotions are code drops that will generally be merged unchanged, with no comments. They have all been reviewed internally, get posted with Reviewed-by tags pre-filled, we have no visibility to the review. Since the code has already been merged internally and the batch has passed CI, feels like the bar for changing anything at this point is pretty high. Just my two cents. BR, Jani. (Side note, there should be a \n before 6.) 6. If there is > + an error, we debug as fast as possible; usually, a simple bisect in the > + weekly promotion patches points to a bad change, and we can take two > + possible actions: fix the issue or drop the patch. If we cannot identify > the > + problem in the week interval, we drop the promotion and start over the > + following week; i
RE: [PATCH] drm/amdgpu/vpe: correct queue stop programing
[AMD Official Use Only - General] This patch is: Reviewed-by: Yifan Zhang Best Regards, Yifan -Original Message- From: Yu, Lang Sent: Monday, October 23, 2023 5:25 PM To: amd-gfx@lists.freedesktop.org Cc: Deucher, Alexander ; Zhang, Yifan ; Chiu, Solomon ; Yu, Lang Subject: [PATCH] drm/amdgpu/vpe: correct queue stop programing IB test would fail if not stop queue correctly. Signed-off-by: Lang Yu --- drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c b/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c index 756f39348dd9..174f13eff575 100644 --- a/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c +++ b/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c @@ -205,19 +205,21 @@ static int vpe_v6_1_ring_start(struct amdgpu_vpe *vpe) static int vpe_v_6_1_ring_stop(struct amdgpu_vpe *vpe) { struct amdgpu_device *adev = vpe->ring.adev; - uint32_t rb_cntl, ib_cntl; + uint32_t queue_reset; + int ret; - rb_cntl = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_RB_CNTL)); - rb_cntl = REG_SET_FIELD(rb_cntl, VPEC_QUEUE0_RB_CNTL, RB_ENABLE, 0); - WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_RB_CNTL), rb_cntl); + queue_reset = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE_RESET_REQ)); + queue_reset = REG_SET_FIELD(queue_reset, VPEC_QUEUE_RESET_REQ, QUEUE0_RESET, 1); + WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE_RESET_REQ), +queue_reset); - ib_cntl = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_IB_CNTL)); - ib_cntl = REG_SET_FIELD(ib_cntl, VPEC_QUEUE0_IB_CNTL, IB_ENABLE, 0); - WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_IB_CNTL), ib_cntl); + ret = SOC15_WAIT_ON_RREG(VPE, 0, regVPEC_QUEUE_RESET_REQ, 0, +VPEC_QUEUE_RESET_REQ__QUEUE0_RESET_MASK); + if (ret) + dev_err(adev->dev, "VPE queue reset failed\n"); vpe->ring.sched.ready = false; - return 0; + return ret; } static int vpe_v6_1_set_trap_irq_state(struct amdgpu_device *adev, -- 2.25.1
[PATCH] drm/amdgpu/vpe: correct queue stop programing
IB test would fail if not stop queue correctly. Signed-off-by: Lang Yu --- drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c b/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c index 756f39348dd9..174f13eff575 100644 --- a/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c +++ b/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c @@ -205,19 +205,21 @@ static int vpe_v6_1_ring_start(struct amdgpu_vpe *vpe) static int vpe_v_6_1_ring_stop(struct amdgpu_vpe *vpe) { struct amdgpu_device *adev = vpe->ring.adev; - uint32_t rb_cntl, ib_cntl; + uint32_t queue_reset; + int ret; - rb_cntl = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_RB_CNTL)); - rb_cntl = REG_SET_FIELD(rb_cntl, VPEC_QUEUE0_RB_CNTL, RB_ENABLE, 0); - WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_RB_CNTL), rb_cntl); + queue_reset = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE_RESET_REQ)); + queue_reset = REG_SET_FIELD(queue_reset, VPEC_QUEUE_RESET_REQ, QUEUE0_RESET, 1); + WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE_RESET_REQ), queue_reset); - ib_cntl = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_IB_CNTL)); - ib_cntl = REG_SET_FIELD(ib_cntl, VPEC_QUEUE0_IB_CNTL, IB_ENABLE, 0); - WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_IB_CNTL), ib_cntl); + ret = SOC15_WAIT_ON_RREG(VPE, 0, regVPEC_QUEUE_RESET_REQ, 0, +VPEC_QUEUE_RESET_REQ__QUEUE0_RESET_MASK); + if (ret) + dev_err(adev->dev, "VPE queue reset failed\n"); vpe->ring.sched.ready = false; - return 0; + return ret; } static int vpe_v6_1_set_trap_irq_state(struct amdgpu_device *adev, -- 2.25.1
Re: [PATCH v7 4/6] drm: Refuse to async flip with atomic prop changes
On Monday, October 23rd, 2023 at 10:42, Michel Dänzer wrote: > On 10/23/23 10:27, Simon Ser wrote: > > > On Sunday, October 22nd, 2023 at 12:12, Michel Dänzer > > michel.daen...@mailbox.org wrote: > > > > > On 10/17/23 14:16, Simon Ser wrote: > > > > > > > After discussing with André it seems like we missed a plane type check > > > > here. We need to make sure FB_ID changes are only allowed on primary > > > > planes. > > > > > > Can you elaborate why that's needed? > > > > Current drivers are in general not prepared to perform async page-flips > > on planes other than primary. For instance I don't think i915 has logic > > to perform async page-flip on an overlay plane FB_ID change. > > > That should be handled in the driver's atomic_check then? > > Async flips of overlay planes would be useful e.g. for presenting a windowed > application with tearing, while the rest of the desktop is tear-free. Yes, that would be useful, but requires more work. Small steps: first expose what the legacy uAPI can do in atomic, then later extend that in some drivers.
Re: [PATCH v7 4/6] drm: Refuse to async flip with atomic prop changes
On 10/23/23 10:27, Simon Ser wrote: > On Sunday, October 22nd, 2023 at 12:12, Michel Dänzer > wrote: >> On 10/17/23 14:16, Simon Ser wrote: >> >>> After discussing with André it seems like we missed a plane type check >>> here. We need to make sure FB_ID changes are only allowed on primary >>> planes. >> >> Can you elaborate why that's needed? > > Current drivers are in general not prepared to perform async page-flips > on planes other than primary. For instance I don't think i915 has logic > to perform async page-flip on an overlay plane FB_ID change. That should be handled in the driver's atomic_check then? Async flips of overlay planes would be useful e.g. for presenting a windowed application with tearing, while the rest of the desktop is tear-free. -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: [PATCH v7 4/6] drm: Refuse to async flip with atomic prop changes
On Sunday, October 22nd, 2023 at 12:12, Michel Dänzer wrote: > On 10/17/23 14:16, Simon Ser wrote: > > > After discussing with André it seems like we missed a plane type check > > here. We need to make sure FB_ID changes are only allowed on primary > > planes. > > Can you elaborate why that's needed? Current drivers are in general not prepared to perform async page-flips on planes other than primary. For instance I don't think i915 has logic to perform async page-flip on an overlay plane FB_ID change.
Re: [PATCH v6 6/6] drm/doc: Define KMS atomic state set
On Tuesday, October 17th, 2023 at 14:10, Ville Syrjälä wrote: > On Mon, Oct 16, 2023 at 10:00:51PM +, Simon Ser wrote: > > > On Monday, October 16th, 2023 at 17:10, Ville Syrjälä > > ville.syrj...@linux.intel.com wrote: > > > > > On Mon, Oct 16, 2023 at 05:52:22PM +0300, Pekka Paalanen wrote: > > > > > > > On Mon, 16 Oct 2023 15:42:16 +0200 > > > > André Almeida andrealm...@igalia.com wrote: > > > > > > > > > Hi Pekka, > > > > > > > > > > On 10/16/23 14:18, Pekka Paalanen wrote: > > > > > > > > > > > On Mon, 16 Oct 2023 12:52:32 +0200 > > > > > > André Almeida andrealm...@igalia.com wrote: > > > > > > > > > > > > > Hi Michel, > > > > > > > > > > > > > > On 8/17/23 12:37, Michel Dänzer wrote: > > > > > > > > > > > > > > > On 8/15/23 20:57, André Almeida wrote: > > > > > > > > > > > > > > > > > From: Pekka Paalanen pekka.paala...@collabora.com > > > > > > > > > > > > > > > > > > Specify how the atomic state is maintained between userspace > > > > > > > > > and > > > > > > > > > kernel, plus the special case for async flips. > > > > > > > > > > > > > > > > > > Signed-off-by: Pekka Paalanen pekka.paala...@collabora.com > > > > > > > > > Signed-off-by: André Almeida andrealm...@igalia.com > > > > > > > > > [...] > > > > > > > > > > > > > > > > > +An atomic commit with the flag DRM_MODE_PAGE_FLIP_ASYNC is > > > > > > > > > allowed to > > > > > > > > > +effectively change only the FB_ID property on any planes. > > > > > > > > > No-operation changes > > > > > > > > > +are ignored as always. [...] > > > > > > > > > During the hackfest in Brno, it was mentioned that a commit > > > > > > > > > which re-sets the same FB_ID could actually have an effect > > > > > > > > > with VRR: It could trigger scanout of the next frame before > > > > > > > > > vertical blank has reached its maximum duration. Some kind of > > > > > > > > > mechanism is required for this in order to allow user space > > > > > > > > > to perform low frame rate compensation. > > > > > > > > > > > > > > Xaver tested this hypothesis in a flipping the same fb in a VRR > > > > > > > monitor > > > > > > > and it worked as expected, so this shouldn't be a concern. > > > > > > > Right, so it must have some effect. It cannot be simply ignored > > > > > > > like in > > > > > > > the proposed doc wording. Do we special-case re-setting the same > > > > > > > FB_ID > > > > > > > as "not a no-op" or "not ignored" or some other way? > > > > > > > There's an effect in the refresh rate, the image won't change but > > > > > > > it > > > > > > > will report that a flip had happened asynchronously so the > > > > > > > reported > > > > > > > framerate will be increased. Maybe an additional wording could be > > > > > > > like: > > > > > > > > > > Flipping to the same FB_ID will result in a immediate flip as if it > > > > > was > > > > > changing to a different one, with no effect on the image but effecting > > > > > the reported frame rate. > > > > > > > > Re-setting FB_ID to its current value is a special case regardless of > > > > PAGE_FLIP_ASYNC, is it not? > > > > > > No. The rule has so far been that all side effects are observed > > > even if you flip to the same fb. And that is one of my annoyances > > > with this proposal. The rules will now be different for async flips > > > vs. everything else. > > > > Well with the patches the async page-flip case is exactly the same as > > the non-async page-flip case. In both cases, if a FB_ID is included in > > an atomic commit then the side effects are triggered even if the property > > value didn't change. The rules are the same for everything. > > I see it only checking if FB_ID changes or not. If it doesn't > change then the implication is that the side effects will in > fact be skipped as not all planes may even support async flips. Hm right. So the problem is that setting any prop = same value as previous one will result in a new page-flip for asynchronous page-flips, but will not result in any side-effect for asynchronous page-flips. Does it actually matter though? For async page-flips, I don't think this would result in any actual difference in behavior?