[PATCH] drm/amdgpu: get RAS poison status from DF v4_6_2

2023-10-23 Thread Tao Zhou
Add DF block and RAS poison mode query for DF v4_6_2.

Signed-off-by: Tao Zhou 
Reviewed-by: Stanley.Yang 
---
 drivers/gpu/drm/amd/amdgpu/Makefile   |  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c |  4 +++
 drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c| 34 +++
 drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h| 31 +
 4 files changed, 71 insertions(+), 1 deletion(-)
 create mode 100644 drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c
 create mode 100644 drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h

diff --git a/drivers/gpu/drm/amd/amdgpu/Makefile 
b/drivers/gpu/drm/amd/amdgpu/Makefile
index ec1daf7112a9..260e32ef7bae 100644
--- a/drivers/gpu/drm/amd/amdgpu/Makefile
+++ b/drivers/gpu/drm/amd/amdgpu/Makefile
@@ -104,7 +104,8 @@ amdgpu-y += \
 amdgpu-y += \
df_v1_7.o \
df_v3_6.o \
-   df_v4_3.o
+   df_v4_3.o \
+   df_v4_6_2.o
 
 # add GMC block
 amdgpu-y += \
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
index 17d4311e22d5..8d3681172cea 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c
@@ -35,6 +35,7 @@
 #include "df_v1_7.h"
 #include "df_v3_6.h"
 #include "df_v4_3.h"
+#include "df_v4_6_2.h"
 #include "nbio_v6_1.h"
 #include "nbio_v7_0.h"
 #include "nbio_v7_4.h"
@@ -2557,6 +2558,9 @@ int amdgpu_discovery_set_ip_blocks(struct amdgpu_device 
*adev)
case IP_VERSION(4, 3, 0):
adev->df.funcs = &df_v4_3_funcs;
break;
+   case IP_VERSION(4, 6, 2):
+   adev->df.funcs = &df_v4_6_2_funcs;
+   break;
default:
break;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c 
b/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c
new file mode 100644
index ..a47960a0babd
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.c
@@ -0,0 +1,34 @@
+/*
+ * Copyright 2023 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+#include "amdgpu.h"
+#include "df_v4_6_2.h"
+
+static bool df_v4_6_2_query_ras_poison_mode(struct amdgpu_device *adev)
+{
+   /* return true since related regs are inaccessible */
+   return true;
+}
+
+const struct amdgpu_df_funcs df_v4_6_2_funcs = {
+   .query_ras_poison_mode = df_v4_6_2_query_ras_poison_mode,
+};
diff --git a/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h 
b/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h
new file mode 100644
index ..3bc3e6d216e2
--- /dev/null
+++ b/drivers/gpu/drm/amd/amdgpu/df_v4_6_2.h
@@ -0,0 +1,31 @@
+/*
+ * Copyright 2023 Advanced Micro Devices, Inc.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
+ * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+ * OTHER DEALINGS IN THE SOFTWARE.
+ *
+ */
+
+#ifndef __DF_V4_6_2_H__
+#define __DF_V4_6_2_H__
+
+#include "soc15_common.h"
+
+extern const struct amdgpu_df_funcs df_v4_6_2_funcs;
+
+#endif
-- 
2.35.1



RE: [PATCH 1/3] drm/amdgpu: ungate power gating when system suspend

2023-10-23 Thread Feng, Kenneth
[AMD Official Use Only - General]

Reviewed-by: Kenneth Feng 


-Original Message-
From: Yuan, Perry 
Sent: Tuesday, October 24, 2023 10:33 AM
To: Zhang, Yifan ; Feng, Kenneth ; 
Limonciello, Mario 
Cc: Deucher, Alexander ; Wang, Yang(Kevin) 
; amd-gfx@lists.freedesktop.org
Subject: [PATCH 1/3] drm/amdgpu: ungate power gating when system suspend

[Why] During suspend, if GFX DPM is enabled and GFXOFF feature is enabled the 
system may get hung. So, it is suggested to disable GFXOFF feature during 
suspend and enable it after resume.

[How] Update the code to disable GFXOFF feature during suspend and enable it 
after resume.

[  311.396526] amdgpu :03:00.0: amdgpu: SMU: I'm not done with your 
previous command: SMN_C2PMSG_66:0x001E SMN_C2PMSG_82:0x [  
311.396530] amdgpu :03:00.0: amdgpu: Fail to disable dpm features!
[  311.396531] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend 
of IP block  failed -62

Signed-off-by: Perry Yuan 
Signed-off-by: Kun Liu 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index d9ccacd06fba..6399bc71c56d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3498,6 +3498,8 @@ static void gfx_v10_0_ring_invalidate_tlbs(struct 
amdgpu_ring *ring,  static void gfx_v10_0_update_spm_vmid_internal(struct 
amdgpu_device *adev,
   unsigned int vmid);

+static int gfx_v10_0_set_powergating_state(void *handle,
+ enum amd_powergating_state state);
 static void gfx10_kiq_set_resources(struct amdgpu_ring *kiq_ring, uint64_t 
queue_mask)  {
amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_SET_RESOURCES, 6)); @@ 
-7172,6 +7174,13 @@ static int gfx_v10_0_hw_fini(void *handle)
amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);

+   /* WA added for Vangogh asic fixing the SMU suspend failure
+* It needs to set power gating again during gfxoff control
+* otherwise the gfxoff disallowing will be failed to set.
+*/
+   if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(10, 3, 1))
+   gfx_v10_0_set_powergating_state(handle, AMD_PG_STATE_UNGATE);
+
if (!adev->no_hw_access) {
if (amdgpu_async_gfx_ring) {
if (amdgpu_gfx_disable_kgq(adev, 0))
--
2.34.1



Re: [PATCH 2/2] drm/amdgpu: Add timeout for sync wait

2023-10-23 Thread Christian König

Am 20.10.23 um 11:59 schrieb Emily Deng:

Issue: Dead heappen during gpu recover, the call sequence as below:

amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset->flush_delayed_work->
amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait


Resolving a deadlock with a timeout is illegal in general. So this patch 
here is an obvious no-go.


Additional to this problem Xinhu already investigated that the delayed 
work is causing issues during suspend because because flushing doesn't 
guarantee that a new one isn't started right after doing that.


After talking with Felix about this the correct solution is to stop 
flushing the delayed work and instead submitting it to the freezable 
work queue.


Regards,
Christian.



It is because the amdgpu_sync_wait is waiting for the bad job's fence, and
never return, so the recover couldn't continue.

Signed-off-by: Emily Deng 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +--
  1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index dcd8c066bc1f..9d4f122a7bf0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync, bool intr)
int i, r;
  
  	hash_for_each_safe(sync->fences, i, tmp, e, node) {

-   r = dma_fence_wait(e->fence, intr);
-   if (r)
+   struct drm_sched_fence *s_fence = to_drm_sched_fence(e->fence);
+   long timeout = msecs_to_jiffies(1);
+
+   if (s_fence)
+   timeout = s_fence->sched->timeout;
+   r = dma_fence_wait_timeout(e->fence, intr, timeout);
+   if (r == 0)
+   r = -ETIMEDOUT;
+   if (r < 0)
return r;
  
  		amdgpu_sync_entry_free(e);




Re: [PATCH] drm/amdgpu: Initialize schedulers before using them

2023-10-23 Thread Christian König

Am 24.10.23 um 04:55 schrieb Luben Tuikov:

On 2023-10-23 01:49, Christian König wrote:


Am 23.10.23 um 05:23 schrieb Luben Tuikov:

Initialize ring schedulers before using them, very early in the amdgpu boot,
at PCI probe time, specifically at frame-buffer dumb-create at fill-buffer.

This was discovered by using dynamic scheduler run-queues, which showed that
amdgpu was using a scheduler before calling drm_sched_init(), and the only
reason it was working was because sched_rq[] was statically allocated in the
scheduler structure. However, the scheduler structure had _not_ been
initialized.

When switching to dynamically allocated run-queues, this lack of
initialization was causing an oops and a blank screen at boot up. This patch
fixes this amdgpu bug.

This patch depends on the "drm/sched: Convert the GPU scheduler to variable
number of run-queues" patch, as that patch prevents subsequent scheduler
initialization if a scheduler has already been initialized.

Cc: Christian König 
Cc: Alex Deucher 
Cc: Felix Kuehling 
Cc: AMD Graphics 
Signed-off-by: Luben Tuikov 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++
   1 file changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 4e51dce3aab5d6..575ef7e1e30fd4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -60,6 +60,7 @@
   #include "amdgpu_atomfirmware.h"
   #include "amdgpu_res_cursor.h"
   #include "bif/bif_4_1_d.h"
+#include "amdgpu_reset.h"
   
   MODULE_IMPORT_NS(DMA_BUF);
   
@@ -2059,6 +2060,19 @@ void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable)
   
   		ring = adev->mman.buffer_funcs_ring;

sched = &ring->sched;
+
+   r = drm_sched_init(sched, &amdgpu_sched_ops,
+  DRM_SCHED_PRIORITY_COUNT,
+  ring->num_hw_submission, 0,
+  adev->sdma_timeout, adev->reset_domain->wq,
+  ring->sched_score, ring->name,
+  adev->dev);
+   if (r) {
+   drm_err(adev, "%s: couldn't initialize ring:%s 
error:%d\n",
+   __func__, ring->name, r);
+   return;
+   }

That doesn't look correct either.

amdgpu_ttm_set_buffer_funcs_status() should only be called with
enable=true as argument *after* the copy ring is initialized and valid
to use. One part of this ring initialization is to setup the scheduler.

It's the only way to keep the functionality of amdgpu_fill_buffer()
from amdgpu_mode_dumb_create(), from drm_client_framebuffer_create(),
from ... without an oops and a blank screen at boot up.

Here is a stack of the oops:

Oct 20 22:12:34 fedora kernel: RIP: 0010:drm_sched_job_arm+0x1f/0x60 [gpu_sched]
Oct 20 22:12:34 fedora kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 
55 53 48 8b 6f 58 48 85 ed 74 3f 48 89 fb 48 89 ef e8 95 34 00 00 48 8b 45 10 
<48> 8b 50 08 48 89 53 18 8b 45 24 89 43 54 b8 01 00 00 00 f0 48 0f
Oct 20 22:12:34 fedora kernel: RSP: 0018:c90001613838 EFLAGS: 00010246
Oct 20 22:12:34 fedora kernel: RAX:  RBX: 88812f33b400 RCX: 
0004
Oct 20 22:12:34 fedora kernel: RDX:  RSI: c9000395145c RDI: 
88812eacf850
Oct 20 22:12:34 fedora kernel: RBP: 88812eacf850 R08: 0004 R09: 
0003
Oct 20 22:12:34 fedora kernel: R10: c066b850 R11: bc848ef1 R12: 

Oct 20 22:12:34 fedora kernel: R13: 0004 R14: 00800300 R15: 
0100
Oct 20 22:12:34 fedora kernel: FS:  7f7be4866940() 
GS:0ed0() knlGS:
Oct 20 22:12:34 fedora kernel: CS:  0010 DS:  ES:  CR0: 80050033
Oct 20 22:12:34 fedora kernel: CR2: 0008 CR3: 00012cf22000 CR4: 
003506e0
Oct 20 22:12:34 fedora kernel: Call Trace:
Oct 20 22:12:34 fedora kernel:  
Oct 20 22:12:34 fedora kernel:  ? __die+0x1f/0x70
Oct 20 22:12:34 fedora kernel:  ? page_fault_oops+0x149/0x440
Oct 20 22:12:34 fedora kernel:  ? drm_sched_fence_alloc+0x1a/0x40 [gpu_sched]
Oct 20 22:12:34 fedora kernel:  ? amdgpu_job_alloc_with_ib+0x34/0xb0 [amdgpu]
Oct 20 22:12:34 fedora kernel:  ? srso_return_thunk+0x5/0x10
Oct 20 22:12:34 fedora kernel:  ? do_user_addr_fault+0x65/0x650
Oct 20 22:12:34 fedora kernel:  ? drm_client_framebuffer_create+0xa3/0x280 [drm]
Oct 20 22:12:34 fedora kernel:  ? exc_page_fault+0x7b/0x180
Oct 20 22:12:34 fedora kernel:  ? asm_exc_page_fault+0x22/0x30
Oct 20 22:12:34 fedora kernel:  ? local_pci_probe+0x41/0x90
Oct 20 22:12:34 fedora kernel:  ? __pfx_sdma_v5_0_emit_fill_buffer+0x10/0x10 
[amdgpu]
Oct 20 22:12:34 fedora kernel:  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched]
Oct 20 22:12:34 fedora kernel:  ? drm_sched_job_arm+0x1b/0x60 [gpu_sched]
Oct 20 

RE: [PATCH 2/3] drm/amdgpu: avoid sending csib command when system resumes from S3

2023-10-23 Thread Wang, Yang(Kevin)
[AMD Official Use Only - General]

-Original Message-
From: Yuan, Perry 
Sent: Tuesday, October 24, 2023 10:33 AM
To: Zhang, Yifan ; Feng, Kenneth ; 
Limonciello, Mario 
Cc: Deucher, Alexander ; Wang, Yang(Kevin) 
; amd-gfx@lists.freedesktop.org
Subject: [PATCH 2/3] drm/amdgpu: avoid sending csib command when system resumes 
from S3

Previously the CSIB command pocket was sent to GFX block while amdgpu driver 
loading or S3 resuming time all the time.
As the CP protocol required, the CSIB is not needed to send again while GC is 
not powered down while resuming from aborted S3 suspend sequence.

PREAMBLE_CNTL packet coming in the ring after PG event where the RLC already 
sent its copy of CSIB, send another CSIB pocket will cause Gfx IB testing 
timeout when system resume from S3.

Add flag `csib_initialized` to make sure normal S3 suspend/resume will 
initialize csib normally, when system abort to S3 suspend and resume 
immediately because of some failed suspend callback, GPU is not power down at 
that time, so csib command is not needed to send again.

Error dmesg log:
amdgpu :04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed 
on gfx_0.0.0 (-110).
[drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test 
failed (-110).
PM: resume of devices complete after 2373.995 msecs
PM: Finishing wakeup.

Signed-off-by: Perry Yuan 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  5 +  
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 29 ++---
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 44df1a5bce7f..e5d85ea26a5e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1114,6 +1114,7 @@ struct amdgpu_device {
booldebug_vm;
booldebug_largebar;
booldebug_disable_soft_recovery;
+   boolcsib_initialized;
[Kevin]:
you'd better use space to instead of "tab" , to align with other field.

 };

 static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev, 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420196a17e22..a47c9f840754 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2468,6 +2468,11 @@ static int amdgpu_pmops_suspend_noirq(struct device *dev)
if (amdgpu_acpi_should_gpu_reset(adev))
return amdgpu_asic_reset(adev);

+   /* update flag to make sure csib will be sent when system
+* resume from normal S3
+*/
+   adev->csib_initialized = false;
+
return 0;
 }

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 6399bc71c56d..ab2e3e592dfc 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3481,6 +3481,7 @@ static uint64_t gfx_v10_0_get_gpu_clock_counter(struct 
amdgpu_device *adev);  static void gfx_v10_0_select_se_sh(struct amdgpu_device 
*adev, u32 se_num,
   u32 sh_num, u32 instance, int xcc_id);  
static u32 gfx_v10_0_get_wgp_active_bitmap_per_sh(struct amdgpu_device *adev);
+static int gfx_v10_0_wait_for_idle(void *handle);

 static int gfx_v10_0_rlc_backdoor_autoload_buffer_init(struct amdgpu_device 
*adev);  static void gfx_v10_0_rlc_backdoor_autoload_buffer_fini(struct 
amdgpu_device *adev); @@ -5958,7 +5959,7 @@ static int 
gfx_v10_0_cp_gfx_load_microcode(struct amdgpu_device *adev)
return 0;
 }

-static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev)
+static int gfx_v10_csib_submit(struct amdgpu_device *adev)
 {
struct amdgpu_ring *ring;
const struct cs_section_def *sect = NULL; @@ -5966,13 +5967,6 @@ static 
int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev)
int r, i;
int ctx_reg_offset;

-   /* init the CP */
-   WREG32_SOC15(GC, 0, mmCP_MAX_CONTEXT,
-adev->gfx.config.max_hw_contexts - 1);
-   WREG32_SOC15(GC, 0, mmCP_DEVICE_ID, 1);
-
-   gfx_v10_0_cp_gfx_enable(adev, true);
-
ring = &adev->gfx.gfx_ring[0];
r = amdgpu_ring_alloc(ring, gfx_v10_0_get_csb_size(adev) + 4);
if (r) {
@@ -6035,6 +6029,25 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device 
*adev)

amdgpu_ring_commit(ring);
}
+
+   gfx_v10_0_wait_for_idle(adev);
[kevin]:
Do you forgot to check return value here?  If you want to ignore the return 
result, you'd better put some comments here.
Thanks.

Best Regards,
Kevin

+   adev->csib_initialized = true;
+
+   return 0;
+};
+
+static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev) {
+   /* init the CP */
+   WREG32_SOC15(GC, 0, mmCP_MAX_CONTEXT,
+

[PATCH] drm/amd/amdgpu: avoid to disable gfxhub interrupt when driver is unloaded

2023-10-23 Thread Kenneth Feng
avoid to disable gfxhub interrupt when driver is unloaded on gmc 11

Signed-off-by: Kenneth Feng 
---
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
index 80ca2c05b0b8..8e36a8395464 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
@@ -73,7 +73,8 @@ gmc_v11_0_vm_fault_interrupt_state(struct amdgpu_device *adev,
 * fini/suspend, so the overall state doesn't
 * change over the course of suspend/resume.
 */
-   if (!adev->in_s0ix)
+   if (!adev->in_s0ix && (adev->in_runpm || adev->in_suspend ||
+  
amdgpu_in_reset(adev)))
amdgpu_gmc_set_vm_fault_masks(adev, AMDGPU_GFXHUB(0), 
false);
break;
case AMDGPU_IRQ_STATE_ENABLE:
-- 
2.34.1



Re: [PATCH] drm/amdgpu: Initialize schedulers before using them

2023-10-23 Thread Luben Tuikov
On 2023-10-23 01:49, Christian König wrote:
> 
> 
> Am 23.10.23 um 05:23 schrieb Luben Tuikov:
>> Initialize ring schedulers before using them, very early in the amdgpu boot,
>> at PCI probe time, specifically at frame-buffer dumb-create at fill-buffer.
>>
>> This was discovered by using dynamic scheduler run-queues, which showed that
>> amdgpu was using a scheduler before calling drm_sched_init(), and the only
>> reason it was working was because sched_rq[] was statically allocated in the
>> scheduler structure. However, the scheduler structure had _not_ been
>> initialized.
>>
>> When switching to dynamically allocated run-queues, this lack of
>> initialization was causing an oops and a blank screen at boot up. This patch
>> fixes this amdgpu bug.
>>
>> This patch depends on the "drm/sched: Convert the GPU scheduler to variable
>> number of run-queues" patch, as that patch prevents subsequent scheduler
>> initialization if a scheduler has already been initialized.
>>
>> Cc: Christian König 
>> Cc: Alex Deucher 
>> Cc: Felix Kuehling 
>> Cc: AMD Graphics 
>> Signed-off-by: Luben Tuikov 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 4e51dce3aab5d6..575ef7e1e30fd4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -60,6 +60,7 @@
>>   #include "amdgpu_atomfirmware.h"
>>   #include "amdgpu_res_cursor.h"
>>   #include "bif/bif_4_1_d.h"
>> +#include "amdgpu_reset.h"
>>   
>>   MODULE_IMPORT_NS(DMA_BUF);
>>   
>> @@ -2059,6 +2060,19 @@ void amdgpu_ttm_set_buffer_funcs_status(struct 
>> amdgpu_device *adev, bool enable)
>>   
>>  ring = adev->mman.buffer_funcs_ring;
>>  sched = &ring->sched;
>> +
>> +r = drm_sched_init(sched, &amdgpu_sched_ops,
>> +   DRM_SCHED_PRIORITY_COUNT,
>> +   ring->num_hw_submission, 0,
>> +   adev->sdma_timeout, adev->reset_domain->wq,
>> +   ring->sched_score, ring->name,
>> +   adev->dev);
>> +if (r) {
>> +drm_err(adev, "%s: couldn't initialize ring:%s 
>> error:%d\n",
>> +__func__, ring->name, r);
>> +return;
>> +}
> 
> That doesn't look correct either.
> 
> amdgpu_ttm_set_buffer_funcs_status() should only be called with 
> enable=true as argument *after* the copy ring is initialized and valid 
> to use. One part of this ring initialization is to setup the scheduler.

It's the only way to keep the functionality of amdgpu_fill_buffer()
from amdgpu_mode_dumb_create(), from drm_client_framebuffer_create(),
from ... without an oops and a blank screen at boot up.

Here is a stack of the oops:

Oct 20 22:12:34 fedora kernel: RIP: 0010:drm_sched_job_arm+0x1f/0x60 [gpu_sched]
Oct 20 22:12:34 fedora kernel: Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 
00 00 55 53 48 8b 6f 58 48 85 ed 74 3f 48 89 fb 48 89 ef e8 95 34 00 00 48 8b 
45 10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 54 b8 01 00 00 00 f0 48 0f
Oct 20 22:12:34 fedora kernel: RSP: 0018:c90001613838 EFLAGS: 00010246
Oct 20 22:12:34 fedora kernel: RAX:  RBX: 88812f33b400 RCX: 
0004
Oct 20 22:12:34 fedora kernel: RDX:  RSI: c9000395145c RDI: 
88812eacf850
Oct 20 22:12:34 fedora kernel: RBP: 88812eacf850 R08: 0004 R09: 
0003
Oct 20 22:12:34 fedora kernel: R10: c066b850 R11: bc848ef1 R12: 

Oct 20 22:12:34 fedora kernel: R13: 0004 R14: 00800300 R15: 
0100
Oct 20 22:12:34 fedora kernel: FS:  7f7be4866940() 
GS:0ed0() knlGS:
Oct 20 22:12:34 fedora kernel: CS:  0010 DS:  ES:  CR0: 80050033
Oct 20 22:12:34 fedora kernel: CR2: 0008 CR3: 00012cf22000 CR4: 
003506e0
Oct 20 22:12:34 fedora kernel: Call Trace:
Oct 20 22:12:34 fedora kernel:  
Oct 20 22:12:34 fedora kernel:  ? __die+0x1f/0x70
Oct 20 22:12:34 fedora kernel:  ? page_fault_oops+0x149/0x440
Oct 20 22:12:34 fedora kernel:  ? drm_sched_fence_alloc+0x1a/0x40 [gpu_sched]
Oct 20 22:12:34 fedora kernel:  ? amdgpu_job_alloc_with_ib+0x34/0xb0 [amdgpu]
Oct 20 22:12:34 fedora kernel:  ? srso_return_thunk+0x5/0x10
Oct 20 22:12:34 fedora kernel:  ? do_user_addr_fault+0x65/0x650
Oct 20 22:12:34 fedora kernel:  ? drm_client_framebuffer_create+0xa3/0x280 [drm]
Oct 20 22:12:34 fedora kernel:  ? exc_page_fault+0x7b/0x180
Oct 20 22:12:34 fedora kernel:  ? asm_exc_page_fault+0x22/0x30
Oct 20 22:12:34 fedora kernel:  ? local_pci_probe+0x41/0x90
Oct 20 22:12:34 fedora kernel:  ? __pfx_sdma_v5_0_emit_fill_buffer+0x10/0x10 
[amdgpu]
Oct 20 22:12:34 fedora kernel:  ? drm_sched_job_arm+

[PATCH 2/3] drm/amdgpu: avoid sending csib command when system resumes from S3

2023-10-23 Thread Perry Yuan
Previously the CSIB command pocket was sent to GFX block while amdgpu
driver loading or S3 resuming time all the time.
As the CP protocol required, the CSIB is not needed to send again while
GC is not powered down while resuming from aborted S3 suspend sequence.

PREAMBLE_CNTL packet coming in the ring after PG event where the RLC
already sent its copy of CSIB, send another CSIB pocket will cause
Gfx IB testing timeout when system resume from S3.

Add flag `csib_initialized` to make sure normal S3 suspend/resume
will initialize csib normally, when system abort to S3 suspend and
resume immediately because of some failed suspend callback, GPU is not
power down at that time, so csib command is not needed to send again.

Error dmesg log:
amdgpu :04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed 
on gfx_0.0.0 (-110).
[drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test 
failed (-110).
PM: resume of devices complete after 2373.995 msecs
PM: Finishing wakeup.

Signed-off-by: Perry Yuan 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |  5 +
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 29 ++---
 3 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 44df1a5bce7f..e5d85ea26a5e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1114,6 +1114,7 @@ struct amdgpu_device {
booldebug_vm;
booldebug_largebar;
booldebug_disable_soft_recovery;
+   boolcsib_initialized;
 };
 
 static inline uint32_t amdgpu_ip_version(const struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 420196a17e22..a47c9f840754 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -2468,6 +2468,11 @@ static int amdgpu_pmops_suspend_noirq(struct device *dev)
if (amdgpu_acpi_should_gpu_reset(adev))
return amdgpu_asic_reset(adev);
 
+   /* update flag to make sure csib will be sent when system
+* resume from normal S3
+*/
+   adev->csib_initialized = false;
+
return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 6399bc71c56d..ab2e3e592dfc 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3481,6 +3481,7 @@ static uint64_t gfx_v10_0_get_gpu_clock_counter(struct 
amdgpu_device *adev);
 static void gfx_v10_0_select_se_sh(struct amdgpu_device *adev, u32 se_num,
   u32 sh_num, u32 instance, int xcc_id);
 static u32 gfx_v10_0_get_wgp_active_bitmap_per_sh(struct amdgpu_device *adev);
+static int gfx_v10_0_wait_for_idle(void *handle);
 
 static int gfx_v10_0_rlc_backdoor_autoload_buffer_init(struct amdgpu_device 
*adev);
 static void gfx_v10_0_rlc_backdoor_autoload_buffer_fini(struct amdgpu_device 
*adev);
@@ -5958,7 +5959,7 @@ static int gfx_v10_0_cp_gfx_load_microcode(struct 
amdgpu_device *adev)
return 0;
 }
 
-static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev)
+static int gfx_v10_csib_submit(struct amdgpu_device *adev)
 {
struct amdgpu_ring *ring;
const struct cs_section_def *sect = NULL;
@@ -5966,13 +5967,6 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device 
*adev)
int r, i;
int ctx_reg_offset;
 
-   /* init the CP */
-   WREG32_SOC15(GC, 0, mmCP_MAX_CONTEXT,
-adev->gfx.config.max_hw_contexts - 1);
-   WREG32_SOC15(GC, 0, mmCP_DEVICE_ID, 1);
-
-   gfx_v10_0_cp_gfx_enable(adev, true);
-
ring = &adev->gfx.gfx_ring[0];
r = amdgpu_ring_alloc(ring, gfx_v10_0_get_csb_size(adev) + 4);
if (r) {
@@ -6035,6 +6029,25 @@ static int gfx_v10_0_cp_gfx_start(struct amdgpu_device 
*adev)
 
amdgpu_ring_commit(ring);
}
+
+   gfx_v10_0_wait_for_idle(adev);
+   adev->csib_initialized = true;
+
+   return 0;
+};
+
+static int gfx_v10_0_cp_gfx_start(struct amdgpu_device *adev)
+{
+   /* init the CP */
+   WREG32_SOC15(GC, 0, mmCP_MAX_CONTEXT,
+adev->gfx.config.max_hw_contexts - 1);
+   WREG32_SOC15(GC, 0, mmCP_DEVICE_ID, 1);
+
+   gfx_v10_0_cp_gfx_enable(adev, true);
+
+   if (!adev->csib_initialized)
+   gfx_v10_csib_submit(adev);
+
return 0;
 }
 
-- 
2.34.1



[PATCH 3/3] drm/amdgpu: optimize RLC powerdown notification on Vangogh

2023-10-23 Thread Perry Yuan
The smu needs to get the rlc power down message to sync the rlc state
with smu, the rlc state updating message need to be sent at while smu
begin suspend sequence , otherwise SMU will crash while RLC state is not
notified by driver, and rlc state probally changed after that
notification, so it needs to notify rlc state to smu at the end of the
suspend sequence in amdgpu_device_suspend() that can make sure the rlc
state  is correctly set to SMU.

[  101.000590] amdgpu :03:00.0: amdgpu: SMU: I'm not done with your 
previous command: SMN_C2PMSG_66:0x001E SMN_C2PMSG_82:0x
[  101.000598] amdgpu :03:00.0: amdgpu: Failed to disable gfxoff!
[  110.838026] amdgpu :03:00.0: amdgpu: SMU: I'm not done with your 
previous command: SMN_C2PMSG_66:0x001E SMN_C2PMSG_82:0x
[  110.838035] amdgpu :03:00.0: amdgpu: Failed to disable smu features.
[  110.838039] amdgpu :03:00.0: amdgpu: Fail to disable dpm features!
[  110.838040] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend 
of IP block  failed -62
[  110.884394] PM: suspend of devices aborted after 21213.620 msecs
[  110.884402] PM: start suspend of devices aborted after 21213.882 msecs
[  110.884405] PM: Some devices failed to suspend, or early wake event detected

Signed-off-by: Perry Yuan 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  4 
 drivers/gpu/drm/amd/include/kgd_pp_interface.h |  1 +
 drivers/gpu/drm/amd/pm/amdgpu_dpm.c| 18 ++
 drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h|  2 ++
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c  | 10 ++
 drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h  |  5 +
 .../gpu/drm/amd/pm/swsmu/smu11/vangogh_ppt.c   |  5 ++---
 drivers/gpu/drm/amd/pm/swsmu/smu_internal.h|  1 +
 8 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index cc047fe0b7ee..be08ffc69231 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4428,6 +4428,10 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
if (amdgpu_sriov_vf(adev))
amdgpu_virt_release_full_gpu(adev, false);
 
+   r = amdgpu_dpm_notify_rlc_state(adev, false);
+   if (r)
+   return r;
+
return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/include/kgd_pp_interface.h 
b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
index 3201808c2dd8..4eacfdfcfd4b 100644
--- a/drivers/gpu/drm/amd/include/kgd_pp_interface.h
+++ b/drivers/gpu/drm/amd/include/kgd_pp_interface.h
@@ -444,6 +444,7 @@ struct amd_pm_funcs {
   struct dpm_clocks *clock_table);
int (*get_smu_prv_buf_details)(void *handle, void **addr, size_t *size);
void (*pm_compute_clocks)(void *handle);
+   int (*notify_rlc_state)(void *handle, bool en);
 };
 
 struct metrics_table_header {
diff --git a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c 
b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
index acf3527fff2d..ed7237bb64c8 100644
--- a/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
+++ b/drivers/gpu/drm/amd/pm/amdgpu_dpm.c
@@ -181,6 +181,24 @@ int amdgpu_dpm_set_mp1_state(struct amdgpu_device *adev,
return ret;
 }
 
+int amdgpu_dpm_notify_rlc_state(struct amdgpu_device *adev, bool en)
+{
+   int ret = 0;
+   const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
+
+   if (pp_funcs && pp_funcs->notify_rlc_state) {
+   mutex_lock(&adev->pm.mutex);
+
+   ret = pp_funcs->notify_rlc_state(
+   adev->powerplay.pp_handle,
+   en);
+
+   mutex_unlock(&adev->pm.mutex);
+   }
+
+   return ret;
+}
+
 bool amdgpu_dpm_is_baco_supported(struct amdgpu_device *adev)
 {
const struct amd_pm_funcs *pp_funcs = adev->powerplay.pp_funcs;
diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
index feccd2a7120d..482ea30147ab 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_dpm.h
@@ -415,6 +415,8 @@ int amdgpu_dpm_mode1_reset(struct amdgpu_device *adev);
 int amdgpu_dpm_set_mp1_state(struct amdgpu_device *adev,
 enum pp_mp1_state mp1_state);
 
+int amdgpu_dpm_notify_rlc_state(struct amdgpu_device *adev, bool en);
+
 int amdgpu_dpm_set_gfx_power_up_by_imu(struct amdgpu_device *adev);
 
 int amdgpu_dpm_baco_exit(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index a0b8d5d78beb..a8fb914f746b 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -1710,6 +1710,16 @@ static int smu_disable_dpms(struct smu_context *smu)
}
}
 
+   /* Notify SMU RLC is going to be off, stop RLC and SMU interaction.
+* otherwise SMU will hang 

[PATCH 1/3] drm/amdgpu: ungate power gating when system suspend

2023-10-23 Thread Perry Yuan
[Why] During suspend, if GFX DPM is enabled and GFXOFF feature is
enabled the system may get hung. So, it is suggested to disable
GFXOFF feature during suspend and enable it after resume.

[How] Update the code to disable GFXOFF feature during suspend and enable
it after resume.

[  311.396526] amdgpu :03:00.0: amdgpu: SMU: I'm not done with your 
previous command: SMN_C2PMSG_66:0x001E SMN_C2PMSG_82:0x
[  311.396530] amdgpu :03:00.0: amdgpu: Fail to disable dpm features!
[  311.396531] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend 
of IP block  failed -62

Signed-off-by: Perry Yuan 
Signed-off-by: Kun Liu 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index d9ccacd06fba..6399bc71c56d 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -3498,6 +3498,8 @@ static void gfx_v10_0_ring_invalidate_tlbs(struct 
amdgpu_ring *ring,
 static void gfx_v10_0_update_spm_vmid_internal(struct amdgpu_device *adev,
   unsigned int vmid);
 
+static int gfx_v10_0_set_powergating_state(void *handle,
+ enum amd_powergating_state state);
 static void gfx10_kiq_set_resources(struct amdgpu_ring *kiq_ring, uint64_t 
queue_mask)
 {
amdgpu_ring_write(kiq_ring, PACKET3(PACKET3_SET_RESOURCES, 6));
@@ -7172,6 +7174,13 @@ static int gfx_v10_0_hw_fini(void *handle)
amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
 
+   /* WA added for Vangogh asic fixing the SMU suspend failure
+* It needs to set power gating again during gfxoff control
+* otherwise the gfxoff disallowing will be failed to set.
+*/
+   if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(10, 3, 1))
+   gfx_v10_0_set_powergating_state(handle, AMD_PG_STATE_UNGATE);
+
if (!adev->no_hw_access) {
if (amdgpu_async_gfx_ring) {
if (amdgpu_gfx_disable_kgq(adev, 0))
-- 
2.34.1



[PATCH v3 05/10] drm/ci: clean up xfails (specially flakes list)

2023-10-23 Thread Helen Koike
Since the script that collected the list of the expectation files was
bogus and placing test to the flakes list incorrectly, restart the
expectation files with the correct script.

This reduces a lot the number of tests in the flakes list.

Signed-off-by: Helen Koike 
Reviewed-by: David Heidelberg 

---

v2:
- fix typo in the commit message
- re-add kms_cursor_legacy@flip-vs-cursor-toggle back to msm-sdm845-flakes.txt
- removed kms_async_flips@crc,Fail from i915-cml-fails.txt

v3:
- add kms_rmfb@close-fd,Fail to amdgpu-stoney-fails.txt
- add kms_async_flips@crc to i915-kbl-flakes.txt

Signed-off-by: Helen Koike 
---
 .../gpu/drm/ci/xfails/amdgpu-stoney-fails.txt | 12 +-
 .../drm/ci/xfails/amdgpu-stoney-flakes.txt| 20 -
 drivers/gpu/drm/ci/xfails/i915-amly-fails.txt |  9 
 .../gpu/drm/ci/xfails/i915-amly-flakes.txt| 32 ---
 drivers/gpu/drm/ci/xfails/i915-apl-fails.txt  | 11 -
 drivers/gpu/drm/ci/xfails/i915-apl-flakes.txt |  1 -
 drivers/gpu/drm/ci/xfails/i915-cml-fails.txt  | 14 ++-
 drivers/gpu/drm/ci/xfails/i915-cml-flakes.txt | 38 -
 drivers/gpu/drm/ci/xfails/i915-glk-fails.txt  | 17 
 drivers/gpu/drm/ci/xfails/i915-glk-flakes.txt | 41 ---
 drivers/gpu/drm/ci/xfails/i915-kbl-fails.txt  |  7 
 drivers/gpu/drm/ci/xfails/i915-kbl-flakes.txt | 25 ---
 drivers/gpu/drm/ci/xfails/i915-tgl-fails.txt  |  1 -
 drivers/gpu/drm/ci/xfails/i915-tgl-flakes.txt |  5 ---
 drivers/gpu/drm/ci/xfails/i915-whl-flakes.txt |  1 -
 .../drm/ci/xfails/mediatek-mt8173-flakes.txt  |  0
 .../drm/ci/xfails/mediatek-mt8183-fails.txt   |  5 ++-
 .../drm/ci/xfails/mediatek-mt8183-flakes.txt  | 14 ---
 .../gpu/drm/ci/xfails/meson-g12b-fails.txt| 14 ---
 .../gpu/drm/ci/xfails/meson-g12b-flakes.txt   |  4 --
 .../gpu/drm/ci/xfails/msm-apq8016-flakes.txt  |  4 --
 .../gpu/drm/ci/xfails/msm-apq8096-fails.txt   |  2 +
 .../gpu/drm/ci/xfails/msm-apq8096-flakes.txt  |  4 --
 .../gpu/drm/ci/xfails/msm-sc7180-fails.txt| 15 ---
 .../gpu/drm/ci/xfails/msm-sc7180-flakes.txt   | 24 +++
 .../gpu/drm/ci/xfails/msm-sc7180-skips.txt| 18 +---
 .../gpu/drm/ci/xfails/msm-sdm845-fails.txt|  9 +---
 .../gpu/drm/ci/xfails/msm-sdm845-flakes.txt   | 19 +
 .../drm/ci/xfails/rockchip-rk3288-fails.txt   |  6 +++
 .../drm/ci/xfails/rockchip-rk3288-flakes.txt  |  9 
 .../drm/ci/xfails/rockchip-rk3399-fails.txt   | 40 +-
 .../drm/ci/xfails/rockchip-rk3399-flakes.txt  | 28 +++--
 .../drm/ci/xfails/virtio_gpu-none-flakes.txt  |  0
 33 files changed, 162 insertions(+), 287 deletions(-)
 delete mode 100644 drivers/gpu/drm/ci/xfails/i915-amly-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/i915-apl-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/i915-cml-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/i915-glk-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/i915-tgl-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/i915-whl-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/mediatek-mt8173-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/mediatek-mt8183-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/meson-g12b-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/msm-apq8016-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/msm-apq8096-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/rockchip-rk3288-flakes.txt
 delete mode 100644 drivers/gpu/drm/ci/xfails/virtio_gpu-none-flakes.txt

diff --git a/drivers/gpu/drm/ci/xfails/amdgpu-stoney-fails.txt 
b/drivers/gpu/drm/ci/xfails/amdgpu-stoney-fails.txt
index bd9392536e7c..ea87dc46bc2b 100644
--- a/drivers/gpu/drm/ci/xfails/amdgpu-stoney-fails.txt
+++ b/drivers/gpu/drm/ci/xfails/amdgpu-stoney-fails.txt
@@ -1,8 +1,14 @@
 kms_addfb_basic@bad-pitch-65536,Fail
 kms_addfb_basic@bo-too-small,Fail
+kms_addfb_basic@too-high,Fail
+kms_async_flips@async-flip-with-page-flip-events,Fail
+kms_async_flips@crc,Fail
 kms_async_flips@invalid-async-flip,Fail
-kms_atomic@plane-immutable-zpos,Fail
+kms_atomic_transition@plane-all-modeset-transition-internal-panels,Fail
+kms_atomic_transition@plane-all-transition,Fail
+kms_atomic_transition@plane-all-transition-nonblocking,Fail
 kms_atomic_transition@plane-toggle-modeset-transition,Fail
+kms_atomic_transition@plane-use-after-nonblocking-unbind,Fail
 kms_bw@linear-tiling-1-displays-2560x1440p,Fail
 kms_bw@linear-tiling-1-displays-3840x2160p,Fail
 kms_bw@linear-tiling-2-displays-3840x2160p,Fail
@@ -11,9 +17,11 @@ kms_color@degamma,Fail
 kms_cursor_crc@cursor-size-change,Fail
 kms_cursor_crc@pipe-A-cursor-size-change,Fail
 kms_cursor_crc@pipe-B-cursor-size-change,Fail
-kms_cursor_legacy@forked-move,Fail
+kms_flip@flip-vs-modeset-vs-hang,Fail
+kms_flip@flip-vs-panning-vs-hang,Fail
 kms_hdr@bpc-switch,Fail
 kms_hdr@bpc-switch-dpms,Fail
+kms_plane@pixel-format,Fail
 kms_plane_multiple@atomic-pipe-A-tiling-none,Fail
 kms_rmfb@close-fd,Fail
 kms_rotation_

Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"

2023-10-23 Thread Felix Kuehling

[sorry, I hit send too early]


On 2023-10-23 11:15, Christian König wrote:

Am 23.10.23 um 15:06 schrieb Daniel Tang:

That commit causes the screen to freeze a few moments after running
clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer
including ssh also freezes. On v6.5-rc1, it only results in a NULL 
pointer

deference message in dmesg and the process to become a zombie whose
unkillableness prevents shutdown without REISUB. Although llama.cpp and
hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by
this revert, pytorch-rocm is now working with stability and without
whole-computer freezes caused by any accidental running of clinfo.

This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639.


That result doesn't make much sense. Felix please correct me, but 
AFAIK the ATS stuff was completely removed by now.


Are you sure that this is pure v6.6-rc7 and not some other patches 
applied? If yes than we must have missed something.


This revert doesn't really affect systems with ATS. It moves the sanity 
check back out of the ATS-specific code.


The Null pointer dereference in the bug report comes from the CPU page 
table update code:


[10089.267556] BUG: kernel NULL pointer dereference, address: 
[10089.267563] #PF: supervisor write access in kernel mode
[10089.267566] #PF: error_code(0x0002) - not-present page
[10089.267569] PGD 0 P4D 0
[10089.267574] Oops: 0002 [#1] PREEMPT SMP NOPTI
[10089.267578] CPU: 23 PID: 18191 Comm: clinfo Tainted: G   OE  
6.5.0-9-generic #9-Ubuntu
[10089.267582] Hardware name: Micro-Star International Co., Ltd. MS-7C37/X570-A 
PRO (MS-7C37), BIOS H.I0 08/10/2022
[10089.267585] RIP: 0010:amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu]
[10089.267820] Code: 90 90 90 90 90 90 90 0f 1f 44 00 00 48 b8 00 f0 ff ff ff ff 00 
00 55 48 21 c1 8d 04 d5 00 00 00 00 4c 09 c1 48 01 c6 48 89 e5 <48> 89 0e 31 c0 
5d 31 d2 31 c9 31 f6 45 31 c0 e9 89 7e 27 fb 66 0f
[10089.267823] RSP: 0018:b49805eeb8b0 EFLAGS: 00010246
[10089.267827] RAX:  RBX: 0020 RCX: 00400480
[10089.267830] RDX:  RSI:  RDI: 9890d438
[10089.267832] RBP: b49805eeb8b0 R08: 00400480 R09: 0020
[10089.267835] R10: 000800100200 R11: 000800100200 R12: b49805eeba98
[10089.267837] R13: 0001 R14: 0020 R15: 0001
[10089.267840] FS:  7f8ca9f09740() GS:9897befc() 
knlGS:
[10089.267843] CS:  0010 DS:  ES:  CR0: 80050033
[10089.267846] CR2:  CR3: 0002e0746000 CR4: 00750ee0
[10089.267849] PKRU: 5554
[10089.267851] Call Trace:
[10089.267853]  
[10089.267858]  ? show_regs+0x6d/0x80
[10089.267865]  ? __die+0x24/0x80
[10089.267870]  ? page_fault_oops+0x99/0x1b0
[10089.267876]  ? do_user_addr_fault+0x316/0x6b0
[10089.267879]  ? srso_alias_return_thunk+0x5/0x7f
[10089.267884]  ? scsi_dispatch_cmd+0x91/0x240
[10089.267891]  ? exc_page_fault+0x83/0x1b0
[10089.267896]  ? asm_exc_page_fault+0x27/0x30
[10089.267904]  ? amdgpu_gmc_set_pte_pde+0x23/0x40 [amdgpu]
[10089.268140]  amdgpu_vm_cpu_update+0xa9/0x130 [amdgpu]
...

This revert is just a roundabout way of disabling CPU page table updates 
for compute VMs. But I don't think it really addresses the root cause.


Regards,
  Felix




Regards,
Christian.



Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596
Signed-off-by: Daniel Tang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index 82f25996ff5e..602f311ab766 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct 
amdgpu_device *adev, struct amdgpu_vm *vm)

  if (r)
  return r;
  +    /* Sanity checks */
+    if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
+    r = -EINVAL;
+    goto unreserve_bo;
+    }
+
  /* Check if PD needs to be reinitialized and do it before
   * changing any other state, in case it fails.
   */
  if (pte_support_ats != vm->pte_support_ats) {
-    /* Sanity checks */
-    if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
-    r = -EINVAL;
-    goto unreserve_bo;
-    }
-
  vm->pte_support_ats = pte_support_ats;
  r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo),
 false);
-- 2.40.1





Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"

2023-10-23 Thread Felix Kuehling



On 2023-10-23 11:15, Christian König wrote:

Am 23.10.23 um 15:06 schrieb Daniel Tang:

That commit causes the screen to freeze a few moments after running
clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer
including ssh also freezes. On v6.5-rc1, it only results in a NULL 
pointer

deference message in dmesg and the process to become a zombie whose
unkillableness prevents shutdown without REISUB. Although llama.cpp and
hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by
this revert, pytorch-rocm is now working with stability and without
whole-computer freezes caused by any accidental running of clinfo.

This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639.


That result doesn't make much sense. Felix please correct me, but 
AFAIK the ATS stuff was completely removed by now.


Are you sure that this is pure v6.6-rc7 and not some other patches 
applied? If yes than we must have missed something.


This revert doesn't really affect systems with ATS. It moves the sanity 
check back out of the ATS-specific code.


The Null pointer dereference in the bug report comes from the CPU page 
table update code:



This revert is just a roundabout way of disabling CPU page table updates 
for compute VMs. But I don't think it really addresses the root cause.






Regards,
Christian.



Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596
Signed-off-by: Daniel Tang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

index 82f25996ff5e..602f311ab766 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct 
amdgpu_device *adev, struct amdgpu_vm *vm)

  if (r)
  return r;
  +    /* Sanity checks */
+    if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
+    r = -EINVAL;
+    goto unreserve_bo;
+    }
+
  /* Check if PD needs to be reinitialized and do it before
   * changing any other state, in case it fails.
   */
  if (pte_support_ats != vm->pte_support_ats) {
-    /* Sanity checks */
-    if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
-    r = -EINVAL;
-    goto unreserve_bo;
-    }
-
  vm->pte_support_ats = pte_support_ats;
  r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo),
 false);
--
2.40.1







Re: [PATCH v3] drm/amdkfd: Use partial mapping in GPU page faults

2023-10-23 Thread Felix Kuehling

On 2023-10-20 17:53, Xiaogang.Chen wrote:

From: Xiaogang Chen 

After partial migration to recover GPU page fault this patch does GPU vm
space mapping for same page range that got migrated intead of mapping all
pages of svm range in which the page fault happened.

Signed-off-by: Xiaogang Chen
---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 29 
  1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 54af7a2b29f8..3a71d04779b1 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1619,6 +1619,7 @@ static void *kfd_svm_page_owner(struct kfd_process *p, 
int32_t gpuidx)
   * 5. Release page table (and SVM BO) reservation
   */
  static int svm_range_validate_and_map(struct mm_struct *mm,
+ unsigned long map_start, unsigned long 
map_last,
  struct svm_range *prange, int32_t gpuidx,
  bool intr, bool wait, bool flush_tlb)
  {
@@ -1699,6 +1700,8 @@ static int svm_range_validate_and_map(struct mm_struct 
*mm,
end = (prange->last + 1) << PAGE_SHIFT;
for (addr = start; !r && addr < end; ) {
struct hmm_range *hmm_range;
+   unsigned long map_start_vma;
+   unsigned long map_last_vma;
struct vm_area_struct *vma;
uint64_t vram_pages_vma;
unsigned long next = 0;
@@ -1747,9 +1750,16 @@ static int svm_range_validate_and_map(struct mm_struct 
*mm,
r = -EAGAIN;
}
  
-		if (!r)

-   r = svm_range_map_to_gpus(prange, offset, npages, 
readonly,
- ctx->bitmap, wait, flush_tlb);
+   if (!r) {
+   map_start_vma = max(map_start, prange->start + offset);
+   map_last_vma = min(map_last, prange->start + offset + 
npages - 1);
+   if (map_start_vma <= map_last_vma) {
+   offset = map_start_vma - prange->start;
+   npages = map_last_vma - map_start_vma + 1;
+   r = svm_range_map_to_gpus(prange, offset, 
npages, readonly,
+ ctx->bitmap, wait, 
flush_tlb);
+   }
+   }
  
  		if (!r && next == end)

prange->mapped_to_gpu = true;
@@ -1855,8 +1865,8 @@ static void svm_range_restore_work(struct work_struct 
*work)
 */
mutex_lock(&prange->migrate_mutex);
  
-		r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE,

-  false, true, false);
+   r = svm_range_validate_and_map(mm, prange->start, prange->last, 
prange,
+  MAX_GPU_INSTANCE, false, true, 
false);
if (r)
pr_debug("failed %d to map 0x%lx to gpus\n", r,
 prange->start);
@@ -3069,6 +3079,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,
kfd_smi_event_page_fault_start(node, p->lead_thread->pid, addr,
   write_fault, timestamp);
  
+	start = prange->start;

+   last = prange->last;


This means, page faults that don't migrate will map the whole range. 
Should we move the proper assignment of start and last out of the 
condition below, so it applies equally to page faults that migrate and 
those that don't?


Regards,
  Felix



if (prange->actual_loc != 0 || best_loc != 0) {
migration = true;
/* Align migration range start and size to granularity size */
@@ -3102,10 +3114,11 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,
}
}
  
-	r = svm_range_validate_and_map(mm, prange, gpuidx, false, false, false);

+   r = svm_range_validate_and_map(mm, start, last, prange, gpuidx, false,
+  false, false);
if (r)
pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpus\n",
-r, svms, prange->start, prange->last);
+r, svms, start, last);
  
  	kfd_smi_event_page_fault_end(node, p->lead_thread->pid, addr,

 migration);
@@ -3650,7 +3663,7 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
  
  		flush_tlb = !migrated && update_mapping && prange->mapped_to_gpu;
  
-		r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE,

+   r = svm_range_validate_and_map(mm, prange->start, prange->last, 
prange, MAX_GPU_INSTANCE,
   true, true, flush_tlb);
if (r)

Re: [PATCH 3/3] Revert "[PATCH] drm/amdkfd: Use partial migrations in GPU page faults"

2023-10-23 Thread Felix Kuehling

On 2023-10-23 16:37, Philip Yang wrote:

This reverts commit 1fd60d88c4b57d715c0ae09794061c0cc53009e3.

The change prevents migrating the entire range to VRAM because retry
fault restore_pages map the remaining system memory range to GPUs. It
will work correctly to submit together with partial mapping to GPU
patch later.

Signed-off-by: Philip Yang 


The series is Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 150 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  83 +++--
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   6 +-
  4 files changed, 85 insertions(+), 160 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 81d25a679427..6c25dab051d5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -442,10 +442,10 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
goto out_free;
}
if (cpages != npages)
-   pr_debug("partial migration, 0x%lx/0x%llx pages collected\n",
+   pr_debug("partial migration, 0x%lx/0x%llx pages migrated\n",
 cpages, npages);
else
-   pr_debug("0x%lx pages collected\n", cpages);
+   pr_debug("0x%lx pages migrated\n", cpages);
  
  	r = svm_migrate_copy_to_vram(node, prange, &migrate, &mfence, scratch, ttm_res_offset);

migrate_vma_pages(&migrate);
@@ -479,8 +479,6 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
   * svm_migrate_ram_to_vram - migrate svm range from system to device
   * @prange: range structure
   * @best_loc: the device to migrate to
- * @start_mgr: start page to migrate
- * @last_mgr: last page to migrate
   * @mm: the process mm structure
   * @trigger: reason of migration
   *
@@ -491,7 +489,6 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
   */
  static int
  svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
-   unsigned long start_mgr, unsigned long last_mgr,
struct mm_struct *mm, uint32_t trigger)
  {
unsigned long addr, start, end;
@@ -501,30 +498,23 @@ svm_migrate_ram_to_vram(struct svm_range *prange, 
uint32_t best_loc,
unsigned long cpages = 0;
long r = 0;
  
-	if (!best_loc) {

-   pr_debug("svms 0x%p [0x%lx 0x%lx] migrate to sys ram\n",
-   prange->svms, start_mgr, last_mgr);
+   if (prange->actual_loc == best_loc) {
+   pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n",
+prange->svms, prange->start, prange->last, best_loc);
return 0;
}
  
-	if (start_mgr < prange->start || last_mgr > prange->last) {

-   pr_debug("range [0x%lx 0x%lx] out prange [0x%lx 0x%lx]\n",
-start_mgr, last_mgr, prange->start, 
prange->last);
-   return -EFAULT;
-   }
-
node = svm_range_get_node_by_id(prange, best_loc);
if (!node) {
pr_debug("failed to get kfd node by id 0x%x\n", best_loc);
return -ENODEV;
}
  
-	pr_debug("svms 0x%p [0x%lx 0x%lx] in [0x%lx 0x%lx] to gpu 0x%x\n",

-   prange->svms, start_mgr, last_mgr, prange->start, prange->last,
-   best_loc);
+   pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
+prange->start, prange->last, best_loc);
  
-	start = start_mgr << PAGE_SHIFT;

-   end = (last_mgr + 1) << PAGE_SHIFT;
+   start = prange->start << PAGE_SHIFT;
+   end = (prange->last + 1) << PAGE_SHIFT;
  
  	r = svm_range_vram_node_new(node, prange, true);

if (r) {
@@ -554,11 +544,8 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,
  
  	if (cpages) {

prange->actual_loc = best_loc;
-   prange->vram_pages = prange->vram_pages + cpages;
-   } else if (!prange->actual_loc) {
-   /* if no page migrated and all pages from prange are at
-* sys ram drop svm_bo got from svm_range_vram_node_new
-*/
+   svm_range_dma_unmap(prange);
+   } else {
svm_range_vram_node_free(prange);
}
  
@@ -676,8 +663,9 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,

   * Context: Process context, caller hold mmap read lock, prange->migrate_mutex
   *
   * Return:
+ *   0 - success with all pages migrated
   *   negative values - indicate error
- *   positive values or zero - number of pages got migrated
+ *   positive values - partial migration, number of pages not migrated
   */
  static long
  svm_migrate_vma_to_ram(struct kfd_node *node, struct svm_range *prange,
@@ -688,7 +676,6 @@ svm_migrate_vma_to_ram(struct kfd_nod

Re: [PATCH 3/3] drm/amd: Explicitly disable ASPM when dynamic switching disabled

2023-10-23 Thread Alex Deucher
On Mon, Oct 23, 2023 at 5:12 PM Mario Limonciello
 wrote:
>
> Currently there are separate but related checks:
> * amdgpu_device_should_use_aspm()
> * amdgpu_device_aspm_support_quirk()
> * amdgpu_device_pcie_dynamic_switching_supported()
>
> Simplify into checking whether DPM was enabled or not in the auto
> case.  This works because amdgpu_device_pcie_dynamic_switching_supported()
> populates that value.
>
> Signed-off-by: Mario Limonciello 

Series is:
Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 ++---
>  drivers/gpu/drm/amd/amdgpu/nv.c|  7 +++
>  drivers/gpu/drm/amd/amdgpu/vi.c|  2 +-
>  4 files changed, 10 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 44df1a5bce7f..c1c98bd2d489 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1339,9 +1339,7 @@ void amdgpu_device_pci_config_reset(struct 
> amdgpu_device *adev);
>  int amdgpu_device_pci_reset(struct amdgpu_device *adev);
>  bool amdgpu_device_need_post(struct amdgpu_device *adev);
>  bool amdgpu_device_seamless_boot_supported(struct amdgpu_device *adev);
> -bool amdgpu_device_pcie_dynamic_switching_supported(void);
>  bool amdgpu_device_should_use_aspm(struct amdgpu_device *adev);
> -bool amdgpu_device_aspm_support_quirk(void);
>
>  void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes,
>   u64 num_vis_bytes);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4e144be7f044..7ec32b44df05 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1456,14 +1456,14 @@ bool amdgpu_device_seamless_boot_supported(struct 
> amdgpu_device *adev)
>  }
>
>  /*
> - * Intel hosts such as Raptor Lake and Sapphire Rapids don't support dynamic
> - * speed switching. Until we have confirmation from Intel that a specific 
> host
> - * supports it, it's safer that we keep it disabled for all.
> + * Intel hosts such as Rocket Lake, Alder Lake, Raptor Lake and Sapphire 
> Rapids
> + * don't support dynamic speed switching. Until we have confirmation from 
> Intel
> + * that a specific host supports it, it's safer that we keep it disabled for 
> all.
>   *
>   * 
> https://edc.intel.com/content/www/us/en/design/products/platforms/details/raptor-lake-s/13th-generation-core-processors-datasheet-volume-1-of-2/005/pci-express-support/
>   * https://gitlab.freedesktop.org/drm/amd/-/issues/2663
>   */
> -bool amdgpu_device_pcie_dynamic_switching_supported(void)
> +static bool amdgpu_device_pcie_dynamic_switching_supported(void)
>  {
>  #if IS_ENABLED(CONFIG_X86)
> struct cpuinfo_x86 *c = &cpu_data(0);
> @@ -1498,20 +1498,11 @@ bool amdgpu_device_should_use_aspm(struct 
> amdgpu_device *adev)
> }
> if (adev->flags & AMD_IS_APU)
> return false;
> +   if (!(adev->pm.pp_feature & PP_PCIE_DPM_MASK))
> +   return false;
> return pcie_aspm_enabled(adev->pdev);
>  }
>
> -bool amdgpu_device_aspm_support_quirk(void)
> -{
> -#if IS_ENABLED(CONFIG_X86)
> -   struct cpuinfo_x86 *c = &cpu_data(0);
> -
> -   return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE);
> -#else
> -   return true;
> -#endif
> -}
> -
>  /* if we get transitioned to only one device, take VGA back */
>  /**
>   * amdgpu_device_vga_set_decode - enable/disable vga decode
> diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
> index 9fa220de1490..4d7976b77767 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nv.c
> @@ -513,7 +513,7 @@ static int nv_set_vce_clocks(struct amdgpu_device *adev, 
> u32 evclk, u32 ecclk)
>
>  static void nv_program_aspm(struct amdgpu_device *adev)
>  {
> -   if (!amdgpu_device_should_use_aspm(adev) || 
> !amdgpu_device_aspm_support_quirk())
> +   if (!amdgpu_device_should_use_aspm(adev))
> return;
>
> if (adev->nbio.funcs->program_aspm)
> @@ -608,9 +608,8 @@ static int nv_update_umd_stable_pstate(struct 
> amdgpu_device *adev,
> if (adev->gfx.funcs->update_perfmon_mgcg)
> adev->gfx.funcs->update_perfmon_mgcg(adev, !enter);
>
> -   if (!(adev->flags & AMD_IS_APU) &&
> -   (adev->nbio.funcs->enable_aspm) &&
> -amdgpu_device_should_use_aspm(adev))
> +   if (adev->nbio.funcs->enable_aspm &&
> +   amdgpu_device_should_use_aspm(adev))
> adev->nbio.funcs->enable_aspm(adev, !enter);
>
> return 0;
> diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
> index 1a08052bade3..1a98812981f4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vi.c
> @@ -1124,7 +1

[PATCH 2/3] drm/amd: Move AMD_IS_APU check for ASPM into top level function

2023-10-23 Thread Mario Limonciello
There is no need for every ASIC driver to perform the same check.
Move the duplicated code into amdgpu_device_should_use_aspm().

Signed-off-by: Mario Limonciello 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
 drivers/gpu/drm/amd/amdgpu/cik.c   | 4 
 drivers/gpu/drm/amd/amdgpu/nv.c| 3 +--
 drivers/gpu/drm/amd/amdgpu/si.c| 2 --
 drivers/gpu/drm/amd/amdgpu/soc15.c | 3 +--
 drivers/gpu/drm/amd/amdgpu/soc21.c | 3 +--
 drivers/gpu/drm/amd/amdgpu/vi.c| 3 +--
 7 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b345c7bcc3bc..4e144be7f044 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1496,6 +1496,8 @@ bool amdgpu_device_should_use_aspm(struct amdgpu_device 
*adev)
default:
return false;
}
+   if (adev->flags & AMD_IS_APU)
+   return false;
return pcie_aspm_enabled(adev->pdev);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/cik.c b/drivers/gpu/drm/amd/amdgpu/cik.c
index 5641cf05d856..4cd13486a349 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik.c
@@ -1725,10 +1725,6 @@ static void cik_program_aspm(struct amdgpu_device *adev)
if (pci_is_root_bus(adev->pdev->bus))
return;
 
-   /* XXX double check APUs */
-   if (adev->flags & AMD_IS_APU)
-   return;
-
orig = data = RREG32_PCIE(ixPCIE_LC_N_FTS_CNTL);
data &= ~PCIE_LC_N_FTS_CNTL__LC_XMIT_N_FTS_MASK;
data |= (0x24 << PCIE_LC_N_FTS_CNTL__LC_XMIT_N_FTS__SHIFT) |
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 1995c7459f20..9fa220de1490 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -516,8 +516,7 @@ static void nv_program_aspm(struct amdgpu_device *adev)
if (!amdgpu_device_should_use_aspm(adev) || 
!amdgpu_device_aspm_support_quirk())
return;
 
-   if (!(adev->flags & AMD_IS_APU) &&
-   (adev->nbio.funcs->program_aspm))
+   if (adev->nbio.funcs->program_aspm)
adev->nbio.funcs->program_aspm(adev);
 
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/si.c b/drivers/gpu/drm/amd/amdgpu/si.c
index f64b87b11b1b..456ca581f517 100644
--- a/drivers/gpu/drm/amd/amdgpu/si.c
+++ b/drivers/gpu/drm/amd/amdgpu/si.c
@@ -2456,8 +2456,6 @@ static void si_program_aspm(struct amdgpu_device *adev)
if (!amdgpu_device_should_use_aspm(adev))
return;
 
-   if (adev->flags & AMD_IS_APU)
-   return;
orig = data = RREG32_PCIE_PORT(PCIE_LC_N_FTS_CNTL);
data &= ~LC_XMIT_N_FTS_MASK;
data |= LC_XMIT_N_FTS(0x24) | LC_XMIT_N_FTS_OVERRIDE_EN;
diff --git a/drivers/gpu/drm/amd/amdgpu/soc15.c 
b/drivers/gpu/drm/amd/amdgpu/soc15.c
index 66ed28136bc8..d4b8d62f4294 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc15.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc15.c
@@ -646,8 +646,7 @@ static void soc15_program_aspm(struct amdgpu_device *adev)
if (!amdgpu_device_should_use_aspm(adev))
return;
 
-   if (!(adev->flags & AMD_IS_APU) &&
-   (adev->nbio.funcs->program_aspm))
+   if (adev->nbio.funcs->program_aspm)
adev->nbio.funcs->program_aspm(adev);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/soc21.c 
b/drivers/gpu/drm/amd/amdgpu/soc21.c
index 8c6cab641a1c..d5083c549330 100644
--- a/drivers/gpu/drm/amd/amdgpu/soc21.c
+++ b/drivers/gpu/drm/amd/amdgpu/soc21.c
@@ -433,8 +433,7 @@ static void soc21_program_aspm(struct amdgpu_device *adev)
if (!amdgpu_device_should_use_aspm(adev))
return;
 
-   if (!(adev->flags & AMD_IS_APU) &&
-   (adev->nbio.funcs->program_aspm))
+   if (adev->nbio.funcs->program_aspm)
adev->nbio.funcs->program_aspm(adev);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index fe8ba9e9837b..1a08052bade3 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -1127,8 +1127,7 @@ static void vi_program_aspm(struct amdgpu_device *adev)
if (!amdgpu_device_should_use_aspm(adev) || 
!amdgpu_device_pcie_dynamic_switching_supported())
return;
 
-   if (adev->flags & AMD_IS_APU ||
-   adev->asic_type < CHIP_POLARIS10)
+   if (adev->asic_type < CHIP_POLARIS10)
return;
 
orig = data = RREG32_PCIE(ixPCIE_LC_CNTL);
-- 
2.34.1



[PATCH 3/3] drm/amd: Explicitly disable ASPM when dynamic switching disabled

2023-10-23 Thread Mario Limonciello
Currently there are separate but related checks:
* amdgpu_device_should_use_aspm()
* amdgpu_device_aspm_support_quirk()
* amdgpu_device_pcie_dynamic_switching_supported()

Simplify into checking whether DPM was enabled or not in the auto
case.  This works because amdgpu_device_pcie_dynamic_switching_supported()
populates that value.

Signed-off-by: Mario Limonciello 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 ++---
 drivers/gpu/drm/amd/amdgpu/nv.c|  7 +++
 drivers/gpu/drm/amd/amdgpu/vi.c|  2 +-
 4 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 44df1a5bce7f..c1c98bd2d489 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1339,9 +1339,7 @@ void amdgpu_device_pci_config_reset(struct amdgpu_device 
*adev);
 int amdgpu_device_pci_reset(struct amdgpu_device *adev);
 bool amdgpu_device_need_post(struct amdgpu_device *adev);
 bool amdgpu_device_seamless_boot_supported(struct amdgpu_device *adev);
-bool amdgpu_device_pcie_dynamic_switching_supported(void);
 bool amdgpu_device_should_use_aspm(struct amdgpu_device *adev);
-bool amdgpu_device_aspm_support_quirk(void);
 
 void amdgpu_cs_report_moved_bytes(struct amdgpu_device *adev, u64 num_bytes,
  u64 num_vis_bytes);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4e144be7f044..7ec32b44df05 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1456,14 +1456,14 @@ bool amdgpu_device_seamless_boot_supported(struct 
amdgpu_device *adev)
 }
 
 /*
- * Intel hosts such as Raptor Lake and Sapphire Rapids don't support dynamic
- * speed switching. Until we have confirmation from Intel that a specific host
- * supports it, it's safer that we keep it disabled for all.
+ * Intel hosts such as Rocket Lake, Alder Lake, Raptor Lake and Sapphire Rapids
+ * don't support dynamic speed switching. Until we have confirmation from Intel
+ * that a specific host supports it, it's safer that we keep it disabled for 
all.
  *
  * 
https://edc.intel.com/content/www/us/en/design/products/platforms/details/raptor-lake-s/13th-generation-core-processors-datasheet-volume-1-of-2/005/pci-express-support/
  * https://gitlab.freedesktop.org/drm/amd/-/issues/2663
  */
-bool amdgpu_device_pcie_dynamic_switching_supported(void)
+static bool amdgpu_device_pcie_dynamic_switching_supported(void)
 {
 #if IS_ENABLED(CONFIG_X86)
struct cpuinfo_x86 *c = &cpu_data(0);
@@ -1498,20 +1498,11 @@ bool amdgpu_device_should_use_aspm(struct amdgpu_device 
*adev)
}
if (adev->flags & AMD_IS_APU)
return false;
+   if (!(adev->pm.pp_feature & PP_PCIE_DPM_MASK))
+   return false;
return pcie_aspm_enabled(adev->pdev);
 }
 
-bool amdgpu_device_aspm_support_quirk(void)
-{
-#if IS_ENABLED(CONFIG_X86)
-   struct cpuinfo_x86 *c = &cpu_data(0);
-
-   return !(c->x86 == 6 && c->x86_model == INTEL_FAM6_ALDERLAKE);
-#else
-   return true;
-#endif
-}
-
 /* if we get transitioned to only one device, take VGA back */
 /**
  * amdgpu_device_vga_set_decode - enable/disable vga decode
diff --git a/drivers/gpu/drm/amd/amdgpu/nv.c b/drivers/gpu/drm/amd/amdgpu/nv.c
index 9fa220de1490..4d7976b77767 100644
--- a/drivers/gpu/drm/amd/amdgpu/nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/nv.c
@@ -513,7 +513,7 @@ static int nv_set_vce_clocks(struct amdgpu_device *adev, 
u32 evclk, u32 ecclk)
 
 static void nv_program_aspm(struct amdgpu_device *adev)
 {
-   if (!amdgpu_device_should_use_aspm(adev) || 
!amdgpu_device_aspm_support_quirk())
+   if (!amdgpu_device_should_use_aspm(adev))
return;
 
if (adev->nbio.funcs->program_aspm)
@@ -608,9 +608,8 @@ static int nv_update_umd_stable_pstate(struct amdgpu_device 
*adev,
if (adev->gfx.funcs->update_perfmon_mgcg)
adev->gfx.funcs->update_perfmon_mgcg(adev, !enter);
 
-   if (!(adev->flags & AMD_IS_APU) &&
-   (adev->nbio.funcs->enable_aspm) &&
-amdgpu_device_should_use_aspm(adev))
+   if (adev->nbio.funcs->enable_aspm &&
+   amdgpu_device_should_use_aspm(adev))
adev->nbio.funcs->enable_aspm(adev, !enter);
 
return 0;
diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index 1a08052bade3..1a98812981f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -1124,7 +1124,7 @@ static void vi_program_aspm(struct amdgpu_device *adev)
bool bL1SS = false;
bool bClkReqSupport = true;
 
-   if (!amdgpu_device_should_use_aspm(adev) || 
!amdgpu_device_pcie_dynamic_switching_supported())
+   if (!amdgpu_device_should_use_aspm(adev))
return;
 
if (ade

[PATCH 1/3] drm/amd: Disable PP_PCIE_DPM_MASK when dynamic speed switching not supported

2023-10-23 Thread Mario Limonciello
Rather than individual ASICs checking for the quirk, set the quirk at the
driver level.

Signed-off-by: Mario Limonciello 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 2 ++
 drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c | 4 +---
 drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c | 2 +-
 drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c  | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index cc047fe0b7ee..b345c7bcc3bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2315,6 +2315,8 @@ static int amdgpu_device_ip_early_init(struct 
amdgpu_device *adev)
adev->pm.pp_feature &= ~PP_GFXOFF_MASK;
if (amdgpu_sriov_vf(adev) && adev->asic_type == CHIP_SIENNA_CICHLID)
adev->pm.pp_feature &= ~PP_OVERDRIVE_MASK;
+   if (!amdgpu_device_pcie_dynamic_switching_supported())
+   adev->pm.pp_feature &= ~PP_PCIE_DPM_MASK;
 
total = true;
for (i = 0; i < adev->num_ip_blocks; i++) {
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c 
b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
index 5a2371484a58..11372fcc59c8 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/smu7_hwmgr.c
@@ -1823,9 +1823,7 @@ static void smu7_init_dpm_defaults(struct pp_hwmgr *hwmgr)
 
data->mclk_dpm_key_disabled = hwmgr->feature_mask & PP_MCLK_DPM_MASK ? 
false : true;
data->sclk_dpm_key_disabled = hwmgr->feature_mask & PP_SCLK_DPM_MASK ? 
false : true;
-   data->pcie_dpm_key_disabled =
-   !amdgpu_device_pcie_dynamic_switching_supported() ||
-   !(hwmgr->feature_mask & PP_PCIE_DPM_MASK);
+   data->pcie_dpm_key_disabled = !(hwmgr->feature_mask & PP_PCIE_DPM_MASK);
/* need to set voltage control types before EVV patching */
data->voltage_control = SMU7_VOLTAGE_CONTROL_NONE;
data->vddci_control = SMU7_VOLTAGE_CONTROL_NONE;
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index 090249b6422a..97a5c9b3e941 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -2115,7 +2115,7 @@ static int sienna_cichlid_update_pcie_parameters(struct 
smu_context *smu,
min_lane_width = min_lane_width > max_lane_width ?
 max_lane_width : min_lane_width;
 
-   if (!amdgpu_device_pcie_dynamic_switching_supported()) {
+   if (!(smu->adev->pm.pp_feature & PP_PCIE_DPM_MASK)) {
pcie_table->pcie_gen[0] = max_gen_speed;
pcie_table->pcie_lane[0] = max_lane_width;
} else {
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
index bcb7ab9d2221..e06de3524a1a 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c
@@ -2437,7 +2437,7 @@ int smu_v13_0_update_pcie_parameters(struct smu_context 
*smu,
uint32_t smu_pcie_arg;
int ret, i;
 
-   if (!amdgpu_device_pcie_dynamic_switching_supported()) {
+   if (!(smu->adev->pm.pp_feature & PP_PCIE_DPM_MASK)) {
if (pcie_table->pcie_gen[num_of_levels - 1] < pcie_gen_cap)
pcie_gen_cap = pcie_table->pcie_gen[num_of_levels - 1];
 
-- 
2.34.1



[PATCH 2/3] Revert "drm/amdkfd:remove unused code"

2023-10-23 Thread Philip Yang
This reverts commit d97e7b1eb8afd7a404466533b0bc192351b760c7.

Needed for the next revert patch.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 60 
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  3 ++
 2 files changed, 63 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 4d000c63cde8..3422eee8d0d0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1145,6 +1145,66 @@ svm_range_add_child(struct svm_range *prange, struct 
mm_struct *mm,
list_add_tail(&pchild->child_list, &prange->child_list);
 }
 
+/**
+ * svm_range_split_by_granularity - collect ranges within granularity boundary
+ *
+ * @p: the process with svms list
+ * @mm: mm structure
+ * @addr: the vm fault address in pages, to split the prange
+ * @parent: parent range if prange is from child list
+ * @prange: prange to split
+ *
+ * Trims @prange to be a single aligned block of prange->granularity if
+ * possible. The head and tail are added to the child_list in @parent.
+ *
+ * Context: caller must hold mmap_read_lock and prange->lock
+ *
+ * Return:
+ * 0 - OK, otherwise error code
+ */
+int
+svm_range_split_by_granularity(struct kfd_process *p, struct mm_struct *mm,
+  unsigned long addr, struct svm_range *parent,
+  struct svm_range *prange)
+{
+   struct svm_range *head, *tail;
+   unsigned long start, last, size;
+   int r;
+
+   /* Align splited range start and size to granularity size, then a single
+* PTE will be used for whole range, this reduces the number of PTE
+* updated and the L1 TLB space used for translation.
+*/
+   size = 1UL << prange->granularity;
+   start = ALIGN_DOWN(addr, size);
+   last = ALIGN(addr + 1, size) - 1;
+
+   pr_debug("svms 0x%p split [0x%lx 0x%lx] to [0x%lx 0x%lx] size 0x%lx\n",
+prange->svms, prange->start, prange->last, start, last, size);
+
+   if (start > prange->start) {
+   r = svm_range_split(prange, start, prange->last, &head);
+   if (r)
+   return r;
+   svm_range_add_child(parent, mm, head, SVM_OP_ADD_RANGE);
+   }
+
+   if (last < prange->last) {
+   r = svm_range_split(prange, prange->start, last, &tail);
+   if (r)
+   return r;
+   svm_range_add_child(parent, mm, tail, SVM_OP_ADD_RANGE);
+   }
+
+   /* xnack on, update mapping on GPUs with ACCESS_IN_PLACE */
+   if (p->xnack_enabled && prange->work_item.op == SVM_OP_ADD_RANGE) {
+   prange->work_item.op = SVM_OP_ADD_RANGE_AND_MAP;
+   pr_debug("change prange 0x%p [0x%lx 0x%lx] op %d\n",
+prange, prange->start, prange->last,
+SVM_OP_ADD_RANGE_AND_MAP);
+   }
+   return 0;
+}
 static bool
 svm_nodes_in_same_hive(struct kfd_node *node_a, struct kfd_node *node_b)
 {
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 026863a0abcd..be11ba0c4289 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -172,6 +172,9 @@ struct kfd_node *svm_range_get_node_by_id(struct svm_range 
*prange,
 int svm_range_vram_node_new(struct kfd_node *node, struct svm_range *prange,
bool clear);
 void svm_range_vram_node_free(struct svm_range *prange);
+int svm_range_split_by_granularity(struct kfd_process *p, struct mm_struct *mm,
+  unsigned long addr, struct svm_range *parent,
+  struct svm_range *prange);
 int svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
uint32_t vmid, uint32_t node_id, uint64_t addr,
bool write_fault);
-- 
2.35.1



[PATCH 3/3] Revert "[PATCH] drm/amdkfd: Use partial migrations in GPU page faults"

2023-10-23 Thread Philip Yang
This reverts commit 1fd60d88c4b57d715c0ae09794061c0cc53009e3.

The change prevents migrating the entire range to VRAM because retry
fault restore_pages map the remaining system memory range to GPUs. It
will work correctly to submit together with partial mapping to GPU
patch later.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 150 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h |   6 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c |  83 +++--
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   6 +-
 4 files changed, 85 insertions(+), 160 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 81d25a679427..6c25dab051d5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -442,10 +442,10 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
goto out_free;
}
if (cpages != npages)
-   pr_debug("partial migration, 0x%lx/0x%llx pages collected\n",
+   pr_debug("partial migration, 0x%lx/0x%llx pages migrated\n",
 cpages, npages);
else
-   pr_debug("0x%lx pages collected\n", cpages);
+   pr_debug("0x%lx pages migrated\n", cpages);
 
r = svm_migrate_copy_to_vram(node, prange, &migrate, &mfence, scratch, 
ttm_res_offset);
migrate_vma_pages(&migrate);
@@ -479,8 +479,6 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
  * svm_migrate_ram_to_vram - migrate svm range from system to device
  * @prange: range structure
  * @best_loc: the device to migrate to
- * @start_mgr: start page to migrate
- * @last_mgr: last page to migrate
  * @mm: the process mm structure
  * @trigger: reason of migration
  *
@@ -491,7 +489,6 @@ svm_migrate_vma_to_vram(struct kfd_node *node, struct 
svm_range *prange,
  */
 static int
 svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc,
-   unsigned long start_mgr, unsigned long last_mgr,
struct mm_struct *mm, uint32_t trigger)
 {
unsigned long addr, start, end;
@@ -501,30 +498,23 @@ svm_migrate_ram_to_vram(struct svm_range *prange, 
uint32_t best_loc,
unsigned long cpages = 0;
long r = 0;
 
-   if (!best_loc) {
-   pr_debug("svms 0x%p [0x%lx 0x%lx] migrate to sys ram\n",
-   prange->svms, start_mgr, last_mgr);
+   if (prange->actual_loc == best_loc) {
+   pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n",
+prange->svms, prange->start, prange->last, best_loc);
return 0;
}
 
-   if (start_mgr < prange->start || last_mgr > prange->last) {
-   pr_debug("range [0x%lx 0x%lx] out prange [0x%lx 0x%lx]\n",
-start_mgr, last_mgr, prange->start, 
prange->last);
-   return -EFAULT;
-   }
-
node = svm_range_get_node_by_id(prange, best_loc);
if (!node) {
pr_debug("failed to get kfd node by id 0x%x\n", best_loc);
return -ENODEV;
}
 
-   pr_debug("svms 0x%p [0x%lx 0x%lx] in [0x%lx 0x%lx] to gpu 0x%x\n",
-   prange->svms, start_mgr, last_mgr, prange->start, prange->last,
-   best_loc);
+   pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms,
+prange->start, prange->last, best_loc);
 
-   start = start_mgr << PAGE_SHIFT;
-   end = (last_mgr + 1) << PAGE_SHIFT;
+   start = prange->start << PAGE_SHIFT;
+   end = (prange->last + 1) << PAGE_SHIFT;
 
r = svm_range_vram_node_new(node, prange, true);
if (r) {
@@ -554,11 +544,8 @@ svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t 
best_loc,
 
if (cpages) {
prange->actual_loc = best_loc;
-   prange->vram_pages = prange->vram_pages + cpages;
-   } else if (!prange->actual_loc) {
-   /* if no page migrated and all pages from prange are at
-* sys ram drop svm_bo got from svm_range_vram_node_new
-*/
+   svm_range_dma_unmap(prange);
+   } else {
svm_range_vram_node_free(prange);
}
 
@@ -676,8 +663,9 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
  * Context: Process context, caller hold mmap read lock, prange->migrate_mutex
  *
  * Return:
+ *   0 - success with all pages migrated
  *   negative values - indicate error
- *   positive values or zero - number of pages got migrated
+ *   positive values - partial migration, number of pages not migrated
  */
 static long
 svm_migrate_vma_to_ram(struct kfd_node *node, struct svm_range *prange,
@@ -688,7 +676,6 @@ svm_migrate_vma_to_ram(struct kfd_node *node, struct 
svm_range *prange,
uint64_t npages = (end - start) >> PAGE_SHI

[PATCH 1/3] Revert "drm/amdkfd: Use partial mapping in GPU page fault recovery"

2023-10-23 Thread Philip Yang
This reverts commit c45c3bc930bf60e7658f87c519a40f77513b96aa.

Found KFDSVMEvict test regression on vega10, kernel BUG backtrace:

[  135.365083] amdgpu: Migration failed during eviction
[  135.365090] [ cut here ]
[  135.365097] This was not the last reference
[  135.365122] WARNING: CPU: 5 PID: 1998 at
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:3515
svm_range_evict_svm_bo_worker+0x21c/0x390 [amdgpu]
[  135.365836]  svm_range_evict_svm_bo_worker+0x21c/0x390 [amdgpu]
[  135.366249]  process_one_work+0x298/0x590
[  135.366256]  worker_thread+0x3d/0x3d0
..
[  135.721257] kernel BUG at include/linux/swapops.h:472!
[  135.721537] Call Trace:
[  135.721540]  
[  135.721592]  hmm_vma_walk_pmd+0x5c8/0x780
[  135.721598]  walk_pgd_range+0x3bc/0x7c0
[  135.721604]  __walk_page_range+0x1ec/0x200
[  135.721609]  walk_page_range+0x119/0x1a0
[  135.721613]  hmm_range_fault+0x5d/0xb0
[  135.721617]  amdgpu_hmm_range_get_pages+0x159/0x240 [amdgpu]
[  135.721820]  svm_range_validate_and_map+0x57f/0x16c0 [amdgpu]
[  135.722411]  svm_range_restore_pages+0xcd8/0x1150 [amdgpu]
[  135.722613]  amdgpu_vm_handle_fault+0xc2/0x360 [amdgpu]
[  135.722777]  gmc_v9_0_process_interrupt+0x255/0x670 [amdgpu]

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 35 +---
 1 file changed, 11 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index f2b33fb2afcf..4d000c63cde8 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1565,7 +1565,6 @@ static void *kfd_svm_page_owner(struct kfd_process *p, 
int32_t gpuidx)
  * 5. Release page table (and SVM BO) reservation
  */
 static int svm_range_validate_and_map(struct mm_struct *mm,
- unsigned long map_start, unsigned long 
map_last,
  struct svm_range *prange, int32_t gpuidx,
  bool intr, bool wait, bool flush_tlb)
 {
@@ -1646,8 +1645,6 @@ static int svm_range_validate_and_map(struct mm_struct 
*mm,
end = (prange->last + 1) << PAGE_SHIFT;
for (addr = start; !r && addr < end; ) {
struct hmm_range *hmm_range;
-   unsigned long map_start_vma;
-   unsigned long map_last_vma;
struct vm_area_struct *vma;
uint64_t vram_pages_vma;
unsigned long next = 0;
@@ -1696,16 +1693,9 @@ static int svm_range_validate_and_map(struct mm_struct 
*mm,
r = -EAGAIN;
}
 
-   if (!r) {
-   map_start_vma = max(map_start, prange->start + offset);
-   map_last_vma = min(map_last, prange->start + offset + 
npages - 1);
-   if (map_start_vma <= map_last_vma) {
-   offset = map_start_vma - prange->start;
-   npages = map_last_vma - map_start_vma + 1;
-   r = svm_range_map_to_gpus(prange, offset, 
npages, readonly,
- ctx->bitmap, wait, 
flush_tlb);
-   }
-   }
+   if (!r)
+   r = svm_range_map_to_gpus(prange, offset, npages, 
readonly,
+ ctx->bitmap, wait, flush_tlb);
 
if (!r && next == end)
prange->mapped_to_gpu = true;
@@ -1811,8 +1801,8 @@ static void svm_range_restore_work(struct work_struct 
*work)
 */
mutex_lock(&prange->migrate_mutex);
 
-   r = svm_range_validate_and_map(mm, prange->start, prange->last, 
prange,
-  MAX_GPU_INSTANCE, false, true, 
false);
+   r = svm_range_validate_and_map(mm, prange, MAX_GPU_INSTANCE,
+  false, true, false);
if (r)
pr_debug("failed %d to map 0x%lx to gpus\n", r,
 prange->start);
@@ -3026,8 +3016,6 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,
kfd_smi_event_page_fault_start(node, p->lead_thread->pid, addr,
   write_fault, timestamp);
 
-   start = prange->start;
-   last = prange->last;
if (prange->actual_loc != 0 || best_loc != 0) {
migration = true;
/* Align migration range start and size to granularity size */
@@ -3061,11 +3049,10 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,
}
}
 
-   r = svm_range_validate_and_map(mm, start, last, prange, gpuidx, false,
-  false, false);
+   r = svm_range_validate_and_map(mm, prange, gpuidx, false, false, false);
if (r)

Re: [PATCH] drm/amdgpu: Fix a null pointer access when the smc_rreg pointer is NULL

2023-10-23 Thread Alex Deucher
Applied.  Thanks!

Alex

On Mon, Oct 23, 2023 at 9:06 AM  wrote:
>
> In certain types of chips, such as VEGA20, reading the amdgpu_regs_smc file 
> could result in an abnormal null pointer access when the smc_rreg pointer is 
> NULL. Below are the steps to reproduce this issue and the corresponding 
> exception log:
>
> 1. Navigate to the directory: /sys/kernel/debug/dri/0
> 2. Execute command: cat amdgpu_regs_smc
> 3. Exception Log::
> [4005007.702554] BUG: kernel NULL pointer dereference, address: 
> 
> [4005007.702562] #PF: supervisor instruction fetch in kernel mode
> [4005007.702567] #PF: error_code(0x0010) - not-present page
> [4005007.702570] PGD 0 P4D 0
> [4005007.702576] Oops: 0010 [#1] SMP NOPTI
> [4005007.702581] CPU: 4 PID: 62563 Comm: cat Tainted: G   OE 
> 5.15.0-43-generic #46-Ubunt   u
> [4005007.702590] RIP: 0010:0x0
> [4005007.702598] Code: Unable to access opcode bytes at RIP 
> 0xffd6.
> [4005007.702600] RSP: 0018:a82b46d27da0 EFLAGS: 00010206
> [4005007.702605] RAX:  RBX:  RCX: 
> a82b46d27e68
> [4005007.702609] RDX: 0001 RSI:  RDI: 
> 9940656e
> [4005007.702612] RBP: a82b46d27dd8 R08:  R09: 
> 994060c07980
> [4005007.702615] R10: 0002 R11:  R12: 
> 7f5e06753000
> [4005007.702618] R13: 9940656e R14: a82b46d27e68 R15: 
> 7f5e06753000
> [4005007.702622] FS:  7f5e0755b740() GS:99479d30() 
> knlGS:
> [4005007.702626] CS:  0010 DS:  ES:  CR0: 80050033
> [4005007.702629] CR2: ffd6 CR3: 0003253fc000 CR4: 
> 003506e0
> [4005007.702633] Call Trace:
> [4005007.702636]  
> [4005007.702640]  amdgpu_debugfs_regs_smc_read+0xb0/0x120 [amdgpu]
> [4005007.703002]  full_proxy_read+0x5c/0x80
> [4005007.703011]  vfs_read+0x9f/0x1a0
> [4005007.703019]  ksys_read+0x67/0xe0
> [4005007.703023]  __x64_sys_read+0x19/0x20
> [4005007.703028]  do_syscall_64+0x5c/0xc0
> [4005007.703034]  ? do_user_addr_fault+0x1e3/0x670
> [4005007.703040]  ? exit_to_user_mode_prepare+0x37/0xb0
> [4005007.703047]  ? irqentry_exit_to_user_mode+0x9/0x20
> [4005007.703052]  ? irqentry_exit+0x19/0x30
> [4005007.703057]  ? exc_page_fault+0x89/0x160
> [4005007.703062]  ? asm_exc_page_fault+0x8/0x30
> [4005007.703068]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [4005007.703075] RIP: 0033:0x7f5e07672992
> [4005007.703079] Code: c0 e9 b2 fe ff ff 50 48 8d 3d fa b2 0c 00 e8 c5 1d 02 
> 00 0f 1f 44 00 00 f3 0f1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 
> 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 e   c 28 48 89 54 24
> [4005007.703083] RSP: 002b:7ffe03097898 EFLAGS: 0246 ORIG_RAX: 
> 
> [4005007.703088] RAX: ffda RBX: 0002 RCX: 
> 7f5e07672992
> [4005007.703091] RDX: 0002 RSI: 7f5e06753000 RDI: 
> 0003
> [4005007.703094] RBP: 7f5e06753000 R08: 7f5e06752010 R09: 
> 7f5e06752010
> [4005007.703096] R10: 0022 R11: 0246 R12: 
> 00022000
> [4005007.703099] R13: 0003 R14: 0002 R15: 
> 0002
> [4005007.703105]  
> [4005007.703107] Modules linked in: nf_tables libcrc32c nfnetlink algif_hash 
> af_alg binfmt_misc nls_   iso8859_1 ipmi_ssif ast intel_rapl_msr 
> intel_rapl_common drm_vram_helper drm_ttm_helper amd64_edac t   tm 
> edac_mce_amd kvm_amd ccp mac_hid k10temp kvm acpi_ipmi ipmi_si rapl 
> sch_fq_codel ipmi_devintf ipm   i_msghandler msr parport_pc ppdev lp 
> parport mtd pstore_blk efi_pstore ramoops pstore_zone reed_solo   mon 
> ip_tables x_tables autofs4 ib_uverbs ib_core amdgpu(OE) amddrm_ttm_helper(OE) 
> amdttm(OE) iommu_v   2 amd_sched(OE) amdkcl(OE) drm_kms_helper 
> syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_coredrm igb ahci 
> xhci_pci libahci i2c_piix4 i2c_algo_bit xhci_pci_renesas dca
> [4005007.703184] CR2: 
> [4005007.703188] ---[ end trace ac65a538d240da39 ]---
> [4005007.800865] RIP: 0010:0x0
> [4005007.800871] Code: Unable to access opcode bytes at RIP 
> 0xffd6.
> [4005007.800874] RSP: 0018:a82b46d27da0 EFLAGS: 00010206
> [4005007.800878] RAX:  RBX:  RCX: 
> a82b46d27e68
> [4005007.800881] RDX: 0001 RSI:  RDI: 
> 9940656e
> [4005007.800883] RBP: a82b46d27dd8 R08:  R09: 
> 994060c07980
> [4005007.800886] R10: 0002 R11:  R12: 
> 7f5e06753000
> [4005007.800888] R13: 9940656e R14: a82b46d27e68 R15: 
> 7f5e06753000
> [4005007.800891] FS:  7f5e0755b740() GS:99479d30() 
> knlGS:
> [4005007.800895] CS:  0010 DS:  ES:  CR0: 80050033
> [4005007.800898] CR2: ffd6 CR3: 0003253fc000 CR4: 
> 003506e0
>
> Signed-

Re: [PATCH] drm/amdxcp: fix amdxcp unloads incompletely

2023-10-23 Thread Deucher, Alexander
[Public]

Acked-by: Alex Deucher 

From: amd-gfx  on behalf of James Zhu 

Sent: Thursday, September 7, 2023 10:41 AM
To: amd-gfx@lists.freedesktop.org 
Cc: Lin, Amber ; Zhu, James ; Kamal, Asad 

Subject: [PATCH] drm/amdxcp: fix amdxcp unloads incompletely

amdxcp unloads incompletely, and below error will be seen during load/unload,
sysfs: cannot create duplicate filename '/devices/platform/amdgpu_xcp.0'

devres_release_group will free xcp device at first, platform device will be
unregistered later in platform_device_unregister.

Signed-off-by: James Zhu 
---
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c 
b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
index 353597fc908d..90ddd8371176 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
@@ -89,9 +89,10 @@ EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
 void amdgpu_xcp_drv_release(void)
 {
 for (--pdev_num; pdev_num >= 0; --pdev_num) {
-   devres_release_group(&xcp_dev[pdev_num]->pdev->dev, NULL);
-   platform_device_unregister(xcp_dev[pdev_num]->pdev);
-   xcp_dev[pdev_num]->pdev = NULL;
+   struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
+
+   devres_release_group(&pdev->dev, NULL);
+   platform_device_unregister(pdev);
 xcp_dev[pdev_num] = NULL;
 }
 pdev_num = 0;
--
2.34.1



Re: [PATCH] drm/amdgpu: Use pcie domain of xcc acpi objects

2023-10-23 Thread Alex Deucher
On Sat, Oct 21, 2023 at 8:02 PM Lijo Lazar  wrote:
>
> PCI domain/segment information of xccs is available through ACPI DSM
> methods. Consider that also while looking for devices.
>
> Signed-off-by: Lijo Lazar 

Acked-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 40 +---
>  1 file changed, 22 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> index 2bca37044ad0..d62e49758635 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
> @@ -68,7 +68,7 @@ struct amdgpu_acpi_xcc_info {
>  struct amdgpu_acpi_dev_info {
> struct list_head list;
> struct list_head xcc_list;
> -   uint16_t bdf;
> +   uint32_t sbdf;
> uint16_t supp_xcp_mode;
> uint16_t xcp_mode;
> uint16_t mem_mode;
> @@ -927,7 +927,7 @@ static acpi_status amdgpu_acpi_get_node_id(acpi_handle 
> handle,
>  #endif
>  }
>
> -static struct amdgpu_acpi_dev_info *amdgpu_acpi_get_dev(u16 bdf)
> +static struct amdgpu_acpi_dev_info *amdgpu_acpi_get_dev(u32 sbdf)
>  {
> struct amdgpu_acpi_dev_info *acpi_dev;
>
> @@ -935,14 +935,14 @@ static struct amdgpu_acpi_dev_info 
> *amdgpu_acpi_get_dev(u16 bdf)
> return NULL;
>
> list_for_each_entry(acpi_dev, &amdgpu_acpi_dev_list, list)
> -   if (acpi_dev->bdf == bdf)
> +   if (acpi_dev->sbdf == sbdf)
> return acpi_dev;
>
> return NULL;
>  }
>
>  static int amdgpu_acpi_dev_init(struct amdgpu_acpi_dev_info **dev_info,
> -   struct amdgpu_acpi_xcc_info *xcc_info, u16 
> bdf)
> +   struct amdgpu_acpi_xcc_info *xcc_info, u32 
> sbdf)
>  {
> struct amdgpu_acpi_dev_info *tmp;
> union acpi_object *obj;
> @@ -955,7 +955,7 @@ static int amdgpu_acpi_dev_init(struct 
> amdgpu_acpi_dev_info **dev_info,
>
> INIT_LIST_HEAD(&tmp->xcc_list);
> INIT_LIST_HEAD(&tmp->list);
> -   tmp->bdf = bdf;
> +   tmp->sbdf = sbdf;
>
> obj = acpi_evaluate_dsm_typed(xcc_info->handle, &amd_xcc_dsm_guid, 0,
>   AMD_XCC_DSM_GET_SUPP_MODE, NULL,
> @@ -1007,7 +1007,7 @@ static int amdgpu_acpi_dev_init(struct 
> amdgpu_acpi_dev_info **dev_info,
>
> DRM_DEBUG_DRIVER(
> "New dev(%x): Supported xcp mode: %x curr xcp_mode : %x mem 
> mode : %x, tmr base: %llx tmr size: %llx  ",
> -   tmp->bdf, tmp->supp_xcp_mode, tmp->xcp_mode, tmp->mem_mode,
> +   tmp->sbdf, tmp->supp_xcp_mode, tmp->xcp_mode, tmp->mem_mode,
> tmp->tmr_base, tmp->tmr_size);
> list_add_tail(&tmp->list, &amdgpu_acpi_dev_list);
> *dev_info = tmp;
> @@ -1023,7 +1023,7 @@ static int amdgpu_acpi_dev_init(struct 
> amdgpu_acpi_dev_info **dev_info,
>  }
>
>  static int amdgpu_acpi_get_xcc_info(struct amdgpu_acpi_xcc_info *xcc_info,
> -   u16 *bdf)
> +   u32 *sbdf)
>  {
> union acpi_object *obj;
> acpi_status status;
> @@ -1054,8 +1054,10 @@ static int amdgpu_acpi_get_xcc_info(struct 
> amdgpu_acpi_xcc_info *xcc_info,
> xcc_info->phy_id = (obj->integer.value >> 32) & 0xFF;
> /* xcp node of this xcc [47:40] */
> xcc_info->xcp_node = (obj->integer.value >> 40) & 0xFF;
> +   /* PF domain of this xcc [31:16] */
> +   *sbdf = (obj->integer.value) & 0x;
> /* PF bus/dev/fn of this xcc [63:48] */
> -   *bdf = (obj->integer.value >> 48) & 0x;
> +   *sbdf |= (obj->integer.value >> 48) & 0x;
> ACPI_FREE(obj);
> obj = NULL;
>
> @@ -1079,7 +1081,7 @@ static int amdgpu_acpi_enumerate_xcc(void)
> struct acpi_device *acpi_dev;
> char hid[ACPI_ID_LEN];
> int ret, id;
> -   u16 bdf;
> +   u32 sbdf;
>
> INIT_LIST_HEAD(&amdgpu_acpi_dev_list);
> xa_init(&numa_info_xa);
> @@ -1107,16 +1109,16 @@ static int amdgpu_acpi_enumerate_xcc(void)
> xcc_info->handle = acpi_device_handle(acpi_dev);
> acpi_dev_put(acpi_dev);
>
> -   ret = amdgpu_acpi_get_xcc_info(xcc_info, &bdf);
> +   ret = amdgpu_acpi_get_xcc_info(xcc_info, &sbdf);
> if (ret) {
> kfree(xcc_info);
> continue;
> }
>
> -   dev_info = amdgpu_acpi_get_dev(bdf);
> +   dev_info = amdgpu_acpi_get_dev(sbdf);
>
> if (!dev_info)
> -   ret = amdgpu_acpi_dev_init(&dev_info, xcc_info, bdf);
> +   ret = amdgpu_acpi_dev_init(&dev_info, xcc_info, sbdf);
>
> if (ret == -ENOMEM)
> return ret;
> @@ -1136,13 +1138,14 @@ int amdgpu_acpi_get_tmr_info(struct amdgpu_device 
> *adev, u64 *tmr_offset,
>

Re: [PATCH] drm/amdkfd: Address 'remap_list' not described in 'svm_range_add'

2023-10-23 Thread Felix Kuehling

On 2023-10-23 12:12, Srinivasan Shanmugam wrote:

Fixes the below:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:2073: warning: Function 
parameter or member 'remap_list' not described in 'svm_range_add'

Cc: Felix Kuehling 
Cc: Christian König 
Cc: Alex Deucher 
Cc: "Pan, Xinhui" 
Signed-off-by: Srinivasan Shanmugam 


Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index f2b33fb2afcf..f43dedf3e240 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2046,6 +2046,7 @@ svm_range_split_new(struct svm_range_list *svms, uint64_t 
start, uint64_t last,
   * @update_list: output, the ranges need validate and update GPU mapping
   * @insert_list: output, the ranges need insert to svms
   * @remove_list: output, the ranges are replaced and need remove from svms
+ * @remap_list: output, remap unaligned svm ranges
   *
   * Check if the virtual address range has overlap with any existing ranges,
   * split partly overlapping ranges and add new ranges in the gaps. All changes


Re: [PATCH] drm/amdgpu: Use pcie domain of xcc acpi objects

2023-10-23 Thread Lazar, Lijo
[AMD Official Use Only - General]



Thanks,
Lijo

From: amd-gfx  on behalf of Lijo Lazar 

Sent: Friday, October 20, 2023 8:44:22 PM
To: amd-gfx@lists.freedesktop.org 
Cc: Deucher, Alexander ; Kasiviswanathan, Harish 
; Zhang, Hawking 
Subject: [PATCH] drm/amdgpu: Use pcie domain of xcc acpi objects

PCI domain/segment information of xccs is available through ACPI DSM
methods. Consider that also while looking for devices.

Signed-off-by: Lijo Lazar 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c | 40 +---
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
index 2bca37044ad0..d62e49758635 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_acpi.c
@@ -68,7 +68,7 @@ struct amdgpu_acpi_xcc_info {
 struct amdgpu_acpi_dev_info {
 struct list_head list;
 struct list_head xcc_list;
-   uint16_t bdf;
+   uint32_t sbdf;
 uint16_t supp_xcp_mode;
 uint16_t xcp_mode;
 uint16_t mem_mode;
@@ -927,7 +927,7 @@ static acpi_status amdgpu_acpi_get_node_id(acpi_handle 
handle,
 #endif
 }

-static struct amdgpu_acpi_dev_info *amdgpu_acpi_get_dev(u16 bdf)
+static struct amdgpu_acpi_dev_info *amdgpu_acpi_get_dev(u32 sbdf)
 {
 struct amdgpu_acpi_dev_info *acpi_dev;

@@ -935,14 +935,14 @@ static struct amdgpu_acpi_dev_info 
*amdgpu_acpi_get_dev(u16 bdf)
 return NULL;

 list_for_each_entry(acpi_dev, &amdgpu_acpi_dev_list, list)
-   if (acpi_dev->bdf == bdf)
+   if (acpi_dev->sbdf == sbdf)
 return acpi_dev;

 return NULL;
 }

 static int amdgpu_acpi_dev_init(struct amdgpu_acpi_dev_info **dev_info,
-   struct amdgpu_acpi_xcc_info *xcc_info, u16 bdf)
+   struct amdgpu_acpi_xcc_info *xcc_info, u32 sbdf)
 {
 struct amdgpu_acpi_dev_info *tmp;
 union acpi_object *obj;
@@ -955,7 +955,7 @@ static int amdgpu_acpi_dev_init(struct amdgpu_acpi_dev_info 
**dev_info,

 INIT_LIST_HEAD(&tmp->xcc_list);
 INIT_LIST_HEAD(&tmp->list);
-   tmp->bdf = bdf;
+   tmp->sbdf = sbdf;

 obj = acpi_evaluate_dsm_typed(xcc_info->handle, &amd_xcc_dsm_guid, 0,
   AMD_XCC_DSM_GET_SUPP_MODE, NULL,
@@ -1007,7 +1007,7 @@ static int amdgpu_acpi_dev_init(struct 
amdgpu_acpi_dev_info **dev_info,

 DRM_DEBUG_DRIVER(
 "New dev(%x): Supported xcp mode: %x curr xcp_mode : %x mem 
mode : %x, tmr base: %llx tmr size: %llx  ",
-   tmp->bdf, tmp->supp_xcp_mode, tmp->xcp_mode, tmp->mem_mode,
+   tmp->sbdf, tmp->supp_xcp_mode, tmp->xcp_mode, tmp->mem_mode,
 tmp->tmr_base, tmp->tmr_size);
 list_add_tail(&tmp->list, &amdgpu_acpi_dev_list);
 *dev_info = tmp;
@@ -1023,7 +1023,7 @@ static int amdgpu_acpi_dev_init(struct 
amdgpu_acpi_dev_info **dev_info,
 }

 static int amdgpu_acpi_get_xcc_info(struct amdgpu_acpi_xcc_info *xcc_info,
-   u16 *bdf)
+   u32 *sbdf)
 {
 union acpi_object *obj;
 acpi_status status;
@@ -1054,8 +1054,10 @@ static int amdgpu_acpi_get_xcc_info(struct 
amdgpu_acpi_xcc_info *xcc_info,
 xcc_info->phy_id = (obj->integer.value >> 32) & 0xFF;
 /* xcp node of this xcc [47:40] */
 xcc_info->xcp_node = (obj->integer.value >> 40) & 0xFF;
+   /* PF domain of this xcc [31:16] */
+   *sbdf = (obj->integer.value) & 0x;
 /* PF bus/dev/fn of this xcc [63:48] */
-   *bdf = (obj->integer.value >> 48) & 0x;
+   *sbdf |= (obj->integer.value >> 48) & 0x;
 ACPI_FREE(obj);
 obj = NULL;

@@ -1079,7 +1081,7 @@ static int amdgpu_acpi_enumerate_xcc(void)
 struct acpi_device *acpi_dev;
 char hid[ACPI_ID_LEN];
 int ret, id;
-   u16 bdf;
+   u32 sbdf;

 INIT_LIST_HEAD(&amdgpu_acpi_dev_list);
 xa_init(&numa_info_xa);
@@ -1107,16 +1109,16 @@ static int amdgpu_acpi_enumerate_xcc(void)
 xcc_info->handle = acpi_device_handle(acpi_dev);
 acpi_dev_put(acpi_dev);

-   ret = amdgpu_acpi_get_xcc_info(xcc_info, &bdf);
+   ret = amdgpu_acpi_get_xcc_info(xcc_info, &sbdf);
 if (ret) {
 kfree(xcc_info);
 continue;
 }

-   dev_info = amdgpu_acpi_get_dev(bdf);
+   dev_info = amdgpu_acpi_get_dev(sbdf);

 if (!dev_info)
-   ret = amdgpu_acpi_dev_init(&dev_info, xcc_info, bdf);
+   ret = amdgpu_acpi_dev_init(&dev_info, xcc_info, sbdf);

 if (ret == -ENOMEM)
 return ret;
@@ -1136,13 +1138,14 @@ int amdgpu_acpi_get_tmr_info

Re: [PATCH] drm/amdxcp: fix amdxcp unloads incompletely

2023-10-23 Thread Zhu, James
[AMD Official Use Only - General]

ping ...


Thanks & Best Regards!


James Zhu


From: Zhu, James 
Sent: Thursday, September 7, 2023 10:41 AM
To: amd-gfx@lists.freedesktop.org 
Cc: Kamal, Asad ; Lin, Amber ; Zhu, 
James 
Subject: [PATCH] drm/amdxcp: fix amdxcp unloads incompletely

amdxcp unloads incompletely, and below error will be seen during load/unload,
sysfs: cannot create duplicate filename '/devices/platform/amdgpu_xcp.0'

devres_release_group will free xcp device at first, platform device will be
unregistered later in platform_device_unregister.

Signed-off-by: James Zhu 
---
 drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c 
b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
index 353597fc908d..90ddd8371176 100644
--- a/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
+++ b/drivers/gpu/drm/amd/amdxcp/amdgpu_xcp_drv.c
@@ -89,9 +89,10 @@ EXPORT_SYMBOL(amdgpu_xcp_drm_dev_alloc);
 void amdgpu_xcp_drv_release(void)
 {
 for (--pdev_num; pdev_num >= 0; --pdev_num) {
-   devres_release_group(&xcp_dev[pdev_num]->pdev->dev, NULL);
-   platform_device_unregister(xcp_dev[pdev_num]->pdev);
-   xcp_dev[pdev_num]->pdev = NULL;
+   struct platform_device *pdev = xcp_dev[pdev_num]->pdev;
+
+   devres_release_group(&pdev->dev, NULL);
+   platform_device_unregister(pdev);
 xcp_dev[pdev_num] = NULL;
 }
 pdev_num = 0;
--
2.34.1



[PATCH] drm/amdkfd: Address 'remap_list' not described in 'svm_range_add'

2023-10-23 Thread Srinivasan Shanmugam
Fixes the below:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_svm.c:2073: warning: Function 
parameter or member 'remap_list' not described in 'svm_range_add'

Cc: Felix Kuehling 
Cc: Christian König 
Cc: Alex Deucher 
Cc: "Pan, Xinhui" 
Signed-off-by: Srinivasan Shanmugam 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index f2b33fb2afcf..f43dedf3e240 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2046,6 +2046,7 @@ svm_range_split_new(struct svm_range_list *svms, uint64_t 
start, uint64_t last,
  * @update_list: output, the ranges need validate and update GPU mapping
  * @insert_list: output, the ranges need insert to svms
  * @remove_list: output, the ranges are replaced and need remove from svms
+ * @remap_list: output, remap unaligned svm ranges
  *
  * Check if the virtual address range has overlap with any existing ranges,
  * split partly overlapping ranges and add new ranges in the gaps. All changes
-- 
2.34.1



Re: [PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"

2023-10-23 Thread Christian König

Am 23.10.23 um 15:06 schrieb Daniel Tang:

That commit causes the screen to freeze a few moments after running
clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer
including ssh also freezes. On v6.5-rc1, it only results in a NULL pointer
deference message in dmesg and the process to become a zombie whose
unkillableness prevents shutdown without REISUB. Although llama.cpp and
hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by
this revert, pytorch-rocm is now working with stability and without
whole-computer freezes caused by any accidental running of clinfo.

This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639.


That result doesn't make much sense. Felix please correct me, but AFAIK 
the ATS stuff was completely removed by now.


Are you sure that this is pure v6.6-rc7 and not some other patches 
applied? If yes than we must have missed something.


Regards,
Christian.



Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596
Signed-off-by: Daniel Tang 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 82f25996ff5e..602f311ab766 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)
if (r)
return r;
  
+	/* Sanity checks */

+   if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
+   r = -EINVAL;
+   goto unreserve_bo;
+   }
+
/* Check if PD needs to be reinitialized and do it before
 * changing any other state, in case it fails.
 */
if (pte_support_ats != vm->pte_support_ats) {
-   /* Sanity checks */
-   if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
-   r = -EINVAL;
-   goto unreserve_bo;
-   }
-
vm->pte_support_ats = pte_support_ats;
r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo),
   false);
--
2.40.1







Re: [PATCH] drm/amd/pm: fix the high voltage and temperature issue on smu 13

2023-10-23 Thread Alex Deucher
On Sun, Oct 22, 2023 at 9:05 PM Feng, Kenneth  wrote:
>
> [AMD Official Use Only - General]
>
> Thanks Alex, I will make another patch.
> And please refer to the comments inline below.
>
>
> -Original Message-
> From: Alex Deucher 
> Sent: Friday, October 20, 2023 9:58 PM
> To: Feng, Kenneth 
> Cc: amd-gfx@lists.freedesktop.org; Wang, Yang(Kevin) 
> Subject: Re: [PATCH] drm/amd/pm: fix the high voltage and temperature issue 
> on smu 13
>
> Caution: This message originated from an External Source. Use proper caution 
> when opening attachments, clicking links, or responding.
>
>
> On Fri, Oct 20, 2023 at 4:32 AM Kenneth Feng  wrote:
> >
> > fix the high voltage and temperature issue after the driver is
> > unloaded on smu 13.0.0, smu 13.0.7 and smu 13.0.10
> >
> > Signed-off-by: Kenneth Feng 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 36 +++
> >  drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c|  4 +--
> >  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c | 27 --
> >  drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h |  1 +
> > drivers/gpu/drm/amd/pm/swsmu/inc/smu_v13_0.h  |  2 ++
> >  .../gpu/drm/amd/pm/swsmu/smu13/smu_v13_0.c| 13 +++
> >  .../drm/amd/pm/swsmu/smu13/smu_v13_0_0_ppt.c  |  8 -
> > .../drm/amd/pm/swsmu/smu13/smu_v13_0_7_ppt.c  |  8 -
> >  8 files changed, 86 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 31f8c3ead161..c5c892a8b3f9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3986,13 +3986,23 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> > }
> > }
> > } else {
> > -   tmp = amdgpu_reset_method;
> > -   /* It should do a default reset when loading or 
> > reloading the driver,
> > -* regardless of the module parameter reset_method.
> > -*/
> > -   amdgpu_reset_method = AMD_RESET_METHOD_NONE;
> > -   r = amdgpu_asic_reset(adev);
> > -   amdgpu_reset_method = tmp;
> > +   switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) {
> > +   case IP_VERSION(13, 0, 0):
> > +   case IP_VERSION(13, 0, 7):
> > +   case IP_VERSION(13, 0, 10):
> > +   r = psp_gpu_reset(adev);
> > +   break;
> > +   default:
> > +   tmp = amdgpu_reset_method;
> > +   /* It should do a default reset when 
> > loading or reloading the driver,
> > +* regardless of the module parameter 
> > reset_method.
> > +*/
> > +   amdgpu_reset_method = AMD_RESET_METHOD_NONE;
> > +   r = amdgpu_asic_reset(adev);
> > +   amdgpu_reset_method = tmp;
> > +   break;
> > +   }
> > +
> > if (r) {
> > dev_err(adev->dev, "asic reset on init 
> > failed\n");
> > goto failed; @@ -5945,6 +5955,18 @@
> > int amdgpu_device_baco_exit(struct drm_device *dev)
> > return -ENOTSUPP;
> >
> > ret = amdgpu_dpm_baco_exit(adev);
> > +
> > +   if (!ret)
> > +   switch (amdgpu_ip_version(adev, MP1_HWIP, 0)) {
> > +   case IP_VERSION(13, 0, 0):
> > +   case IP_VERSION(13, 0, 7):
> > +   case IP_VERSION(13, 0, 10):
> > +   adev->gfx.is_poweron = false;
> > +   break;
> > +   default:
> > +   break;
> > +   }
>
> Maybe better to move this into smu_v13_0_0_baco_exit() so we keep the asic 
> specific details out of the common files?
>
> > +
> > if (ret)
> > return ret;
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > index 80ca2c05b0b8..3ad38e42773b 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c
> > @@ -73,7 +73,7 @@ gmc_v11_0_vm_fault_interrupt_state(struct amdgpu_device 
> > *adev,
> >  * fini/suspend, so the overall state doesn't
> >  * change over the course of suspend/resume.
> >  */
> > -   if (!adev->in_s0ix)
> > +   if (!adev->in_s0ix && adev->gfx.is_poweron)
> > amdgpu_gmc_set_vm_fault_masks(adev, 
> > AMDGPU_GFXHUB(0), false);
> > break;
> > case AMDGPU_IRQ_STATE_ENABLE:
> > @@ -85,7 +85,7 @@ gmc_v11_0_vm

RE: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems

2023-10-23 Thread Limonciello, Mario
[Public]

> -Original Message-
> From: Deucher, Alexander 
> Sent: Monday, October 23, 2023 09:22
> To: Limonciello, Mario ; amd-
> g...@lists.freedesktop.org
> Cc: Limonciello, Mario ;
> paolo.gent...@canonical.com
> Subject: RE: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems
>
> [Public]
>
> > -Original Message-
> > From: amd-gfx  On Behalf Of
> Mario
> > Limonciello
> > Sent: Monday, October 23, 2023 9:45 AM
> > To: amd-gfx@lists.freedesktop.org
> > Cc: Limonciello, Mario ;
> > paolo.gent...@canonical.com
> > Subject: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems
> >
> > Originally we were quirking ASPM disabled specifically for VI when used with
> > Alder Lake, but it appears to have problems with Rocket Lake as well.
> >
> > Like we've done in the case of dpm for newer platforms, disable ASPM for all
> > Intel systems.
> >
> > Cc: sta...@vger.kernel.org # 5.15+
> > Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
> > Reported-and-tested-by: Paolo Gentili 
> > Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036742
> > Signed-off-by: Mario Limonciello 
>
> Reviewed-by: Alex Deucher 
>
> As a follow on, we probably want to apply this to all of the program_aspm()
> functions for each asic family.
>

Yeah; I had that thought too but wanted to have a narrow patch for fixes and 
stable first.
I will merge and send a follow up for that.

> Alex
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/vi.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c
> > b/drivers/gpu/drm/amd/amdgpu/vi.c index 6a8494f98d3e..fe8ba9e9837b
> > 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/vi.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/vi.c
> > @@ -1124,7 +1124,7 @@ static void vi_program_aspm(struct
> > amdgpu_device *adev)
> >   bool bL1SS = false;
> >   bool bClkReqSupport = true;
> >
> > - if (!amdgpu_device_should_use_aspm(adev) ||
> > !amdgpu_device_aspm_support_quirk())
> > + if (!amdgpu_device_should_use_aspm(adev) ||
> > +!amdgpu_device_pcie_dynamic_switching_supported())
> >   return;
> >
> >   if (adev->flags & AMD_IS_APU ||
> > --
> > 2.34.1
>



RE: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems

2023-10-23 Thread Deucher, Alexander
[Public]

> -Original Message-
> From: amd-gfx  On Behalf Of Mario
> Limonciello
> Sent: Monday, October 23, 2023 9:45 AM
> To: amd-gfx@lists.freedesktop.org
> Cc: Limonciello, Mario ;
> paolo.gent...@canonical.com
> Subject: [PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems
>
> Originally we were quirking ASPM disabled specifically for VI when used with
> Alder Lake, but it appears to have problems with Rocket Lake as well.
>
> Like we've done in the case of dpm for newer platforms, disable ASPM for all
> Intel systems.
>
> Cc: sta...@vger.kernel.org # 5.15+
> Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
> Reported-and-tested-by: Paolo Gentili 
> Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036742
> Signed-off-by: Mario Limonciello 

Reviewed-by: Alex Deucher 

As a follow on, we probably want to apply this to all of the program_aspm() 
functions for each asic family.

Alex

> ---
>  drivers/gpu/drm/amd/amdgpu/vi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c
> b/drivers/gpu/drm/amd/amdgpu/vi.c index 6a8494f98d3e..fe8ba9e9837b
> 100644
> --- a/drivers/gpu/drm/amd/amdgpu/vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/vi.c
> @@ -1124,7 +1124,7 @@ static void vi_program_aspm(struct
> amdgpu_device *adev)
>   bool bL1SS = false;
>   bool bClkReqSupport = true;
>
> - if (!amdgpu_device_should_use_aspm(adev) ||
> !amdgpu_device_aspm_support_quirk())
> + if (!amdgpu_device_should_use_aspm(adev) ||
> +!amdgpu_device_pcie_dynamic_switching_supported())
>   return;
>
>   if (adev->flags & AMD_IS_APU ||
> --
> 2.34.1



RE: [PATCH v2 00/24] DC Patches October 18, 2023

2023-10-23 Thread Wheeler, Daniel
[Public]

Hi all,

This week this patchset was tested on the following systems:
* Lenovo ThinkBook T13s Gen4 with AMD Ryzen 5 6600U
* MSI Gaming X Trio RX 6800
* Gigabyte Gaming OC RX 7900 XTX

These systems were tested on the following display/connection types:
* eDP, (1080p 60hz [5650U]) (1920x1200 60hz [6600U]) (2560x1600 
120hz[6600U])
* VGA and DVI (1680x1050 60hz [DP to VGA/DVI, USB-C to VGA/DVI])
* DP/HDMI/USB-C (1440p 170hz, 4k 60hz, 4k 144hz, 4k 240hz [Includes 
USB-C to DP/HDMI adapters])
* Thunderbolt (LG Ultrafine 5k)
* MST (Startech MST14DP123DP [DP to 3x DP] and 2x 4k 60Hz displays)
* DSC (with Cable Matters 101075 [DP to 3x DP] with 3x 4k60 displays, 
and HP Hook G2 with 1 4k60 display)
* USB 4 (Kensington SD5700T and 1x 4k 60Hz display)
* PCON (Club3D CAC-1085 and 1x 4k 144Hz display [at 4k 120HZ, as that 
is the max the adapter supports])

The testing is a mix of automated and manual tests. Manual testing includes 
(but is not limited to):
* Changing display configurations and settings
* Benchmark testing
* Feature testing (Freesync, etc.)

Automated testing includes (but is not limited to):
* Script testing (scripts to automate some of the manual checks)
* IGT testing

The patchset consists of the amd-staging-drm-next branch (Head commit - 
310b5f1a3c9eb1ed96e437ead40f900f3b7bf530 -> drm/amd/display: Revert 
"drm/amd/display: Use drm_connector in create_validate_stream_for_sink") with 
new patches added on top of it.

Tested on Ubuntu 22.04.3, on Wayland and X11, using KDE Plasma and Gnome.

Tested-by: Daniel Wheeler 


Thank you,

Dan Wheeler
Sr. Technologist | AMD
SW Display
--
1 Commerce Valley Dr E, Thornhill, ON L3T 7X6
amd.com

-Original Message-
From: roman...@amd.com 
Sent: Thursday, October 19, 2023 9:32 AM
To: amd-gfx@lists.freedesktop.org
Cc: Wentland, Harry ; Li, Sun peng (Leo) 
; Siqueira, Rodrigo ; Pillai, 
Aurabindo ; Li, Roman ; Lin, Wayne 
; Wang, Chao-kai (Stylon) ; Kotarac, 
Pavle ; Gutierrez, Agustin ; 
Chung, ChiaHsuan (Tom) ; Wu, Hersen 
; Zuo, Jerry ; Li, Roman 
; Wheeler, Daniel 
Subject: [PATCH v2 00/24] DC Patches October 18, 2023

From: Roman Li 

This DC patchset brings improvements in multiple areas. In summary, we
highlight:

* Fixes null-deref regression after
  "drm/amd/display: Update OPP counter from new interface"
* Fixes display flashing when VSR and HDR enabled on dcn32
* Fixes dcn3x intermittent hangs due to FPO
* Fixes MST Multi-Stream light up on dcn35
* Fixes green screen on DCN31x when DVI and HDMI monitors attached
* Adds DML2 improvements
* Adds idle power optimization improvements
* Accommodates panels with lower nit backlight
* Updates SDP VSC colorimetry from DP test automation request
* Reverts "drm/amd/display: allow edp updates for virtual signal"

Cc: Daniel Wheeler 

Agustin Gutierrez (1):
  drm/amd/display: Remove power sequencing check

Alex Hung (2):
  drm/amd/display: Revert "drm/amd/display: allow edp updates for
virtual signal"
  drm/amd/display: Set emulated sink type to HDMI accordingly.

Alvin Lee (1):
  drm/amd/display: Update FAMS sequence for DCN30 & DCN32

Aric Cyr (1):
  drm/amd/display: 3.2.256

Aurabindo Pillai (1):
  drm/amd/display: add interface to query SubVP status

Fangzhi Zuo (1):
  drm/amd/display: Fix MST Multi-Stream Not Lighting Up on dcn35

George Shen (1):
  drm/amd/display: Update SDP VSC colorimetry from DP test automation
request

Hugo Hu (1):
  drm/amd/display: reprogram det size while seamless boot

Ilya Bakoulin (1):
  drm/amd/display: Fix shaper using bad LUT params

Iswara Nagulendran (1):
  drm/amd/display: Read before writing Backlight Mode Set Register

Michael Strauss (1):
  drm/amd/display: Disable SYMCLK32_SE RCO on DCN314

Nicholas Kazlauskas (2):
  drm/amd/display: Revert "Improve x86 and dmub ips handshake"
  drm/amd/display: Fix IPS handshake for idle optimizations

Rodrigo Siqueira (3):
  drm/amd/display: Correct enum typo
  drm/amd/display: Add prefix to amdgpu crtc functions
  drm/amd/display: Add prefix for plane functions

Samson Tam (2):
  drm/amd/display: fix num_ways overflow error
  drm/amd/display: add null check for invalid opps

Sung Joon Kim (2):
  drm/amd/display: Add a check for idle power optimization
  drm/amd/display: Fix HDMI framepack 3D test issue

Swapnil Patel (1):
  drm/amd/display: Reduce default backlight min from 5 nits to 1 nits

Wenjing Liu (2):
  drm/amd/display: add pipe resource management callbacks to DML2
  drm/amd/display: implement map dc pipe with callback in DML2

 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |   5 +-
 .../amd/display/amdgpu_dm/amdgpu_dm_crtc.c|  48 +-
 .../amd/display/amdgpu_dm/amdgpu_dm_debugfs.c |   4 +
 .../amd/display/amdgpu_dm/amdgpu_dm_plane.c   | 542 +-
 ...

[PATCH] drm/amd: Disable ASPM for VI w/ all Intel systems

2023-10-23 Thread Mario Limonciello
Originally we were quirking ASPM disabled specifically for VI when
used with Alder Lake, but it appears to have problems with Rocket
Lake as well.

Like we've done in the case of dpm for newer platforms, disable
ASPM for all Intel systems.

Cc: sta...@vger.kernel.org # 5.15+
Fixes: 0064b0ce85bb ("drm/amd/pm: enable ASPM by default")
Reported-and-tested-by: Paolo Gentili 
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036742
Signed-off-by: Mario Limonciello 
---
 drivers/gpu/drm/amd/amdgpu/vi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index 6a8494f98d3e..fe8ba9e9837b 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -1124,7 +1124,7 @@ static void vi_program_aspm(struct amdgpu_device *adev)
bool bL1SS = false;
bool bClkReqSupport = true;
 
-   if (!amdgpu_device_should_use_aspm(adev) || 
!amdgpu_device_aspm_support_quirk())
+   if (!amdgpu_device_should_use_aspm(adev) || 
!amdgpu_device_pcie_dynamic_switching_supported())
return;
 
if (adev->flags & AMD_IS_APU ||
-- 
2.34.1



[PATCH] Revert "drm/amdgpu: remove vm sanity check from amdgpu_vm_make_compute"

2023-10-23 Thread Daniel Tang
That commit causes the screen to freeze a few moments after running
clinfo on v6.6-rc7 and ROCm 5.6. Sometimes the rest of the computer
including ssh also freezes. On v6.5-rc1, it only results in a NULL pointer
deference message in dmesg and the process to become a zombie whose
unkillableness prevents shutdown without REISUB. Although llama.cpp and
hashcat were working in v6.2 and ROCm 5.6, broke, and are not fixed by
this revert, pytorch-rocm is now working with stability and without
whole-computer freezes caused by any accidental running of clinfo.

This reverts commit 1d7776cc148b9f2f3ebaf1181662ba695a29f639.

Closes: https://github.com/RadeonOpenCompute/ROCm/issues/2596
Signed-off-by: Daniel Tang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 82f25996ff5e..602f311ab766 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -2243,16 +2243,16 @@ int amdgpu_vm_make_compute(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)
if (r)
return r;
 
+   /* Sanity checks */
+   if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
+   r = -EINVAL;
+   goto unreserve_bo;
+   }
+
/* Check if PD needs to be reinitialized and do it before
 * changing any other state, in case it fails.
 */
if (pte_support_ats != vm->pte_support_ats) {
-   /* Sanity checks */
-   if (!amdgpu_vm_pt_is_root_clean(adev, vm)) {
-   r = -EINVAL;
-   goto unreserve_bo;
-   }
-
vm->pte_support_ats = pte_support_ats;
r = amdgpu_vm_pt_clear(adev, vm, to_amdgpu_bo_vm(vm->root.bo),
   false);
--
2.40.1





[PATCH] drm/amdgpu: Fix a null pointer access when the smc_rreg pointer is NULL

2023-10-23 Thread qu . huang
In certain types of chips, such as VEGA20, reading the amdgpu_regs_smc file 
could result in an abnormal null pointer access when the smc_rreg pointer is 
NULL. Below are the steps to reproduce this issue and the corresponding 
exception log:

1. Navigate to the directory: /sys/kernel/debug/dri/0
2. Execute command: cat amdgpu_regs_smc
3. Exception Log::
[4005007.702554] BUG: kernel NULL pointer dereference, address: 
[4005007.702562] #PF: supervisor instruction fetch in kernel mode
[4005007.702567] #PF: error_code(0x0010) - not-present page
[4005007.702570] PGD 0 P4D 0
[4005007.702576] Oops: 0010 [#1] SMP NOPTI
[4005007.702581] CPU: 4 PID: 62563 Comm: cat Tainted: G   OE 
5.15.0-43-generic #46-Ubunt   u
[4005007.702590] RIP: 0010:0x0
[4005007.702598] Code: Unable to access opcode bytes at RIP 0xffd6.
[4005007.702600] RSP: 0018:a82b46d27da0 EFLAGS: 00010206
[4005007.702605] RAX:  RBX:  RCX: 
a82b46d27e68
[4005007.702609] RDX: 0001 RSI:  RDI: 
9940656e
[4005007.702612] RBP: a82b46d27dd8 R08:  R09: 
994060c07980
[4005007.702615] R10: 0002 R11:  R12: 
7f5e06753000
[4005007.702618] R13: 9940656e R14: a82b46d27e68 R15: 
7f5e06753000
[4005007.702622] FS:  7f5e0755b740() GS:99479d30() 
knlGS:
[4005007.702626] CS:  0010 DS:  ES:  CR0: 80050033
[4005007.702629] CR2: ffd6 CR3: 0003253fc000 CR4: 
003506e0
[4005007.702633] Call Trace:
[4005007.702636]  
[4005007.702640]  amdgpu_debugfs_regs_smc_read+0xb0/0x120 [amdgpu]
[4005007.703002]  full_proxy_read+0x5c/0x80
[4005007.703011]  vfs_read+0x9f/0x1a0
[4005007.703019]  ksys_read+0x67/0xe0
[4005007.703023]  __x64_sys_read+0x19/0x20
[4005007.703028]  do_syscall_64+0x5c/0xc0
[4005007.703034]  ? do_user_addr_fault+0x1e3/0x670
[4005007.703040]  ? exit_to_user_mode_prepare+0x37/0xb0
[4005007.703047]  ? irqentry_exit_to_user_mode+0x9/0x20
[4005007.703052]  ? irqentry_exit+0x19/0x30
[4005007.703057]  ? exc_page_fault+0x89/0x160
[4005007.703062]  ? asm_exc_page_fault+0x8/0x30
[4005007.703068]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[4005007.703075] RIP: 0033:0x7f5e07672992
[4005007.703079] Code: c0 e9 b2 fe ff ff 50 48 8d 3d fa b2 0c 00 e8 c5 1d 02 00 
0f 1f 44 00 00 f3 0f1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 
<48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 e   c 28 48 89 54 24
[4005007.703083] RSP: 002b:7ffe03097898 EFLAGS: 0246 ORIG_RAX: 

[4005007.703088] RAX: ffda RBX: 0002 RCX: 
7f5e07672992
[4005007.703091] RDX: 0002 RSI: 7f5e06753000 RDI: 
0003
[4005007.703094] RBP: 7f5e06753000 R08: 7f5e06752010 R09: 
7f5e06752010
[4005007.703096] R10: 0022 R11: 0246 R12: 
00022000
[4005007.703099] R13: 0003 R14: 0002 R15: 
0002
[4005007.703105]  
[4005007.703107] Modules linked in: nf_tables libcrc32c nfnetlink algif_hash 
af_alg binfmt_misc nls_   iso8859_1 ipmi_ssif ast intel_rapl_msr 
intel_rapl_common drm_vram_helper drm_ttm_helper amd64_edac t   tm 
edac_mce_amd kvm_amd ccp mac_hid k10temp kvm acpi_ipmi ipmi_si rapl 
sch_fq_codel ipmi_devintf ipm   i_msghandler msr parport_pc ppdev lp 
parport mtd pstore_blk efi_pstore ramoops pstore_zone reed_solo   mon 
ip_tables x_tables autofs4 ib_uverbs ib_core amdgpu(OE) amddrm_ttm_helper(OE) 
amdttm(OE) iommu_v   2 amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops cec rc_coredrm igb ahci xhci_pci 
libahci i2c_piix4 i2c_algo_bit xhci_pci_renesas dca
[4005007.703184] CR2: 
[4005007.703188] ---[ end trace ac65a538d240da39 ]---
[4005007.800865] RIP: 0010:0x0
[4005007.800871] Code: Unable to access opcode bytes at RIP 0xffd6.
[4005007.800874] RSP: 0018:a82b46d27da0 EFLAGS: 00010206
[4005007.800878] RAX:  RBX:  RCX: 
a82b46d27e68
[4005007.800881] RDX: 0001 RSI:  RDI: 
9940656e
[4005007.800883] RBP: a82b46d27dd8 R08:  R09: 
994060c07980
[4005007.800886] R10: 0002 R11:  R12: 
7f5e06753000
[4005007.800888] R13: 9940656e R14: a82b46d27e68 R15: 
7f5e06753000
[4005007.800891] FS:  7f5e0755b740() GS:99479d30() 
knlGS:
[4005007.800895] CS:  0010 DS:  ES:  CR0: 80050033
[4005007.800898] CR2: ffd6 CR3: 0003253fc000 CR4: 
003506e0

Signed-off-by: Qu Huang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index a4faea4..05405da 100644
--- a/drivers/gpu/drm/

Re: [PATCH 1/2] drm/amdgpu: handle the return for sync wait

2023-10-23 Thread Christian König

Am 20.10.23 um 11:59 schrieb Emily Deng:

Add error handling for amdgpu_sync_wait.

Signed-off-by: Emily Deng 


Reviewed-by: Christian König  for this one.

Going to discuss with Felix later today what we do with the timeout.

Christian.


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 9 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c  | 6 +-
  2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 54f31a420229..3011c191d7dd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -2668,7 +2668,7 @@ static int validate_invalid_user_pages(struct 
amdkfd_process_info *process_info)
  
  unreserve_out:

ttm_eu_backoff_reservation(&ticket, &resv_list);
-   amdgpu_sync_wait(&sync, false);
+   ret = amdgpu_sync_wait(&sync, false);
amdgpu_sync_free(&sync);
  out_free:
kfree(pd_bo_list_entries);
@@ -2939,8 +2939,11 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, 
struct dma_fence **ef)
}
  
  	/* Wait for validate and PT updates to finish */

-   amdgpu_sync_wait(&sync_obj, false);
-
+   ret = amdgpu_sync_wait(&sync_obj, false);
+   if (ret) {
+   pr_err("Failed to wait for validate and PT updates to 
finish\n");
+   goto validate_map_fail;
+   }
/* Release old eviction fence and create new one, because fence only
 * goes from unsignaled to signaled, fence cannot be reused.
 * Use context and mm from the old fence.
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 70fe3b39c004..a63139277583 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -1153,7 +1153,11 @@ int amdgpu_mes_ctx_map_meta_data(struct amdgpu_device 
*adev,
}
amdgpu_sync_fence(&sync, vm->last_update);
  
-	amdgpu_sync_wait(&sync, false);

+   r = amdgpu_sync_wait(&sync, false);
+   if (r) {
+   DRM_ERROR("failed to wait sync\n");
+   goto error;
+   }
ttm_eu_backoff_reservation(&ticket, &list);
  
  	amdgpu_sync_free(&sync);




Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait

2023-10-23 Thread Christian König

Am 20.10.23 um 21:47 schrieb Felix Kuehling:


On 2023-10-20 09:10, Christian König wrote:
No, the wait forever is what is expected and perfectly valid user 
experience.


Waiting with a timeout on the other hand sounds like a really bad 
idea to me.


Every wait with a timeout needs a justification, e.g. for example 
that userspace explicitly specified it. And I absolutely don't see 
that here.
In this case the wait is in a kernel worker thread, and the wait is 
not interruptible. Not having a timeout means, you can have a kernel 
worker stuck forever. The restore worker also has retry logic already, 
so it can handle a timeout perfectly well. But maybe this shouldn't be 
done automatically for all callers of amdgpu_sync_wait, but only for 
this particular caller in the restore_process_worker. So we'd need to 
add a timeout parameter to amdgpu_sync_wait.


Adding a parameter sounds like a good idea to me, but it's mandatory 
that dma_fence operations finish in a reasonable amount of time in the 
first place.


This is even documented by now and basically means we need timeouts in 
the area of 100ms for each operation and not between 10 and 60 seconds.


If upstream starts to taint the kernel for longer timeouts we will need 
to reduce the current values massively.


Regards,
Christian.



Regards,
  Felix




Regards,
Christian.

Am 20.10.23 um 10:52 schrieb Deng, Emily:

[AMD Official Use Only - General]

Hi Christian,
  The issue is running a compute hang with a quark and trigger a 
compute job timeout. For compute, the timeout setting is 60s, but 
for gfx and sdma, it is 10s.

So, get the timeout from the sched is reasonable for different sched.
 And if wait timeout, it will print error, so won't hint real 
issues. And even it has real issue, the wait forever is bad user 
experience, and driver couldn't work anymore.


Emily Deng
Best Wishes




-Original Message-
From: Christian König 
Sent: Friday, October 20, 2023 3:29 PM
To: Deng, Emily ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/2] drm/amdgpu: Add timeout for sync wait

Am 20.10.23 um 08:13 schrieb Emily Deng:

Issue: Dead heappen during gpu recover, the call sequence as below:

amdgpu_device_gpu_recover->amdgpu_amdkfd_pre_reset-
flush_delayed_work
-> amdgpu_amdkfd_gpuvm_restore_process_bos->amdgpu_sync_wait

It is because the amdgpu_sync_wait is waiting for the bad job's 
fence,

and never return, so the recover couldn't continue.


Signed-off-by: Emily Deng 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c | 11 +--
   1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
index dcd8c066bc1f..6253d6aab7f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c
@@ -406,8 +406,15 @@ int amdgpu_sync_wait(struct amdgpu_sync *sync,

bool intr)

  int i, r;

  hash_for_each_safe(sync->fences, i, tmp, e, node) {
-    r = dma_fence_wait(e->fence, intr);
-    if (r)
+    struct drm_sched_fence *s_fence = to_drm_sched_fence(e-
fence);
+    long timeout = msecs_to_jiffies(1);
That handling doesn't make much sense. If you need a timeout then 
you need

a timeout for the whole function.

Additional to that timeouts often just hide real problems which 
needs fixing.


So this here needs a much better justification otherwise it's a 
pretty clear NAK.


Regards,
Christian.


+
+    if (s_fence)
+    timeout = s_fence->sched->timeout;
+
+    if (r == 0)
+    r = -ETIMEDOUT;
+    if (r < 0)
  return r;

  amdgpu_sync_entry_free(e);






Re: [PATCH 7/8] Documentation/gpu: Add an explanation about the DC weekly patches

2023-10-23 Thread Jani Nikula
On Fri, 20 Oct 2023, Rodrigo Siqueira  wrote:
> Sharing code with other OSes is confusing and raises some questions.
> This patch introduces some explanation about our upstream process with
> the shared code.

Thanks for writing this! It does help with the transparency.

Please find a comment inline.

>
> Cc: Mario Limonciello 
> Cc: Alex Deucher 
> Cc: Harry Wentland 
> Cc: Hamza Mahfooz 
> Signed-off-by: Rodrigo Siqueira 
> ---
>  Documentation/gpu/amdgpu/display/index.rst | 111 -
>  1 file changed, 109 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/gpu/amdgpu/display/index.rst 
> b/Documentation/gpu/amdgpu/display/index.rst
> index b09d1434754d..9d53a42c5339 100644
> --- a/Documentation/gpu/amdgpu/display/index.rst
> +++ b/Documentation/gpu/amdgpu/display/index.rst
> @@ -10,7 +10,114 @@ reason, our Display Core Driver is divided into two 
> pieces:
>  1. **Display Core (DC)** contains the OS-agnostic components. Things like
> hardware programming and resource management are handled here.
>  2. **Display Manager (DM)** contains the OS-dependent components. Hooks to 
> the
> -   amdgpu base driver and DRM are implemented here.
> +   amdgpu base driver and DRM are implemented here. For example, you can 
> check
> +   display/amdgpu_dm/ folder.
> +
> +
> +How AMD shares code?
> +
> +
> +Maintaining the same code-base across multiple OSes requires a lot of
> +synchronization effort between repositories. In the DC case, we maintain a
> +central repository where everyone who works from other OSes can put their
> +change in this centralized repository. In a simple way, this shared 
> repository
> +is identical to all code that you can see in the display folder. The shared
> +repo has integration tests with our Linux CI farm, and we run an exhaustive 
> set
> +of IGT tests in various AMD GPUs/APUs. Our CI also checks ARM64/32, PPC64/32,
> +and x86_64/32 compilation with DCN enabled and disabled. After all tests pass
> +and the developer gets reviewed by someone else, the change gets merged into
> +the shared repository.
> +
> +To maintain this shared code working properly, we run two activities every
> +week:
> +
> +1. **Weekly backport**: We bring changes from Linux to the other shared
> +   repositories. This work gets massive support from our CI tools, which can
> +   detect new changes and send them to internal maintainers.
> +2. **Weekly promotion**: Every week, we get changes from other teams in the
> +   shared repo that have yet to be made public. For this reason, at the
> +   beginning of each week, a developer will review that internal repo and
> +   prepare a series of patches that can be sent to the public upstream
> +   (promotion).
> +
> +For the context of this documentation, promotion is the essential part that
> +deserves a good elaboration here.
> +
> +Weekly promotion
> +
> +
> +As described in the previous sections, the display folder has its equivalent 
> as
> +an internal repository shared with multiple teams. The promotion activity is
> +the task of 'promoting' those internal changes to the upstream; this is
> +possible thanks to numerous tools that help us manage the code-sharing
> +challenges. The weekly promotion usually takes one week, sliced like this:
> +
> +1. Extract all merged patches from the previous week that can be sent to the
> +   upstream. In other words, we check the week's time frame.
> +2. Evaluate if any potential new patches make sense to the upstream.
> +3. Create a branch candidate with the latest amd-staging-drm-next code 
> together
> +   with the new patches. At this step, we must ensure that every patch 
> compiles
> +   and the entire series pass our set of IGT test in different hardware 
> (i.e.,
> +   it has to pass to our CI).
> +4. Send the new candidate branch for an internal quality test and extra CI
> +   validation.
> +5. Send patches to amd-gfx for reviews. We wait a few days for community
> +   feedback after sending a series to the public mailing list.

So we've debated this one before. :)

Again, I applaud the transparency in writing the document, but I can't
help feeling the weekly promotions are code drops that will generally be
merged unchanged, with no comments. They have all been reviewed
internally, get posted with Reviewed-by tags pre-filled, we have no
visibility to the review. Since the code has already been merged
internally and the batch has passed CI, feels like the bar for changing
anything at this point is pretty high.

Just my two cents.


BR,
Jani.


(Side note, there should be a \n before 6.)

6. If there is
> +   an error, we debug as fast as possible; usually, a simple bisect in the
> +   weekly promotion patches points to a bad change, and we can take two
> +   possible actions: fix the issue or drop the patch. If we cannot identify 
> the
> +   problem in the week interval, we drop the promotion and start over the
> +   following week; i

RE: [PATCH] drm/amdgpu/vpe: correct queue stop programing

2023-10-23 Thread Zhang, Yifan
[AMD Official Use Only - General]

This patch is:

Reviewed-by: Yifan Zhang 

Best Regards,
Yifan

-Original Message-
From: Yu, Lang 
Sent: Monday, October 23, 2023 5:25 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander ; Zhang, Yifan 
; Chiu, Solomon ; Yu, Lang 

Subject: [PATCH] drm/amdgpu/vpe: correct queue stop programing

IB test would fail if not stop queue correctly.

Signed-off-by: Lang Yu 
---
 drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c 
b/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c
index 756f39348dd9..174f13eff575 100644
--- a/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c
@@ -205,19 +205,21 @@ static int vpe_v6_1_ring_start(struct amdgpu_vpe *vpe)  
static int vpe_v_6_1_ring_stop(struct amdgpu_vpe *vpe)  {
struct amdgpu_device *adev = vpe->ring.adev;
-   uint32_t rb_cntl, ib_cntl;
+   uint32_t queue_reset;
+   int ret;

-   rb_cntl = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_RB_CNTL));
-   rb_cntl = REG_SET_FIELD(rb_cntl, VPEC_QUEUE0_RB_CNTL, RB_ENABLE, 0);
-   WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_RB_CNTL), rb_cntl);
+   queue_reset = RREG32(vpe_get_reg_offset(vpe, 0, 
regVPEC_QUEUE_RESET_REQ));
+   queue_reset = REG_SET_FIELD(queue_reset, VPEC_QUEUE_RESET_REQ, 
QUEUE0_RESET, 1);
+   WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE_RESET_REQ),
+queue_reset);

-   ib_cntl = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_IB_CNTL));
-   ib_cntl = REG_SET_FIELD(ib_cntl, VPEC_QUEUE0_IB_CNTL, IB_ENABLE, 0);
-   WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_IB_CNTL), ib_cntl);
+   ret = SOC15_WAIT_ON_RREG(VPE, 0, regVPEC_QUEUE_RESET_REQ, 0,
+VPEC_QUEUE_RESET_REQ__QUEUE0_RESET_MASK);
+   if (ret)
+   dev_err(adev->dev, "VPE queue reset failed\n");

vpe->ring.sched.ready = false;

-   return 0;
+   return ret;
 }

 static int vpe_v6_1_set_trap_irq_state(struct amdgpu_device *adev,
--
2.25.1



[PATCH] drm/amdgpu/vpe: correct queue stop programing

2023-10-23 Thread Lang Yu
IB test would fail if not stop queue correctly.

Signed-off-by: Lang Yu 
---
 drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c 
b/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c
index 756f39348dd9..174f13eff575 100644
--- a/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c
+++ b/drivers/gpu/drm/amd/amdgpu/vpe_v6_1.c
@@ -205,19 +205,21 @@ static int vpe_v6_1_ring_start(struct amdgpu_vpe *vpe)
 static int vpe_v_6_1_ring_stop(struct amdgpu_vpe *vpe)
 {
struct amdgpu_device *adev = vpe->ring.adev;
-   uint32_t rb_cntl, ib_cntl;
+   uint32_t queue_reset;
+   int ret;
 
-   rb_cntl = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_RB_CNTL));
-   rb_cntl = REG_SET_FIELD(rb_cntl, VPEC_QUEUE0_RB_CNTL, RB_ENABLE, 0);
-   WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_RB_CNTL), rb_cntl);
+   queue_reset = RREG32(vpe_get_reg_offset(vpe, 0, 
regVPEC_QUEUE_RESET_REQ));
+   queue_reset = REG_SET_FIELD(queue_reset, VPEC_QUEUE_RESET_REQ, 
QUEUE0_RESET, 1);
+   WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE_RESET_REQ), 
queue_reset);
 
-   ib_cntl = RREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_IB_CNTL));
-   ib_cntl = REG_SET_FIELD(ib_cntl, VPEC_QUEUE0_IB_CNTL, IB_ENABLE, 0);
-   WREG32(vpe_get_reg_offset(vpe, 0, regVPEC_QUEUE0_IB_CNTL), ib_cntl);
+   ret = SOC15_WAIT_ON_RREG(VPE, 0, regVPEC_QUEUE_RESET_REQ, 0,
+VPEC_QUEUE_RESET_REQ__QUEUE0_RESET_MASK);
+   if (ret)
+   dev_err(adev->dev, "VPE queue reset failed\n");
 
vpe->ring.sched.ready = false;
 
-   return 0;
+   return ret;
 }
 
 static int vpe_v6_1_set_trap_irq_state(struct amdgpu_device *adev,
-- 
2.25.1



Re: [PATCH v7 4/6] drm: Refuse to async flip with atomic prop changes

2023-10-23 Thread Simon Ser
On Monday, October 23rd, 2023 at 10:42, Michel Dänzer 
 wrote:

> On 10/23/23 10:27, Simon Ser wrote:
> 
> > On Sunday, October 22nd, 2023 at 12:12, Michel Dänzer 
> > michel.daen...@mailbox.org wrote:
> > 
> > > On 10/17/23 14:16, Simon Ser wrote:
> > > 
> > > > After discussing with André it seems like we missed a plane type check
> > > > here. We need to make sure FB_ID changes are only allowed on primary
> > > > planes.
> > > 
> > > Can you elaborate why that's needed?
> > 
> > Current drivers are in general not prepared to perform async page-flips
> > on planes other than primary. For instance I don't think i915 has logic
> > to perform async page-flip on an overlay plane FB_ID change.
> 
> 
> That should be handled in the driver's atomic_check then?
> 
> Async flips of overlay planes would be useful e.g. for presenting a windowed 
> application with tearing, while the rest of the desktop is tear-free.

Yes, that would be useful, but requires more work. Small steps: first
expose what the legacy uAPI can do in atomic, then later extend that in
some drivers.


Re: [PATCH v7 4/6] drm: Refuse to async flip with atomic prop changes

2023-10-23 Thread Michel Dänzer
On 10/23/23 10:27, Simon Ser wrote:
> On Sunday, October 22nd, 2023 at 12:12, Michel Dänzer 
>  wrote:
>> On 10/17/23 14:16, Simon Ser wrote:
>>
>>> After discussing with André it seems like we missed a plane type check
>>> here. We need to make sure FB_ID changes are only allowed on primary
>>> planes.
>>
>> Can you elaborate why that's needed?
> 
> Current drivers are in general not prepared to perform async page-flips
> on planes other than primary. For instance I don't think i915 has logic
> to perform async page-flip on an overlay plane FB_ID change.

That should be handled in the driver's atomic_check then?

Async flips of overlay planes would be useful e.g. for presenting a windowed 
application with tearing, while the rest of the desktop is tear-free.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: [PATCH v7 4/6] drm: Refuse to async flip with atomic prop changes

2023-10-23 Thread Simon Ser
On Sunday, October 22nd, 2023 at 12:12, Michel Dänzer 
 wrote:

> On 10/17/23 14:16, Simon Ser wrote:
> 
> > After discussing with André it seems like we missed a plane type check
> > here. We need to make sure FB_ID changes are only allowed on primary
> > planes.
> 
> Can you elaborate why that's needed?

Current drivers are in general not prepared to perform async page-flips
on planes other than primary. For instance I don't think i915 has logic
to perform async page-flip on an overlay plane FB_ID change.


Re: [PATCH v6 6/6] drm/doc: Define KMS atomic state set

2023-10-23 Thread Simon Ser
On Tuesday, October 17th, 2023 at 14:10, Ville Syrjälä 
 wrote:

> On Mon, Oct 16, 2023 at 10:00:51PM +, Simon Ser wrote:
> 
> > On Monday, October 16th, 2023 at 17:10, Ville Syrjälä 
> > ville.syrj...@linux.intel.com wrote:
> > 
> > > On Mon, Oct 16, 2023 at 05:52:22PM +0300, Pekka Paalanen wrote:
> > > 
> > > > On Mon, 16 Oct 2023 15:42:16 +0200
> > > > André Almeida andrealm...@igalia.com wrote:
> > > > 
> > > > > Hi Pekka,
> > > > > 
> > > > > On 10/16/23 14:18, Pekka Paalanen wrote:
> > > > > 
> > > > > > On Mon, 16 Oct 2023 12:52:32 +0200
> > > > > > André Almeida andrealm...@igalia.com wrote:
> > > > > > 
> > > > > > > Hi Michel,
> > > > > > > 
> > > > > > > On 8/17/23 12:37, Michel Dänzer wrote:
> > > > > > > 
> > > > > > > > On 8/15/23 20:57, André Almeida wrote:
> > > > > > > > 
> > > > > > > > > From: Pekka Paalanen pekka.paala...@collabora.com
> > > > > > > > > 
> > > > > > > > > Specify how the atomic state is maintained between userspace 
> > > > > > > > > and
> > > > > > > > > kernel, plus the special case for async flips.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Pekka Paalanen pekka.paala...@collabora.com
> > > > > > > > > Signed-off-by: André Almeida andrealm...@igalia.com
> > > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > +An atomic commit with the flag DRM_MODE_PAGE_FLIP_ASYNC is 
> > > > > > > > > allowed to
> > > > > > > > > +effectively change only the FB_ID property on any planes. 
> > > > > > > > > No-operation changes
> > > > > > > > > +are ignored as always. [...]
> > > > > > > > > During the hackfest in Brno, it was mentioned that a commit 
> > > > > > > > > which re-sets the same FB_ID could actually have an effect 
> > > > > > > > > with VRR: It could trigger scanout of the next frame before 
> > > > > > > > > vertical blank has reached its maximum duration. Some kind of 
> > > > > > > > > mechanism is required for this in order to allow user space 
> > > > > > > > > to perform low frame rate compensation.
> > > > > > > 
> > > > > > > Xaver tested this hypothesis in a flipping the same fb in a VRR 
> > > > > > > monitor
> > > > > > > and it worked as expected, so this shouldn't be a concern.
> > > > > > > Right, so it must have some effect. It cannot be simply ignored 
> > > > > > > like in
> > > > > > > the proposed doc wording. Do we special-case re-setting the same 
> > > > > > > FB_ID
> > > > > > > as "not a no-op" or "not ignored" or some other way?
> > > > > > > There's an effect in the refresh rate, the image won't change but 
> > > > > > > it
> > > > > > > will report that a flip had happened asynchronously so the 
> > > > > > > reported
> > > > > > > framerate will be increased. Maybe an additional wording could be 
> > > > > > > like:
> > > > > 
> > > > > Flipping to the same FB_ID will result in a immediate flip as if it 
> > > > > was
> > > > > changing to a different one, with no effect on the image but effecting
> > > > > the reported frame rate.
> > > > 
> > > > Re-setting FB_ID to its current value is a special case regardless of
> > > > PAGE_FLIP_ASYNC, is it not?
> > > 
> > > No. The rule has so far been that all side effects are observed
> > > even if you flip to the same fb. And that is one of my annoyances
> > > with this proposal. The rules will now be different for async flips
> > > vs. everything else.
> > 
> > Well with the patches the async page-flip case is exactly the same as
> > the non-async page-flip case. In both cases, if a FB_ID is included in
> > an atomic commit then the side effects are triggered even if the property
> > value didn't change. The rules are the same for everything.
> 
> I see it only checking if FB_ID changes or not. If it doesn't
> change then the implication is that the side effects will in
> fact be skipped as not all planes may even support async flips.

Hm right. So the problem is that setting any prop = same value as
previous one will result in a new page-flip for asynchronous page-flips,
but will not result in any side-effect for asynchronous page-flips.

Does it actually matter though? For async page-flips, I don't think this
would result in any actual difference in behavior?