date:20240115

[PATCH] drm/amd/display: Fix a switch statement in populate_dml_output_cfg_from_stream_state()

2024-01-15 Thread Christophe JAILLET

It is likely that the statement related to 'dml_edp' is misplaced. So move
it in the correct "case SIGNAL_TYPE_EDP".

Fixes: 7966f319c66d ("drm/amd/display: Introduce DML2")
Signed-off-by: Christophe JAILLET 
---
 drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c 
b/drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c
index fa6a93dd9629..64d01a9cd68c 100644
--- a/drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c
+++ b/drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c
@@ -626,8 +626,8 @@ static void 
populate_dml_output_cfg_from_stream_state(struct dml_output_cfg_st *
if (is_dp2p0_output_encoder(pipe))
out->OutputEncoder[location] = dml_dp2p0;
break;
-   out->OutputEncoder[location] = dml_edp;
case SIGNAL_TYPE_EDP:
+   out->OutputEncoder[location] = dml_edp;
break;
case SIGNAL_TYPE_HDMI_TYPE_A:
case SIGNAL_TYPE_DVI_SINGLE_LINK:
-- 
2.43.0

[PATCH] drm/amdgpu: Remove usage of the deprecated ida_simple_xx() API

2024-01-15 Thread Christophe JAILLET

ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().

Note that the upper limit of ida_simple_get() is exclusive, but the one of
ida_alloc_range() is inclusive. So a -1 has been added when needed.

Signed-off-by: Christophe JAILLET 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index ddd0891da116..3d7fcdeaf8cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -62,9 +62,8 @@ int amdgpu_pasid_alloc(unsigned int bits)
int pasid = -EINVAL;
 
for (bits = min(bits, 31U); bits > 0; bits--) {
-   pasid = ida_simple_get(&amdgpu_pasid_ida,
-  1U << (bits - 1), 1U << bits,
-  GFP_KERNEL);
+   pasid = ida_alloc_range(&amdgpu_pasid_ida, 1U << (bits - 1),
+   (1U << bits) - 1, GFP_KERNEL);
if (pasid != -ENOSPC)
break;
}
@@ -82,7 +81,7 @@ int amdgpu_pasid_alloc(unsigned int bits)
 void amdgpu_pasid_free(u32 pasid)
 {
trace_amdgpu_pasid_freed(pasid);
-   ida_simple_remove(&amdgpu_pasid_ida, pasid);
+   ida_free(&amdgpu_pasid_ida, pasid);
 }
 
 static void amdgpu_pasid_free_cb(struct dma_fence *fence,
-- 
2.43.0

[PATCH] drm/amd/display: remove kernel-doc misuses in dmub_replay.c

2024-01-15 Thread Randy Dunlap

Change non-kernel-doc comments from "/**" to common "/*" to prevent
kernel-doc warnings:

dmub_replay.c:262: warning: This comment starts with '/**', but isn't a 
kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
 * Set REPLAY power optimization flags and coasting vtotal.
dmub_replay.c:284: warning: This comment starts with '/**', but isn't a 
kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
 * send Replay general cmd to DMUB.

Signed-off-by: Randy Dunlap 
Cc: Harry Wentland 
Cc: Leo Li 
Cc: Rodrigo Siqueira 
Cc: amd-gfx@lists.freedesktop.org
Cc: Alex Deucher 
Cc: Christian König 
Cc: "Pan, Xinhui" 
Cc: Tom Chung 
Cc: dri-de...@lists.freedesktop.org
---
 drivers/gpu/drm/amd/display/dc/dce/dmub_replay.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -- a/drivers/gpu/drm/amd/display/dc/dce/dmub_replay.c 
b/drivers/gpu/drm/amd/display/dc/dce/dmub_replay.c
--- a/drivers/gpu/drm/amd/display/dc/dce/dmub_replay.c
+++ b/drivers/gpu/drm/amd/display/dc/dce/dmub_replay.c
@@ -258,7 +258,7 @@ static void dmub_replay_residency(struct
*residency = 0;
 }
 
-/**
+/*
  * Set REPLAY power optimization flags and coasting vtotal.
  */
 static void dmub_replay_set_power_opt_and_coasting_vtotal(struct dmub_replay 
*dmub,
@@ -280,7 +280,7 @@ static void dmub_replay_set_power_opt_an
dc_wake_and_execute_dmub_cmd(dc, &cmd, DM_DMUB_WAIT_TYPE_WAIT);
 }
 
-/**
+/*
  * send Replay general cmd to DMUB.
  */
 static void dmub_replay_send_cmd(struct dmub_replay *dmub,

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König


Am 13.01.24 um 15:02 schrieb Joshua Ashton:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.


Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error code.


Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft 
recovered errors fatal as well while Michel is voting for better 
ignoring them.


I'm not really sure what to do. If you guys think that soft recovered 
hangs should be fatal as well then we can certainly do this.


Regards,
Christian.



There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, 
struct amdgpu_job *job)
dma_fence_set_error(fence, -ENODATA);
spin_unlock_irqrestore(fence->lock, flags);
  
+	if (job->vm)

+   drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
while (!dma_fence_is_signaled(fence) &&
   ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König


Am 13.01.24 um 23:55 schrieb Joshua Ashton:

+Marek

On 1/13/24 21:35, André Almeida wrote:

Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.

Without this feedback, the application may keep pushing through the 
soft

recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an 
application

which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.



I sent a similar patch in the past, maybe you find the discussion 
interesting:


https://lore.kernel.org/lkml/20230424014324.218531-1-andrealm...@igalia.com/ 



Thanks, I had a peruse through that old thread.

Marek definitely had the right idea here, given he mentions:
"That supposedly depends on the compositor. There may be compositors for
very specific cases (e.g. Steam Deck)"

Given that is what I work on and also wrote this patch for that does 
basically the same thing as was proposed. :-)


For context though, I am less interested in Gamescope (the Steam Deck 
compositor) hanging (we don't have code that hangs, if we go down, 
it's likely Steam/CEF died with us anyway atm, can solve that battle 
some other day) and more about the applications run under it.


Marek is very right when he says applications that fault/hang will 
submit one IB after another that also fault/hang -- especially if they 
write to descriptors from the GPU (descriptor buffers), or use draw 
indirect or anything bindless or...
That's basically functionally equivalent to DOSing a system if it is 
not prevented.


And that's exactly what I see even in a simple test app doing a fault 
-> hang every frame.


Right now, given that soft recovery never marks a context as guilty, 
it means that every app I tested is never stopped from submitting 
garbage That holds true for basically any app that GPU hangs and makes 
soft recovery totally useless in my testing without this.


Yeah, the problem is that your patch wouldn't help with that. A testing 
app can still re-create the context for each submission and so crash the 
system over and over again.


The question here is really if we should handled soft recovered errors 
as fatal or not. Marek is in pro of that Michel is against it.


Figure out what you want in userspace and I'm happy to implement it :)



(That being said, without my patches, RADV treats *any* reset from the 
query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it 
was not guilty, so any faulting/hanging application causes every 
Vulkan app to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )


That is actually intended behavior. When something disrupted the GPU 
execution and the application is affected it is mandatory to forward 
this error to the application.


Regards,
Christian.



- Joshie 🐸✨




Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
  atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

RE: [PATCH v3] drm/amd/amdgpu: Update RLC_SPM_MC_CNT by ring wreg

2024-01-15 Thread YuanShang Mao (River)

[AMD Official Use Only - General]

Ping...

-Original Message-
From: YuanShang Mao (River) 
Sent: Saturday, January 13, 2024 2:58 PM
To: amd-gfx@lists.freedesktop.org
Cc: YuanShang Mao (River) ; YuanShang Mao (River) 

Subject: [PATCH v3] drm/amd/amdgpu: Update RLC_SPM_MC_CNT by ring wreg

[Why]
RLC_SPM_MC_CNTL can not updated by MMIO
since MMIO protection is enabled during runtime in guest machine.

[How]
Submit command of wreg in amdgpu ring to update RLC_SPM_MC_CNT.

Signed-off-by: YuanShang 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.h |  2 +-  
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  |  2 +-  
drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  |  2 +-  
drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c  | 12 +---
 drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c   |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c   |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_4_3.c |  4 ++--
 8 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.h
index b591d33af264..5a17e0ff2ab8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_rlc.h
@@ -169,7 +169,7 @@ struct amdgpu_rlc_funcs {
void (*stop)(struct amdgpu_device *adev);
void (*reset)(struct amdgpu_device *adev);
void (*start)(struct amdgpu_device *adev);
-   void (*update_spm_vmid)(struct amdgpu_device *adev, unsigned vmid);
+   void (*update_spm_vmid)(struct amdgpu_device *adev, struct amdgpu_ring
+*ring, unsigned vmid);
bool (*is_rlcg_access_range)(struct amdgpu_device *adev, uint32_t reg); 
 };

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 7da71b6a9dc6..13b2c82e5f48 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -650,7 +650,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
amdgpu_gmc_emit_pasid_mapping(ring, job->vmid, job->pasid);

if (spm_update_needed && adev->gfx.rlc.funcs->update_spm_vmid)
-   adev->gfx.rlc.funcs->update_spm_vmid(adev, job->vmid);
+   adev->gfx.rlc.funcs->update_spm_vmid(adev, ring, job->vmid);

if (!ring->is_mes_queue && ring->funcs->emit_gds_switch &&
gds_switch_needed) {
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index c8a3bf01743f..830ed6fe1baf 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -7951,7 +7951,7 @@ static void gfx_v10_0_update_spm_vmid_internal(struct 
amdgpu_device *adev,
WREG32_SOC15_NO_KIQ(GC, 0, mmRLC_SPM_MC_CNTL, data);  }

-static void gfx_v10_0_update_spm_vmid(struct amdgpu_device *adev, unsigned int 
vmid)
+static void gfx_v10_0_update_spm_vmid(struct amdgpu_device *adev,
+struct amdgpu_ring *ring, unsigned int vmid)
 {
amdgpu_gfx_off_ctrl(adev, false);

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index c659ef0f47ce..42e9976c053e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -749,7 +749,7 @@ static int gfx_v11_0_rlc_init(struct amdgpu_device *adev)

/* init spm vmid with 0xf */
if (adev->gfx.rlc.funcs->update_spm_vmid)
-   adev->gfx.rlc.funcs->update_spm_vmid(adev, 0xf);
+   adev->gfx.rlc.funcs->update_spm_vmid(adev, NULL, 0xf);

return 0;
 }
@@ -5002,7 +5002,7 @@ static int gfx_v11_0_update_gfx_clock_gating(struct 
amdgpu_device *adev,
return 0;
 }

-static void gfx_v11_0_update_spm_vmid(struct amdgpu_device *adev, unsigned 
vmid)
+static void gfx_v11_0_update_spm_vmid(struct amdgpu_device *adev,
+struct amdgpu_ring *ring, unsigned vmid)
 {
u32 data;

@@ -5013,9 +5013,15 @@ static void gfx_v11_0_update_spm_vmid(struct 
amdgpu_device *adev, unsigned vmid)
data &= ~RLC_SPM_MC_CNTL__RLC_SPM_VMID_MASK;
data |= (vmid & RLC_SPM_MC_CNTL__RLC_SPM_VMID_MASK) << 
RLC_SPM_MC_CNTL__RLC_SPM_VMID__SHIFT;

-   WREG32_SOC15_NO_KIQ(GC, 0, regRLC_SPM_MC_CNTL, data);
+   if (ring == NULL)
+   WREG32_SOC15_NO_KIQ(GC, 0, regRLC_SPM_MC_CNTL, data);

amdgpu_gfx_off_ctrl(adev, true);
+
+   if (ring) {
+   uint32_t reg = SOC15_REG_OFFSET(GC, 0, regRLC_SPM_MC_CNTL);
+   amdgpu_ring_emit_wreg(ring, reg, data);
+   }
 }

 static const struct amdgpu_rlc_funcs gfx_v11_0_rlc_funcs = { diff --git 
a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
index c2faf6b4c2fc..86a4865b1ae5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v7_0.c
@@ -3274,7 +3274,7 @@ static int gfx_v7_0_rlc_init(struct amdgpu_device *adev)

/* init spm vmid with 0xf */
if (adev->gfx.rlc.funcs->update_spm_vmid)
-   adev->gfx.rl

Re: [PATCH] drm/amdgpu: Remove usage of the deprecated ida_simple_xx() API

2024-01-15 Thread Christian König


Am 14.01.24 um 16:14 schrieb Christophe JAILLET:

ida_alloc() and ida_free() should be preferred to the deprecated
ida_simple_get() and ida_simple_remove().

Note that the upper limit of ida_simple_get() is exclusive, but the one of
ida_alloc_range() is inclusive. So a -1 has been added when needed.

Signed-off-by: Christophe JAILLET 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index ddd0891da116..3d7fcdeaf8cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
@@ -62,9 +62,8 @@ int amdgpu_pasid_alloc(unsigned int bits)
int pasid = -EINVAL;
  
  	for (bits = min(bits, 31U); bits > 0; bits--) {

-   pasid = ida_simple_get(&amdgpu_pasid_ida,
-  1U << (bits - 1), 1U << bits,
-  GFP_KERNEL);
+   pasid = ida_alloc_range(&amdgpu_pasid_ida, 1U << (bits - 1),
+   (1U << bits) - 1, GFP_KERNEL);
if (pasid != -ENOSPC)
break;
}
@@ -82,7 +81,7 @@ int amdgpu_pasid_alloc(unsigned int bits)
  void amdgpu_pasid_free(u32 pasid)
  {
trace_amdgpu_pasid_freed(pasid);
-   ida_simple_remove(&amdgpu_pasid_ida, pasid);
+   ida_free(&amdgpu_pasid_ida, pasid);
  }
  
  static void amdgpu_pasid_free_cb(struct dma_fence *fence,

Re: Failed to create a rescuer kthread for the amdgpu-reset-dev workqueue

2024-01-15 Thread Christian König


Am 15.01.24 um 11:17 schrieb Thomas Perrot:

Hello Christian,

On Fri, 2024-01-12 at 09:17 +0100, Christian König wrote:

Well the driver load is interrupted for some reason.

Have you set any timeout for modprobe?


We don't set a modprobe timeout.


Well you somehow abort probing the driver.

This seems to be an external event and not something the driver can 
influence.


Regards,
Christian.



Kind regards,
Thomas


Regards,
Christian.

Am 12.01.24 um 09:11 schrieb Thomas Perrot:

Hello,

We are updating the kernel from the 6.1 to the 6.6 and we observe
an
amdgpu’s regression with Radeon RX580 8GB and SiFive Unmatched:
“workqueue: Failed to create a rescuer kthread for wq 'amdgpu-
reset-
dev': -EINTR
[drm:amdgpu_reset_create_reset_domain [amdgpu]] *ERROR* Failed to
allocate wq for amdgpu_reset_domain!
amdgpu :07:00.0: amdgpu: Fatal error during GPU init
amdgpu :07:00.0: amdgpu: amdgpu: finishing device.
amdgpu: probe of :07:00.0 failed with error -12”

We tried to figure it out without success for the moment, do you
have
some advice to identify the root cause and to fix it?

Kind regards,
Thomas Perrot

Re: [PATCH 2/2] drm/amdgpu: Process fences on IH overflow

2024-01-15 Thread Christian König


Am 14.01.24 um 14:00 schrieb Friedrich Vock:

If the IH ring buffer overflows, it's possible that fence signal events
were lost. Check each ring for progress to prevent job timeouts/GPU
hangs due to the fences staying unsignaled despite the work being done.


That's completely unnecessary and in some cases even harmful.

We already have a timeout handler for that and overflows point to severe 
system problem so they should never occur in a production system.


Regards,
Christian.



Cc: Joshua Ashton 
Cc: Alex Deucher 
Cc: sta...@vger.kernel.org

Signed-off-by: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 15 +++
  1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index f3b0aaf3ebc6..2a246db1d3a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -209,6 +209,7 @@ int amdgpu_ih_process(struct amdgpu_device *adev, struct 
amdgpu_ih_ring *ih)
  {
unsigned int count;
u32 wptr;
+   int i;

if (!ih->enabled || adev->shutdown)
return IRQ_NONE;
@@ -227,6 +228,20 @@ int amdgpu_ih_process(struct amdgpu_device *adev, struct 
amdgpu_ih_ring *ih)
ih->rptr &= ih->ptr_mask;
}

+   /* If the ring buffer overflowed, we might have lost some fence
+* signal interrupts. Check if there was any activity so the signal
+* doesn't get lost.
+*/
+   if (ih->overflow) {
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->fence_drv.initialized)
+   continue;
+   amdgpu_fence_process(ring);
+   }
+   }
+
amdgpu_ih_set_rptr(adev, ih);
wake_up_all(&ih->wait_process);

--
2.43.0

Re: [PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

2024-01-15 Thread Friedrich Vock


Adding the original Ccs from the thread since they seemed to be missing
in the reply.

On 15.01.24 11:55, Christian König wrote:

Am 14.01.24 um 14:00 schrieb Friedrich Vock:

Allows us to detect subsequent IH ring buffer overflows as well.


Well that suggested handling here is certainly broken, see below.



Cc: Joshua Ashton 
Cc: Alex Deucher 
Cc: sta...@vger.kernel.org

Signed-off-by: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h  |  2 ++
  drivers/gpu/drm/amd/amdgpu/cik_ih.c | 13 +
  drivers/gpu/drm/amd/amdgpu/cz_ih.c  | 14 +-
  drivers/gpu/drm/amd/amdgpu/iceland_ih.c | 14 +-
  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c    | 13 +
  drivers/gpu/drm/amd/amdgpu/ih_v6_1.c    | 13 +
  drivers/gpu/drm/amd/amdgpu/navi10_ih.c  | 12 
  drivers/gpu/drm/amd/amdgpu/si_ih.c  | 12 
  drivers/gpu/drm/amd/amdgpu/tonga_ih.c   | 13 +
  drivers/gpu/drm/amd/amdgpu/vega10_ih.c  | 12 
  drivers/gpu/drm/amd/amdgpu/vega20_ih.c  | 12 
  11 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
index 508f02eb0cf8..6041ec727f06 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
@@ -69,6 +69,8 @@ struct amdgpu_ih_ring {
  unsigned    rptr;
  struct amdgpu_ih_regs    ih_regs;

+    bool overflow;
+
  /* For waiting on IH processing at checkpoint. */
  wait_queue_head_t wait_process;
  uint64_t    processed_timestamp;
diff --git a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
index 6f7c031dd197..807cc30c9e33 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
@@ -204,6 +204,7 @@ static u32 cik_ih_get_wptr(struct amdgpu_device
*adev,
  tmp = RREG32(mmIH_RB_CNTL);
  tmp |= IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
  WREG32(mmIH_RB_CNTL, tmp);
+    ih->overflow = true;
  }
  return (wptr & ih->ptr_mask);
  }
@@ -274,7 +275,19 @@ static void cik_ih_decode_iv(struct
amdgpu_device *adev,
  static void cik_ih_set_rptr(struct amdgpu_device *adev,
  struct amdgpu_ih_ring *ih)
  {
+    u32 tmp;
+
  WREG32(mmIH_RB_RPTR, ih->rptr);
+
+    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
bit),
+ * reset it here to detect more overflows if they occur.
+ */
+    if (ih->overflow) {
+    tmp = RREG32(mmIH_RB_CNTL);
+    tmp &= ~IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
+    WREG32(mmIH_RB_CNTL, tmp);
+    ih->overflow = false;
+    }


Well that is an extremely bad idea. We already reset the overflow
after reading the WPTR.


This is not resetting the overflow bit. This is resetting a "clear
overflow" bit. I don't have the hardware docs, but the name (and my
observations) strongly suggest that setting this bit actually prevents
the hardware from setting the overflow bit ever again.

Right now, IH overflows, even if they occur repeatedly, only get
registered once. If not registering IH overflows can trivially lead to
system crashes, it's amdgpu's current handling that is broken.

The possibility of a repeated IH overflow in between reading the wptr
and updating the rptr is a good point, but how can we detect that at
all? It seems to me like we can't set the OVERFLOW_CLEAR bit at all
then, because we're guaranteed to miss any overflows that happen while
the bit is set.

Regards,
Friedrich



When you clear the overflow again when updating the RPTR you could
loose another overflow which might have happened in between and so
potentially process corrupted IVs.

That can trivially crash the system.

Regards,
Christian.


  }

  static int cik_ih_early_init(void *handle)
diff --git a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
index b8c47e0cf37a..076559668573 100644
--- a/drivers/gpu/drm/amd/amdgpu/cz_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/cz_ih.c
@@ -215,7 +215,7 @@ static u32 cz_ih_get_wptr(struct amdgpu_device
*adev,
  tmp = RREG32(mmIH_RB_CNTL);
  tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 1);
  WREG32(mmIH_RB_CNTL, tmp);
-
+    ih->overflow = true;

  out:
  return (wptr & ih->ptr_mask);
@@ -266,7 +266,19 @@ static void cz_ih_decode_iv(struct amdgpu_device
*adev,
  static void cz_ih_set_rptr(struct amdgpu_device *adev,
 struct amdgpu_ih_ring *ih)
  {
+    u32 tmp;
+
  WREG32(mmIH_RB_RPTR, ih->rptr);
+
+    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
bit),
+ * reset it here to detect more overflows if they occur.
+ */
+    if (ih->overflow) {
+    tmp = RREG32(mmIH_RB_CNTL);
+    tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_CLEAR, 0);
+    WREG32(mmIH_RB_CNTL, tmp);
+    ih->overflow = false;
+    }
  }

  static int cz_ih_early_init(void *handle)
diff --git a/drivers/gpu/dr

Re: [PATCH 2/2] drm/amdgpu: Process fences on IH overflow

2024-01-15 Thread Friedrich Vock


On 15.01.24 11:26, Christian König wrote:

Am 14.01.24 um 14:00 schrieb Friedrich Vock:

If the IH ring buffer overflows, it's possible that fence signal events
were lost. Check each ring for progress to prevent job timeouts/GPU
hangs due to the fences staying unsignaled despite the work being done.


That's completely unnecessary and in some cases even harmful.

How is it harmful? The only effect it can have is prevent unnecessary
GPU hangs, no? It's not like it hides any legitimate errors that you'd
otherwise see.


We already have a timeout handler for that and overflows point to
severe system problem so they should never occur in a production system.


IH ring buffer overflows are pretty reliably reproducible if you trigger
a lot of page faults, at least on Deck. Why shouldn't enough page faults
in quick succession be able to overflow the IH ring buffer?

The fence fallback timer as it is now is useless for this because it
only gets triggered once after 0.5s. I guess an alternative approach
would be to make a timer trigger for each work item in flight every
0.5s, but why should that be better than just handling overflow errors
as they occur?

Regards,
Friedrich



Regards,
Christian.



Cc: Joshua Ashton 
Cc: Alex Deucher 
Cc: sta...@vger.kernel.org

Signed-off-by: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 15 +++
  1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index f3b0aaf3ebc6..2a246db1d3a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -209,6 +209,7 @@ int amdgpu_ih_process(struct amdgpu_device *adev,
struct amdgpu_ih_ring *ih)
  {
  unsigned int count;
  u32 wptr;
+    int i;

  if (!ih->enabled || adev->shutdown)
  return IRQ_NONE;
@@ -227,6 +228,20 @@ int amdgpu_ih_process(struct amdgpu_device
*adev, struct amdgpu_ih_ring *ih)
  ih->rptr &= ih->ptr_mask;
  }

+    /* If the ring buffer overflowed, we might have lost some fence
+ * signal interrupts. Check if there was any activity so the signal
+ * doesn't get lost.
+ */
+    if (ih->overflow) {
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    if (!ring || !ring->fence_drv.initialized)
+    continue;
+    amdgpu_fence_process(ring);
+    }
+    }
+
  amdgpu_ih_set_rptr(adev, ih);
  wake_up_all(&ih->wait_process);

--
2.43.0

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton





On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.


Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error code.


The fence error code does not result in ctx's being marked as guilty, 
only the karma path does.


See drm_sched_increase_karma.

Are you proposing that we instead mark contexts as guilty with the fence 
error ourselves here?





Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft 
recovered errors fatal as well while Michel is voting for better 
ignoring them.


I'm not really sure what to do. If you guys think that soft recovered 
hangs should be fatal as well then we can certainly do this.


They have to be!

As Marek and I have pointed out, applications that hang or fault will 
just hang or fault again, especially when they use things like draw 
indirect, buffer device address, descriptor buffers, etc.


The majority of apps these days have a lot of side effects and 
persistence between frames and submissions.


- Joshie 🐸✨



Regards,
Christian.



There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application
which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring 
*ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
  atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton





On 1/15/24 09:47, Christian König wrote:

Am 13.01.24 um 23:55 schrieb Joshua Ashton:

+Marek

On 1/13/24 21:35, André Almeida wrote:

Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.

Without this feedback, the application may keep pushing through the 
soft

recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an 
application

which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.



I sent a similar patch in the past, maybe you find the discussion 
interesting:


https://lore.kernel.org/lkml/20230424014324.218531-1-andrealm...@igalia.com/


Thanks, I had a peruse through that old thread.

Marek definitely had the right idea here, given he mentions:
"That supposedly depends on the compositor. There may be compositors for
very specific cases (e.g. Steam Deck)"

Given that is what I work on and also wrote this patch for that does 
basically the same thing as was proposed. :-)


For context though, I am less interested in Gamescope (the Steam Deck 
compositor) hanging (we don't have code that hangs, if we go down, 
it's likely Steam/CEF died with us anyway atm, can solve that battle 
some other day) and more about the applications run under it.


Marek is very right when he says applications that fault/hang will 
submit one IB after another that also fault/hang -- especially if they 
write to descriptors from the GPU (descriptor buffers), or use draw 
indirect or anything bindless or...
That's basically functionally equivalent to DOSing a system if it is 
not prevented.


And that's exactly what I see even in a simple test app doing a fault 
-> hang every frame.


Right now, given that soft recovery never marks a context as guilty, 
it means that every app I tested is never stopped from submitting 
garbage That holds true for basically any app that GPU hangs and makes 
soft recovery totally useless in my testing without this.


Yeah, the problem is that your patch wouldn't help with that. A testing 
app can still re-create the context for each submission and so crash the 
system over and over again.


It is still definitely possible for an application to do re-create its 
context and hang yet again -- however that is not the problem I am 
trying to solve here.


The problem I am trying to solve is that applications do not even get 
marked guilty when triggering soft recovery right now.


The patch does help with that on SteamOS, as the type of applications we 
deal with that hang, just abort on VK_ERROR_DEVICE_LOST.


If a UI toolkit that handles DEVICE_LOST keeps doing this, then yes it 
would not help, but this patch is also a necessary step towards fixing 
that someday. (Eg. some policy where processes are killed for hanging 
too many times, etc)




The question here is really if we should handled soft recovered errors 
as fatal or not. Marek is in pro of that Michel is against it.


Figure out what you want in userspace and I'm happy to implement it :)



(That being said, without my patches, RADV treats *any* reset from the 
query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it 
was not guilty, so any faulting/hanging application causes every 
Vulkan app to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )


That is actually intended behavior. When something disrupted the GPU 
execution and the application is affected it is mandatory to forward 
this error to the application.


No. If said context was entirely unaffected, then it should be 
completely transparent to the application.


Consider the following:

 - I have Counter-Strike 2 running
 - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the specific 
VMID for HangApp's submissions and signals the submission fence.


In this instance, the Gamescope and Counter-Strike 2 ctxs are completely 
unaffected and don't need to report VK_ERROR_DEVICE_LOST as there was no 
impact to their work.


Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem with 
HangApp, FWIU the way that the clear-out works being vmid specific means 
that they would be unaffected, right?


- Joshie 🐸✨



Regards,
Christian.



- Joshie 🐸✨




Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.

[PATCH] drm/amdgpu: fix sdma ecc irq unbalanced issue

2024-01-15 Thread Yang Wang

fix sdma ecc irq unblanced issue when do mode2 reset.

Fixes: 90b87f67124a ("drm/amdgpu: add sdma v4.4.2 ACA support")

Signed-off-by: Yang Wang 
---
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index 4bb055eacad5..fec5a3d1c4bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -2267,21 +2267,12 @@ static int sdma_v4_4_2_ras_late_init(struct 
amdgpu_device *adev, struct ras_comm
 {
int r;
 
-   r = amdgpu_ras_block_late_init(adev, ras_block);
+   r = amdgpu_sdma_ras_late_init(adev, ras_block);
if (r)
return r;
 
-   r = amdgpu_ras_bind_aca(adev, AMDGPU_RAS_BLOCK__SDMA,
-   &sdma_v4_4_2_aca_info, NULL);
-   if (r)
-   goto late_fini;
-
-   return 0;
-
-late_fini:
-   amdgpu_ras_block_late_fini(adev, ras_block);
-
-   return r;
+   return amdgpu_ras_bind_aca(adev, AMDGPU_RAS_BLOCK__SDMA,
+  &sdma_v4_4_2_aca_info, NULL);
 }
 
 static struct amdgpu_sdma_ras sdma_v4_4_2_ras = {
-- 
2.34.1

RE: [PATCH] drm/amdgpu: fix sdma ecc irq unbalanced issue

2024-01-15 Thread Kamal, Asad

[AMD Official Use Only - General]

Reviewed-by: Asad Kamal 

Thanks & Regards
Asad

-Original Message-
From: amd-gfx  On Behalf Of Yang Wang
Sent: Monday, January 15, 2024 5:33 PM
To: amd-gfx@lists.freedesktop.org
Cc: Wang, Yang(Kevin) ; Zhang, Hawking 

Subject: [PATCH] drm/amdgpu: fix sdma ecc irq unbalanced issue

fix sdma ecc irq unblanced issue when do mode2 reset.

Fixes: 90b87f67124a ("drm/amdgpu: add sdma v4.4.2 ACA support")

Signed-off-by: Yang Wang 
---
 drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
index 4bb055eacad5..fec5a3d1c4bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
@@ -2267,21 +2267,12 @@ static int sdma_v4_4_2_ras_late_init(struct 
amdgpu_device *adev, struct ras_comm  {
int r;

-   r = amdgpu_ras_block_late_init(adev, ras_block);
+   r = amdgpu_sdma_ras_late_init(adev, ras_block);
if (r)
return r;

-   r = amdgpu_ras_bind_aca(adev, AMDGPU_RAS_BLOCK__SDMA,
-   &sdma_v4_4_2_aca_info, NULL);
-   if (r)
-   goto late_fini;
-
-   return 0;
-
-late_fini:
-   amdgpu_ras_block_late_fini(adev, ras_block);
-
-   return r;
+   return amdgpu_ras_bind_aca(adev, AMDGPU_RAS_BLOCK__SDMA,
+  &sdma_v4_4_2_aca_info, NULL);
 }

 static struct amdgpu_sdma_ras sdma_v4_4_2_ras = {
--
2.34.1

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König


Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.


Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error 
code.


The fence error code does not result in ctx's being marked as guilty, 
only the karma path does.


See drm_sched_increase_karma.

Are you proposing that we instead mark contexts as guilty with the 
fence error ourselves here?


No, I'm proposing to completely abandon the concept of guilty contexts. 
Basically what we should do is to return an error from the CS IOCTL 
whenever a previous submission resulted in a fatal error as suggested by 
Marek.


That we query the context for guilty was just a design decision we 
copied over from our closed source drivers which turned out to 
absolutely not solving anything.


Marek can probably comment as well why the whole idea of querying the 
kernel if something fatal happens instead of just rejecting submissions 
is broken by design.




Without this feedback, the application may keep pushing through the 
soft

recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft 
recovered errors fatal as well while Michel is voting for better 
ignoring them.


I'm not really sure what to do. If you guys think that soft recovered 
hangs should be fatal as well then we can certainly do this.


They have to be!

As Marek and I have pointed out, applications that hang or fault will 
just hang or fault again, especially when they use things like draw 
indirect, buffer device address, descriptor buffers, etc.


Ok, well then I now have two people (Marek and you) saying that soft 
recovery should be fatal while Michel is saying that soft recovery being 
non fatal improves stability for him :)


Should we somehow make that configurable or depend it on if that's the 
display server or if it's an user application?


Regards,
Christian.



The majority of apps these days have a lot of side effects and 
persistence between frames and submissions.


- Joshie 🐸✨



Regards,
Christian.



There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an 
application

which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
  atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König


Am 15.01.24 um 12:54 schrieb Joshua Ashton:

[SNIP]


The question here is really if we should handled soft recovered 
errors as fatal or not. Marek is in pro of that Michel is against it.


Figure out what you want in userspace and I'm happy to implement it :)



(That being said, without my patches, RADV treats *any* reset from 
the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost 
and it was not guilty, so any faulting/hanging application causes 
every Vulkan app to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )


That is actually intended behavior. When something disrupted the GPU 
execution and the application is affected it is mandatory to forward 
this error to the application.


No. If said context was entirely unaffected, then it should be 
completely transparent to the application.


Consider the following:

 - I have Counter-Strike 2 running
 - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the 
specific VMID for HangApp's submissions and signals the submission fence.


In this instance, the Gamescope and Counter-Strike 2 ctxs are 
completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as 
there was no impact to their work.


Ok, that is something I totally agree on. But why would the Gamescope 
and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery 
in the first place?


IIRC a soft recovery doesn't increment the reset counter in any way. So 
they should never be affected.


Regards,
Christian.



Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem 
with HangApp, FWIU the way that the clear-out works being vmid 
specific means that they would be unaffected, right?


- Joshie 🐸✨



Regards,
Christian.



- Joshie 🐸✨




Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton





On 1/15/24 13:19, Christian König wrote:

Am 15.01.24 um 12:54 schrieb Joshua Ashton:

[SNIP]


The question here is really if we should handled soft recovered 
errors as fatal or not. Marek is in pro of that Michel is against it.


Figure out what you want in userspace and I'm happy to implement it :)



(That being said, without my patches, RADV treats *any* reset from 
the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost 
and it was not guilty, so any faulting/hanging application causes 
every Vulkan app to die e_e. This is fixed in 
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )


That is actually intended behavior. When something disrupted the GPU 
execution and the application is affected it is mandatory to forward 
this error to the application.


No. If said context was entirely unaffected, then it should be 
completely transparent to the application.


Consider the following:

 - I have Counter-Strike 2 running
 - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the 
specific VMID for HangApp's submissions and signals the submission fence.


In this instance, the Gamescope and Counter-Strike 2 ctxs are 
completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as 
there was no impact to their work.


Ok, that is something I totally agree on. But why would the Gamescope 
and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery 
in the first place?


IIRC a soft recovery doesn't increment the reset counter in any way. So 
they should never be affected.


It does, and without assigning any guilt, amdgpu_ring_soft_recovery 
calls atomic_inc(&ring->adev->gpu_reset_counter).





Regards,
Christian.



Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem 
with HangApp, FWIU the way that the clear-out works being vmid 
specific means that they would be unaffected, right?


- Joshie 🐸✨



Regards,
Christian.



- Joshie 🐸✨




Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)










- Joshie 🐸✨

Re: Failed to create a rescuer kthread for the amdgpu-reset-dev workqueue

2024-01-15 Thread Thomas Perrot

Hello Christian,

On Fri, 2024-01-12 at 09:17 +0100, Christian König wrote:
> Well the driver load is interrupted for some reason.
> 
> Have you set any timeout for modprobe?
> 

We don't set a modprobe timeout.

Kind regards,
Thomas

> Regards,
> Christian.
> 
> Am 12.01.24 um 09:11 schrieb Thomas Perrot:
> > Hello,
> > 
> > We are updating the kernel from the 6.1 to the 6.6 and we observe
> > an
> > amdgpu’s regression with Radeon RX580 8GB and SiFive Unmatched:
> > “workqueue: Failed to create a rescuer kthread for wq 'amdgpu-
> > reset-
> > dev': -EINTR
> > [drm:amdgpu_reset_create_reset_domain [amdgpu]] *ERROR* Failed to
> > allocate wq for amdgpu_reset_domain!
> > amdgpu :07:00.0: amdgpu: Fatal error during GPU init
> > amdgpu :07:00.0: amdgpu: amdgpu: finishing device.
> > amdgpu: probe of :07:00.0 failed with error -12”
> > 
> > We tried to figure it out without success for the moment, do you
> > have
> > some advice to identify the root cause and to fix it?
> > 
> > Kind regards,
> > Thomas Perrot
> > 
> 

-- 
Thomas Perrot, Bootlin
Embedded Linux and kernel engineering
https://bootlin.com



signature.asc
Description: This is a digitally signed message part

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton





On 1/15/24 13:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:

We need to bump the karma of the drm_sched job in order for the context
that we just recovered to get correct feedback that it is guilty of
hanging.


Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error 
code.


The fence error code does not result in ctx's being marked as guilty, 
only the karma path does.


See drm_sched_increase_karma.

Are you proposing that we instead mark contexts as guilty with the 
fence error ourselves here?


No, I'm proposing to completely abandon the concept of guilty contexts. 
Basically what we should do is to return an error from the CS IOCTL 
whenever a previous submission resulted in a fatal error as suggested by 
Marek.


Oh, I agree that is broken by design, but this is already implemented 
with the current guilt system!


The ioctls already will return -ECANCELLED if you are guilty of a hang:

https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c#L64

The query merely exists to give more feedback as to the situation, which 
is fine.




That we query the context for guilty was just a design decision we 
copied over from our closed source drivers which turned out to 
absolutely not solving anything.


Marek can probably comment as well why the whole idea of querying the 
kernel if something fatal happens instead of just rejecting submissions 
is broken by design.




Without this feedback, the application may keep pushing through the 
soft

recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft 
recovered errors fatal as well while Michel is voting for better 
ignoring them.


I'm not really sure what to do. If you guys think that soft recovered 
hangs should be fatal as well then we can certainly do this.


They have to be!

As Marek and I have pointed out, applications that hang or fault will 
just hang or fault again, especially when they use things like draw 
indirect, buffer device address, descriptor buffers, etc.


Ok, well then I now have two people (Marek and you) saying that soft 
recovery should be fatal while Michel is saying that soft recovery being 
non fatal improves stability for him :)


Should we somehow make that configurable or depend it on if that's the 
display server or if it's an user application?


I could probably get every RADV developer, and all of the Proton 
graphics team to come in and preach the same thing also. :P


If a compositor/display server is guilty of a hang, it can just listen 
for DEVICE_LOST from vkQueueSubmit and re-create it's context, re-import 
the dmabufs etc (or restart itself).


FWIU, in the email chain, the thing Daenzer was saying was that Firefox 
was falling back to software rendering even when it was innocent, but 
that seems incredibly broken by design.


- Joshie 🐸✨



Regards,
Christian.



The majority of apps these days have a lot of side effects and 
persistence between frames and submissions.


- Joshie 🐸✨



Regards,
Christian.



There is an accompanying Mesa/RADV patch here
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050
to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an 
application

which can fault the GPU in a variety of ways, and still have Steam +
Counter-Strike 2 + Gamescope (compositor) stay up and continue
functioning on Steam Deck.

Signed-off-by: Joshua Ashton 

Cc: Friedrich Vock 
Cc: Bas Nieuwenhuizen 
Cc: Christian König 
Cc: André Almeida 
Cc: sta...@vger.kernel.org
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c

index 25209ce54552..e87cafb5b1c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct 
amdgpu_ring *ring, struct amdgpu_job *job)

  dma_fence_set_error(fence, -ENODATA);
  spin_unlock_irqrestore(fence->lock, flags);
+    if (job->vm)
+    drm_sched_increase_karma(&job->base);
  atomic_inc(&ring->adev->gpu_reset_counter);
  while (!dma_fence_is_signaled(fence) &&
 ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Re: [PATCH 2/7] drm/uAPI: Add "active color format" drm property as feedback for userspace

2024-01-15 Thread Sebastian Wick

On Thu, Jan 11, 2024 at 05:17:46PM +, Andri Yngvason wrote:
> mið., 10. jan. 2024 kl. 13:26 skrifaði Daniel Stone :
> > >
> > > This thing here works entirely differently, and I think we need somewhat
> > > new semantics for this:
> > >
> > > - I agree it should be read-only for userspace, so immutable sounds right.
> > >
> > > - But I also agree with Daniel Stone that this should be tied more
> > >   directly to the modeset state.
> > >
> > > So I think the better approach would be to put the output type into
> > > drm_connector_state, require that drivers compute it in their
> > > ->atomic_check code (which in the future would allow us to report it out
> > > for TEST_ONLY commits too), and so guarantee that the value is updated
> > > right after the kms ioctl returns (and not somewhen later for non-blocking
> > > commits).
> >
> > That's exactly the point. Whether userspace gets an explicit
> > notification or it has to 'know' when to read isn't much of an issue -
> > just as long as it's well defined. I think the suggestion of 'do it in
> > atomic_check and then it's guaranteed to be readable when the commit
> > completes' is a good one.
> >
> > I do still have some reservations - for instance, why do we have the
> > fallback to auto when userspace has explicitly requested a certain
> > type? - but they may have been covered previously.
> >
> 
> We discussed this further on IRC and this is summary of that discussion:
> 
> Sima proposed a new type of property that can be used to git feedback to
> userspace after atomic ioctl. The user supplies a list of output properties
> that they want to query and the kernel fills it in before returning from the
> ioctl. This would help to get some information about why things failed
> during TEST_ONLY.
> 
> Emersion raised the point that you might not know how much memory is needed
> beforehand for the returned properties, to which sima replied: blob
> property. There was some discussion about how that makes it possible to leak
> kernel memory, especially if userspace does not know about a new property
> blob. Emersion pointed out that userspace should only request properties
> that it understands and pq agreed.
> 
> Emersion asked how the user should inform the kernel that it's done with the
> blob, to which sima replied: DRM_IOCTL_MODE_DESTROYPROPBLOB. Sima also
> mentioned using some sort of weak reference garbage collection scheme for
> properties and there was some further discussion, but I'm not sure there was
> any conclusion.
> 
> I asked if it made sense to add color format capabilities to the mode info
> struct, but the conclusion was that it wouldn't really be useful because we
> need TEST_ONLY anyway to see if the color format setting is compatible with
> other settings.
> 
> I asked again if we should drop the "active color format" property as it
> seems to be more trouble than it's worth for now. pq mentioned that there
> are 2 separate use cases (in his words):
> - People playing with setting UI would like to know what "auto" would result
>   in, but that's just nice to have
> - The other use case is the flicker-free boot into known configuration He
>   went on to point out that the main problem with "auto" is that any modeset
>   could make the driver decide differently. This means that we cannot fully
>   rely on the previously set property.
> 
> However, leaving out "active color property" did not put us in a worse
> situation than before, so the conclusion was that we should leave it out for
> now. For GUI selectors, the current TEST_ONLY should be good enough, and all
> the fancy stuff discussed previously isn't needed for now.
> 
> To summarise the summary: this means that we will drop the "active
> color format" property and rename the "preferred color format"
> property to "force color format" or just "color format" and failing to
> satisfy that constraint will fail the modeset rather than falling back
> to the "auto" behaviour.

That's a good idea.

Anything else won't work with the new color pipeline API anyways because
user space will be responsible for setting up the color pipeline API in
a way so that the monitor will receive the correctly encoded data.

> Cheers,
> Andri
>

Re: [PATCH] drm/amdkfd: init drm_client with funcs hook

2024-01-15 Thread Felix Kuehling


On 2024-01-12 3:05, Flora Cui wrote:

otherwise drm_client_dev_unregister() would try to
kfree(&adev->kfd.client).

Signed-off-by: Flora Cui 


Thank you for finding and fixing this bug. You can add:

Fixes: 1819200166ce ("drm/amdkfd: Export DMABufs from KFD using GEM 
handles")

Reviewed-by: Felix Kuehling 



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 067690ba7bff..81af6bf2f052 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -138,6 +138,9 @@ static void amdgpu_amdkfd_reset_work(struct work_struct 
*work)
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
  }
  
+static const struct drm_client_funcs kfd_client_funcs = {

+   .unregister = drm_client_release,
+};
  void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
  {
int i;
@@ -161,7 +164,7 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
.enable_mes = adev->enable_mes,
};
  
-		ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", NULL);

+   ret = drm_client_init(&adev->ddev, &adev->kfd.client, "kfd", 
&kfd_client_funcs);
if (ret) {
dev_err(adev->dev, "Failed to init DRM client: %d\n", 
ret);
return;

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Michel Dänzer

On 2024-01-15 14:17, Christian König wrote:
> Am 15.01.24 um 12:37 schrieb Joshua Ashton:
>> On 1/15/24 09:40, Christian König wrote:
>>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>>
 Without this feedback, the application may keep pushing through the soft
 recoveries, continually hanging the system with jobs that timeout.
>>>
>>> Well, that is intentional behavior. Marek is voting for making soft 
>>> recovered errors fatal as well while Michel is voting for better ignoring 
>>> them.
>>>
>>> I'm not really sure what to do. If you guys think that soft recovered hangs 
>>> should be fatal as well then we can certainly do this.

A possible compromise might be making soft resets fatal if they happen 
repeatedly (within a certain period of time?).

>> They have to be!
>>
>> As Marek and I have pointed out, applications that hang or fault will just 
>> hang or fault again, especially when they use things like draw indirect, 
>> buffer device address, descriptor buffers, etc.
> 
> Ok, well then I now have two people (Marek and you) saying that soft recovery 
> should be fatal while Michel is saying that soft recovery being non fatal 
> improves stability for him :)

That's not quite what I wrote before.

I pointed out that my GNOME session has survived a soft reset without issues[0] 
on multiple occasions, whereas Marek's proposal at the time would have kicked 
me back to the login screen every time. > 0 vs effectively 0 chance of survival.

[0] Except for Firefox unnecessarily falling back to software rendering, which 
is a side note, not the main point.

-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton





On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making soft recovered 
errors fatal as well while Michel is voting for better ignoring them.

I'm not really sure what to do. If you guys think that soft recovered hangs 
should be fatal as well then we can certainly do this.


A possible compromise might be making soft resets fatal if they happen 
repeatedly (within a certain period of time?).



No, no and no. Aside from introducing issues by side effects not 
surfacing and all of the stuff I mentioned about descriptor buffers, 
bda, draw indirect and stuff just resulting in more faults and hangs...


You are proposing we throw out every promise we made to an application 
on the API contract level because it "might work". That's just wrong!


Let me put this in explicit terms: What you are proposing is in direct 
violation of the GL and Vulkan specification.


You can't just chose to break these contracts because you think it 
'might' be a better user experience.





They have to be!

As Marek and I have pointed out, applications that hang or fault will just hang 
or fault again, especially when they use things like draw indirect, buffer 
device address, descriptor buffers, etc.


Ok, well then I now have two people (Marek and you) saying that soft recovery 
should be fatal while Michel is saying that soft recovery being non fatal 
improves stability for him :)


That's not quite what I wrote before.

I pointed out that my GNOME session has survived a soft reset without issues[0] on 
multiple occasions, whereas Marek's proposal at the time would have kicked me back 
to the login screen every time. > 0 vs effectively 0 chance of survival.


The correct thing for GNOME/Mutter to do is to simply re-create it's 
context, reimport it's DMABUFs, etc.


The fact that it survives and keeps soldiering on with whatever side 
effects missed is purely coincidental and not valid API usage.


If you want such behaviour for hangs for Mutter, you should propose a 
GL/VK extension for it, but I really doubt that will get anywhere.


- Joshie 🐸✨



[0] Except for Firefox unnecessarily falling back to software rendering, which 
is a side note, not the main point.

[PATCH v2 1/4] drm/amd/display: Remove unnecessary SIGNAL_TYPE_HDMI_TYPE_A check

2024-01-15 Thread Andri Yngvason

From: Werner Sembach 

Remove unnecessary SIGNAL_TYPE_HDMI_TYPE_A check that was performed in the
drm_mode_is_420_only() case, but not in the drm_mode_is_420_also() &&
force_yuv420_output case.

Without further knowledge if YCbCr 4:2:0 is supported outside of HDMI,
there is no reason to use RGB when the display
reports drm_mode_is_420_only() even on a non HDMI connection.

This patch also moves both checks in the same if-case. This  eliminates an
extra else-if-case.

Signed-off-by: Werner Sembach 
Signed-off-by: Andri Yngvason 
Tested-by: Andri Yngvason 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index f6575d7dee971..cc4d1f7f97b98 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -5575,11 +5575,7 @@ static void fill_stream_properties_from_drm_display_mode(
timing_out->v_border_bottom = 0;
/* TODO: un-hardcode */
if (drm_mode_is_420_only(info, mode_in)
-   && stream->signal == SIGNAL_TYPE_HDMI_TYPE_A)
-   timing_out->pixel_encoding = PIXEL_ENCODING_YCBCR420;
-   else if (drm_mode_is_420_also(info, mode_in)
-   && aconnector
-   && aconnector->force_yuv420_output)
+   || (drm_mode_is_420_also(info, mode_in) && 
aconnector->force_yuv420_output))
timing_out->pixel_encoding = PIXEL_ENCODING_YCBCR420;
else if ((connector->display_info.color_formats & 
DRM_COLOR_FORMAT_YCBCR444)
&& stream->signal == SIGNAL_TYPE_HDMI_TYPE_A)
-- 
2.43.0

[PATCH v2 3/4] drm/amd/display: Add handling for new "force color format" property

2024-01-15 Thread Andri Yngvason

From: Werner Sembach 

This commit implements the "force color format" drm property for the
AMD GPU driver.

Signed-off-by: Werner Sembach 
Co-Developed-by: Andri Yngvason 
Signed-off-by: Andri Yngvason 
Tested-by: Andri Yngvason 
---

Changes in v2:
 - Renamed to "force color format" from "preferred color format"
 - Modeset will fail if color format cannot be satisfied

---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 63 ---
 .../display/amdgpu_dm/amdgpu_dm_mst_types.c   |  4 ++
 2 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index cc4d1f7f97b98..26c4260c78d7b 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -5573,15 +5573,32 @@ static void 
fill_stream_properties_from_drm_display_mode(
timing_out->h_border_right = 0;
timing_out->v_border_top = 0;
timing_out->v_border_bottom = 0;
-   /* TODO: un-hardcode */
-   if (drm_mode_is_420_only(info, mode_in)
-   || (drm_mode_is_420_also(info, mode_in) && 
aconnector->force_yuv420_output))
+
+   if (connector_state
+   && (connector_state->force_color_format == 
DRM_COLOR_FORMAT_YCBCR420
+   || aconnector->force_yuv420_output) && 
drm_mode_is_420(info, mode_in))
timing_out->pixel_encoding = PIXEL_ENCODING_YCBCR420;
-   else if ((connector->display_info.color_formats & 
DRM_COLOR_FORMAT_YCBCR444)
-   && stream->signal == SIGNAL_TYPE_HDMI_TYPE_A)
+   else if (connector_state
+   && connector_state->force_color_format == 
DRM_COLOR_FORMAT_YCBCR444
+   && connector->display_info.color_formats & 
DRM_COLOR_FORMAT_YCBCR444)
timing_out->pixel_encoding = PIXEL_ENCODING_YCBCR444;
-   else
+   else if (connector_state
+   && connector_state->force_color_format == 
DRM_COLOR_FORMAT_RGB444
+   && !drm_mode_is_420_only(info, mode_in))
timing_out->pixel_encoding = PIXEL_ENCODING_RGB;
+   else
+   /*
+* connector_state->force_color_format not possible
+* || connector_state->force_color_format == 0 (auto)
+* || connector_state->force_color_format == 
DRM_COLOR_FORMAT_YCBCR422
+*/
+   if (drm_mode_is_420_only(info, mode_in))
+   timing_out->pixel_encoding = PIXEL_ENCODING_YCBCR420;
+   else if ((connector->display_info.color_formats & 
DRM_COLOR_FORMAT_YCBCR444)
+   && stream->signal == SIGNAL_TYPE_HDMI_TYPE_A)
+   timing_out->pixel_encoding = PIXEL_ENCODING_YCBCR444;
+   else
+   timing_out->pixel_encoding = PIXEL_ENCODING_RGB;
 
timing_out->timing_3d_format = TIMING_3D_FORMAT_NONE;
timing_out->display_color_depth = convert_color_depth_from_display_info(
@@ -6685,6 +6702,33 @@ static enum dc_status 
dm_validate_stream_and_context(struct dc *dc,
return dc_result;
 }
 
+static enum dc_status
+dm_validate_stream_color_format(const struct drm_connector_state *drm_state,
+   const struct dc_stream_state *stream)
+{
+   if (!drm_state->force_color_format)
+   return DC_OK;
+
+   enum dc_pixel_encoding encoding = PIXEL_ENCODING_UNDEFINED;
+   switch (drm_state->force_color_format) {
+   case DRM_COLOR_FORMAT_RGB444:
+   encoding = PIXEL_ENCODING_RGB;
+   break;
+   case DRM_COLOR_FORMAT_YCBCR444:
+   encoding = PIXEL_ENCODING_YCBCR444;
+   break;
+   case DRM_COLOR_FORMAT_YCBCR422:
+   encoding = PIXEL_ENCODING_YCBCR422;
+   break;
+   case DRM_COLOR_FORMAT_YCBCR420:
+   encoding = PIXEL_ENCODING_YCBCR420;
+   break;
+   }
+
+   return encoding == stream->timing.pixel_encoding ?
+  DC_OK : DC_UNSUPPORTED_VALUE;
+}
+
 struct dc_stream_state *
 create_validate_stream_for_sink(struct amdgpu_dm_connector *aconnector,
const struct drm_display_mode *drm_mode,
@@ -6717,6 +6761,9 @@ create_validate_stream_for_sink(struct 
amdgpu_dm_connector *aconnector,
if (dc_result == DC_OK)
dc_result = dm_validate_stream_and_context(adev->dm.dc, 
stream);
 
+   if (dc_result == DC_OK)
+   dc_result = dm_validate_stream_color_format(drm_state, 
stream);
+
if (dc_result != DC_OK) {
DRM_DEBUG_KMS("Mode %dx%d (clk %d) failed DC validation 
with error %d (%s)\n",
  drm_mode->hdisplay,
@@ -7512,8 +7559,10 @@ void amdgpu_dm_connector_init_helper(struct 
amdgpu_display_m

[PATCH v2 2/4] drm/uAPI: Add "force color format" drm property as setting for userspace

2024-01-15 Thread Andri Yngvason

From: Werner Sembach 

Add a new general drm property "force color format" which can be used
by userspace to tell the graphics driver which color format to use.

Possible options are:
- auto (default/current behaviour)
- rgb
- ycbcr444
- ycbcr422 (supported by neither amdgpu or i915)
- ycbcr420

In theory the auto option should choose the best available option for the
current setup, but because of bad internal conversion some monitors look
better with rgb and some with ycbcr444.

Also, because of bad shielded connectors and/or cables, it might be
preferable to use the less bandwidth heavy ycbcr422 and ycbcr420 formats
for a signal that is less susceptible to interference.

In the future, automatic color calibration for screens might also depend on
this option being available.

Signed-off-by: Werner Sembach 
Signed-off-by: Andri Yngvason 
Tested-by: Andri Yngvason 
---

Changes in v2:
 - Renamed to "force color format" from "preferred color format"
 - Removed Reported-by pointing to invalid email address

---
 drivers/gpu/drm/drm_atomic_helper.c |  4 +++
 drivers/gpu/drm/drm_atomic_uapi.c   |  4 +++
 drivers/gpu/drm/drm_connector.c | 48 +
 include/drm/drm_connector.h | 16 ++
 4 files changed, 72 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c 
b/drivers/gpu/drm/drm_atomic_helper.c
index 39ef0a6addeba..1dabd164c4f09 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -707,6 +707,10 @@ drm_atomic_helper_check_modeset(struct drm_device *dev,
if (old_connector_state->max_requested_bpc !=
new_connector_state->max_requested_bpc)
new_crtc_state->connectors_changed = true;
+
+   if (old_connector_state->force_color_format !=
+   new_connector_state->force_color_format)
+   new_crtc_state->connectors_changed = true;
}
 
if (funcs->atomic_check)
diff --git a/drivers/gpu/drm/drm_atomic_uapi.c 
b/drivers/gpu/drm/drm_atomic_uapi.c
index 29d4940188d49..e45949bf4615f 100644
--- a/drivers/gpu/drm/drm_atomic_uapi.c
+++ b/drivers/gpu/drm/drm_atomic_uapi.c
@@ -776,6 +776,8 @@ static int drm_atomic_connector_set_property(struct 
drm_connector *connector,
state->max_requested_bpc = val;
} else if (property == connector->privacy_screen_sw_state_property) {
state->privacy_screen_sw_state = val;
+   } else if (property == connector->force_color_format_property) {
+   state->force_color_format = val;
} else if (connector->funcs->atomic_set_property) {
return connector->funcs->atomic_set_property(connector,
state, property, val);
@@ -859,6 +861,8 @@ drm_atomic_connector_get_property(struct drm_connector 
*connector,
*val = state->max_requested_bpc;
} else if (property == connector->privacy_screen_sw_state_property) {
*val = state->privacy_screen_sw_state;
+   } else if (property == connector->force_color_format_property) {
+   *val = state->force_color_format;
} else if (connector->funcs->atomic_get_property) {
return connector->funcs->atomic_get_property(connector,
state, property, val);
diff --git a/drivers/gpu/drm/drm_connector.c b/drivers/gpu/drm/drm_connector.c
index b0516505f7ae9..e0535e58b4535 100644
--- a/drivers/gpu/drm/drm_connector.c
+++ b/drivers/gpu/drm/drm_connector.c
@@ -1061,6 +1061,14 @@ static const struct drm_prop_enum_list 
drm_dp_subconnector_enum_list[] = {
{ DRM_MODE_SUBCONNECTOR_Native,  "Native"}, /* DP */
 };
 
+static const struct drm_prop_enum_list drm_force_color_format_enum_list[] = {
+   { 0, "auto" },
+   { DRM_COLOR_FORMAT_RGB444, "rgb" },
+   { DRM_COLOR_FORMAT_YCBCR444, "ycbcr444" },
+   { DRM_COLOR_FORMAT_YCBCR422, "ycbcr422" },
+   { DRM_COLOR_FORMAT_YCBCR420, "ycbcr420" },
+};
+
 DRM_ENUM_NAME_FN(drm_get_dp_subconnector_name,
 drm_dp_subconnector_enum_list)
 
@@ -1396,6 +1404,15 @@ static const u32 dp_colorspaces =
  * drm_connector_attach_max_bpc_property() to create and attach the
  * property to the connector during initialization.
  *
+ * force color format:
+ * This property is used by userspace to change the used color format. When
+ * used the driver will use the selected format if valid for the hardware,
+ * sink, and current resolution and refresh rate combination. Drivers to
+ * use the function drm_connector_attach_force_color_format_property()
+ * to create and attach the property to the connector during
+ * initialization. Possible values are "auto", "rgb", "ycbcr444",
+ * "ycbcr422", and "ycbcr420".
+ *
  * Connectors also have one standardized atomic property:
  *

[PATCH v2 4/4] drm/i915/display: Add handling for new "force color format" property

2024-01-15 Thread Andri Yngvason

From: Werner Sembach 

This commit implements the "force color format" drm property for the
Intel GPU driver.

Signed-off-by: Werner Sembach 
Co-Developed-by: Andri Yngvason 
Signed-off-by: Andri Yngvason 
Tested-by: Andri Yngvason 
---

Changes in v2:
 - Renamed to "force color format" from "preferred color format"
 - Modeset will fail if color format cannot be satisfied

---
 drivers/gpu/drm/i915/display/intel_dp.c | 35 -
 drivers/gpu/drm/i915/display/intel_dp_mst.c |  5 +++
 drivers/gpu/drm/i915/display/intel_hdmi.c   | 29 ++---
 3 files changed, 58 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_dp.c 
b/drivers/gpu/drm/i915/display/intel_dp.c
index 7d2b8ce48fda1..71e822483572e 100644
--- a/drivers/gpu/drm/i915/display/intel_dp.c
+++ b/drivers/gpu/drm/i915/display/intel_dp.c
@@ -2799,6 +2799,16 @@ static bool intel_dp_has_audio(struct intel_encoder 
*encoder,
return intel_conn_state->force_audio == HDMI_AUDIO_ON;
 }
 
+static u32 intel_output_format_to_drm_color_format(enum intel_output_format 
input)
+{
+   switch (input) {
+   case INTEL_OUTPUT_FORMAT_RGB: return DRM_COLOR_FORMAT_RGB444;
+   case INTEL_OUTPUT_FORMAT_YCBCR444: return DRM_COLOR_FORMAT_YCBCR444;
+   case INTEL_OUTPUT_FORMAT_YCBCR420: return DRM_COLOR_FORMAT_YCBCR420;
+   }
+   return 0;
+}
+
 static int
 intel_dp_compute_output_format(struct intel_encoder *encoder,
   struct intel_crtc_state *crtc_state,
@@ -2810,17 +2820,20 @@ intel_dp_compute_output_format(struct intel_encoder 
*encoder,
struct intel_connector *connector = intel_dp->attached_connector;
const struct drm_display_info *info = &connector->base.display_info;
const struct drm_display_mode *adjusted_mode = 
&crtc_state->hw.adjusted_mode;
-   bool ycbcr_420_only;
+   bool ycbcr_420_output;
int ret;
 
-   ycbcr_420_only = drm_mode_is_420_only(info, adjusted_mode);
+   ycbcr_420_output = drm_mode_is_420_only(info, adjusted_mode) ||
+  (conn_state->force_color_format == 
DRM_COLOR_FORMAT_YCBCR420 &&
+   drm_mode_is_420_also(&connector->base.display_info, 
adjusted_mode));
 
-   if (ycbcr_420_only && !connector->base.ycbcr_420_allowed) {
+   crtc_state->sink_format = ycbcr_420_output ? 
INTEL_OUTPUT_FORMAT_YCBCR420 :
+INTEL_OUTPUT_FORMAT_RGB;
+
+   if (ycbcr_420_output && !connector->base.ycbcr_420_allowed) {
drm_dbg_kms(&i915->drm,
"YCbCr 4:2:0 mode but YCbCr 4:2:0 output not 
possible. Falling back to RGB.\n");
crtc_state->sink_format = INTEL_OUTPUT_FORMAT_RGB;
-   } else {
-   crtc_state->sink_format = intel_dp_sink_format(connector, 
adjusted_mode);
}
 
crtc_state->output_format = intel_dp_output_format(connector, 
crtc_state->sink_format);
@@ -2840,6 +2853,11 @@ intel_dp_compute_output_format(struct intel_encoder 
*encoder,
   respect_downstream_limits);
}
 
+   if (conn_state->force_color_format && conn_state->force_color_format !=
+   intel_output_format_to_drm_color_format(crtc_state->sink_format)) {
+   ret = -EINVAL;
+   }
+
return ret;
 }
 
@@ -6179,10 +6197,13 @@ intel_dp_add_properties(struct intel_dp *intel_dp, 
struct drm_connector *connect
intel_attach_force_audio_property(connector);
 
intel_attach_broadcast_rgb_property(connector);
-   if (HAS_GMCH(dev_priv))
+   if (HAS_GMCH(dev_priv)) {
drm_connector_attach_max_bpc_property(connector, 6, 10);
-   else if (DISPLAY_VER(dev_priv) >= 5)
+   drm_connector_attach_force_color_format_property(connector);
+   } else if (DISPLAY_VER(dev_priv) >= 5) {
drm_connector_attach_max_bpc_property(connector, 6, 12);
+   drm_connector_attach_force_color_format_property(connector);
+   }
 
/* Register HDMI colorspace for case of lspcon */
if 
(intel_bios_encoder_is_lspcon(dp_to_dig_port(intel_dp)->base.devdata)) {
diff --git a/drivers/gpu/drm/i915/display/intel_dp_mst.c 
b/drivers/gpu/drm/i915/display/intel_dp_mst.c
index 8a94323350303..dcb3abcc6d83e 100644
--- a/drivers/gpu/drm/i915/display/intel_dp_mst.c
+++ b/drivers/gpu/drm/i915/display/intel_dp_mst.c
@@ -1572,6 +1572,11 @@ static struct drm_connector 
*intel_dp_add_mst_connector(struct drm_dp_mst_topolo
drm_dbg_kms(&dev_priv->drm, "[%s:%d] HDCP MST init failed, 
skipping.\n",
connector->name, connector->base.id);
 
+   connector->force_color_format_property =
+   intel_dp->attached_connector->base.force_color_format_property;
+   if (connector->force_color_format_property)
+   drm_connector_attach_force_color_format_pr

[PATCH v2 0/4] New DRM properties for output color format

2024-01-15 Thread Andri Yngvason

After some discussion, we decided to drop the "active color format"
property and rename the "preferred color format" property to "force
color format". 

The user can probe available color formats in combination with other
properties using TEST_ONLY commits.

v1: 
https://lore.kernel.org/dri-devel/20240109181104.1670304-1-an...@yngvason.is/

v2
 - Dropped "active color format"
 - Replaced "preferred color format" with "force color format"


Werner Sembach (4):
  drm/amd/display: Remove unnecessary SIGNAL_TYPE_HDMI_TYPE_A check
  drm/uAPI: Add "force color format" drm property as setting for
userspace
  drm/amd/display: Add handling for new "force color format" property
  drm/i915/display: Add handling for new "force color format" property

 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 67 ---
 .../display/amdgpu_dm/amdgpu_dm_mst_types.c   |  4 ++
 drivers/gpu/drm/drm_atomic_helper.c   |  4 ++
 drivers/gpu/drm/drm_atomic_uapi.c |  4 ++
 drivers/gpu/drm/drm_connector.c   | 48 +
 drivers/gpu/drm/i915/display/intel_dp.c   | 35 --
 drivers/gpu/drm/i915/display/intel_dp_mst.c   |  5 ++
 drivers/gpu/drm/i915/display/intel_hdmi.c | 29 ++--
 include/drm/drm_connector.h   | 16 +
 9 files changed, 190 insertions(+), 22 deletions(-)


base-commit: 052d534373b7ed33712a63d5e17b2b6cdbce84fd
-- 
2.43.0

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Friedrich Vock




On 15.01.24 16:43, Joshua Ashton wrote:



On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through
the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making
soft recovered errors fatal as well while Michel is voting for
better ignoring them.

I'm not really sure what to do. If you guys think that soft
recovered hangs should be fatal as well then we can certainly do
this.


A possible compromise might be making soft resets fatal if they
happen repeatedly (within a certain period of time?).



No, no and no. Aside from introducing issues by side effects not
surfacing and all of the stuff I mentioned about descriptor buffers,
bda, draw indirect and stuff just resulting in more faults and hangs...

You are proposing we throw out every promise we made to an application
on the API contract level because it "might work". That's just wrong!

Let me put this in explicit terms: What you are proposing is in direct
violation of the GL and Vulkan specification.

You can't just chose to break these contracts because you think it
'might' be a better user experience.


Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)






They have to be!

As Marek and I have pointed out, applications that hang or fault
will just hang or fault again, especially when they use things like
draw indirect, buffer device address, descriptor buffers, etc.


Ok, well then I now have two people (Marek and you) saying that soft
recovery should be fatal while Michel is saying that soft recovery
being non fatal improves stability for him :)


That's not quite what I wrote before.

I pointed out that my GNOME session has survived a soft reset without
issues[0] on multiple occasions, whereas Marek's proposal at the time
would have kicked me back to the login screen every time. > 0 vs
effectively 0 chance of survival.


The correct thing for GNOME/Mutter to do is to simply re-create it's
context, reimport it's DMABUFs, etc.

The fact that it survives and keeps soldiering on with whatever side
effects missed is purely coincidental and not valid API usage.

If you want such behaviour for hangs for Mutter, you should propose a
GL/VK extension for it, but I really doubt that will get anywhere.

- Joshie 🐸✨



[0] Except for Firefox unnecessarily falling back to software
rendering, which is a side note, not the main point.

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Michel Dänzer

On 2024-01-15 17:19, Friedrich Vock wrote:
> On 15.01.24 16:43, Joshua Ashton wrote:
>> On 1/15/24 15:25, Michel Dänzer wrote:
>>> On 2024-01-15 14:17, Christian König wrote:
 Am 15.01.24 um 12:37 schrieb Joshua Ashton:
> On 1/15/24 09:40, Christian König wrote:
>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>
>>> Without this feedback, the application may keep pushing through
>>> the soft
>>> recoveries, continually hanging the system with jobs that timeout.
>>
>> Well, that is intentional behavior. Marek is voting for making
>> soft recovered errors fatal as well while Michel is voting for
>> better ignoring them.
>>
>> I'm not really sure what to do. If you guys think that soft
>> recovered hangs should be fatal as well then we can certainly do
>> this.
>>>
>>> A possible compromise might be making soft resets fatal if they
>>> happen repeatedly (within a certain period of time?).
>>
>> No, no and no. Aside from introducing issues by side effects not
>> surfacing and all of the stuff I mentioned about descriptor buffers,
>> bda, draw indirect and stuff just resulting in more faults and hangs...
>>
>> You are proposing we throw out every promise we made to an application
>> on the API contract level because it "might work". That's just wrong!
>>
>> Let me put this in explicit terms: What you are proposing is in direct
>> violation of the GL and Vulkan specification.
>>
>> You can't just chose to break these contracts because you think it
>> 'might' be a better user experience.
> 
> Is the original issue that motivated soft resets to be non-fatal even an
> issue anymore?
> 
> If I read that old thread correctly, the rationale for that was that
> assigning guilt to a context was more broken than not doing it, because
> the compositor/Xwayland process would also crash despite being unrelated
> to the hang.
> With Joshua's Mesa fixes, this is not the case anymore, so I don't think
> keeping soft resets non-fatal provides any benefit to the user experience.
> The potential detriments to user experience have been outlined multiple
> times in this thread already.
> 
> (I suppose if the compositor itself faults it might still bring down a
> session, but I've literally never seen that, and it's not like a
> compositor triggering segfaults on CPU stays alive either.)

That's indeed what happened for me, multiple times. And each time the session 
continued running fine for days after the soft reset.

But apparently my experience isn't valid somehow, and I should have been forced 
to log in again to please the GL gods...


Conversely, I can't remember hitting a case where an app kept running into soft 
resets. It's almost as if different people may have different experiences! ;)

Note that I'm not saying that case can't happen. Making soft resets fatal only 
if they happen repeatedly could address both issues, rather than only one or 
the other. Seems like a win-win.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton





On 1/15/24 16:34, Michel Dänzer wrote:

On 2024-01-15 17:19, Friedrich Vock wrote:

On 15.01.24 16:43, Joshua Ashton wrote:

On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through
the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making
soft recovered errors fatal as well while Michel is voting for
better ignoring them.

I'm not really sure what to do. If you guys think that soft
recovered hangs should be fatal as well then we can certainly do
this.


A possible compromise might be making soft resets fatal if they
happen repeatedly (within a certain period of time?).


No, no and no. Aside from introducing issues by side effects not
surfacing and all of the stuff I mentioned about descriptor buffers,
bda, draw indirect and stuff just resulting in more faults and hangs...

You are proposing we throw out every promise we made to an application
on the API contract level because it "might work". That's just wrong!

Let me put this in explicit terms: What you are proposing is in direct
violation of the GL and Vulkan specification.

You can't just chose to break these contracts because you think it
'might' be a better user experience.


Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)


That's indeed what happened for me, multiple times. And each time the session 
continued running fine for days after the soft reset.

But apparently my experience isn't valid somehow, and I should have been forced 
to log in again to please the GL gods...


Conversely, I can't remember hitting a case where an app kept running into soft 
resets. It's almost as if different people may have different experiences! ;)


Your anecdote of whatever application coincidentally managing to soldier 
through being hung is really not relevant.


The right thing to do is to not break all of our api contracts with the 
application and not egregiously violate the VK and GL specs. :-)


No-one is saying your session should go down.
It looks like Mutter already has some handling for GL robustness anyway...
So this should already just work with these patches if it is not broken?

- Joshie 🐸✨



Note that I'm not saying that case can't happen. Making soft resets fatal only 
if they happen repeatedly could address both issues, rather than only one or 
the other. Seems like a win-win.

Re: [PATCH] drm/amd/display: Avoid enum conversion warning

2024-01-15 Thread Alex Deucher

Applied.  Thanks!

On Wed, Jan 10, 2024 at 3:56 PM Nathan Chancellor  wrote:
>
> Clang warns (or errors with CONFIG_WERROR=y) when performing arithmetic
> with different enumerated types, which is usually a bug:
>
> 
> drivers/gpu/drm/amd/amdgpu/../display/dc/link/protocols/link_dp_dpia_bw.c:548:24:
>  error: arithmetic between different enumeration types ('const enum 
> dc_link_rate' and 'const enum dc_lane_count') [-Werror,-Wenum-enum-conversion]
>   548 | link_cap->link_rate * 
> link_cap->lane_count * LINK_RATE_REF_FREQ_IN_KHZ * 8;
>   | ~~~ ^ 
> 1 error generated.
>
> In this case, there is not a problem because the enumerated types are
> basically treated as '#define' values. Add an explicit cast to an
> integral type to silence the warning.
>
> Closes: https://github.com/ClangBuiltLinux/linux/issues/1976
> Fixes: 5f3bce13266e ("drm/amd/display: Request usb4 bw for mst streams")
> Signed-off-by: Nathan Chancellor 
> ---
>  drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c 
> b/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
> index 4ef1a6a1d129..dd0d2b206462 100644
> --- a/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
> +++ b/drivers/gpu/drm/amd/display/dc/link/protocols/link_dp_dpia_bw.c
> @@ -544,8 +544,9 @@ int link_dp_dpia_get_dp_overhead_in_dp_tunneling(struct 
> dc_link *link)
>  */
> const struct dc_link_settings *link_cap =
> dc_link_get_link_cap(link);
> -   uint32_t link_bw_in_kbps =
> -   link_cap->link_rate * link_cap->lane_count * 
> LINK_RATE_REF_FREQ_IN_KHZ * 8;
> +   uint32_t link_bw_in_kbps = (uint32_t)link_cap->link_rate *
> +  (uint32_t)link_cap->lane_count *
> +  LINK_RATE_REF_FREQ_IN_KHZ * 8;
> link_mst_overhead = (link_bw_in_kbps / 64) + 
> ((link_bw_in_kbps % 64) ? 1 : 0);
> }
>
>
> ---
> base-commit: 6e7a29f011ac03a765f53844f7c3f04cdd421715
> change-id: 20240110-amdgpu-display-enum-enum-conversion-e2451adbf4a7
>
> Best regards,
> --
> Nathan Chancellor 
>

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Michel Dänzer

On 2024-01-15 17:46, Joshua Ashton wrote:
> On 1/15/24 16:34, Michel Dänzer wrote:
>> On 2024-01-15 17:19, Friedrich Vock wrote:
>>> On 15.01.24 16:43, Joshua Ashton wrote:
 On 1/15/24 15:25, Michel Dänzer wrote:
> On 2024-01-15 14:17, Christian König wrote:
>> Am 15.01.24 um 12:37 schrieb Joshua Ashton:
>>> On 1/15/24 09:40, Christian König wrote:
 Am 13.01.24 um 15:02 schrieb Joshua Ashton:

> Without this feedback, the application may keep pushing through
> the soft
> recoveries, continually hanging the system with jobs that timeout.

 Well, that is intentional behavior. Marek is voting for making
 soft recovered errors fatal as well while Michel is voting for
 better ignoring them.

 I'm not really sure what to do. If you guys think that soft
 recovered hangs should be fatal as well then we can certainly do
 this.
>
> A possible compromise might be making soft resets fatal if they
> happen repeatedly (within a certain period of time?).

 No, no and no. Aside from introducing issues by side effects not
 surfacing and all of the stuff I mentioned about descriptor buffers,
 bda, draw indirect and stuff just resulting in more faults and hangs...

 You are proposing we throw out every promise we made to an application
 on the API contract level because it "might work". That's just wrong!

 Let me put this in explicit terms: What you are proposing is in direct
 violation of the GL and Vulkan specification.

 You can't just chose to break these contracts because you think it
 'might' be a better user experience.
>>>
>>> Is the original issue that motivated soft resets to be non-fatal even an
>>> issue anymore?
>>>
>>> If I read that old thread correctly, the rationale for that was that
>>> assigning guilt to a context was more broken than not doing it, because
>>> the compositor/Xwayland process would also crash despite being unrelated
>>> to the hang.
>>> With Joshua's Mesa fixes, this is not the case anymore, so I don't think
>>> keeping soft resets non-fatal provides any benefit to the user experience.
>>> The potential detriments to user experience have been outlined multiple
>>> times in this thread already.
>>>
>>> (I suppose if the compositor itself faults it might still bring down a
>>> session, but I've literally never seen that, and it's not like a
>>> compositor triggering segfaults on CPU stays alive either.)
>>
>> That's indeed what happened for me, multiple times. And each time the 
>> session continued running fine for days after the soft reset.
>>
>> But apparently my experience isn't valid somehow, and I should have been 
>> forced to log in again to please the GL gods...
>>
>>
>> Conversely, I can't remember hitting a case where an app kept running into 
>> soft resets. It's almost as if different people may have different 
>> experiences! ;)
> 
> Your anecdote of whatever application coincidentally managing to soldier 
> through being hung is really not relevant.

But yours is, got it.


> It looks like Mutter already has some handling for GL robustness anyway...

That's just for suspend/resume with the nvidia driver. It can't recover from 
GPU hangs yet.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Friedrich Vock


On 15.01.24 18:09, Michel Dänzer wrote:

On 2024-01-15 17:46, Joshua Ashton wrote:

On 1/15/24 16:34, Michel Dänzer wrote:

On 2024-01-15 17:19, Friedrich Vock wrote:

On 15.01.24 16:43, Joshua Ashton wrote:

On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through
the soft
recoveries, continually hanging the system with jobs that timeout.

Well, that is intentional behavior. Marek is voting for making
soft recovered errors fatal as well while Michel is voting for
better ignoring them.

I'm not really sure what to do. If you guys think that soft
recovered hangs should be fatal as well then we can certainly do
this.

A possible compromise might be making soft resets fatal if they
happen repeatedly (within a certain period of time?).

No, no and no. Aside from introducing issues by side effects not
surfacing and all of the stuff I mentioned about descriptor buffers,
bda, draw indirect and stuff just resulting in more faults and hangs...

You are proposing we throw out every promise we made to an application
on the API contract level because it "might work". That's just wrong!

Let me put this in explicit terms: What you are proposing is in direct
violation of the GL and Vulkan specification.

You can't just chose to break these contracts because you think it
'might' be a better user experience.

Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)

That's indeed what happened for me, multiple times. And each time the session 
continued running fine for days after the soft reset.

But apparently my experience isn't valid somehow, and I should have been forced 
to log in again to please the GL gods...


Conversely, I can't remember hitting a case where an app kept running into soft 
resets. It's almost as if different people may have different experiences! ;)

Your anecdote of whatever application coincidentally managing to soldier 
through being hung is really not relevant.

But yours is, got it.

The fundamental problem here is that not telling applications that
something went wrong when you just canceled their work midway is an
out-of-spec hack.
When there is a report of real-world apps breaking because of that hack,
reports of different apps working (even if it's convenient that they
work) doesn't justify keeping the broken code.

I don't think it makes sense to argue against fixing broken code just
because it doesn't affect all apps (and is convenient for some of them).

This isn't anything personal. I don't think your experience is invalid
(I think I just misunderstood the original thread. Sorry if I came
across as condescending!), all I'm arguing is that this should be fixed
somewhere else than the kernel, because what the kernel does right now
makes it impossible to implement modern APIs fully correctly. If mutter
needs to be robust against faults it caused itself, it should be robust
against GPU resets.





It looks like Mutter already has some handling for GL robustness anyway...

That's just for suspend/resume with the nvidia driver. It can't recover from 
GPU hangs yet.

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Michel Dänzer

On 2024-01-15 18:26, Friedrich Vock wrote:
> On 15.01.24 18:09, Michel Dänzer wrote:
>> On 2024-01-15 17:46, Joshua Ashton wrote:
>>> On 1/15/24 16:34, Michel Dänzer wrote:
 On 2024-01-15 17:19, Friedrich Vock wrote:
> On 15.01.24 16:43, Joshua Ashton wrote:
>> On 1/15/24 15:25, Michel Dänzer wrote:
>>> On 2024-01-15 14:17, Christian König wrote:
 Am 15.01.24 um 12:37 schrieb Joshua Ashton:
> On 1/15/24 09:40, Christian König wrote:
>> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
>>
>>> Without this feedback, the application may keep pushing through
>>> the soft
>>> recoveries, continually hanging the system with jobs that timeout.
>> Well, that is intentional behavior. Marek is voting for making
>> soft recovered errors fatal as well while Michel is voting for
>> better ignoring them.
>>
>> I'm not really sure what to do. If you guys think that soft
>> recovered hangs should be fatal as well then we can certainly do
>> this.
>>> A possible compromise might be making soft resets fatal if they
>>> happen repeatedly (within a certain period of time?).
>> No, no and no. Aside from introducing issues by side effects not
>> surfacing and all of the stuff I mentioned about descriptor buffers,
>> bda, draw indirect and stuff just resulting in more faults and hangs...
>>
>> You are proposing we throw out every promise we made to an application
>> on the API contract level because it "might work". That's just wrong!
>>
>> Let me put this in explicit terms: What you are proposing is in direct
>> violation of the GL and Vulkan specification.
>>
>> You can't just chose to break these contracts because you think it
>> 'might' be a better user experience.
> Is the original issue that motivated soft resets to be non-fatal even an
> issue anymore?
>
> If I read that old thread correctly, the rationale for that was that
> assigning guilt to a context was more broken than not doing it, because
> the compositor/Xwayland process would also crash despite being unrelated
> to the hang.
> With Joshua's Mesa fixes, this is not the case anymore, so I don't think
> keeping soft resets non-fatal provides any benefit to the user experience.
> The potential detriments to user experience have been outlined multiple
> times in this thread already.
>
> (I suppose if the compositor itself faults it might still bring down a
> session, but I've literally never seen that, and it's not like a
> compositor triggering segfaults on CPU stays alive either.)
 That's indeed what happened for me, multiple times. And each time the 
 session continued running fine for days after the soft reset.

 But apparently my experience isn't valid somehow, and I should have been 
 forced to log in again to please the GL gods...


 Conversely, I can't remember hitting a case where an app kept running into 
 soft resets. It's almost as if different people may have different 
 experiences! ;)
>>> Your anecdote of whatever application coincidentally managing to soldier 
>>> through being hung is really not relevant.
>> But yours is, got it.
> The fundamental problem here is that not telling applications that
> something went wrong when you just canceled their work midway is an
> out-of-spec hack.
> When there is a report of real-world apps breaking because of that hack,
> reports of different apps working (even if it's convenient that they
> work) doesn't justify keeping the broken code.

If the breaking apps hit multiple soft resets in a row, I've laid out a 
pragmatic solution which covers both cases.


> If mutter needs to be robust against faults it caused itself, it should be 
> robust
> against GPU resets.

It's unlikely that the hangs I've seen were caused by mutter itself, more 
likely Mesa or amdgpu.

Anyway, this will happen at some point, the reality is it hasn't yet though.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton





On 1/15/24 17:09, Michel Dänzer wrote:

On 2024-01-15 17:46, Joshua Ashton wrote:

On 1/15/24 16:34, Michel Dänzer wrote:

On 2024-01-15 17:19, Friedrich Vock wrote:

On 15.01.24 16:43, Joshua Ashton wrote:

On 1/15/24 15:25, Michel Dänzer wrote:

On 2024-01-15 14:17, Christian König wrote:

Am 15.01.24 um 12:37 schrieb Joshua Ashton:

On 1/15/24 09:40, Christian König wrote:

Am 13.01.24 um 15:02 schrieb Joshua Ashton:


Without this feedback, the application may keep pushing through
the soft
recoveries, continually hanging the system with jobs that timeout.


Well, that is intentional behavior. Marek is voting for making
soft recovered errors fatal as well while Michel is voting for
better ignoring them.

I'm not really sure what to do. If you guys think that soft
recovered hangs should be fatal as well then we can certainly do
this.


A possible compromise might be making soft resets fatal if they
happen repeatedly (within a certain period of time?).


No, no and no. Aside from introducing issues by side effects not
surfacing and all of the stuff I mentioned about descriptor buffers,
bda, draw indirect and stuff just resulting in more faults and hangs...

You are proposing we throw out every promise we made to an application
on the API contract level because it "might work". That's just wrong!

Let me put this in explicit terms: What you are proposing is in direct
violation of the GL and Vulkan specification.

You can't just chose to break these contracts because you think it
'might' be a better user experience.


Is the original issue that motivated soft resets to be non-fatal even an
issue anymore?

If I read that old thread correctly, the rationale for that was that
assigning guilt to a context was more broken than not doing it, because
the compositor/Xwayland process would also crash despite being unrelated
to the hang.
With Joshua's Mesa fixes, this is not the case anymore, so I don't think
keeping soft resets non-fatal provides any benefit to the user experience.
The potential detriments to user experience have been outlined multiple
times in this thread already.

(I suppose if the compositor itself faults it might still bring down a
session, but I've literally never seen that, and it's not like a
compositor triggering segfaults on CPU stays alive either.)


That's indeed what happened for me, multiple times. And each time the session 
continued running fine for days after the soft reset.

But apparently my experience isn't valid somehow, and I should have been forced 
to log in again to please the GL gods...


Conversely, I can't remember hitting a case where an app kept running into soft 
resets. It's almost as if different people may have different experiences! ;)


Your anecdote of whatever application coincidentally managing to soldier 
through being hung is really not relevant.


But yours is, got it.


Yes, as what I am stating is backed by the specification for the APIs we 
are using.
You previously said things are not black and white, but it very 
explicitly is -- we have specifications for a reason!


Your app just-so happening to survive a command buffer being ignored is 
not a useful or valid addition to this discussion at all.


- Joshie 🐸✨





It looks like Mutter already has some handling for GL robustness anyway...


That's just for suspend/resume with the nvidia driver. It can't recover from 
GPU hangs yet.

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Friedrich Vock


On 15.01.24 18:54, Michel Dänzer wrote:

On 2024-01-15 18:26, Friedrich Vock wrote:
[snip]

The fundamental problem here is that not telling applications that
something went wrong when you just canceled their work midway is an
out-of-spec hack.
When there is a report of real-world apps breaking because of that hack,
reports of different apps working (even if it's convenient that they
work) doesn't justify keeping the broken code.

If the breaking apps hit multiple soft resets in a row, I've laid out a 
pragmatic solution which covers both cases.

Hitting soft reset every time is the lucky path. Once GPU work is
interrupted out of nowhere, all bets are off and it might as well
trigger a full system hang next time. No hang recovery should be able to
cause that under any circumstance.



If mutter needs to be robust against faults it caused itself, it should be 
robust
against GPU resets.

It's unlikely that the hangs I've seen were caused by mutter itself, more 
likely Mesa or amdgpu.

Anyway, this will happen at some point, the reality is it hasn't yet though.

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Friedrich Vock


Re-sending as plaintext, sorry about that

On 15.01.24 18:54, Michel Dänzer wrote:

On 2024-01-15 18:26, Friedrich Vock wrote:

[snip]
The fundamental problem here is that not telling applications that
something went wrong when you just canceled their work midway is an
out-of-spec hack.
When there is a report of real-world apps breaking because of that hack,
reports of different apps working (even if it's convenient that they
work) doesn't justify keeping the broken code.

If the breaking apps hit multiple soft resets in a row, I've laid out a 
pragmatic solution which covers both cases.

Hitting soft reset every time is the lucky path. Once GPU work is
interrupted out of nowhere, all bets are off and it might as well
trigger a full system hang next time. No hang recovery should be able to
cause that under any circumstance.




If mutter needs to be robust against faults it caused itself, it should be 
robust
against GPU resets.

It's unlikely that the hangs I've seen were caused by mutter itself, more 
likely Mesa or amdgpu.

Anyway, this will happen at some point, the reality is it hasn't yet though.

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Bas Nieuwenhuizen

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
wrote:

> Re-sending as plaintext, sorry about that
>
> On 15.01.24 18:54, Michel Dänzer wrote:
> > On 2024-01-15 18:26, Friedrich Vock wrote:
> >> [snip]
> >> The fundamental problem here is that not telling applications that
> >> something went wrong when you just canceled their work midway is an
> >> out-of-spec hack.
> >> When there is a report of real-world apps breaking because of that hack,
> >> reports of different apps working (even if it's convenient that they
> >> work) doesn't justify keeping the broken code.
> > If the breaking apps hit multiple soft resets in a row, I've laid out a
> pragmatic solution which covers both cases.
> Hitting soft reset every time is the lucky path. Once GPU work is
> interrupted out of nowhere, all bets are off and it might as well
> trigger a full system hang next time. No hang recovery should be able to
> cause that under any circumstance.
>

I think the more insidious situation is no further hangs but wrong results
because we skipped some work. That we skipped work may e.g. result in some
texture not being uploaded or some GPGPU work not being done and causing
further errors downstream (say if a game is doing AI/physics on the GPU not
to say anything of actual GPGPU work one might be doing like AI)


> >
> >
> >> If mutter needs to be robust against faults it caused itself, it should
> be robust
> >> against GPU resets.
> > It's unlikely that the hangs I've seen were caused by mutter itself,
> more likely Mesa or amdgpu.
> >
> > Anyway, this will happen at some point, the reality is it hasn't yet
> though.
> >
> >
>

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock > wrote:

Re-sending as plaintext, sorry about that

On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling applications that
 >> something went wrong when you just canceled their work midway is an
 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
that hack,
 >> reports of different apps working (even if it's convenient that they
 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've laid
out a pragmatic solution which covers both cases.
Hitting soft reset every time is the lucky path. Once GPU work is
interrupted out of nowhere, all bets are off and it might as well
trigger a full system hang next time. No hang recovery should be able to
cause that under any circumstance.

I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not being 
done and causing further errors downstream (say if a game is doing 
AI/physics on the GPU not to say anything of actual GPGPU work one might 
be doing like AI)

Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even knowing!

Now imagine this is VulkanSC displaying something in the car dashboard, 
or some medical device doing some compute work to show something on a 
graph...

I am not saying you should be doing any of that with RADV + AMDGPU, but 
it's just food for thought... :-)

As I have been saying, you simply cannot just violate API contracts like 
this, it's flatout wrong.

- Joshie 🐸✨

 >
 >
 >> If mutter needs to be robust against faults it caused itself, it
should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it hasn't
yet though.
 >
 >

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock > wrote:

    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling applications 
that
 >> something went wrong when you just canceled their work midway 
is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
    that hack,
 >> reports of different apps working (even if it's convenient 
that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've laid
    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU work is
    interrupted out of nowhere, all bets are off and it might as well
    trigger a full system hang next time. No hang recovery should be 
able to

    cause that under any circumstance.

I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not 
being done and causing further errors downstream (say if a game is 
doing AI/physics on the GPU not to say anything of actual GPGPU work 
one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even knowing!

Well on the kernel side we do provide an API to query the result of a 
submission. That includes canceling submissions with a soft recovery.

What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for bad 
behavior.

Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...

I am not saying you should be doing any of that with RADV + AMDGPU, 
but it's just food for thought... :-)

As I have been saying, you simply cannot just violate API contracts 
like this, it's flatout wrong.

Yeah, completely agree to that.

Regards,
Christian.

- Joshie 🐸✨

 >
 >
 >> If mutter needs to be robust against faults it caused itself, it
    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it hasn't
    yet though.
 >
 >

Re: Proposal to add CRIU support to DRM render nodes

2024-01-15 Thread Felix Kuehling

I haven't seen any replies to this proposal. Either it got lost in the 
pre-holiday noise, or there is genuinely no interest in this.


If it's the latter, I would look for an AMDGPU driver-specific solution 
with minimally invasive changes in DRM and DMABuf code, if needed. Maybe 
it could be generalized later if there is interest then.


Regards,
  Felix


On 2023-12-06 16:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes in 
order to maintain CRIU support for ROCm application once they start 
relying on render nodes for more GPU memory management. In this email 
I'm providing some background why we are doing this, and outlining 
some of the problems we need to solve to checkpoint and restore render 
node state and shared memory (DMABuf) state. I have some thoughts on 
the API design, leaning on what we did for KFD, but would like to get 
feedback from the DRI community regarding that API and to what extent 
there is interest in making that generic.


We are working on using DRM render nodes for virtual address mappings 
in ROCm applications to implement the CUDA11-style VM API and improve 
interoperability between graphics and compute. This uses DMABufs for 
sharing buffer objects between KFD and multiple render node devices, 
as well as between processes. In the long run this also provides a 
path to moving all or most memory management from the KFD ioctl API to 
libdrm.


Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU virtual 
address mappings. Support will be needed for checkpointing GEM buffer 
objects and handles, their GPU virtual address mappings and memory 
sharing relationships between devices and processes.


Eventually, if full CRIU support for graphics applications is desired, 
more state would need to be captured, including scheduler contexts and 
BO lists. Most of this state is driver-specific.


After some internal discussions we decided to take our design process 
public as this potentially touches DRM GEM and DMABuf APIs and may 
have implications for other drivers in the future.


One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?


With that out of the way, some considerations for a possible DRM CRIU 
API (either generic of AMDGPU driver specific): The API goes through 
several phases during checkpoint and restore:


Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can
allocate memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at this
time)
 2. Resume (final fixups after the VMAs are sorted out, resume execution)

For some more background about our implementation in KFD, you can 
refer to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md


Potential objections to a KFD-style CRIU API in DRM render nodes, I'll 
address each of them in more detail below:


  * Opaque information in the checkpoint data that user mode can't
interpret or do anything with
  * A second API for creating objects (e.g. BOs) that is separate from
the regular BO creation API
  * Kernel mode would need to be involved in restoring BO sharing
relationships rather than replaying BO creation, export and import
from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding any new 
objects or attributes to the driver/HW state that needs to be 
checkpointed could potentially break the ABI of the CRIU 
checkpoint/restore ioctl if the plugin needs to parse that 
information. Therefore, much of the information in our KFD CRIU ioctl 
API is opaque. It is written by kernel mode in the checkpoint, it is 
consumed by kernel mode when restoring the checkpoint, but user mode 
doesn't care about the contents or binary layout, so there is no user 
mode ABI to break. This is how we were able to maintain CRIU support 
when we added the SVM API to KFD without changing the CRIU plugin and 
without breaking our ABI.


Opaque information may also lend itself to API abstraction, if this 
becomes a generic DRM API with driver-specific callbacks that fill in 
HW-specific opaque data.


# Second API for creating objects

Creating BOs and other objects when restoring a checkpoint needs more 
information than the usual BO alloc and similar APIs provide. For 
example, we need to restore BOs with the same GEM handles so that user 
mode can continue using those handles after resuming execution. If BOs 
are shared through DMABufs without dynamic attachment, we need to 
restore pinne

[PATCH] drm/amdkfd: Use S_ENDPGM_SAVED in trap handler

2024-01-15 Thread Jay Cornwall

This instruction has no functional difference to S_ENDPGM
but allows performance counters to track save events correctly.

Signed-off-by: Jay Cornwall 
---
 drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h | 14 +++---
 .../gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm |  2 +-
 .../gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm  |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h 
b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
index df75863393fc..d1caaf0e6a7c 100644
--- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
+++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler.h
@@ -674,7 +674,7 @@ static const uint32_t cwsr_trap_gfx9_hex[] = {
0x86ea6a6a, 0x8f6e837a,
0xb96ee0c2, 0xbf82,
0xb97a0002, 0xbf8a,
-   0xbe801f6c, 0xbf81,
+   0xbe801f6c, 0xbf9b,
 };
 
 static const uint32_t cwsr_trap_nv1x_hex[] = {
@@ -1091,7 +1091,7 @@ static const uint32_t cwsr_trap_nv1x_hex[] = {
0xb9eef807, 0x876dff6d,
0x, 0x87fe7e7e,
0x87ea6a6a, 0xb9faf802,
-   0xbe80226c, 0xbf81,
+   0xbe80226c, 0xbf9b,
0xbf9f, 0xbf9f,
0xbf9f, 0xbf9f,
0xbf9f, 0x,
@@ -1574,7 +1574,7 @@ static const uint32_t cwsr_trap_arcturus_hex[] = {
0x86ea6a6a, 0x8f6e837a,
0xb96ee0c2, 0xbf82,
0xb97a0002, 0xbf8a,
-   0xbe801f6c, 0xbf81,
+   0xbe801f6c, 0xbf9b,
 };
 
 static const uint32_t cwsr_trap_aldebaran_hex[] = {
@@ -2065,7 +2065,7 @@ static const uint32_t cwsr_trap_aldebaran_hex[] = {
0x86ea6a6a, 0x8f6e837a,
0xb96ee0c2, 0xbf82,
0xb97a0002, 0xbf8a,
-   0xbe801f6c, 0xbf81,
+   0xbe801f6c, 0xbf9b,
 };
 
 static const uint32_t cwsr_trap_gfx10_hex[] = {
@@ -2500,7 +2500,7 @@ static const uint32_t cwsr_trap_gfx10_hex[] = {
0x876dff6d, 0x,
0x87fe7e7e, 0x87ea6a6a,
0xb9faf802, 0xbe80226c,
-   0xbf81, 0xbf9f,
+   0xbf9b, 0xbf9f,
0xbf9f, 0xbf9f,
0xbf9f, 0xbf9f,
 };
@@ -2944,7 +2944,7 @@ static const uint32_t cwsr_trap_gfx11_hex[] = {
0xb8eef802, 0xbf0d866e,
0xbfa20002, 0xb97af802,
0xbe80486c, 0xb97af802,
-   0xbe804a6c, 0xbfb0,
+   0xbe804a6c, 0xbfb1,
0xbf9f, 0xbf9f,
0xbf9f, 0xbf9f,
0xbf9f, 0x,
@@ -3436,5 +3436,5 @@ static const uint32_t cwsr_trap_gfx9_4_3_hex[] = {
0x86ea6a6a, 0x8f6e837a,
0xb96ee0c2, 0xbf82,
0xb97a0002, 0xbf8a,
-   0xbe801f6c, 0xbf81,
+   0xbe801f6c, 0xbf9b,
 };
diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm 
b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm
index e0140df0b0ec..71b3dc0c7363 100644
--- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm
+++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx10.asm
@@ -1104,7 +1104,7 @@ L_RETURN_WITHOUT_PRIV:
s_rfe_b64   s_restore_pc_lo 
//Return to the main shader program and resume execution
 
 L_END_PGM:
-   s_endpgm
+   s_endpgm_saved
 end
 
 function write_hwreg_to_mem(s, s_rsrc, s_mem_offset)
diff --git a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm 
b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm
index e506411ad28a..bb26338204f4 100644
--- a/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm
+++ b/drivers/gpu/drm/amd/amdkfd/cwsr_trap_handler_gfx9.asm
@@ -921,7 +921,7 @@ L_RESTORE:
 /* the END   */
 /**/
 L_END_PGM:
-s_endpgm
+s_endpgm_saved
 
 end
 
-- 
2.25.1

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock > wrote:

    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling applications 
that
 >> something went wrong when you just canceled their work midway 
is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
    that hack,
 >> reports of different apps working (even if it's convenient 
that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've laid
    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU work is
    interrupted out of nowhere, all bets are off and it might as well
    trigger a full system hang next time. No hang recovery should be 
able to

    cause that under any circumstance.

I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not 
being done and causing further errors downstream (say if a game is 
doing AI/physics on the GPU not to say anything of actual GPGPU work 
one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even knowing!

Well on the kernel side we do provide an API to query the result of a 
submission. That includes canceling submissions with a soft recovery.

What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for bad 
behavior.

You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking the 
submission fence error somehow? That doesn't seem very ergonomic for a 
Vulkan driver compared to the simple solution which is to just mark it 
as guilty with what already exists...

- Joshie 🐸✨

Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...

I am not saying you should be doing any of that with RADV + AMDGPU, 
but it's just food for thought... :-)

As I have been saying, you simply cannot just violate API contracts 
like this, it's flatout wrong.

Yeah, completely agree to that.

Regards,
Christian.

- Joshie 🐸✨

 >
 >
 >> If mutter needs to be robust against faults it caused itself, it
    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it hasn't
    yet though.
 >
 >

- Joshie 🐸✨

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
mailto:friedrich.v...@gmx.de>> wrote:

    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling 
applications that
 >> something went wrong when you just canceled their work 
midway is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
    that hack,
 >> reports of different apps working (even if it's convenient 
that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've 
laid

    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU work is
    interrupted out of nowhere, all bets are off and it might as well
    trigger a full system hang next time. No hang recovery should 
be able to

    cause that under any circumstance.

I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not 
being done and causing further errors downstream (say if a game is 
doing AI/physics on the GPU not to say anything of actual GPGPU 
work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even knowing!

Well on the kernel side we do provide an API to query the result of a 
submission. That includes canceling submissions with a soft recovery.

What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for bad 
behavior.

You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking the 
submission fence error somehow? That doesn't seem very ergonomic for a 
Vulkan driver compared to the simple solution which is to just mark it 
as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of reasons.

What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():

    /* Ignore soft recovered fences here */
    r = drm_sched_entity_error(s_entity);
    if (r && r != -ENODATA)
    goto error;

This will bubble up errors from soft recoveries into the entity as well 
and makes sure that further submissions are rejected.

Regards,
Christian.

- Joshie 🐸✨

Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...

I am not saying you should be doing any of that with RADV + AMDGPU, 
but it's just food for thought... :-)

As I have been saying, you simply cannot just violate API contracts 
like this, it's flatout wrong.

Yeah, completely agree to that.

Regards,
Christian.

- Joshie 🐸✨

 >
 >
 >> If mutter needs to be robust against faults it caused 
itself, it

    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it 
hasn't

    yet though.
 >
 >

- Joshie 🐸✨

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
mailto:friedrich.v...@gmx.de>> wrote:

    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling 
applications that
 >> something went wrong when you just canceled their work 
midway is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking because of
    that hack,
 >> reports of different apps working (even if it's convenient 
that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, I've 
laid

    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU work is
    interrupted out of nowhere, all bets are off and it might as well
    trigger a full system hang next time. No hang recovery should 
be able to

    cause that under any circumstance.

I think the more insidious situation is no further hangs but wrong 
results because we skipped some work. That we skipped work may e.g. 
result in some texture not being uploaded or some GPGPU work not 
being done and causing further errors downstream (say if a game is 
doing AI/physics on the GPU not to say anything of actual GPGPU 
work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even knowing!

Well on the kernel side we do provide an API to query the result of a 
submission. That includes canceling submissions with a soft recovery.

What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for bad 
behavior.

You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking the 
submission fence error somehow? That doesn't seem very ergonomic for a 
Vulkan driver compared to the simple solution which is to just mark it 
as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of reasons.

What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():

     /* Ignore soft recovered fences here */
     r = drm_sched_entity_error(s_entity);
     if (r && r != -ENODATA)
     goto error;

This will bubble up errors from soft recoveries into the entity as well 
and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will 
not tell the app that it was guilty.

We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.

I am not super sure what is wrong/egrigious about the ctx->guilty 
handling, is there a previous email thread explaining that?

- Joshie 🐸✨

Regards,
Christian.

- Joshie 🐸✨

Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...

I am not saying you should be doing any of that with RADV + AMDGPU, 
but it's just food for thought... :-)

As I have been saying, you simply cannot just violate API contracts 
like this, it's flatout wrong.

Yeah, completely agree to that.

Regards,
Christian.

- Joshie 🐸✨

 >
 >
 >> If mutter needs to be robust against faults it caused 
itself, it

    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, this will happen at some point, the reality is it 
hasn't

    yet though.
 >
 >

- Joshie 🐸✨

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
mailto:friedrich.v...@gmx.de>> wrote:

    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling 
applications that
 >> something went wrong when you just canceled their work 
midway is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking 
because of

    that hack,
 >> reports of different apps working (even if it's 
convenient that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, 
I've laid

    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU 
work is
    interrupted out of nowhere, all bets are off and it might as 
well
    trigger a full system hang next time. No hang recovery should 
be able to

    cause that under any circumstance.

I think the more insidious situation is no further hangs but 
wrong results because we skipped some work. That we skipped work 
may e.g. result in some texture not being uploaded or some GPGPU 
work not being done and causing further errors downstream (say if 
a game is doing AI/physics on the GPU not to say anything of 
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even 
knowing!

Well on the kernel side we do provide an API to query the result of 
a submission. That includes canceling submissions with a soft 
recovery.

What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for 
bad behavior.

You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking 
the submission fence error somehow? That doesn't seem very ergonomic 
for a Vulkan driver compared to the simple solution which is to just 
mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of 
reasons.

What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():

 /* Ignore soft recovered fences here */
 r = drm_sched_entity_error(s_entity);
 if (r && r != -ENODATA)
 goto error;

This will bubble up errors from soft recoveries into the entity as 
well and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will 
not tell the app that it was guilty.

No, it clearly gets that signaled. We should probably replace the guilty 
atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this, 
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or 
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence result IOCTL.

When you get an -ENODATA you have been guilty of causing a soft 
recovery, when you get an -ETIME you are guilty of causing a timeout 
which had to be hard recovered. When you get an -ECANCELED you are an 
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but that 
should be trivial to do.

Regards,
Christian.

We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.

I am not super sure what is wrong/egrigious about the ctx->guilty 
handling, is there a previous email thread explaining that?

- Joshie 🐸✨

Regards,
Christian.

- Joshie 🐸✨

Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...

I am not saying you should be doing any of that with RADV + 
AMDGPU, but it's just food for thought... :-)

As I have been saying, you simply cannot just violate API 
contracts like this, it's flatout wrong.

Yeah, completely agree to that.

Regards,
Christian.

- Joshie 🐸✨

 >
 >
 >> If mutter needs to be robust against faults it caused 
itself, it

    should be robust
 >> against GPU resets.
 > It's unlikely that the hangs I've seen were caused by mutter
    itself, more likely Mesa or amdgpu.
 >
 > Anyway, th

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Joshua Ashton

On 1/15/24 19:57, Christian König wrote:

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock 
mailto:friedrich.v...@gmx.de>> wrote:

    Re-sending as plaintext, sorry about that

    On 15.01.24 18:54, Michel Dänzer wrote:
 > On 2024-01-15 18:26, Friedrich Vock wrote:
 >> [snip]
 >> The fundamental problem here is that not telling 
applications that
 >> something went wrong when you just canceled their work 
midway is an

 >> out-of-spec hack.
 >> When there is a report of real-world apps breaking 
because of

    that hack,
 >> reports of different apps working (even if it's 
convenient that they

 >> work) doesn't justify keeping the broken code.
 > If the breaking apps hit multiple soft resets in a row, 
I've laid

    out a pragmatic solution which covers both cases.
    Hitting soft reset every time is the lucky path. Once GPU 
work is
    interrupted out of nowhere, all bets are off and it might as 
well
    trigger a full system hang next time. No hang recovery should 
be able to

    cause that under any circumstance.

I think the more insidious situation is no further hangs but 
wrong results because we skipped some work. That we skipped work 
may e.g. result in some texture not being uploaded or some GPGPU 
work not being done and causing further errors downstream (say if 
a game is doing AI/physics on the GPU not to say anything of 
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something 
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even 
knowing!

Well on the kernel side we do provide an API to query the result of 
a submission. That includes canceling submissions with a soft 
recovery.

What we just doesn't do is to prevent further submissions from this 
application. E.g. enforcing that the application is punished for 
bad behavior.

You do prevent future submissions for regular resets though: Those 
increase karma which sets ctx->guilty, and if ctx->guilty then 
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't 
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking 
the submission fence error somehow? That doesn't seem very ergonomic 
for a Vulkan driver compared to the simple solution which is to just 
mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of 
reasons.

What we can do rather trivially is changing this code in 
amdgpu_job_prepare_job():

 /* Ignore soft recovered fences here */
 r = drm_sched_entity_error(s_entity);
 if (r && r != -ENODATA)
 goto error;

This will bubble up errors from soft recoveries into the entity as 
well and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will 
not tell the app that it was guilty.

No, it clearly gets that signaled. We should probably replace the guilty 
atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this, 
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or 
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence result 
IOCTL.

When you get an -ENODATA you have been guilty of causing a soft 
recovery, when you get an -ETIME you are guilty of causing a timeout 
which had to be hard recovered. When you get an -ECANCELED you are an 
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but that 
should be trivial to do.

Thanks for the info, I will test things out here and likely send a patch 
to change if (r && r != -ENODATA) -> if (r) if things work out.

- Joshie 🐸✨

Regards,
Christian.

We could always return UNKNOWN_CONTEXT_RESET_EXT in that case, I guess.

I am not super sure what is wrong/egrigious about the ctx->guilty 
handling, is there a previous email thread explaining that?

- Joshie 🐸✨

Regards,
Christian.

- Joshie 🐸✨

Now imagine this is VulkanSC displaying something in the car 
dashboard, or some medical device doing some compute work to show 
something on a graph...

I am not saying you should be doing any of that with RADV + 
AMDGPU, but it's just food for thought... :-)

As I have been saying, you simply cannot just violate API 
contracts like this, it's flatout wrong.

Yeah, completely agree to that.

Regards,
Christian.

- Joshie 🐸✨

 >
 >
 >> If mutter needs to be robust against faults it

Re: Documentation for RGB strip on RX 7900 XTX (Reference)

2024-01-15 Thread Harry Wentland




On 2024-01-09 03:31, Christian König wrote:
> Am 09.01.24 um 09:23 schrieb Alexander Koskovich:
>> Thanks for the info, will take a look.
>>
>> Also just to clarify, this is the first party AMD 7900 XTX, not a third 
>> party AIB (e.g. Sapphire, ASRock, etc).
> 
> Yeah, but that doesn't matter.
> 
> For the reference designs AMD basically just says how it should look like and 
> then somebody else stitches it together. The handling of for example how 
> connectors are detected is still the same.
> 
> AMD should somewhere have detailed documentation what's on that reference 
> board, but to be honest I have absolutely no idea who to ask for that. It's 
> simpler to just look into the tables used by all vendors.
> 
> Regards,
> Christian.
> 
>>
>>
>> On Tuesday, January 9th, 2024 at 3:02 AM, Christian König 
>>  wrote:
>>
>>
>>>
>>> Am 08.01.24 um 23:32 schrieb Deucher, Alexander:
>>>
 [Public]

> -Original Message-
> From: amd-gfx amd-gfx-boun...@lists.freedesktop.org On Behalf Of
> Alexander Koskovich
> Sent: Sunday, January 7, 2024 11:19 PM
> To: amd-gfx@lists.freedesktop.org
> Subject: Documentation for RGB strip on RX 7900 XTX (Reference)
>
> Hello,
>
> I was wondering if AMD would be able provide any documentation for the
> RGB strip on the reference cooler
> (https://www.amd.com/en/products/graphics/amd-radeon-rx-7900xtx)? It
> looks to be handled via I2C commands to the SMU, but having proper
> documentation would be extremely helpful.
> It depends on the AIB/OEM and how they designed the specific board. The 
> RGB controller will either be attached to the DDCVGA i2c bus on the 
> display hardware or the second SMU i2c bus. The former will require 
> changes to the amdgpu display code to register display i2c buses that are 
> not used by the display connectors on the board so they can be used by 
> 3rd party applications. Currently we only register i2c buses used for 
> display connectors. The latter buses are already registered with the i2c 
> subsystem since they are used for other things like EEPROMs on server and 
> workstation cards and should be available via standard Linux i2c APIs. 
> I'm not sure what i2c LED controllers each AIB vendor uses off hand. 
> https://openrgb.org/index.html would probably be a good resource for that 
> information.
>>>
>>>
>>> It might also be a good idea to look some of the ATOMBIOS tables found
>>> on your device.
>>>
>>> Those tables are filled in by the AIB/OEM with the information which
>>> connectors (HDMI, DVI, DP etc...) are on the board and I bet that the
>>> information which RGB controller is used and where to find it is
>>> somewhere in there as well.
>>>
>>> Adding Harry from our display team, might be that he has some more hints
>>> as well.
>>>

I don't know how the RGB strips are connected and controlled.

Harry

>>> Christian.
>>>
 Alex
>

Re: [PATCH] drm/amd/display: Drop 'acrtc' and add 'new_crtc_state' NULL check for writeback requests.

2024-01-15 Thread Alex Hung


Thanks for catching this.

Reviewed-by: Alex Hung 

On 2024-01-13 02:11, Srinivasan Shanmugam wrote:

Return value of 'to_amdgpu_crtc' which is container_of(...) can't be
null, so it's null check 'acrtc' is dropped.

Fixing the below:
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:9302 
amdgpu_dm_atomic_commit_tail() error: we previously assumed 'acrtc' could be 
null (see line 9299)

Add 'new_crtc_state'NULL check for function
'drm_atomic_get_new_crtc_state' that retrieves the new state for a CRTC,
while enabling writeback requests.

Cc: sta...@vger.kernel.org
Cc: Alex Hung 
Cc: Aurabindo Pillai 
Cc: Rodrigo Siqueira 
Cc: Hamza Mahfooz 
Signed-off-by: Srinivasan Shanmugam 
---
  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 95ff3800fc87..8eb381d5f6b8 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -9294,10 +9294,10 @@ static void amdgpu_dm_atomic_commit_tail(struct 
drm_atomic_state *state)
if (!new_con_state->writeback_job)
continue;
  
-		new_crtc_state = NULL;

+   new_crtc_state = drm_atomic_get_new_crtc_state(state, 
&acrtc->base);
  
-		if (acrtc)

-   new_crtc_state = drm_atomic_get_new_crtc_state(state, 
&acrtc->base);
+   if (!new_crtc_state)
+   continue;
  
  		if (acrtc->wb_enabled)

continue;

Re: [PATCH] drm/amd/display: Fix a switch statement in populate_dml_output_cfg_from_stream_state()

2024-01-15 Thread Hamza Mahfooz


On 1/13/24 09:58, Christophe JAILLET wrote:

It is likely that the statement related to 'dml_edp' is misplaced. So move
it in the correct "case SIGNAL_TYPE_EDP".

Fixes: 7966f319c66d ("drm/amd/display: Introduce DML2")
Signed-off-by: Christophe JAILLET 


Nice catch! Applied, thanks!


---
  drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c 
b/drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c
index fa6a93dd9629..64d01a9cd68c 100644
--- a/drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c
+++ b/drivers/gpu/drm/amd/display/dc/dml2/dml2_translation_helper.c
@@ -626,8 +626,8 @@ static void 
populate_dml_output_cfg_from_stream_state(struct dml_output_cfg_st *
if (is_dp2p0_output_encoder(pipe))
out->OutputEncoder[location] = dml_dp2p0;
break;
-   out->OutputEncoder[location] = dml_edp;
case SIGNAL_TYPE_EDP:
+   out->OutputEncoder[location] = dml_edp;
break;
case SIGNAL_TYPE_HDMI_TYPE_A:
case SIGNAL_TYPE_DVI_SINGLE_LINK:

--
Hamza

[PATCH] drm/amdkfd: Correct partial migration virtual addr

2024-01-15 Thread Philip Yang

Partial migration to system memory should use migrate.addr, not
prange->start as virtual address to allocate system memory page.

Fixes: 18eb61bd5a6a ("drm/amdkfd: Use partial migrations/mapping for GPU/CPU 
page faults in SVM"
Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index f856901055d3..bdc01ca9609a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -574,7 +574,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, prange->start,
 prange->last);
 
-   addr = prange->start << PAGE_SHIFT;
+   addr = migrate->start;
 
src = (uint64_t *)(scratch + npages);
dst = scratch;
-- 
2.35.1

[PATCH v4] drm/amdkfd: Set correct svm range actual loc after spliting

2024-01-15 Thread Philip Yang

While svm range partial migrating to system memory, clear dma_addr vram
domain flag, otherwise the future split will get incorrect vram_pages
and actual loc.

After range spliting, set new range and old range actual_loc:
new range actual_loc is 0 if new->vram_pages is 0.
old range actual_loc is 0 if old->vram_pages - new->vram_pages == 0.

new range takes svm_bo ref only if vram_pages not equal to 0.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  8 +
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 42 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 +
 3 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index bdc01ca9609a..79baa195ccac 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -564,6 +564,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
dma_addr_t *scratch, uint64_t npages)
 {
struct device *dev = adev->dev;
+   dma_addr_t *dma_addr;
uint64_t *src;
dma_addr_t *dst;
struct page *dpage;
@@ -575,6 +576,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
 prange->last);
 
addr = migrate->start;
+   dma_addr = svm_get_dma_addr_for_page_count(prange, addr);
 
src = (uint64_t *)(scratch + npages);
dst = scratch;
@@ -623,6 +625,12 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
goto out_oom;
}
 
+   /* Clear VRAM flag when page is migrated to ram, to count vram
+* pages correctly when spliting the range.
+*/
+   if (dma_addr && (dma_addr[i] & SVM_RANGE_VRAM_DOMAIN))
+   dma_addr[i] = 0;
+
pr_debug_ratelimited("dma mapping dst to 0x%llx, pfn 0x%lx\n",
 dst[i] >> PAGE_SHIFT, page_to_pfn(dpage));
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index f84547eccd28..78b4968e4c95 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -362,7 +362,6 @@ svm_range *svm_range_new(struct svm_range_list *svms, 
uint64_t start,
INIT_LIST_HEAD(&prange->child_list);
atomic_set(&prange->invalid, 0);
prange->validate_timestamp = 0;
-   prange->vram_pages = 0;
mutex_init(&prange->migrate_mutex);
mutex_init(&prange->lock);
 
@@ -965,6 +964,24 @@ svm_range_split_array(void *ppnew, void *ppold, size_t 
size,
return 0;
 }
 
+dma_addr_t *
+svm_get_dma_addr_for_page_count(struct svm_range *prange, u64 addr)
+{
+   struct kfd_process *p = container_of(prange->svms, struct kfd_process, 
svms);
+   dma_addr_t *dma_addr;
+   s32 gpuidx;
+
+   gpuidx = kfd_process_gpuidx_from_gpuid(p, prange->actual_loc);
+   if (gpuidx < 0) {
+   pr_debug("no GPU id 0x%x found\n", prange->actual_loc);
+   return NULL;
+   }
+
+   dma_addr = prange->dma_addr[gpuidx];
+   dma_addr += (addr >> PAGE_SHIFT) - prange->start;
+   return dma_addr;
+}
+
 static int
 svm_range_split_pages(struct svm_range *new, struct svm_range *old,
  uint64_t start, uint64_t last)
@@ -980,9 +997,14 @@ svm_range_split_pages(struct svm_range *new, struct 
svm_range *old,
if (r)
return r;
}
-   if (old->actual_loc)
+   if (old->actual_loc && new->vram_pages) {
old->vram_pages -= new->vram_pages;
-
+   new->actual_loc = old->actual_loc;
+   if (!old->vram_pages)
+   old->actual_loc = 0;
+   }
+   pr_debug("new->vram_pages 0x%llx loc 0x%x old->vram_pages 0x%llx loc 
0x%x\n",
+new->vram_pages, new->actual_loc, old->vram_pages, 
old->actual_loc);
return 0;
 }
 
@@ -1002,13 +1024,14 @@ svm_range_split_nodes(struct svm_range *new, struct 
svm_range *old,
new->offset = old->offset + npages;
}
 
-   new->svm_bo = svm_range_bo_ref(old->svm_bo);
-   new->ttm_res = old->ttm_res;
-
-   spin_lock(&new->svm_bo->list_lock);
-   list_add(&new->svm_bo_list, &new->svm_bo->range_list);
-   spin_unlock(&new->svm_bo->list_lock);
+   if (new->vram_pages) {
+   new->svm_bo = svm_range_bo_ref(old->svm_bo);
+   new->ttm_res = old->ttm_res;
 
+   spin_lock(&new->svm_bo->list_lock);
+   list_add(&new->svm_bo_list, &new->svm_bo->range_list);
+   spin_unlock(&new->svm_bo->list_lock);
+   }
return 0;
 }
 
@@ -1058,7 +1081,6 @@ svm_range_split_adjust(struct svm_range *new, struct 
svm_range *old,
new->flags = old->flags;
new->preferred_loc = old->prefe

[pull] amdgpu, amdkfd drm-fixes-6.8

2024-01-15 Thread Alex Deucher

Hi Dave, Sima,

Same PR as Friday, but with the new clang warning fixed.

The following changes since commit e54478fbdad20f2c58d0a4f99d01299ed8e7fe9c:

  Merge tag 'amd-drm-next-6.8-2024-01-05' of 
https://gitlab.freedesktop.org/agd5f/linux into drm-next (2024-01-09 09:07:50 
+1000)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-6.8-2024-01-15

for you to fetch changes up to 86718cf93237a0f45773bbc49b6006733fc4e051:

  drm/amd/display: Avoid enum conversion warning (2024-01-15 16:28:05 -0500)


amd-drm-fixes-6.8-2024-01-15:

amdgpu:
- SubVP fixes
- VRR fixes
- USB4 fixes
- DCN 3.5 fixes
- GFX11 harvesting fix
- RAS fixes
- Misc small fixes
- KFD dma-buf import fixes
- Power reporting fixes
- ATHUB 3.3 fix
- SR-IOV fix
- Add missing fw release for fiji
- GFX 11.5 fix
- Debugging module parameter fix
- SMU 13.0.6 fixes
- Fix new clang warning

amdkfd:
- Fix lockdep warnings
- Fix sparse __rcu warnings
- Bump interface version so userspace knows that the kernel supports dma-bufs 
exported from KFD
  Most of the fixes for this went into 6.7, but the last fix is in this PR
- HMM fix
- SVM fix


Alex Deucher (4):
  drm/amdgpu: fix avg vs input power reporting on smu7
  drm/amdgpu: fall back to INPUT power for AVG power via INFO IOCTL
  drm/amdgpu/pm: clarify debugfs pm output
  drm/amdgpu: drop exp hw support check for GC 9.4.3

Aric Cyr (1):
  drm/amd/display: 3.2.266

Candice Li (2):
  drm/amdgpu: Drop unnecessary sentences about CE and deferred error.
  drm/amdgpu: Support poison error injection via ras_ctrl debugfs

Charlene Liu (1):
  drm/amd/display: Update z8 latency

Dafna Hirschfeld (1):
  drm/amdkfd: fixes for HMM mem allocation

Daniel Miess (1):
  Revert "drm/amd/display: Fix conversions between bytes and KB"

Felix Kuehling (4):
  drm/amdkfd: Fix lock dependency warning
  drm/amdkfd: Fix sparse __rcu annotation warnings
  drm/amdgpu: Auto-validate DMABuf imports in compute VMs
  drm/amdkfd: Bump KFD ioctl version

George Shen (1):
  drm/amd/display: Disconnect phantom pipe OPP from OPTC being disabled

Hawking Zhang (1):
  drm/amdgpu: Packed socket_id to ras feature mask

Ivan Lipski (1):
  Revert "drm/amd/display: fix bandwidth validation failure on DCN 2.1"

James Zhu (1):
  drm/amdgpu: make a correction on comment

Le Ma (3):
  Revert "drm/amdgpu: add param to specify fw bo location for front-door 
loading"
  drm/amdgpu: add debug flag to place fw bo on vram for frontdoor loading
  drm/amdgpu: move debug options init prior to amdgpu device init

Lijo Lazar (2):
  drm/amd/pm: Add error log for smu v13.0.6 reset
  drm/amd/pm: Fix smuv13.0.6 current clock reporting

Likun Gao (1):
  drm/amdgpu: correct the cu count for gfx v11

Martin Leung (2):
  drm/amd/display: revert "for FPO & SubVP/DRR config program vmin/max"
  drm/amd/display: revert "Optimize VRR updates to only necessary ones"

Martin Tsai (1):
  drm/amd/display: To adjust dprefclk by down spread percentage

Meenakshikumar Somasundaram (1):
  drm/amd/display: Dpia hpd status not in sync after S4

Melissa Wen (1):
  drm/amd/display: cleanup inconsistent indenting in amdgpu_dm_color

Nathan Chancellor (1):
  drm/amd/display: Avoid enum conversion warning

Peichen Huang (1):
  drm/amd/display: Request usb4 bw for mst streams

Philip Yang (1):
  drm/amdkfd: Fix lock dependency warning with srcu

Srinivasan Shanmugam (6):
  drm/amd/powerplay: Fix kzalloc parameter 'ATOM_Tonga_PPM_Table' in 
'get_platform_power_management_table()'
  drm/amdgpu: Fix with right return code '-EIO' in 
'amdgpu_gmc_vram_checking()'
  drm/amdgpu: Fix unsigned comparison with less than zero in 
vpe_u1_8_from_fraction()
  drm/amdgpu: Release 'adev->pm.fw' before return in 
'amdgpu_device_need_post()'
  drm/amd/display: Fix variable deferencing before NULL check in 
edp_setup_replay()
  drm/amdkfd: Fix 'node' NULL check in 'svm_range_get_range_boundaries()'

Victor Lu (1):
  drm/amdgpu: Do not program VM_L2_CNTL under SRIOV

Yifan Zhang (3):
  drm/amdgpu: update headers for nbio v7.11
  drm/amdgpu: update ATHUB_MISC_CNTL offset for athub v3.3
  drm/amdgpu: update regGL2C_CTRL4 value in golden setting

 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 12 +++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c   | 43 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c |  6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c  |  2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c|  4 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 15 ++---
 drivers/gpu/drm/amd/amdgpu/amdg

[PATCH v4 1/7] drm/amdkfd: Add helper function svm_range_need_access_gpus

2024-01-15 Thread Philip Yang

Add the helper function to get all GPUs bitmap that need access the svm
range. This helper will be used in the following patch to check if
prange is mapped to all gpus.

Refactor svm_range_validate_and_map to use the helper function, no
functional change.

Signed-off-by: Philip Yang 
Reviewed-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 74 
 1 file changed, 43 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 18f8c82a849c..14dbc0fd51a9 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1169,6 +1169,44 @@ svm_range_add_child(struct svm_range *prange, struct 
mm_struct *mm,
list_add_tail(&pchild->child_list, &prange->child_list);
 }
 
+static int
+svm_range_need_access_gpus(unsigned long *bitmap, struct svm_range *prange)
+{
+   struct kfd_process *p = container_of(prange->svms, struct kfd_process, 
svms);
+   u32 gpuidx;
+
+   if (p->xnack_enabled) {
+   bitmap_copy(bitmap, prange->bitmap_aip, MAX_GPU_INSTANCE);
+
+   /* If prefetch range to GPU, or GPU retry fault migrate range to
+* GPU, which has ACCESS attribute to the range, create mapping
+* on that GPU.
+*/
+   if (prange->actual_loc) {
+   gpuidx = kfd_process_gpuidx_from_gpuid(p, 
prange->actual_loc);
+   if (gpuidx < 0)
+   return -EINVAL;
+
+   if (test_bit(gpuidx, prange->bitmap_access))
+   bitmap_set(bitmap, gpuidx, 1);
+   }
+
+   /*
+* If prange is already mapped or with always mapped flag,
+* update mapping on GPUs with ACCESS attribute
+*/
+   if (bitmap_empty(bitmap, MAX_GPU_INSTANCE)) {
+   if (prange->mapped_to_gpu ||
+   prange->flags & 
KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED)
+   bitmap_copy(bitmap, prange->bitmap_access, 
MAX_GPU_INSTANCE);
+   }
+   } else {
+   bitmap_or(bitmap, prange->bitmap_access,
+ prange->bitmap_aip, MAX_GPU_INSTANCE);
+   }
+   return 0;
+}
+
 static bool
 svm_nodes_in_same_hive(struct kfd_node *node_a, struct kfd_node *node_b)
 {
@@ -1609,38 +1647,12 @@ static int svm_range_validate_and_map(struct mm_struct 
*mm,
if (gpuidx < MAX_GPU_INSTANCE) {
bitmap_zero(ctx->bitmap, MAX_GPU_INSTANCE);
bitmap_set(ctx->bitmap, gpuidx, 1);
-   } else if (ctx->process->xnack_enabled) {
-   bitmap_copy(ctx->bitmap, prange->bitmap_aip, MAX_GPU_INSTANCE);
-
-   /* If prefetch range to GPU, or GPU retry fault migrate range to
-* GPU, which has ACCESS attribute to the range, create mapping
-* on that GPU.
-*/
-   if (prange->actual_loc) {
-   gpuidx = kfd_process_gpuidx_from_gpuid(ctx->process,
-   prange->actual_loc);
-   if (gpuidx < 0) {
-   WARN_ONCE(1, "failed get device by id 0x%x\n",
-prange->actual_loc);
-   r = -EINVAL;
-   goto free_ctx;
-   }
-   if (test_bit(gpuidx, prange->bitmap_access))
-   bitmap_set(ctx->bitmap, gpuidx, 1);
-   }
-
-   /*
-* If prange is already mapped or with always mapped flag,
-* update mapping on GPUs with ACCESS attribute
-*/
-   if (bitmap_empty(ctx->bitmap, MAX_GPU_INSTANCE)) {
-   if (prange->mapped_to_gpu ||
-   prange->flags & 
KFD_IOCTL_SVM_FLAG_GPU_ALWAYS_MAPPED)
-   bitmap_copy(ctx->bitmap, prange->bitmap_access, 
MAX_GPU_INSTANCE);
-   }
} else {
-   bitmap_or(ctx->bitmap, prange->bitmap_access,
- prange->bitmap_aip, MAX_GPU_INSTANCE);
+   r = svm_range_need_access_gpus(ctx->bitmap, prange);
+   if (r) {
+   WARN_ONCE(1, "failed get device by id 0x%x\n", 
prange->actual_loc);
+   goto free_ctx;
+   }
}
 
if (bitmap_empty(ctx->bitmap, MAX_GPU_INSTANCE)) {
-- 
2.35.1

[PATCH v4 3/7] drm/amdkfd: Add granularity size based bitmap map flag

2024-01-15 Thread Philip Yang

Replace prange->mapped_to_gpu with prange->bitmap_map[], which is per
GPU flag and use bitmap bits based on prange granularity. Align map to
GPU or unmap from GPU range size to granularity size and update the
corresponding bitmap_map flag bits.  This will optimize multiple GPU
map, unmap and retry fault recover.

svm_range_partial_mapped is false only if no part of the range mapping
on any GPUs.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 258 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |   7 +-
 2 files changed, 219 insertions(+), 46 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index a2c96f5760ff..a003406db067 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -307,12 +307,12 @@ static void svm_range_free(struct svm_range *prange, bool 
do_unmap)
KFD_IOC_ALLOC_MEM_FLAGS_USERPTR, 0);
}
 
-   /* free dma_addr array for each gpu */
+   /* free dma_addr array, bitmap_map for each gpu */
for (gpuidx = 0; gpuidx < MAX_GPU_INSTANCE; gpuidx++) {
-   if (prange->dma_addr[gpuidx]) {
+   if (prange->dma_addr[gpuidx])
kvfree(prange->dma_addr[gpuidx]);
-   prange->dma_addr[gpuidx] = NULL;
-   }
+   if (prange->bitmap_map[gpuidx])
+   bitmap_free(prange->bitmap_map[gpuidx]);
}
 
mutex_destroy(&prange->lock);
@@ -338,19 +338,38 @@ svm_range *svm_range_new(struct svm_range_list *svms, 
uint64_t start,
uint64_t size = last - start + 1;
struct svm_range *prange;
struct kfd_process *p;
-
-   prange = kzalloc(sizeof(*prange), GFP_KERNEL);
-   if (!prange)
-   return NULL;
+   unsigned int nbits;
+   u32 gpuidx;
 
p = container_of(svms, struct kfd_process, svms);
if (!p->xnack_enabled && update_mem_usage &&
amdgpu_amdkfd_reserve_mem_limit(NULL, size << PAGE_SHIFT,
KFD_IOC_ALLOC_MEM_FLAGS_USERPTR, 0)) {
pr_info("SVM mapping failed, exceeds resident system memory 
limit\n");
-   kfree(prange);
return NULL;
}
+
+   prange = kzalloc(sizeof(*prange), GFP_KERNEL);
+   if (!prange)
+   return NULL;
+
+   svm_range_set_default_attributes(&prange->preferred_loc,
+&prange->prefetch_loc,
+&prange->granularity, &prange->flags);
+
+   nbits = svm_range_map_nbits(start, last, prange->granularity);
+   pr_debug("prange 0x%p [0x%llx 0x%llx] bitmap_map nbits %d\n", prange,
+start, last, nbits);
+   for_each_set_bit(gpuidx, p->svms.bitmap_supported, p->n_pdds) {
+   prange->bitmap_map[gpuidx] = bitmap_zalloc(nbits, GFP_KERNEL);
+   if (!prange->bitmap_map[gpuidx]) {
+   while (gpuidx--)
+   bitmap_free(prange->bitmap_map[gpuidx]);
+   kfree(prange);
+   return NULL;
+   }
+   }
+
prange->npages = size;
prange->svms = svms;
prange->start = start;
@@ -369,10 +388,6 @@ svm_range *svm_range_new(struct svm_range_list *svms, 
uint64_t start,
bitmap_copy(prange->bitmap_access, svms->bitmap_supported,
MAX_GPU_INSTANCE);
 
-   svm_range_set_default_attributes(&prange->preferred_loc,
-&prange->prefetch_loc,
-&prange->granularity, &prange->flags);
-
pr_debug("svms 0x%p [0x%llx 0x%llx]\n", svms, start, last);
 
return prange;
@@ -1017,6 +1032,51 @@ svm_range_split_nodes(struct svm_range *new, struct 
svm_range *old,
return 0;
 }
 
+static int
+svm_range_split_bitmap_map(struct svm_range *new, struct svm_range *old,
+  u64 start, u64 last)
+{
+   struct kfd_process *p = container_of(new->svms, struct kfd_process, 
svms);
+   u32 new_nbits, old_nbits, old_nbits2;
+   unsigned long *bits;
+   u32 gpuidx;
+
+   new_nbits = svm_range_map_nbits(new->start, new->last, 
new->granularity);
+   old_nbits = svm_range_map_nbits(old->start, old->last, 
old->granularity);
+   old_nbits2 = svm_range_map_nbits(start, last, old->granularity);
+
+   pr_debug("old 0x%p [0x%lx 0x%lx] => [0x%llx 0x%llx] nbits %d => %d\n",
+old, old->start, old->last, start, last, old_nbits, 
old_nbits2);
+   pr_debug("new 0x%p [0x%lx 0x%lx] nbits %d\n", new, new->start, 
new->last,
+new_nbits);
+
+   for_each_set_bit(gpuidx, p->svms.bitmap_supported, p->n_pdds) {
+   bits = bitmap_alloc(old_nbits2, GFP_KERNEL);
+   if (!bits)
+

[PATCH v4 6/7] drm/amdkfd: Check bitmap_map flag to skip retry fault

2024-01-15 Thread Philip Yang

Remove prange validate_timestamp which is not accurate for multiple
GPUs.

Use the bitmap_map flag to skip the retry fault from different pages of
the same granularity range if the granularity range is already mapped
on the specific GPU.

Signed-off-by: Philip Yang 
Reviewed-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 24 
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 -
 2 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index ebc4cce801bf..b36d997e7a3d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -45,10 +45,6 @@
 
 #define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1
 
-/* Long enough to ensure no retry fault comes after svm range is restored and
- * page table is updated.
- */
-#define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING   (2UL * NSEC_PER_MSEC)
 #if IS_ENABLED(CONFIG_DYNAMIC_DEBUG)
 #define dynamic_svm_range_dump(svms) \
_dynamic_func_call_no_desc("svm_range_dump", svm_range_debug_dump, svms)
@@ -380,7 +376,6 @@ svm_range *svm_range_new(struct svm_range_list *svms, 
uint64_t start,
INIT_LIST_HEAD(&prange->deferred_list);
INIT_LIST_HEAD(&prange->child_list);
atomic_set(&prange->invalid, 0);
-   prange->validate_timestamp = 0;
mutex_init(&prange->migrate_mutex);
mutex_init(&prange->lock);
 
@@ -1965,8 +1960,6 @@ static int svm_range_validate_and_map(struct mm_struct 
*mm,
}
 
svm_range_unreserve_bos(ctx);
-   if (!r)
-   prange->validate_timestamp = ktime_get_boottime();
 
 free_ctx:
kfree(ctx);
@@ -3226,15 +3219,6 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,
goto out_unlock_mm;
}
 
-   /* skip duplicate vm fault on different pages of same range */
-   if (ktime_before(timestamp, ktime_add_ns(prange->validate_timestamp,
-   AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING))) {
-   pr_debug("svms 0x%p [0x%lx %lx] already restored\n",
-svms, prange->start, prange->last);
-   r = 0;
-   goto out_unlock_mm;
-   }
-
/* __do_munmap removed VMA, return success as we are handling stale
 * retry fault.
 */
@@ -3260,6 +3244,14 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,
goto out_unlock_mm;
}
 
+   /* skip duplicate vm fault on different pages of same granularity range 
*/
+   if (svm_range_partial_mapped_dev(gpuidx, prange, addr, addr)) {
+   pr_debug("svms 0x%p [0x%lx %lx] addr 0x%llx already mapped on 
gpu %d\n",
+svms, prange->start, prange->last, addr, gpuidx);
+   r = 0;
+   goto out_unlock_mm;
+   }
+
mutex_lock(&prange->migrate_mutex);
 
pr_debug("svms %p [0x%lx 0x%lx] best restore 0x%x, actual loc 0x%x\n",
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index a10eeb77f83e..5a9688d5c18c 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -129,7 +129,6 @@ struct svm_range {
uint32_tactual_loc;
uint8_t granularity;
atomic_tinvalid;
-   ktime_t validate_timestamp;
struct mmu_interval_notifiernotifier;
struct svm_work_list_item   work_item;
struct list_headdeferred_list;
-- 
2.35.1

[PATCH v4 4/7] amd/amdkfd: Unmap range from GPU based on granularity

2024-01-15 Thread Philip Yang

When MMU notifier invalidate the range, align the start and last address
to range granularity to unmap from GPU and update bitmap_map flag.
Skip unmap from GPU if range is already unmapped based on bitmap_map
flag. This  avoids unmap 1 page from GPU and flush TLB, also solve
the rocgdb CWSR migration related issue.

Unmap the range from cpu will remove the range and split the range, this
cannot align the start and last address to range granularity. Change
to split the range and bitmap_map flag first, then unmap the range
from GPU. If unmapping from GPU first, the bitmap_map flag is updated,
split range may get incorrect bitmap_map for the remaining ranges.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 42 +++-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index a003406db067..7a30c3e58234 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2114,6 +2114,13 @@ svm_range_evict(struct svm_range *prange, struct 
mm_struct *mm,
} else {
unsigned long s, l;
uint32_t trigger;
+   u64 size = 1UL << prange->granularity;
+
+   if (!svm_range_partial_mapped(prange, start, last)) {
+   pr_debug("svms 0x%p [0x%lx 0x%lx] unmapped already\n",
+prange->svms, start, last);
+   return 0;
+   }
 
if (event == MMU_NOTIFY_MIGRATE)
trigger = KFD_SVM_UNMAP_TRIGGER_MMU_NOTIFY_MIGRATE;
@@ -2122,16 +2129,17 @@ svm_range_evict(struct svm_range *prange, struct 
mm_struct *mm,
 
pr_debug("invalidate unmap svms 0x%p [0x%lx 0x%lx] from GPUs\n",
 prange->svms, start, last);
+
list_for_each_entry(pchild, &prange->child_list, child_list) {
mutex_lock_nested(&pchild->lock, 1);
-   s = max(start, pchild->start);
-   l = min(last, pchild->last);
+   s = svm_range_align_start(start, pchild->start, size);
+   l = svm_range_align_last(last, pchild->last, size);
if (l >= s)
svm_range_unmap_from_gpus(pchild, s, l, 
trigger);
mutex_unlock(&pchild->lock);
}
-   s = max(start, prange->start);
-   l = min(last, prange->last);
+   s = svm_range_align_start(start, prange->start, size);
+   l = svm_range_align_last(last, prange->last, size);
if (l >= s)
svm_range_unmap_from_gpus(prange, s, l, trigger);
}
@@ -2645,24 +2653,32 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, struct 
svm_range *prange,
 
list_for_each_entry(pchild, &prange->child_list, child_list) {
mutex_lock_nested(&pchild->lock, 1);
-   s = max(start, pchild->start);
-   l = min(last, pchild->last);
-   if (l >= s)
-   svm_range_unmap_from_gpus(pchild, s, l, trigger);
svm_range_unmap_split(mm, prange, pchild, start, last);
mutex_unlock(&pchild->lock);
}
-   s = max(start, prange->start);
-   l = min(last, prange->last);
-   if (l >= s)
-   svm_range_unmap_from_gpus(prange, s, l, trigger);
svm_range_unmap_split(mm, prange, prange, start, last);
-
if (unmap_parent)
svm_range_add_list_work(svms, prange, mm, SVM_OP_UNMAP_RANGE);
else
svm_range_add_list_work(svms, prange, mm,
SVM_OP_UPDATE_RANGE_NOTIFIER);
+
+   list_for_each_entry(pchild, &prange->child_list, child_list) {
+   if (pchild->work_item.op != SVM_OP_UNMAP_RANGE)
+   continue;
+
+   s = max(start, pchild->start);
+   l = min(last, pchild->last);
+   if (l >= s)
+   svm_range_unmap_from_gpus(pchild, s, l, trigger);
+   }
+   if (prange->work_item.op == SVM_OP_UNMAP_RANGE) {
+   s = max(start, prange->start);
+   l = min(last, prange->last);
+   if (l >= s)
+   svm_range_unmap_from_gpus(prange, s, l, trigger);
+   }
+
schedule_deferred_list_work(svms);
 
kfd_unref_process(p);
-- 
2.35.1

[PATCH v4 5/7] drm/amdkfd: Change range granularity update bitmap_map

2024-01-15 Thread Philip Yang

When changing the svm range granularity, update the svm range
bitmap_map based on new range granularity.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 49 +++-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 7a30c3e58234..ebc4cce801bf 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -757,6 +757,53 @@ svm_range_check_attr(struct kfd_process *p,
return 0;
 }
 
+static void
+svm_range_change_granularity(struct svm_range *prange, u8 value)
+{
+   struct kfd_process *p = container_of(prange->svms, struct kfd_process, 
svms);
+   u32 new_nbits, old_nbits, i, n;
+   unsigned long *new_bits, *old_bits;
+   u32 gpuidx;
+
+   if (prange->granularity == value)
+   return;
+
+   old_nbits = svm_range_map_nbits(prange->start, prange->last, 
prange->granularity);
+   new_nbits = svm_range_map_nbits(prange->start, prange->last, value);
+   if (new_nbits > old_nbits) {
+   n = new_nbits / old_nbits;
+   if (new_nbits % old_nbits)
+   n++;
+   } else {
+   n = old_nbits / new_nbits;
+   if (old_nbits % new_nbits)
+   n++;
+   }
+
+   pr_debug("prange 0x%p [0x%lx 0x%lx] bitmap_map nbits %d -> %d\n",
+prange, prange->start, prange->last, old_nbits, new_nbits);
+
+   for_each_set_bit(gpuidx, p->svms.bitmap_supported, p->n_pdds) {
+   old_bits = prange->bitmap_map[gpuidx];
+   if (bitmap_empty(old_bits, old_nbits))
+   continue;
+
+   new_bits = bitmap_zalloc(new_nbits, GFP_KERNEL);
+   if (!new_bits)
+   return;
+
+   for_each_set_bit(i, old_bits, old_nbits) {
+   if (new_nbits > old_nbits)
+   bitmap_set(new_bits, i * n, n);
+   else
+   bitmap_set(new_bits, i / n, 1);
+   }
+   prange->bitmap_map[gpuidx] = new_bits;
+   bitmap_free(old_bits);
+   }
+   prange->granularity = value;
+}
+
 static void
 svm_range_apply_attrs(struct kfd_process *p, struct svm_range *prange,
  uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs,
@@ -801,7 +848,7 @@ svm_range_apply_attrs(struct kfd_process *p, struct 
svm_range *prange,
prange->flags &= ~attrs[i].value;
break;
case KFD_IOCTL_SVM_ATTR_GRANULARITY:
-   prange->granularity = min_t(uint32_t, attrs[i].value, 
0x3F);
+   svm_range_change_granularity(prange, min_t(u32, 
attrs[i].value, 0x3F));
break;
default:
WARN_ONCE(1, "svm_range_check_attrs wasn't called?");
-- 
2.35.1

[PATCH v4 2/7] drm/amdkfd: Add helper function align range start last

2024-01-15 Thread Philip Yang

Calculate range start, last address aligned to the range granularity
size. This removes the duplicate code, and the helper function will be
used in the future patch to handle map, unmap to GPU based on range
granularity. No functional change.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 10 --
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 10 ++
 3 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index dae05f70257b..64eb9023d66b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -986,8 +986,8 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf)
 
/* Align migration range start and size to granularity size */
size = 1UL << prange->granularity;
-   start = max(ALIGN_DOWN(addr, size), prange->start);
-   last = min(ALIGN(addr + 1, size) - 1, prange->last);
+   start = svm_range_align_start(addr, prange->start, size);
+   last = svm_range_align_last(addr, prange->last, size);
 
r = svm_migrate_vram_to_ram(prange, vmf->vma->vm_mm, start, last,
KFD_MIGRATE_TRIGGER_PAGEFAULT_CPU, 
vmf->page);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 14dbc0fd51a9..a2c96f5760ff 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2698,10 +2698,8 @@ svm_range_get_range_boundaries(struct kfd_process *p, 
int64_t addr,
 (vma->vm_start <= vma->vm_mm->start_stack &&
  vma->vm_end >= vma->vm_mm->start_stack);
 
-   start_limit = max(vma->vm_start >> PAGE_SHIFT,
- (unsigned long)ALIGN_DOWN(addr, 2UL << 8));
-   end_limit = min(vma->vm_end >> PAGE_SHIFT,
-   (unsigned long)ALIGN(addr + 1, 2UL << 8));
+   start_limit = svm_range_align_start(addr, vma->vm_start >> PAGE_SHIFT, 
2UL << 8);
+   end_limit = svm_range_align_last(addr, (vma->vm_end >> PAGE_SHIFT) - 1, 
2UL << 8) + 1;
/* First range that starts after the fault address */
node = interval_tree_iter_first(&p->svms.objects, addr + 1, ULONG_MAX);
if (node) {
@@ -3043,8 +3041,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,
 
/* Align migration range start and size to granularity size */
size = 1UL << prange->granularity;
-   start = max_t(unsigned long, ALIGN_DOWN(addr, size), prange->start);
-   last = min_t(unsigned long, ALIGN(addr + 1, size) - 1, prange->last);
+   start = svm_range_align_start(addr, prange->start, size);
+   last = svm_range_align_last(addr, prange->last, size);
if (prange->actual_loc != 0 || best_loc != 0) {
migration = true;
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
index 026863a0abcd..806bcac6d101 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h
@@ -159,6 +159,16 @@ static inline struct svm_range_bo *svm_range_bo_ref(struct 
svm_range_bo *svm_bo)
return svm_bo;
 }
 
+static inline u64 svm_range_align_start(u64 addr, u64 range_start, u64 
align_size)
+{
+   return max(ALIGN_DOWN(addr, align_size), range_start);
+}
+
+static inline u64 svm_range_align_last(u64 addr, u64 range_last, u64 
align_size)
+{
+   return min(ALIGN(addr + 1, align_size) - 1, range_last);
+}
+
 int svm_range_list_init(struct kfd_process *p);
 void svm_range_list_fini(struct kfd_process *p);
 int svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start,
-- 
2.35.1

[PATCH v4 7/7] drm/amdkfd: Wait update sdma fence before tlb flush

2024-01-15 Thread Philip Yang

If using sdma update GPU page table, kfd flush tlb does nothing if vm
update fence callback doesn't update vm->tlb_seq. This works now because
retry fault will come and update page table again and flush tlb finally.

With the bitmap_map flag, the retry fault recover will only update
GPU page table once, have to wait sdma udate fence and then flush tlb.

No change if using CPU update GPU page table for large bar because no vm
update fence.

Remove wait parameter in svm_range_validate_and_map because it is always
called with true now.

Signed-off-by: Philip Yang 
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index b36d997e7a3d..9e5f6e12c498 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -1677,7 +1677,7 @@ svm_range_map_to_gpu(struct kfd_process_device *pdd, 
struct svm_range *prange,
 static int
 svm_range_map_to_gpus(struct svm_range *prange, unsigned long offset,
  unsigned long npages, bool readonly,
- unsigned long *bitmap, bool wait, bool flush_tlb)
+ unsigned long *bitmap, bool flush_tlb)
 {
struct kfd_process_device *pdd;
struct amdgpu_device *bo_adev = NULL;
@@ -1710,8 +1710,7 @@ svm_range_map_to_gpus(struct svm_range *prange, unsigned 
long offset,
 
r = svm_range_map_to_gpu(pdd, prange, offset, npages, readonly,
 prange->dma_addr[gpuidx],
-bo_adev, wait ? &fence : NULL,
-flush_tlb);
+bo_adev, &fence, flush_tlb);
if (r)
break;
 
@@ -1837,7 +1836,7 @@ static void *kfd_svm_page_owner(struct kfd_process *p, 
int32_t gpuidx)
 static int svm_range_validate_and_map(struct mm_struct *mm,
  unsigned long map_start, unsigned long 
map_last,
  struct svm_range *prange, int32_t gpuidx,
- bool intr, bool wait, bool flush_tlb)
+ bool intr, bool flush_tlb)
 {
struct svm_validate_context *ctx;
unsigned long start, end, addr;
@@ -1950,7 +1949,7 @@ static int svm_range_validate_and_map(struct mm_struct 
*mm,
offset = map_start_vma - prange->start;
npages = map_last_vma - map_start_vma + 1;
r = svm_range_map_to_gpus(prange, offset, 
npages, readonly,
- ctx->bitmap, wait, 
flush_tlb);
+ ctx->bitmap, 
flush_tlb);
}
}
 
@@ -2041,7 +2040,7 @@ static void svm_range_restore_work(struct work_struct 
*work)
mutex_lock(&prange->migrate_mutex);
 
r = svm_range_validate_and_map(mm, prange->start, prange->last, 
prange,
-  MAX_GPU_INSTANCE, false, true, 
false);
+  MAX_GPU_INSTANCE, false, false);
if (r)
pr_debug("failed %d to map 0x%lx to gpus\n", r,
 prange->start);
@@ -3303,7 +3302,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, 
unsigned int pasid,
mmap_read_lock(mm);
 
r = svm_range_validate_and_map(mm, start, last, prange, gpuidx, false,
-  false, false);
+  false);
if (r)
pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpus\n",
 r, svms, start, last);
@@ -3847,7 +3846,7 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
flush_tlb = !migrated && update_mapping &&
svm_range_partial_mapped(prange, prange->start, 
prange->last);
r = svm_range_validate_and_map(mm, prange->start, prange->last, 
prange,
-  MAX_GPU_INSTANCE, true, true, 
flush_tlb);
+  MAX_GPU_INSTANCE, true, 
flush_tlb);
if (r)
pr_debug("failed %d to map svm range\n", r);
 
@@ -3863,7 +3862,7 @@ svm_range_set_attr(struct kfd_process *p, struct 
mm_struct *mm,
mutex_lock(&prange->migrate_mutex);
flush_tlb = svm_range_partial_mapped(prange, prange->start, 
prange->last);
r = svm_range_validate_and_map(mm,  prange->start, 
prange->last, prange,
-  MAX_GPU_INSTANCE, true, true, 
flush_tlb);
+  MAX_GPU_INS

[PATCH] drm/amdgpu: Remove unnecessary NULL check

2024-01-15 Thread Felix Kuehling

A static checker pointed out, that bo_va->base.bo was already derefenced
earlier in the same scope. Therefore this check is unnecessary here.

Reported-by: Dan Carpenter 
Fixes: 79e7fdec71f2 ("drm/amdgpu: Auto-validate DMABuf imports in compute VMs")
Signed-off-by: Felix Kuehling 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 77d015ebb201..b7f07cc52b1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -1477,7 +1477,7 @@ int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
/* Remember evicted DMABuf imports in compute VMs for later
 * validation
 */
-   if (vm->is_compute_context && bo_va->base.bo &&
+   if (vm->is_compute_context &&
bo_va->base.bo->tbo.base.import_attach &&
(!bo_va->base.bo->tbo.resource ||
 bo_va->base.bo->tbo.resource->mem_type == TTM_PL_SYSTEM))
-- 
2.34.1

Re: [pull] amdgpu, amdkfd drm-fixes-6.8

2024-01-15 Thread Felix Kuehling




On 2024-01-15 17:08, Alex Deucher wrote:

Hi Dave, Sima,

Same PR as Friday, but with the new clang warning fixed.

The following changes since commit e54478fbdad20f2c58d0a4f99d01299ed8e7fe9c:

   Merge tag 'amd-drm-next-6.8-2024-01-05' of 
https://gitlab.freedesktop.org/agd5f/linux into drm-next (2024-01-09 09:07:50 
+1000)

are available in the Git repository at:

   https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-6.8-2024-01-15

for you to fetch changes up to 86718cf93237a0f45773bbc49b6006733fc4e051:

   drm/amd/display: Avoid enum conversion warning (2024-01-15 16:28:05 -0500)


amd-drm-fixes-6.8-2024-01-15:

amdgpu:
- SubVP fixes
- VRR fixes
- USB4 fixes
- DCN 3.5 fixes
- GFX11 harvesting fix
- RAS fixes
- Misc small fixes
- KFD dma-buf import fixes
- Power reporting fixes
- ATHUB 3.3 fix
- SR-IOV fix
- Add missing fw release for fiji
- GFX 11.5 fix
- Debugging module parameter fix
- SMU 13.0.6 fixes
- Fix new clang warning

amdkfd:
- Fix lockdep warnings
- Fix sparse __rcu warnings
- Bump interface version so userspace knows that the kernel supports dma-bufs 
exported from KFD
   Most of the fixes for this went into 6.7, but the last fix is in this PR
- HMM fix
- SVM fix


Alex Deucher (4):
   drm/amdgpu: fix avg vs input power reporting on smu7
   drm/amdgpu: fall back to INPUT power for AVG power via INFO IOCTL
   drm/amdgpu/pm: clarify debugfs pm output
   drm/amdgpu: drop exp hw support check for GC 9.4.3

Aric Cyr (1):
   drm/amd/display: 3.2.266

Candice Li (2):
   drm/amdgpu: Drop unnecessary sentences about CE and deferred error.
   drm/amdgpu: Support poison error injection via ras_ctrl debugfs

Charlene Liu (1):
   drm/amd/display: Update z8 latency

Dafna Hirschfeld (1):
   drm/amdkfd: fixes for HMM mem allocation

Daniel Miess (1):
   Revert "drm/amd/display: Fix conversions between bytes and KB"

Felix Kuehling (4):
   drm/amdkfd: Fix lock dependency warning
   drm/amdkfd: Fix sparse __rcu annotation warnings
   drm/amdgpu: Auto-validate DMABuf imports in compute VMs
   drm/amdkfd: Bump KFD ioctl version


I don't think the last two patches should be on -fixes. This is really 
for a new interoperability feature and not meant to fix other 
functionality that was already expected to work. That's why the user 
mode code checks for the API version.


Regards,
  Felix




George Shen (1):
   drm/amd/display: Disconnect phantom pipe OPP from OPTC being disabled

Hawking Zhang (1):
   drm/amdgpu: Packed socket_id to ras feature mask

Ivan Lipski (1):
   Revert "drm/amd/display: fix bandwidth validation failure on DCN 2.1"

James Zhu (1):
   drm/amdgpu: make a correction on comment

Le Ma (3):
   Revert "drm/amdgpu: add param to specify fw bo location for front-door 
loading"
   drm/amdgpu: add debug flag to place fw bo on vram for frontdoor loading
   drm/amdgpu: move debug options init prior to amdgpu device init

Lijo Lazar (2):
   drm/amd/pm: Add error log for smu v13.0.6 reset
   drm/amd/pm: Fix smuv13.0.6 current clock reporting

Likun Gao (1):
   drm/amdgpu: correct the cu count for gfx v11

Martin Leung (2):
   drm/amd/display: revert "for FPO & SubVP/DRR config program vmin/max"
   drm/amd/display: revert "Optimize VRR updates to only necessary ones"

Martin Tsai (1):
   drm/amd/display: To adjust dprefclk by down spread percentage

Meenakshikumar Somasundaram (1):
   drm/amd/display: Dpia hpd status not in sync after S4

Melissa Wen (1):
   drm/amd/display: cleanup inconsistent indenting in amdgpu_dm_color

Nathan Chancellor (1):
   drm/amd/display: Avoid enum conversion warning

Peichen Huang (1):
   drm/amd/display: Request usb4 bw for mst streams

Philip Yang (1):
   drm/amdkfd: Fix lock dependency warning with srcu

Srinivasan Shanmugam (6):
   drm/amd/powerplay: Fix kzalloc parameter 'ATOM_Tonga_PPM_Table' in 
'get_platform_power_management_table()'
   drm/amdgpu: Fix with right return code '-EIO' in 
'amdgpu_gmc_vram_checking()'
   drm/amdgpu: Fix unsigned comparison with less than zero in 
vpe_u1_8_from_fraction()
   drm/amdgpu: Release 'adev->pm.fw' before return in 
'amdgpu_device_need_post()'
   drm/amd/display: Fix variable deferencing before NULL check in 
edp_setup_replay()
   drm/amdkfd: Fix 'node' NULL check in 'svm_range_get_range_boundaries()'

Victor Lu (1):
   drm/amdgpu: Do not program VM_L2_CNTL under SRIOV

Yifan Zhang (3):
   drm/amdgpu: update headers for nbio v7.11
   drm/amdgpu: update ATHUB_MISC_CNTL offset for athub v3.3
   drm/amdgpu: update regGL2C_CTRL4 value in golden setting

  drivers/gpu/drm/amd/amdgpu/amdgpu.h|  3 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 12 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_g

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Marek Olšák

On Mon, Jan 15, 2024 at 11:41 AM Michel Dänzer  wrote:
>
> On 2024-01-15 17:19, Friedrich Vock wrote:
> > On 15.01.24 16:43, Joshua Ashton wrote:
> >> On 1/15/24 15:25, Michel Dänzer wrote:
> >>> On 2024-01-15 14:17, Christian König wrote:
>  Am 15.01.24 um 12:37 schrieb Joshua Ashton:
> > On 1/15/24 09:40, Christian König wrote:
> >> Am 13.01.24 um 15:02 schrieb Joshua Ashton:
> >>
> >>> Without this feedback, the application may keep pushing through
> >>> the soft
> >>> recoveries, continually hanging the system with jobs that timeout.
> >>
> >> Well, that is intentional behavior. Marek is voting for making
> >> soft recovered errors fatal as well while Michel is voting for
> >> better ignoring them.
> >>
> >> I'm not really sure what to do. If you guys think that soft
> >> recovered hangs should be fatal as well then we can certainly do
> >> this.
> >>>
> >>> A possible compromise might be making soft resets fatal if they
> >>> happen repeatedly (within a certain period of time?).
> >>
> >> No, no and no. Aside from introducing issues by side effects not
> >> surfacing and all of the stuff I mentioned about descriptor buffers,
> >> bda, draw indirect and stuff just resulting in more faults and hangs...
> >>
> >> You are proposing we throw out every promise we made to an application
> >> on the API contract level because it "might work". That's just wrong!
> >>
> >> Let me put this in explicit terms: What you are proposing is in direct
> >> violation of the GL and Vulkan specification.
> >>
> >> You can't just chose to break these contracts because you think it
> >> 'might' be a better user experience.
> >
> > Is the original issue that motivated soft resets to be non-fatal even an
> > issue anymore?
> >
> > If I read that old thread correctly, the rationale for that was that
> > assigning guilt to a context was more broken than not doing it, because
> > the compositor/Xwayland process would also crash despite being unrelated
> > to the hang.
> > With Joshua's Mesa fixes, this is not the case anymore, so I don't think
> > keeping soft resets non-fatal provides any benefit to the user experience.
> > The potential detriments to user experience have been outlined multiple
> > times in this thread already.
> >
> > (I suppose if the compositor itself faults it might still bring down a
> > session, but I've literally never seen that, and it's not like a
> > compositor triggering segfaults on CPU stays alive either.)
>
> That's indeed what happened for me, multiple times. And each time the session 
> continued running fine for days after the soft reset.
>
> But apparently my experience isn't valid somehow, and I should have been 
> forced to log in again to please the GL gods...
>
>
> Conversely, I can't remember hitting a case where an app kept running into 
> soft resets. It's almost as if different people may have different 
> experiences! ;)
>
> Note that I'm not saying that case can't happen. Making soft resets fatal 
> only if they happen repeatedly could address both issues, rather than only 
> one or the other. Seems like a win-win.

This is exactly the comment that shouldn't have been sent, and you are
not the only one.

Nobody should ever care about subjective experiences. We can only do
this properly by looking at the whole system and its rules and try to
find a solution that works for everything on paper first. DrawIndirect
is one case where the current system fails. "Works for me because I
don't use DrawIndirect" is a horrible way to do this.

Marek

[pull] amdgpu, amdkfd drm-fixes-6.8

2024-01-15 Thread Alex Deucher

Hi Dave, Sima,

Fixes for 6.8.  Same PR as Friday, but with new clang warning fixed and
dropped KFD changes at Felix' request.

The following changes since commit e54478fbdad20f2c58d0a4f99d01299ed8e7fe9c:

  Merge tag 'amd-drm-next-6.8-2024-01-05' of 
https://gitlab.freedesktop.org/agd5f/linux into drm-next (2024-01-09 09:07:50 
+1000)

are available in the Git repository at:

  https://gitlab.freedesktop.org/agd5f/linux.git 
tags/amd-drm-fixes-6.8-2024-01-15-1

for you to fetch changes up to d7643fe6fb76edb1f2f1497bf5e8b8f4774b5129:

  drm/amd/display: Avoid enum conversion warning (2024-01-15 18:35:07 -0500)


amd-drm-fixes-6.8-2024-01-15-1:

amdgpu:
- SubVP fixes
- VRR fixes
- USB4 fixes
- DCN 3.5 fixes
- GFX11 harvesting fix
- RAS fixes
- Misc small fixes
- KFD dma-buf import fixes
- Power reporting fixes
- ATHUB 3.3 fix
- SR-IOV fix
- Add missing fw release for fiji
- GFX 11.5 fix
- Debugging module parameter fix
- SMU 13.0.6 fixes
- Fix new clang warning

amdkfd:
- Fix lockdep warnings
- Fix sparse __rcu warnings
- HMM fix
- SVM fix


Alex Deucher (4):
  drm/amdgpu: fix avg vs input power reporting on smu7
  drm/amdgpu: fall back to INPUT power for AVG power via INFO IOCTL
  drm/amdgpu/pm: clarify debugfs pm output
  drm/amdgpu: drop exp hw support check for GC 9.4.3

Aric Cyr (1):
  drm/amd/display: 3.2.266

Candice Li (2):
  drm/amdgpu: Drop unnecessary sentences about CE and deferred error.
  drm/amdgpu: Support poison error injection via ras_ctrl debugfs

Charlene Liu (1):
  drm/amd/display: Update z8 latency

Dafna Hirschfeld (1):
  drm/amdkfd: fixes for HMM mem allocation

Daniel Miess (1):
  Revert "drm/amd/display: Fix conversions between bytes and KB"

Felix Kuehling (2):
  drm/amdkfd: Fix lock dependency warning
  drm/amdkfd: Fix sparse __rcu annotation warnings

George Shen (1):
  drm/amd/display: Disconnect phantom pipe OPP from OPTC being disabled

Hawking Zhang (1):
  drm/amdgpu: Packed socket_id to ras feature mask

Ivan Lipski (1):
  Revert "drm/amd/display: fix bandwidth validation failure on DCN 2.1"

James Zhu (1):
  drm/amdgpu: make a correction on comment

Le Ma (3):
  Revert "drm/amdgpu: add param to specify fw bo location for front-door 
loading"
  drm/amdgpu: add debug flag to place fw bo on vram for frontdoor loading
  drm/amdgpu: move debug options init prior to amdgpu device init

Lijo Lazar (2):
  drm/amd/pm: Add error log for smu v13.0.6 reset
  drm/amd/pm: Fix smuv13.0.6 current clock reporting

Likun Gao (1):
  drm/amdgpu: correct the cu count for gfx v11

Martin Leung (2):
  drm/amd/display: revert "for FPO & SubVP/DRR config program vmin/max"
  drm/amd/display: revert "Optimize VRR updates to only necessary ones"

Martin Tsai (1):
  drm/amd/display: To adjust dprefclk by down spread percentage

Meenakshikumar Somasundaram (1):
  drm/amd/display: Dpia hpd status not in sync after S4

Melissa Wen (1):
  drm/amd/display: cleanup inconsistent indenting in amdgpu_dm_color

Nathan Chancellor (1):
  drm/amd/display: Avoid enum conversion warning

Peichen Huang (1):
  drm/amd/display: Request usb4 bw for mst streams

Philip Yang (1):
  drm/amdkfd: Fix lock dependency warning with srcu

Srinivasan Shanmugam (6):
  drm/amd/powerplay: Fix kzalloc parameter 'ATOM_Tonga_PPM_Table' in 
'get_platform_power_management_table()'
  drm/amdgpu: Fix with right return code '-EIO' in 
'amdgpu_gmc_vram_checking()'
  drm/amdgpu: Fix unsigned comparison with less than zero in 
vpe_u1_8_from_fraction()
  drm/amdgpu: Release 'adev->pm.fw' before return in 
'amdgpu_device_need_post()'
  drm/amd/display: Fix variable deferencing before NULL check in 
edp_setup_replay()
  drm/amdkfd: Fix 'node' NULL check in 'svm_range_get_range_boundaries()'

Victor Lu (1):
  drm/amdgpu: Do not program VM_L2_CNTL under SRIOV

Yifan Zhang (3):
  drm/amdgpu: update headers for nbio v7.11
  drm/amdgpu: update ATHUB_MISC_CNTL offset for athub v3.3
  drm/amdgpu: update regGL2C_CTRL4 value in golden setting

 drivers/gpu/drm/amd/amdgpu/amdgpu.h|  3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c   |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_discovery.c  |  2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c| 15 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c| 21 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c|  7 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c|  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c| 26 
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c  |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h |  2 +-
 drivers/g

Re: [PATCH] drm/amdkfd: Correct partial migration virtual addr

2024-01-15 Thread Chen, Xiaogang


This patch is:

Reviewed-by Xiaogang Chen 

On 1/15/2024 4:00 PM, Philip Yang wrote:

Partial migration to system memory should use migrate.addr, not
prange->start as virtual address to allocate system memory page.

Fixes: 18eb61bd5a6a ("drm/amdkfd: Use partial migrations/mapping for GPU/CPU page 
faults in SVM"
Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index f856901055d3..bdc01ca9609a 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -574,7 +574,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, prange->start,
 prange->last);
  
-	addr = prange->start << PAGE_SHIFT;

+   addr = migrate->start;
  
  	src = (uint64_t *)(scratch + npages);

dst = scratch;

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Marek Olšák

On Mon, Jan 15, 2024 at 3:06 PM Christian König
 wrote:
>
> Am 15.01.24 um 20:30 schrieb Joshua Ashton:
> > On 1/15/24 19:19, Christian König wrote:
> >> Am 15.01.24 um 20:13 schrieb Joshua Ashton:
> >>> On 1/15/24 18:53, Christian König wrote:
>  Am 15.01.24 um 19:35 schrieb Joshua Ashton:
> > On 1/15/24 18:30, Bas Nieuwenhuizen wrote:
> >> On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
> >> mailto:friedrich.v...@gmx.de>> wrote:
> >>
> >> Re-sending as plaintext, sorry about that
> >>
> >> On 15.01.24 18:54, Michel Dänzer wrote:
> >>  > On 2024-01-15 18:26, Friedrich Vock wrote:
> >>  >> [snip]
> >>  >> The fundamental problem here is that not telling
> >> applications that
> >>  >> something went wrong when you just canceled their work
> >> midway is an
> >>  >> out-of-spec hack.
> >>  >> When there is a report of real-world apps breaking
> >> because of
> >> that hack,
> >>  >> reports of different apps working (even if it's
> >> convenient that they
> >>  >> work) doesn't justify keeping the broken code.
> >>  > If the breaking apps hit multiple soft resets in a row,
> >> I've laid
> >> out a pragmatic solution which covers both cases.
> >> Hitting soft reset every time is the lucky path. Once GPU
> >> work is
> >> interrupted out of nowhere, all bets are off and it might as
> >> well
> >> trigger a full system hang next time. No hang recovery should
> >> be able to
> >> cause that under any circumstance.
> >>
> >>
> >> I think the more insidious situation is no further hangs but
> >> wrong results because we skipped some work. That we skipped work
> >> may e.g. result in some texture not being uploaded or some GPGPU
> >> work not being done and causing further errors downstream (say if
> >> a game is doing AI/physics on the GPU not to say anything of
> >> actual GPGPU work one might be doing like AI)
> >
> > Even worse if this is compute on eg. OpenCL for something
> > science/math/whatever related, or training a model.
> >
> > You could randomly just get invalid/wrong results without even
> > knowing!
> 
>  Well on the kernel side we do provide an API to query the result of
>  a submission. That includes canceling submissions with a soft
>  recovery.
> 
>  What we just doesn't do is to prevent further submissions from this
>  application. E.g. enforcing that the application is punished for
>  bad behavior.
> >>>
> >>> You do prevent future submissions for regular resets though: Those
> >>> increase karma which sets ctx->guilty, and if ctx->guilty then
> >>> -ECANCELED is returned for a submission.
> >>>
> >>> ctx->guilty is never true for soft recovery though, as it doesn't
> >>> increase karma, which is the problem this patch is trying to solve.
> >>>
> >>> By the submission result query API, I you assume you mean checking
> >>> the submission fence error somehow? That doesn't seem very ergonomic
> >>> for a Vulkan driver compared to the simple solution which is to just
> >>> mark it as guilty with what already exists...
> >>
> >> Well as I said the guilty handling is broken for quite a number of
> >> reasons.
> >>
> >> What we can do rather trivially is changing this code in
> >> amdgpu_job_prepare_job():
> >>
> >>  /* Ignore soft recovered fences here */
> >>  r = drm_sched_entity_error(s_entity);
> >>  if (r && r != -ENODATA)
> >>  goto error;
> >>
> >> This will bubble up errors from soft recoveries into the entity as
> >> well and makes sure that further submissions are rejected.
> >
> > That makes sense to do, but at least for GL_EXT_robustness, that will
> > not tell the app that it was guilty.
>
> No, it clearly gets that signaled. We should probably replace the guilty
> atomic with a calls to drm_sched_entity_error().
>
> It's just that this isn't what Marek and I had in mind for this,
> basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
> AMDGPU_CTX_OP_QUERY_STATE2.
>
> Instead just look at the return value of the CS or query fence result IOCTL.
>
> When you get an -ENODATA you have been guilty of causing a soft
> recovery, when you get an -ETIME you are guilty of causing a timeout
> which had to be hard recovered. When you get an -ECANCELED you are an
> innocent victim of a hard recovery somebody else caused.
>
> What we haven't defined yet is an error code for loosing VRAM, but that
> should be trivial to do.

So far we have implemented the GPU reset and soft reset, but we
haven't done anything to have a robust system recovery. Under the
current system, things can easily keep hanging indefinitely because
nothing prevents that.

The reset status query should stay. Robust apps will use it to tell
when they should recreate their context and resources even if they
d

Re: [PATCH v4] drm/amdkfd: Set correct svm range actual loc after spliting

2024-01-15 Thread Chen, Xiaogang




With a nitpick below, this patch is:

Reviewed-by:Xiaogang Chen

On 1/15/2024 4:02 PM, Philip Yang wrote:

While svm range partial migrating to system memory, clear dma_addr vram
domain flag, otherwise the future split will get incorrect vram_pages
and actual loc.

After range spliting, set new range and old range actual_loc:
new range actual_loc is 0 if new->vram_pages is 0.
old range actual_loc is 0 if old->vram_pages - new->vram_pages == 0.

new range takes svm_bo ref only if vram_pages not equal to 0.

Signed-off-by: Philip Yang 
---
  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  8 +
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 42 ++--
  drivers/gpu/drm/amd/amdkfd/kfd_svm.h |  1 +
  3 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index bdc01ca9609a..79baa195ccac 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -564,6 +564,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
dma_addr_t *scratch, uint64_t npages)
  {
struct device *dev = adev->dev;
+   dma_addr_t *dma_addr;
uint64_t *src;
dma_addr_t *dst;
struct page *dpage;
@@ -575,6 +576,7 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
 prange->last);
  
  	addr = migrate->start;

+   dma_addr = svm_get_dma_addr_for_page_count(prange, addr);
  
  	src = (uint64_t *)(scratch + npages);

dst = scratch;
@@ -623,6 +625,12 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
goto out_oom;
}
  
+		/* Clear VRAM flag when page is migrated to ram, to count vram

+* pages correctly when spliting the range.
+*/
+   if (dma_addr && (dma_addr[i] & SVM_RANGE_VRAM_DOMAIN))
+   dma_addr[i] = 0;
pr_debug_ratelimited("dma mapping dst to 0x%llx, pfn 0x%lx\n",
 dst[i] >> PAGE_SHIFT, page_to_pfn(dpage));
  
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index f84547eccd28..78b4968e4c95 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -362,7 +362,6 @@ svm_range *svm_range_new(struct svm_range_list *svms, 
uint64_t start,
INIT_LIST_HEAD(&prange->child_list);
atomic_set(&prange->invalid, 0);
prange->validate_timestamp = 0;
-   prange->vram_pages = 0;
mutex_init(&prange->migrate_mutex);
mutex_init(&prange->lock);
  
@@ -965,6 +964,24 @@ svm_range_split_array(void *ppnew, void *ppold, size_t size,

return 0;
  }
  
+dma_addr_t *

+svm_get_dma_addr_for_page_count(struct svm_range *prange, u64 addr)


I think this function name is better be: svm_get_dma_addr_for_page_addr. 
Here we do not count page number, but get correct place to save dma 
addres from specific gpu vm addr.


Regards

Xiaogang


+{
+   struct kfd_process *p = container_of(prange->svms, struct kfd_process, 
svms);
+   dma_addr_t *dma_addr;
+   s32 gpuidx;
+
+   gpuidx = kfd_process_gpuidx_from_gpuid(p, prange->actual_loc);
+   if (gpuidx < 0) {
+   pr_debug("no GPU id 0x%x found\n", prange->actual_loc);
+   return NULL;
+   }
+
+   dma_addr = prange->dma_addr[gpuidx];
+   dma_addr += (addr >> PAGE_SHIFT) - prange->start;
+   return dma_addr;
+}
+
  static int
  svm_range_split_pages(struct svm_range *new, struct svm_range *old,
  uint64_t start, uint64_t last)
@@ -980,9 +997,14 @@ svm_range_split_pages(struct svm_range *new, struct 
svm_range *old,
if (r)
return r;
}
-   if (old->actual_loc)
+   if (old->actual_loc && new->vram_pages) {
old->vram_pages -= new->vram_pages;
-
+   new->actual_loc = old->actual_loc;
+   if (!old->vram_pages)
+   old->actual_loc = 0;
+   }
+   pr_debug("new->vram_pages 0x%llx loc 0x%x old->vram_pages 0x%llx loc 
0x%x\n",
+new->vram_pages, new->actual_loc, old->vram_pages, 
old->actual_loc);
return 0;
  }
  
@@ -1002,13 +1024,14 @@ svm_range_split_nodes(struct svm_range *new, struct svm_range *old,

new->offset = old->offset + npages;
}
  
-	new->svm_bo = svm_range_bo_ref(old->svm_bo);

-   new->ttm_res = old->ttm_res;
-
-   spin_lock(&new->svm_bo->list_lock);
-   list_add(&new->svm_bo_list, &new->svm_bo->range_list);
-   spin_unlock(&new->svm_bo->list_lock);
+   if (new->vram_pages) {
+   new->svm_bo = svm_range_bo_ref(old->svm_bo);
+   new->ttm_res = old->ttm_res;
  
+		spin_lock(&new->svm_bo->list_lock);

+   list_add(&new->

[PATCH 1/2] drm/atomic: Allow drivers to write their own plane check for async flips

2024-01-15 Thread André Almeida

Some hardware are more flexible on what they can flip asynchronously, so
rework the plane check so drivers can implement their own check, lifting
up some of the restrictions.

Signed-off-by: André Almeida 
---
 drivers/gpu/drm/drm_atomic_uapi.c | 62 ++-
 include/drm/drm_atomic_uapi.h | 12 ++
 include/drm/drm_plane.h   |  5 +++
 3 files changed, 62 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/drm_atomic_uapi.c 
b/drivers/gpu/drm/drm_atomic_uapi.c
index aee4a65d4959..6d5b9fec90c7 100644
--- a/drivers/gpu/drm/drm_atomic_uapi.c
+++ b/drivers/gpu/drm/drm_atomic_uapi.c
@@ -620,7 +620,7 @@ static int drm_atomic_plane_set_property(struct drm_plane 
*plane,
return 0;
 }
 
-static int
+int
 drm_atomic_plane_get_property(struct drm_plane *plane,
const struct drm_plane_state *state,
struct drm_property *property, uint64_t *val)
@@ -683,6 +683,7 @@ drm_atomic_plane_get_property(struct drm_plane *plane,
 
return 0;
 }
+EXPORT_SYMBOL(drm_atomic_plane_get_property);
 
 static int drm_atomic_set_writeback_fb_for_connector(
struct drm_connector_state *conn_state,
@@ -1026,18 +1027,54 @@ int drm_atomic_connector_commit_dpms(struct 
drm_atomic_state *state,
return ret;
 }
 
-static int drm_atomic_check_prop_changes(int ret, uint64_t old_val, uint64_t 
prop_value,
+int drm_atomic_check_prop_changes(int ret, uint64_t old_val, uint64_t 
prop_value,
 struct drm_property *prop)
 {
if (ret != 0 || old_val != prop_value) {
drm_dbg_atomic(prop->dev,
-  "[PROP:%d:%s] No prop can be changed during 
async flip\n",
+  "[PROP:%d:%s] This prop cannot be changed during 
async flip\n",
   prop->base.id, prop->name);
return -EINVAL;
}
 
return 0;
 }
+EXPORT_SYMBOL(drm_atomic_check_prop_changes);
+
+/* plane changes may have exceptions, so we have a special function for them */
+static int drm_atomic_check_plane_changes(struct drm_property *prop,
+ struct drm_plane *plane,
+ struct drm_plane_state *plane_state,
+ struct drm_mode_object *obj,
+ u64 prop_value, u64 old_val)
+{
+   struct drm_mode_config *config = &plane->dev->mode_config;
+   int ret;
+
+   if (plane->funcs->check_async_props)
+   return plane->funcs->check_async_props(prop, plane, plane_state,
+obj, prop_value, 
old_val);
+
+   /*
+* if you are trying to change something other than the FB ID, your
+* change will be either rejected or ignored, so we can stop the check
+* here
+*/
+   if (prop != config->prop_fb_id) {
+   ret = drm_atomic_plane_get_property(plane, plane_state,
+   prop, &old_val);
+   return drm_atomic_check_prop_changes(ret, old_val, prop_value, 
prop);
+   }
+
+   if (plane_state->plane->type != DRM_PLANE_TYPE_PRIMARY) {
+   drm_dbg_atomic(prop->dev,
+  "[OBJECT:%d] Only primary planes can be changed 
during async flip\n",
+  obj->id);
+   return -EINVAL;
+   }
+
+   return 0;
+}
 
 int drm_atomic_set_property(struct drm_atomic_state *state,
struct drm_file *file_priv,
@@ -1100,7 +1137,6 @@ int drm_atomic_set_property(struct drm_atomic_state 
*state,
case DRM_MODE_OBJECT_PLANE: {
struct drm_plane *plane = obj_to_plane(obj);
struct drm_plane_state *plane_state;
-   struct drm_mode_config *config = &plane->dev->mode_config;
 
plane_state = drm_atomic_get_plane_state(state, plane);
if (IS_ERR(plane_state)) {
@@ -1108,19 +1144,11 @@ int drm_atomic_set_property(struct drm_atomic_state 
*state,
break;
}
 
-   if (async_flip && prop != config->prop_fb_id) {
-   ret = drm_atomic_plane_get_property(plane, plane_state,
-   prop, &old_val);
-   ret = drm_atomic_check_prop_changes(ret, old_val, 
prop_value, prop);
-   break;
-   }
-
-   if (async_flip && plane_state->plane->type != 
DRM_PLANE_TYPE_PRIMARY) {
-   drm_dbg_atomic(prop->dev,
-  "[OBJECT:%d] Only primary planes can be 
changed during async flip\n",
-  obj->id);
-   ret = -EINVAL;
-   break;
+   if (async_flip) {
+

[PATCH 0/2] drm/atomic: Allow drivers to write their own plane check for async

2024-01-15 Thread André Almeida

Hi,

AMD hardware can do more on the async flip path than just the primary plane, so
to lift up the current restrictions, this patchset allows drivers to write their
own check for planes for async flips.

I'm not sure if adding something new to drm_plane_funcs is the right way to do,
because if we want to expand the other object types (crtc, connector) we would
need to add their own drm_XXX_funcs, so feedbacks are welcome!

André

André Almeida (2):
  drm/atomic: Allow drivers to write their own plane check for async
flips
  drm/amdgpu: Implement check_async_props for planes

 .../amd/display/amdgpu_dm/amdgpu_dm_plane.c   | 30 +
 drivers/gpu/drm/drm_atomic_uapi.c | 62 ++-
 include/drm/drm_atomic_uapi.h | 12 
 include/drm/drm_plane.h   |  5 ++
 4 files changed, 92 insertions(+), 17 deletions(-)

-- 
2.43.0

[PATCH 2/2] drm/amdgpu: Implement check_async_props for planes

2024-01-15 Thread André Almeida

AMD GPUs can do async flips with overlay planes and other props rather
than just FB ID, so implement a custom check_async_props for AMD planes.

Signed-off-by: André Almeida 
---
 .../amd/display/amdgpu_dm/amdgpu_dm_plane.c   | 30 +++
 1 file changed, 30 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
index 116121e647ca..127cae1f9fb4 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
@@ -25,6 +25,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1430,6 +1431,34 @@ static void 
amdgpu_dm_plane_drm_plane_destroy_state(struct drm_plane *plane,
drm_atomic_helper_plane_destroy_state(plane, state);
 }
 
+static int amdgpu_dm_plane_check_async_props(struct drm_property *prop,
+ struct drm_plane *plane,
+ struct drm_plane_state *plane_state,
+ struct drm_mode_object *obj,
+ u64 prop_value, u64 old_val)
+{
+   struct drm_mode_config *config = &plane->dev->mode_config;
+   int ret;
+
+   if (prop != config->prop_fb_id &&
+   prop != config->prop_in_fence_fd &&
+   prop != config->prop_fb_damage_clips) {
+   ret = drm_atomic_plane_get_property(plane, plane_state,
+   prop, &old_val);
+   return drm_atomic_check_prop_changes(ret, old_val, prop_value, 
prop);
+   }
+
+   if (plane_state->plane->type != DRM_PLANE_TYPE_PRIMARY
+   && plane_state->plane->type != DRM_PLANE_TYPE_OVERLAY) {
+   drm_dbg_atomic(prop->dev,
+  "[OBJECT:%d] Only primary or overlay planes can 
be changed during async flip\n",
+  obj->id);
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static const struct drm_plane_funcs dm_plane_funcs = {
.update_plane   = drm_atomic_helper_update_plane,
.disable_plane  = drm_atomic_helper_disable_plane,
@@ -1438,6 +1467,7 @@ static const struct drm_plane_funcs dm_plane_funcs = {
.atomic_duplicate_state = amdgpu_dm_plane_drm_plane_duplicate_state,
.atomic_destroy_state = amdgpu_dm_plane_drm_plane_destroy_state,
.format_mod_supported = amdgpu_dm_plane_format_mod_supported,
+   .check_async_props = amdgpu_dm_plane_check_async_props,
 };
 
 int amdgpu_dm_plane_init(struct amdgpu_display_manager *dm,
-- 
2.43.0

Re: [PATCH 1/2] drm/amdgpu: Reset IH OVERFLOW_CLEAR bit after writing rptr

2024-01-15 Thread Christian König


Am 15.01.24 um 12:18 schrieb Friedrich Vock:

Adding the original Ccs from the thread since they seemed to be missing
in the reply.

On 15.01.24 11:55, Christian König wrote:

Am 14.01.24 um 14:00 schrieb Friedrich Vock:

Allows us to detect subsequent IH ring buffer overflows as well.


Well that suggested handling here is certainly broken, see below.



Cc: Joshua Ashton 
Cc: Alex Deucher 
Cc: sta...@vger.kernel.org

Signed-off-by: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h  |  2 ++
  drivers/gpu/drm/amd/amdgpu/cik_ih.c | 13 +
  drivers/gpu/drm/amd/amdgpu/cz_ih.c  | 14 +-
  drivers/gpu/drm/amd/amdgpu/iceland_ih.c | 14 +-
  drivers/gpu/drm/amd/amdgpu/ih_v6_0.c    | 13 +
  drivers/gpu/drm/amd/amdgpu/ih_v6_1.c    | 13 +
  drivers/gpu/drm/amd/amdgpu/navi10_ih.c  | 12 
  drivers/gpu/drm/amd/amdgpu/si_ih.c  | 12 
  drivers/gpu/drm/amd/amdgpu/tonga_ih.c   | 13 +
  drivers/gpu/drm/amd/amdgpu/vega10_ih.c  | 12 
  drivers/gpu/drm/amd/amdgpu/vega20_ih.c  | 12 
  11 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
index 508f02eb0cf8..6041ec727f06 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h
@@ -69,6 +69,8 @@ struct amdgpu_ih_ring {
  unsigned    rptr;
  struct amdgpu_ih_regs    ih_regs;

+    bool overflow;
+
  /* For waiting on IH processing at checkpoint. */
  wait_queue_head_t wait_process;
  uint64_t    processed_timestamp;
diff --git a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
index 6f7c031dd197..807cc30c9e33 100644
--- a/drivers/gpu/drm/amd/amdgpu/cik_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/cik_ih.c
@@ -204,6 +204,7 @@ static u32 cik_ih_get_wptr(struct amdgpu_device
*adev,
  tmp = RREG32(mmIH_RB_CNTL);
  tmp |= IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
  WREG32(mmIH_RB_CNTL, tmp);
+    ih->overflow = true;
  }
  return (wptr & ih->ptr_mask);
  }
@@ -274,7 +275,19 @@ static void cik_ih_decode_iv(struct
amdgpu_device *adev,
  static void cik_ih_set_rptr(struct amdgpu_device *adev,
  struct amdgpu_ih_ring *ih)
  {
+    u32 tmp;
+
  WREG32(mmIH_RB_RPTR, ih->rptr);
+
+    /* If we overflowed previously (and thus set the OVERFLOW_CLEAR
bit),
+ * reset it here to detect more overflows if they occur.
+ */
+    if (ih->overflow) {
+    tmp = RREG32(mmIH_RB_CNTL);
+    tmp &= ~IH_RB_CNTL__WPTR_OVERFLOW_CLEAR_MASK;
+    WREG32(mmIH_RB_CNTL, tmp);
+    ih->overflow = false;
+    }


Well that is an extremely bad idea. We already reset the overflow
after reading the WPTR.


This is not resetting the overflow bit. This is resetting a "clear
overflow" bit. I don't have the hardware docs, but the name (and my
observations) strongly suggest that setting this bit actually prevents
the hardware from setting the overflow bit ever again.


Well that doesn't make any sense at all. The hardware documentation 
clearly states that this bit is write only and should always read as zero.


Setting this bit will clear the overflow flag in the WPTR register and 
clearing it has no effect at all.


I could only ping the hw engineer responsible for this block to double 
check if the documentation is somehow outdated, but I really doubt so.



Right now, IH overflows, even if they occur repeatedly, only get
registered once. If not registering IH overflows can trivially lead to
system crashes, it's amdgpu's current handling that is broken.


It's years that we last tested this but according to the HW 
documentation this should work fine.


What could potentially happen is that the IH has silenced the source of 
the overflow. We never implemented resetting those, but in this case 
that here won't help either.




The possibility of a repeated IH overflow in between reading the wptr
and updating the rptr is a good point, but how can we detect that at
all? It seems to me like we can't set the OVERFLOW_CLEAR bit at all
then, because we're guaranteed to miss any overflows that happen while
the bit is set.


When an IH overflow is signaled we clear that flag by writing 1 into the 
OVERFLOW_CLEAR bit and skip one entry in the IH ring buffer.


What can of course happen is that the IH ring buffer overflows more than 
this single entry and we process IVs which are potentially corrupted, 
but we won't miss any additional overflows since we only start 
processing after resetting the flag.


An IH overflow is also something you should *never* see in a production 
system. This is purely for driver bringup and as fallback when there is 
a severe incorrect programming of the HW.


The only exception of that is page fault handling on MI products because 
of a hardware bug, to mitigate this we are processing page faults on a 
separate IH

Re: [PATCH 2/2] drm/amdgpu: Process fences on IH overflow

2024-01-15 Thread Christian König


Am 15.01.24 um 12:19 schrieb Friedrich Vock:

On 15.01.24 11:26, Christian König wrote:

Am 14.01.24 um 14:00 schrieb Friedrich Vock:

If the IH ring buffer overflows, it's possible that fence signal events
were lost. Check each ring for progress to prevent job timeouts/GPU
hangs due to the fences staying unsignaled despite the work being done.


That's completely unnecessary and in some cases even harmful.

How is it harmful? The only effect it can have is prevent unnecessary
GPU hangs, no? It's not like it hides any legitimate errors that you'd
otherwise see.


We have no guarantee that all ring buffers are actually fully 
initialized to allow fence processing.


Apart from that fence processing is the least of your problems when an 
IV overflow occurs. Other interrupt source which are not repeated are 
usually for more worse.




We already have a timeout handler for that and overflows point to
severe system problem so they should never occur in a production system.


IH ring buffer overflows are pretty reliably reproducible if you trigger
a lot of page faults, at least on Deck. Why shouldn't enough page faults
in quick succession be able to overflow the IH ring buffer?


At least not on recent hw generations. Since gfx9 we have a rate limit 
on the number of page faults generated.


What could maybe do as well is to change the default of vm_fault_stop, 
but for your case that would be even worse in production.




The fence fallback timer as it is now is useless for this because it
only gets triggered once after 0.5s. I guess an alternative approach
would be to make a timer trigger for each work item in flight every
0.5s, but why should that be better than just handling overflow errors
as they occur?


That is intentional. As I said an IH overflow just points out that there 
is something massively wrong in the HW programming.


After gfx9 the IH should never produce overflow any more, otherwise 
either the ratelimit doesn't work or isn't enabled for some reason or 
the IH ring buffer is just to small.


Regards,
Christian.



Regards,
Friedrich



Regards,
Christian.



Cc: Joshua Ashton 
Cc: Alex Deucher 
Cc: sta...@vger.kernel.org

Signed-off-by: Friedrich Vock 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 15 +++
  1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
index f3b0aaf3ebc6..2a246db1d3a7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c
@@ -209,6 +209,7 @@ int amdgpu_ih_process(struct amdgpu_device *adev,
struct amdgpu_ih_ring *ih)
  {
  unsigned int count;
  u32 wptr;
+    int i;

  if (!ih->enabled || adev->shutdown)
  return IRQ_NONE;
@@ -227,6 +228,20 @@ int amdgpu_ih_process(struct amdgpu_device
*adev, struct amdgpu_ih_ring *ih)
  ih->rptr &= ih->ptr_mask;
  }

+    /* If the ring buffer overflowed, we might have lost some fence
+ * signal interrupts. Check if there was any activity so the 
signal

+ * doesn't get lost.
+ */
+    if (ih->overflow) {
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    if (!ring || !ring->fence_drv.initialized)
+    continue;
+    amdgpu_fence_process(ring);
+    }
+    }
+
  amdgpu_ih_set_rptr(adev, ih);
  wake_up_all(&ih->wait_process);

--
2.43.0

Re: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

2024-01-15 Thread Christian König

Am 16.01.24 um 01:05 schrieb Marek Olšák:

On Mon, Jan 15, 2024 at 3:06 PM Christian König
 wrote:

Am 15.01.24 um 20:30 schrieb Joshua Ashton:

On 1/15/24 19:19, Christian König wrote:

Am 15.01.24 um 20:13 schrieb Joshua Ashton:

On 1/15/24 18:53, Christian König wrote:

Am 15.01.24 um 19:35 schrieb Joshua Ashton:

On 1/15/24 18:30, Bas Nieuwenhuizen wrote:

On Mon, Jan 15, 2024 at 7:14 PM Friedrich Vock
mailto:friedrich.v...@gmx.de>> wrote:

 Re-sending as plaintext, sorry about that

 On 15.01.24 18:54, Michel Dänzer wrote:
  > On 2024-01-15 18:26, Friedrich Vock wrote:
  >> [snip]
  >> The fundamental problem here is that not telling
applications that
  >> something went wrong when you just canceled their work
midway is an
  >> out-of-spec hack.
  >> When there is a report of real-world apps breaking
because of
 that hack,
  >> reports of different apps working (even if it's
convenient that they
  >> work) doesn't justify keeping the broken code.
  > If the breaking apps hit multiple soft resets in a row,
I've laid
 out a pragmatic solution which covers both cases.
 Hitting soft reset every time is the lucky path. Once GPU
work is
 interrupted out of nowhere, all bets are off and it might as
well
 trigger a full system hang next time. No hang recovery should
be able to
 cause that under any circumstance.

I think the more insidious situation is no further hangs but
wrong results because we skipped some work. That we skipped work
may e.g. result in some texture not being uploaded or some GPGPU
work not being done and causing further errors downstream (say if
a game is doing AI/physics on the GPU not to say anything of
actual GPGPU work one might be doing like AI)

Even worse if this is compute on eg. OpenCL for something
science/math/whatever related, or training a model.

You could randomly just get invalid/wrong results without even
knowing!

Well on the kernel side we do provide an API to query the result of
a submission. That includes canceling submissions with a soft
recovery.

What we just doesn't do is to prevent further submissions from this
application. E.g. enforcing that the application is punished for
bad behavior.

You do prevent future submissions for regular resets though: Those
increase karma which sets ctx->guilty, and if ctx->guilty then
-ECANCELED is returned for a submission.

ctx->guilty is never true for soft recovery though, as it doesn't
increase karma, which is the problem this patch is trying to solve.

By the submission result query API, I you assume you mean checking
the submission fence error somehow? That doesn't seem very ergonomic
for a Vulkan driver compared to the simple solution which is to just
mark it as guilty with what already exists...

Well as I said the guilty handling is broken for quite a number of
reasons.

What we can do rather trivially is changing this code in
amdgpu_job_prepare_job():

  /* Ignore soft recovered fences here */
  r = drm_sched_entity_error(s_entity);
  if (r && r != -ENODATA)
  goto error;

This will bubble up errors from soft recoveries into the entity as
well and makes sure that further submissions are rejected.

That makes sense to do, but at least for GL_EXT_robustness, that will
not tell the app that it was guilty.

No, it clearly gets that signaled. We should probably replace the guilty
atomic with a calls to drm_sched_entity_error().

It's just that this isn't what Marek and I had in mind for this,
basically completely forget about AMDGPU_CTX_OP_QUERY_STATE or
AMDGPU_CTX_OP_QUERY_STATE2.

Instead just look at the return value of the CS or query fence result IOCTL.

When you get an -ENODATA you have been guilty of causing a soft
recovery, when you get an -ETIME you are guilty of causing a timeout
which had to be hard recovered. When you get an -ECANCELED you are an
innocent victim of a hard recovery somebody else caused.

What we haven't defined yet is an error code for loosing VRAM, but that
should be trivial to do.

So far we have implemented the GPU reset and soft reset, but we
haven't done anything to have a robust system recovery. Under the
current system, things can easily keep hanging indefinitely because
nothing prevents that.

The reset status query should stay. Robust apps will use it to tell
when they should recreate their context and resources even if they
don't submit anything. Let's fully trust robust apps here. In the
future we might change our mind about that, but for now, let's just
focus on API conformance, and later we can change it as long as we
stay API conformant.

Non-robust apps must be terminated when they hang or are innocent but
affected. Their existence is a security and usability problem and a
source of frustrations for users. 100% guaranteed system recovery is
impossible if they continue to live.

IBs should be rejected for all guilty and affected innocent contexts
unconditionally, both robust a

75 matches

Mail list logo