Bug#1035971: linux-image-6.3.0-0-amd64: IRQ warnings from amdgpu Navi 33 / Radeon RX 7700S ...

2023-05-18 Thread David Reviejo

Hi, Nathan

I have similar warnings with the last longterm 6.1.27 image from
bookworm, in my case when suspending to RAM.

Seems to be an amdgpu bug introduced two or three kernel releases ago, as
you can see googling around; for example here:

https://bugzilla.redhat.com/show_bug.cgi?id=2191739

or here:

https://gitlab.freedesktop.org/drm/amd/-/issues/2522

If it's this bug, the fix seems to be in yesterday's latest upstream
kernel update: 6.3.3 (and 6.1.29 for the stable longterm).

We can only hope that the developers are not too busy with the bookworm
release to apply these patches ASAP ;)

Cheers,
--
David



Bug#1035971: linux-image-6.3.0-0-amd64: IRQ warnings from amdgpu Navi 33 / Radeon RX 7700S ...

2023-05-18 Thread Diederik de Haas
On Thursday, 18 May 2023 12:52:24 CEST David Reviejo wrote:
> Seems to be an amdgpu bug introduced two or three kernel releases ago, as
> you can see googling around; for example here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2191739
> 
> or here:
> 
> https://gitlab.freedesktop.org/drm/amd/-/issues/2522
> 
> If it's this bug, the fix seems to be in yesterday's latest upstream
> kernel update: 6.3.3 (and 6.1.29 for the stable longterm).

I _think_ I got the right commit for the 6.3 branch attached.

Para 4.5(.2) of the Debian Kernel Handbook describes how to test a simple patch:
https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official

Could you try that and see whether it indeed fixes your issue?>From c5123c193696bf97fdf259c825ebfac517b54e44 Mon Sep 17 00:00:00 2001
From: Guchun Chen 
Date: Sat, 6 May 2023 16:52:59 +0800
Subject: drm/amdgpu: disable sdma ecc irq only when sdma RAS is enabled in
 suspend

commit 8b229ada2669b74fdae06c83fbfda5a5a99fc253 upstream.

sdma_v4_0_ip is shared on a few asics, but in sdma_v4_0_hw_fini,
driver unconditionally disables ecc_irq which is only enabled on
those asics enabling sdma ecc. This will introduce a warning in
suspend cycle on those chips with sdma ip v4.0, while without
sdma ecc. So this patch correct this.

[ 7283.166354] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu]
[ 7283.167001] RSP: 0018:9a5fc3967d08 EFLAGS: 00010246
[ 7283.167019] RAX: 98d88afd3770 RBX: 0001 RCX: 
[ 7283.167023] RDX:  RSI: 98d89da30390 RDI: 98d89da2
[ 7283.167025] RBP: 98d89da2 R08: 00036838 R09: 0006
[ 7283.167028] R10: d5764243c008 R11:  R12: 98d89da30390
[ 7283.167030] R13: 98d89da38978 R14: 999ae15a R15: 98d880130105
[ 7283.167032] FS:  () GS:98d996f0() knlGS:
[ 7283.167036] CS:  0010 DS:  ES:  CR0: 80050033
[ 7283.167039] CR2: f7a9d178 CR3: 0001c42ea000 CR4: 003506e0
[ 7283.167041] Call Trace:
[ 7283.167046]  
[ 7283.167048]  sdma_v4_0_hw_fini+0x38/0xa0 [amdgpu]
[ 7283.167704]  amdgpu_device_ip_suspend_phase2+0x101/0x1a0 [amdgpu]
[ 7283.168296]  amdgpu_device_suspend+0x103/0x180 [amdgpu]
[ 7283.168875]  amdgpu_pmops_freeze+0x21/0x60 [amdgpu]
[ 7283.169464]  pci_pm_freeze+0x54/0xc0

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2522
Signed-off-by: Guchun Chen 
Reviewed-by: Tao Zhou 
Signed-off-by: Alex Deucher 
Cc: sta...@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman 
---
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index b5affba221569..8b8ddf0502661 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -1903,9 +1903,11 @@ static int sdma_v4_0_hw_fini(void *handle)
 		return 0;
 	}
 
-	for (i = 0; i < adev->sdma.num_instances; i++) {
-		amdgpu_irq_put(adev, &adev->sdma.ecc_irq,
-			   AMDGPU_SDMA_IRQ_INSTANCE0 + i);
+	if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__SDMA)) {
+		for (i = 0; i < adev->sdma.num_instances; i++) {
+			amdgpu_irq_put(adev, &adev->sdma.ecc_irq,
+   AMDGPU_SDMA_IRQ_INSTANCE0 + i);
+		}
 	}
 
 	sdma_v4_0_ctx_switch_enable(adev, false);
-- 
cgit 



signature.asc
Description: This is a digitally signed message part.


Bug#1035971: linux-image-6.3.0-0-amd64: IRQ warnings from amdgpu Navi 33 / Radeon RX 7700S ...

2023-05-18 Thread Diederik de Haas
On Thursday, 18 May 2023 13:19:52 CEST Diederik de Haas wrote:
> I _think_ I got the right commit for the 6.3 branch attached.

It seems a '>' snuck in the attachment/patch as the very first char, so you may 
want to remove that.

signature.asc
Description: This is a digitally signed message part.