Re: Radeon regression in 6.6 kernel
On Mon, Dec 11, 2023 at 7:28 PM Phillip Susi wrote: > > Phillip Susi writes: > > > And it works, but 6.7-rc5 does not, even though it includes that patch. > > Here's the syslog from the attempt. I'll start bisecting again. > > I checked out the patch that got merged upstream and it also fails. I > looked at the two commits, and I see what happened. Your original patch > MOVED the call to amdgpu_ttm_set_buffer_funcs_status(). The one that > got merged into Linus' tree instead DUPLICATES the call. Oops? > Yeah, I messed up the patch somehow when I applied it. Fix up already queued up to add the missing chunk. Alex
Re: Radeon regression in 6.6 kernel
Phillip Susi writes: > And it works, but 6.7-rc5 does not, even though it includes that patch. > Here's the syslog from the attempt. I'll start bisecting again. I checked out the patch that got merged upstream and it also fails. I looked at the two commits, and I see what happened. Your original patch MOVED the call to amdgpu_ttm_set_buffer_funcs_status(). The one that got merged into Linus' tree instead DUPLICATES the call. Oops?
Re: Radeon regression in 6.6 kernel
On Sun, Dec 3, 2023 at 3:40 PM Phillip Susi wrote: > > Alex Deucher writes: > > > Phillip, > > > > Can you test this patch? I was not able to repro the issue on the > > navi2x card I had handy, but I think it should fix it. > > I pulled -rc4 and it still had the problem. I applied this patch, and > yes, it fixed it! > Great! I'll include it this week. Alex
Re: Radeon regression in 6.6 kernel
Alex Deucher writes: > Phillip, > > Can you test this patch? I was not able to repro the issue on the > navi2x card I had handy, but I think it should fix it. I pulled -rc4 and it still had the problem. I applied this patch, and yes, it fixed it!
Re: Radeon regression in 6.6 kernel
Phillip, Can you test this patch? I was not able to repro the issue on the navi2x card I had handy, but I think it should fix it. Thanks, Alex On Wed, Nov 29, 2023 at 3:49 PM Alex Deucher wrote: > > On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher wrote: > > > > Actually I think I see the problem. I'll try and send out a patch > > later today to test. > > Does the attached patch fix it? > > Alex > > > > > Alex > > > > On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher wrote: > > > > > > On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov wrote: > > > > > > > > On 2023-11-29 10:22, Alex Deucher wrote: > > > > > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher > > > > > wrote: > > > > >> > > > > >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov > > > > >> wrote: > > > > >>> > > > > >>> On 2023-11-28 17:13, Alex Deucher wrote: > > > > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi > > > > wrote: > > > > > > > > > > Alex Deucher writes: > > > > > > > > > >>> In that case those are the already known problems with the > > > > >>> scheduler > > > > >>> changes, aren't they? > > > > >> > > > > >> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe > > > > >> I'm > > > > >> misunderstanding what the original report was actually testing. > > > > >> If it > > > > >> was 6.7, then try reverting: > > > > >> 56e449603f0ac580700621a356d35d5716a62ce5 > > > > >> b70438004a14f4d0f9890b3297cd66248728546c > > > > > > > > > > At some point it was suggested that I file a gitlab issue, but I > > > > > took > > > > > this to mean it was already known and being worked on. -rc3 came > > > > > out > > > > > today and still has the problem. Is there a known issue I could > > > > > track? > > > > > > > > > > > > > At this point, unless there are any objections, I think we should > > > > just > > > > revert the two patches > > > > >>> Uhm, no. > > > > >>> > > > > >>> Why "the two" patches? > > > > >>> > > > > >>> This email, part of this thread, > > > > >>> > > > > >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ > > > > >>> > > > > >>> clearly states that reverting *only* this commit, > > > > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable > > > > >>> number of run-queues > > > > >>> *does not* mitigate the failed suspend. (Furthermore, this commit > > > > >>> doesn't really change > > > > >>> anything operational, other than using an allocated array, instead > > > > >>> of a static one, in DRM, > > > > >>> while the 2nd patch is solely contained within the amdgpu driver > > > > >>> code.) > > > > >>> > > > > >>> Leaving us with only this change, > > > > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > > > > >>> to be at fault, as the kernel log attached in the linked email > > > > >>> above shows. > > > > >>> > > > > >>> The conclusion is that only b70438004a14f4 needs reverting. > > > > >> > > > > >> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > > > > >> 56e449603f0ac5 breaks amdgpu. > > > > > > > > > > We can try and re-enable it in the next kernel. I'm just not sure > > > > > we'll be able to fix this in time for 6.7 with the holidays and all > > > > > and I don't want to cause a lot of scheduler churn at the end of the > > > > > 6.7 cycle if we hold off and try and fix it. Reverting seems like the > > > > > best short term solution. > > > > > > > > A lot of subsequent code has come in since commit 56e449603f0ac5, as it > > > > opened > > > > the opportunity for a 1-to-1 relationship between an entity and a > > > > scheduler. > > > > (Should've always been the case, from the outset. Not sure why it was > > > > coded as > > > > a fixed-size array.) > > > > > > > > Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the > > > > problem > > > > is wholly contained in amdgpu, and no other driver has this problem, > > > > there is > > > > no reason to have to "churn", i.e. go back and forth in DRM, only to > > > > cover up > > > > an init bug in amdgpu. See the response I just sent in @this thread: > > > > https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com > > > > > > > > And it's not like this issue is unknown. I first posted about it on > > > > 2023-10-16. > > > > > > > > Ideally, amdgpu would just fix their init code. > > > > > > You can't make changes to core code that break other drivers. > > > Arguably 56e449603f0ac5 should not have gone in in the first place if > > > it broke amdgpu. b70438004a14f4 was the code to fix amdgpu's init > > > code, but as a side effect it seems to have broken suspend for some > > > users. > > > > > > Alex
Re: Radeon regression in 6.6 kernel
On Wed, Nov 29, 2023 at 10:47 PM Luben Tuikov wrote: > > On 2023-11-29 22:36, Luben Tuikov wrote: > > On 2023-11-29 15:49, Alex Deucher wrote: > >> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher wrote: > >>> > >>> Actually I think I see the problem. I'll try and send out a patch > >>> later today to test. > >> > >> Does the attached patch fix it? > > > > Thanks for the patch, Alex. > > > > Is it possible for AMD to also reproduce this issue and test this patch on > > a Navi23 system? > > > >> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001 > >> From: Alex Deucher > >> Date: Wed, 29 Nov 2023 15:44:25 -0500 > >> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend > >> > >> We need to make disable this after the last eviction > > > > "make disable" --> "disable" > > > >> call, but before we disable the SDMA IP. > >> > >> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level") > >> Link: > >> https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html > > > > Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net > > > > Let's link the start of the thread. > > > > Regards, > > Luben > > > >> Signed-off-by: Alex Deucher > >> Cc: Phillip Susi > >> Cc: Luben Tuikov > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > >> 1 file changed, 2 insertions(+), 2 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> index b5edf40b5d03..78553e027db4 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > >> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, > >> bool fbcon) > >> > >> amdgpu_ras_suspend(adev); > >> > >> -amdgpu_ttm_set_buffer_funcs_status(adev, false); > >> - > >> amdgpu_device_ip_suspend_phase1(adev); > >> > >> if (!adev->in_s0ix) > >> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, > >> bool fbcon) > >> if (r) > >> return r; > >> > >> +amdgpu_ttm_set_buffer_funcs_status(adev, false); > >> + > > If you're moving this past phase 1, there's another instance in > amdgpu_device_ip_suspend(), > which may need to be moved down. I think that one should be ok since we don't do any evictions in amdgpu_device_ip_suspend(). Alex > > Regards, > Luben > > >> amdgpu_fence_driver_hw_fini(adev); > >> > >> amdgpu_device_ip_suspend_phase2(adev); > > > >> > >> Alex > >> > >>> > >>> Alex > >>> > >>> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher > >>> wrote: > > On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov > wrote: > > > > On 2023-11-29 10:22, Alex Deucher wrote: > >> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher > >> wrote: > >>> > >>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov > >>> wrote: > > On 2023-11-28 17:13, Alex Deucher wrote: > > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi > > wrote: > >> > >> Alex Deucher writes: > >> > In that case those are the already known problems with the > scheduler > changes, aren't they? > >>> > >>> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe > >>> I'm > >>> misunderstanding what the original report was actually testing. > >>> If it > >>> was 6.7, then try reverting: > >>> 56e449603f0ac580700621a356d35d5716a62ce5 > >>> b70438004a14f4d0f9890b3297cd66248728546c > >> > >> At some point it was suggested that I file a gitlab issue, but I > >> took > >> this to mean it was already known and being worked on. -rc3 came > >> out > >> today and still has the problem. Is there a known issue I could > >> track? > >> > > > > At this point, unless there are any objections, I think we should > > just > > revert the two patches > Uhm, no. > > Why "the two" patches? > > This email, part of this thread, > > https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ > > clearly states that reverting *only* this commit, > 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable > number of run-queues > *does not* mitigate the failed suspend. (Furthermore, this commit > doesn't really change > anything operational, other than using an allocated array, instead > of a static one, in DRM, > while the 2nd patch is solely contained within the amdgpu driver > code.) > > Leaving us with only this change, > b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > to be at fault, as the kernel log attached in the linked email above
Re: Radeon regression in 6.6 kernel
On Wed, Nov 29, 2023 at 10:36 PM Luben Tuikov wrote: > > On 2023-11-29 15:49, Alex Deucher wrote: > > On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher wrote: > >> > >> Actually I think I see the problem. I'll try and send out a patch > >> later today to test. > > > > Does the attached patch fix it? > > Thanks for the patch, Alex. > > Is it possible for AMD to also reproduce this issue and test this patch on a > Navi23 system? I haven't had a chance to dig into it much due to LPC and thanksgiving and other end of year stuff. > > > From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001 > > From: Alex Deucher > > Date: Wed, 29 Nov 2023 15:44:25 -0500 > > Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend > > > > We need to make disable this after the last eviction > > "make disable" --> "disable" > > > call, but before we disable the SDMA IP. > > > > Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level") > > Link: > > https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html > > Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net > > Let's link the start of the thread. Thanks, I will update the patch. Alex > > Regards, > Luben > > > Signed-off-by: Alex Deucher > > Cc: Phillip Susi > > Cc: Luben Tuikov > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index b5edf40b5d03..78553e027db4 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, > > bool fbcon) > > > > amdgpu_ras_suspend(adev); > > > > - amdgpu_ttm_set_buffer_funcs_status(adev, false); > > - > > amdgpu_device_ip_suspend_phase1(adev); > > > > if (!adev->in_s0ix) > > @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, > > bool fbcon) > > if (r) > > return r; > > > > + amdgpu_ttm_set_buffer_funcs_status(adev, false); > > + > > amdgpu_fence_driver_hw_fini(adev); > > > > amdgpu_device_ip_suspend_phase2(adev); > > > > > Alex > > > >> > >> Alex > >> > >> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher wrote: > >>> > >>> On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov wrote: > > On 2023-11-29 10:22, Alex Deucher wrote: > > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher > > wrote: > >> > >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov > >> wrote: > >>> > >>> On 2023-11-28 17:13, Alex Deucher wrote: > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi > wrote: > > > > Alex Deucher writes: > > > >>> In that case those are the already known problems with the > >>> scheduler > >>> changes, aren't they? > >> > >> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > >> misunderstanding what the original report was actually testing. > >> If it > >> was 6.7, then try reverting: > >> 56e449603f0ac580700621a356d35d5716a62ce5 > >> b70438004a14f4d0f9890b3297cd66248728546c > > > > At some point it was suggested that I file a gitlab issue, but I > > took > > this to mean it was already known and being worked on. -rc3 came > > out > > today and still has the problem. Is there a known issue I could > > track? > > > > At this point, unless there are any objections, I think we should > just > revert the two patches > >>> Uhm, no. > >>> > >>> Why "the two" patches? > >>> > >>> This email, part of this thread, > >>> > >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ > >>> > >>> clearly states that reverting *only* this commit, > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable > >>> number of run-queues > >>> *does not* mitigate the failed suspend. (Furthermore, this commit > >>> doesn't really change > >>> anything operational, other than using an allocated array, instead of > >>> a static one, in DRM, > >>> while the 2nd patch is solely contained within the amdgpu driver > >>> code.) > >>> > >>> Leaving us with only this change, > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > >>> to be at fault, as the kernel log attached in the linked email above > >>> shows. > >>> > >>> The conclusion is that only b70438004a14f4 needs reverting. > >> > >> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > >> 56e449603f0ac5 breaks amdgpu. > > > > We can try and re-enable it in the next kernel. I'm just not sure > > we'll be able to
Re: Radeon regression in 6.6 kernel
On 2023-11-29 22:36, Luben Tuikov wrote: > On 2023-11-29 15:49, Alex Deucher wrote: >> On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher wrote: >>> >>> Actually I think I see the problem. I'll try and send out a patch >>> later today to test. >> >> Does the attached patch fix it? > > Thanks for the patch, Alex. > > Is it possible for AMD to also reproduce this issue and test this patch on a > Navi23 system? > >> From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001 >> From: Alex Deucher >> Date: Wed, 29 Nov 2023 15:44:25 -0500 >> Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend >> >> We need to make disable this after the last eviction > > "make disable" --> "disable" > >> call, but before we disable the SDMA IP. >> >> Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level") >> Link: >> https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html > > Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net > > Let's link the start of the thread. > > Regards, > Luben > >> Signed-off-by: Alex Deucher >> Cc: Phillip Susi >> Cc: Luben Tuikov >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index b5edf40b5d03..78553e027db4 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool >> fbcon) >> >> amdgpu_ras_suspend(adev); >> >> -amdgpu_ttm_set_buffer_funcs_status(adev, false); >> - >> amdgpu_device_ip_suspend_phase1(adev); >> >> if (!adev->in_s0ix) >> @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool >> fbcon) >> if (r) >> return r; >> >> +amdgpu_ttm_set_buffer_funcs_status(adev, false); >> + If you're moving this past phase 1, there's another instance in amdgpu_device_ip_suspend(), which may need to be moved down. Regards, Luben >> amdgpu_fence_driver_hw_fini(adev); >> >> amdgpu_device_ip_suspend_phase2(adev); > >> >> Alex >> >>> >>> Alex >>> >>> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher wrote: On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov wrote: > > On 2023-11-29 10:22, Alex Deucher wrote: >> On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher >> wrote: >>> >>> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov >>> wrote: On 2023-11-28 17:13, Alex Deucher wrote: > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi > wrote: >> >> Alex Deucher writes: >> In that case those are the already known problems with the scheduler changes, aren't they? >>> >>> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm >>> misunderstanding what the original report was actually testing. If >>> it >>> was 6.7, then try reverting: >>> 56e449603f0ac580700621a356d35d5716a62ce5 >>> b70438004a14f4d0f9890b3297cd66248728546c >> >> At some point it was suggested that I file a gitlab issue, but I took >> this to mean it was already known and being worked on. -rc3 came out >> today and still has the problem. Is there a known issue I could >> track? >> > > At this point, unless there are any objections, I think we should just > revert the two patches Uhm, no. Why "the two" patches? This email, part of this thread, https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ clearly states that reverting *only* this commit, 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of run-queues *does not* mitigate the failed suspend. (Furthermore, this commit doesn't really change anything operational, other than using an allocated array, instead of a static one, in DRM, while the 2nd patch is solely contained within the amdgpu driver code.) Leaving us with only this change, b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level to be at fault, as the kernel log attached in the linked email above shows. The conclusion is that only b70438004a14f4 needs reverting. >>> >>> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, >>> 56e449603f0ac5 breaks amdgpu. >> >> We can try and re-enable it in the next kernel. I'm just not sure >> we'll be able to fix this in time for 6.7 with the holidays and all >> and I don't want to cause a lot of scheduler churn at the end of the >> 6.7 cycle if we hold off and try and f
Re: Radeon regression in 6.6 kernel
On 2023-11-29 15:49, Alex Deucher wrote: > On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher wrote: >> >> Actually I think I see the problem. I'll try and send out a patch >> later today to test. > > Does the attached patch fix it? Thanks for the patch, Alex. Is it possible for AMD to also reproduce this issue and test this patch on a Navi23 system? > From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001 > From: Alex Deucher > Date: Wed, 29 Nov 2023 15:44:25 -0500 > Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend > > We need to make disable this after the last eviction "make disable" --> "disable" > call, but before we disable the SDMA IP. > > Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level") > Link: https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html Link: https://lore.kernel.org/r/87edgv4x3i@vps.thesusis.net Let's link the start of the thread. Regards, Luben > Signed-off-by: Alex Deucher > Cc: Phillip Susi > Cc: Luben Tuikov > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index b5edf40b5d03..78553e027db4 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4531,8 +4531,6 @@ int amdgpu_device_suspend(struct drm_device *dev, bool > fbcon) > > amdgpu_ras_suspend(adev); > > - amdgpu_ttm_set_buffer_funcs_status(adev, false); > - > amdgpu_device_ip_suspend_phase1(adev); > > if (!adev->in_s0ix) > @@ -4542,6 +4540,8 @@ int amdgpu_device_suspend(struct drm_device *dev, bool > fbcon) > if (r) > return r; > > + amdgpu_ttm_set_buffer_funcs_status(adev, false); > + > amdgpu_fence_driver_hw_fini(adev); > > amdgpu_device_ip_suspend_phase2(adev); > > Alex > >> >> Alex >> >> On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher wrote: >>> >>> On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov wrote: On 2023-11-29 10:22, Alex Deucher wrote: > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher > wrote: >> >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov >> wrote: >>> >>> On 2023-11-28 17:13, Alex Deucher wrote: On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi wrote: > > Alex Deucher writes: > >>> In that case those are the already known problems with the scheduler >>> changes, aren't they? >> >> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm >> misunderstanding what the original report was actually testing. If >> it >> was 6.7, then try reverting: >> 56e449603f0ac580700621a356d35d5716a62ce5 >> b70438004a14f4d0f9890b3297cd66248728546c > > At some point it was suggested that I file a gitlab issue, but I took > this to mean it was already known and being worked on. -rc3 came out > today and still has the problem. Is there a known issue I could > track? > At this point, unless there are any objections, I think we should just revert the two patches >>> Uhm, no. >>> >>> Why "the two" patches? >>> >>> This email, part of this thread, >>> >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ >>> >>> clearly states that reverting *only* this commit, >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number >>> of run-queues >>> *does not* mitigate the failed suspend. (Furthermore, this commit >>> doesn't really change >>> anything operational, other than using an allocated array, instead of a >>> static one, in DRM, >>> while the 2nd patch is solely contained within the amdgpu driver code.) >>> >>> Leaving us with only this change, >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level >>> to be at fault, as the kernel log attached in the linked email above >>> shows. >>> >>> The conclusion is that only b70438004a14f4 needs reverting. >> >> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, >> 56e449603f0ac5 breaks amdgpu. > > We can try and re-enable it in the next kernel. I'm just not sure > we'll be able to fix this in time for 6.7 with the holidays and all > and I don't want to cause a lot of scheduler churn at the end of the > 6.7 cycle if we hold off and try and fix it. Reverting seems like the > best short term solution. A lot of subsequent code has come in since commit 56e449603f0ac5, as it opened the opportunity for a 1-to-1 relationship between an entity and a scheduler. (Should've always been the case, from the outset. Not sure why it was coded as a f
Re: Radeon regression in 6.6 kernel
On Wed, Nov 29, 2023 at 3:10 PM Alex Deucher wrote: > > Actually I think I see the problem. I'll try and send out a patch > later today to test. Does the attached patch fix it? Alex > > Alex > > On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher wrote: > > > > On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov wrote: > > > > > > On 2023-11-29 10:22, Alex Deucher wrote: > > > > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher > > > > wrote: > > > >> > > > >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov > > > >> wrote: > > > >>> > > > >>> On 2023-11-28 17:13, Alex Deucher wrote: > > > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi > > > wrote: > > > > > > > > Alex Deucher writes: > > > > > > > >>> In that case those are the already known problems with the > > > >>> scheduler > > > >>> changes, aren't they? > > > >> > > > >> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > > > >> misunderstanding what the original report was actually testing. > > > >> If it > > > >> was 6.7, then try reverting: > > > >> 56e449603f0ac580700621a356d35d5716a62ce5 > > > >> b70438004a14f4d0f9890b3297cd66248728546c > > > > > > > > At some point it was suggested that I file a gitlab issue, but I > > > > took > > > > this to mean it was already known and being worked on. -rc3 came > > > > out > > > > today and still has the problem. Is there a known issue I could > > > > track? > > > > > > > > > > At this point, unless there are any objections, I think we should > > > just > > > revert the two patches > > > >>> Uhm, no. > > > >>> > > > >>> Why "the two" patches? > > > >>> > > > >>> This email, part of this thread, > > > >>> > > > >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ > > > >>> > > > >>> clearly states that reverting *only* this commit, > > > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable > > > >>> number of run-queues > > > >>> *does not* mitigate the failed suspend. (Furthermore, this commit > > > >>> doesn't really change > > > >>> anything operational, other than using an allocated array, instead of > > > >>> a static one, in DRM, > > > >>> while the 2nd patch is solely contained within the amdgpu driver > > > >>> code.) > > > >>> > > > >>> Leaving us with only this change, > > > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > > > >>> to be at fault, as the kernel log attached in the linked email above > > > >>> shows. > > > >>> > > > >>> The conclusion is that only b70438004a14f4 needs reverting. > > > >> > > > >> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > > > >> 56e449603f0ac5 breaks amdgpu. > > > > > > > > We can try and re-enable it in the next kernel. I'm just not sure > > > > we'll be able to fix this in time for 6.7 with the holidays and all > > > > and I don't want to cause a lot of scheduler churn at the end of the > > > > 6.7 cycle if we hold off and try and fix it. Reverting seems like the > > > > best short term solution. > > > > > > A lot of subsequent code has come in since commit 56e449603f0ac5, as it > > > opened > > > the opportunity for a 1-to-1 relationship between an entity and a > > > scheduler. > > > (Should've always been the case, from the outset. Not sure why it was > > > coded as > > > a fixed-size array.) > > > > > > Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the > > > problem > > > is wholly contained in amdgpu, and no other driver has this problem, > > > there is > > > no reason to have to "churn", i.e. go back and forth in DRM, only to > > > cover up > > > an init bug in amdgpu. See the response I just sent in @this thread: > > > https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com > > > > > > And it's not like this issue is unknown. I first posted about it on > > > 2023-10-16. > > > > > > Ideally, amdgpu would just fix their init code. > > > > You can't make changes to core code that break other drivers. > > Arguably 56e449603f0ac5 should not have gone in in the first place if > > it broke amdgpu. b70438004a14f4 was the code to fix amdgpu's init > > code, but as a side effect it seems to have broken suspend for some > > users. > > > > Alex From 96e75b5218f7a124eafa53853681eef8fe567ab8 Mon Sep 17 00:00:00 2001 From: Alex Deucher Date: Wed, 29 Nov 2023 15:44:25 -0500 Subject: [PATCH] drm/amdgpu: fix buffer funcs setting order on suspend We need to make disable this after the last eviction call, but before we disable the SDMA IP. Fixes: b70438004a14 ("drm/amdgpu: move buffer funcs setting up a level") Link: https://lists.freedesktop.org/archives/amd-gfx/2023-November/101197.html Signed-off-by: Alex Deucher Cc: Phillip Susi Cc: Luben Tuikov --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/
Re: Radeon regression in 6.6 kernel
Luben Tuikov writes: > I remember that the problem was really that amdgpu called > drm_sched_entity_init(), > in amdgpu_ttm_set_buffer_funcs_status() without actually having initialized > the scheduler > used therein. For instance, the code before commit b70438004a14f4, looked > like this: > > sched = &ring->sched; <-- LT: No > one has initialized this scheduler > r = drm_sched_entity_init(&adev->mman.entity, <-- Oopses, > now that sched->sched_rq is not > Before commit 56e449603f0ac5, amdgpu was getting away with this, because the > sched->sched_rq > was a static array. > > Ideally, amdgpu code would be fixed. This sounds like an initilization problem that resulted in an OOPS at boot time, but I don't remember seeing that. I just get a failure on system suspend.
Re: Radeon regression in 6.6 kernel
Actually I think I see the problem. I'll try and send out a patch later today to test. Alex On Wed, Nov 29, 2023 at 1:52 PM Alex Deucher wrote: > > On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov wrote: > > > > On 2023-11-29 10:22, Alex Deucher wrote: > > > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher > > > wrote: > > >> > > >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov > > >> wrote: > > >>> > > >>> On 2023-11-28 17:13, Alex Deucher wrote: > > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi > > wrote: > > > > > > Alex Deucher writes: > > > > > >>> In that case those are the already known problems with the scheduler > > >>> changes, aren't they? > > >> > > >> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > > >> misunderstanding what the original report was actually testing. If > > >> it > > >> was 6.7, then try reverting: > > >> 56e449603f0ac580700621a356d35d5716a62ce5 > > >> b70438004a14f4d0f9890b3297cd66248728546c > > > > > > At some point it was suggested that I file a gitlab issue, but I took > > > this to mean it was already known and being worked on. -rc3 came out > > > today and still has the problem. Is there a known issue I could > > > track? > > > > > > > At this point, unless there are any objections, I think we should just > > revert the two patches > > >>> Uhm, no. > > >>> > > >>> Why "the two" patches? > > >>> > > >>> This email, part of this thread, > > >>> > > >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ > > >>> > > >>> clearly states that reverting *only* this commit, > > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number > > >>> of run-queues > > >>> *does not* mitigate the failed suspend. (Furthermore, this commit > > >>> doesn't really change > > >>> anything operational, other than using an allocated array, instead of a > > >>> static one, in DRM, > > >>> while the 2nd patch is solely contained within the amdgpu driver code.) > > >>> > > >>> Leaving us with only this change, > > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > > >>> to be at fault, as the kernel log attached in the linked email above > > >>> shows. > > >>> > > >>> The conclusion is that only b70438004a14f4 needs reverting. > > >> > > >> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > > >> 56e449603f0ac5 breaks amdgpu. > > > > > > We can try and re-enable it in the next kernel. I'm just not sure > > > we'll be able to fix this in time for 6.7 with the holidays and all > > > and I don't want to cause a lot of scheduler churn at the end of the > > > 6.7 cycle if we hold off and try and fix it. Reverting seems like the > > > best short term solution. > > > > A lot of subsequent code has come in since commit 56e449603f0ac5, as it > > opened > > the opportunity for a 1-to-1 relationship between an entity and a scheduler. > > (Should've always been the case, from the outset. Not sure why it was coded > > as > > a fixed-size array.) > > > > Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the > > problem > > is wholly contained in amdgpu, and no other driver has this problem, there > > is > > no reason to have to "churn", i.e. go back and forth in DRM, only to cover > > up > > an init bug in amdgpu. See the response I just sent in @this thread: > > https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com > > > > And it's not like this issue is unknown. I first posted about it on > > 2023-10-16. > > > > Ideally, amdgpu would just fix their init code. > > You can't make changes to core code that break other drivers. > Arguably 56e449603f0ac5 should not have gone in in the first place if > it broke amdgpu. b70438004a14f4 was the code to fix amdgpu's init > code, but as a side effect it seems to have broken suspend for some > users. > > Alex
Re: Radeon regression in 6.6 kernel
On Wed, Nov 29, 2023 at 11:41 AM Luben Tuikov wrote: > > On 2023-11-29 10:22, Alex Deucher wrote: > > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher wrote: > >> > >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov wrote: > >>> > >>> On 2023-11-28 17:13, Alex Deucher wrote: > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi wrote: > > > > Alex Deucher writes: > > > >>> In that case those are the already known problems with the scheduler > >>> changes, aren't they? > >> > >> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > >> misunderstanding what the original report was actually testing. If it > >> was 6.7, then try reverting: > >> 56e449603f0ac580700621a356d35d5716a62ce5 > >> b70438004a14f4d0f9890b3297cd66248728546c > > > > At some point it was suggested that I file a gitlab issue, but I took > > this to mean it was already known and being worked on. -rc3 came out > > today and still has the problem. Is there a known issue I could track? > > > > At this point, unless there are any objections, I think we should just > revert the two patches > >>> Uhm, no. > >>> > >>> Why "the two" patches? > >>> > >>> This email, part of this thread, > >>> > >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ > >>> > >>> clearly states that reverting *only* this commit, > >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of > >>> run-queues > >>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't > >>> really change > >>> anything operational, other than using an allocated array, instead of a > >>> static one, in DRM, > >>> while the 2nd patch is solely contained within the amdgpu driver code.) > >>> > >>> Leaving us with only this change, > >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > >>> to be at fault, as the kernel log attached in the linked email above > >>> shows. > >>> > >>> The conclusion is that only b70438004a14f4 needs reverting. > >> > >> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > >> 56e449603f0ac5 breaks amdgpu. > > > > We can try and re-enable it in the next kernel. I'm just not sure > > we'll be able to fix this in time for 6.7 with the holidays and all > > and I don't want to cause a lot of scheduler churn at the end of the > > 6.7 cycle if we hold off and try and fix it. Reverting seems like the > > best short term solution. > > A lot of subsequent code has come in since commit 56e449603f0ac5, as it opened > the opportunity for a 1-to-1 relationship between an entity and a scheduler. > (Should've always been the case, from the outset. Not sure why it was coded as > a fixed-size array.) > > Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the > problem > is wholly contained in amdgpu, and no other driver has this problem, there is > no reason to have to "churn", i.e. go back and forth in DRM, only to cover up > an init bug in amdgpu. See the response I just sent in @this thread: > https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com > > And it's not like this issue is unknown. I first posted about it on > 2023-10-16. > > Ideally, amdgpu would just fix their init code. You can't make changes to core code that break other drivers. Arguably 56e449603f0ac5 should not have gone in in the first place if it broke amdgpu. b70438004a14f4 was the code to fix amdgpu's init code, but as a side effect it seems to have broken suspend for some users. Alex
Re: Radeon regression in 6.6 kernel
On Wed, Nov 29, 2023 at 11:21 AM Luben Tuikov wrote: > > On 2023-11-29 08:50, Alex Deucher wrote: > > On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov wrote: > >> > >> On 2023-11-28 17:13, Alex Deucher wrote: > >>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi wrote: > > Alex Deucher writes: > > >> In that case those are the already known problems with the scheduler > >> changes, aren't they? > > > > Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > > misunderstanding what the original report was actually testing. If it > > was 6.7, then try reverting: > > 56e449603f0ac580700621a356d35d5716a62ce5 > > b70438004a14f4d0f9890b3297cd66248728546c > > At some point it was suggested that I file a gitlab issue, but I took > this to mean it was already known and being worked on. -rc3 came out > today and still has the problem. Is there a known issue I could track? > > >>> > >>> At this point, unless there are any objections, I think we should just > >>> revert the two patches > >> Uhm, no. > >> > >> Why "the two" patches? > >> > >> This email, part of this thread, > >> > >> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ > >> > >> clearly states that reverting *only* this commit, > >> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of > >> run-queues > >> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't > >> really change > >> anything operational, other than using an allocated array, instead of a > >> static one, in DRM, > >> while the 2nd patch is solely contained within the amdgpu driver code.) > >> > >> Leaving us with only this change, > >> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > >> to be at fault, as the kernel log attached in the linked email above shows. > >> > >> The conclusion is that only b70438004a14f4 needs reverting. > > > > b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > > 56e449603f0ac5 breaks amdgpu. > > It doesn't "break" it, amdgpu just needs to be fixed. > > I know we put in a Fixes tag in > b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a level" > pointing to 56e449603f0ac5 "drm/sched: Convert the GPU scheduler to variable > number of run-queues", > but given the testing Phillip has done, the culprit is wholly contained in > the amdgpu driver code. > > No other driver has this problem since commit 56e449603f0ac5. > > The Fixes tag in b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a > level" should've ideally > pointed to an amdgpu-driver code commit only (perhaps an old-old commit), and > I was a bit uncomfortable > putting in a Fixes tag which pointed to drm code, but we did it so that the > amdgpu commit follows > the changes in DRM. In retrospect, the Fixes tag should've pointed to and > amdgpu-driver commit when > that the amdgpu code was originally written. > > I remember that the problem was really that amdgpu called > drm_sched_entity_init(), > in amdgpu_ttm_set_buffer_funcs_status() without actually having initialized > the scheduler > used therein. For instance, the code before commit b70438004a14f4, looked > like this: > > void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool > enable) > { > struct ttm_resource_manager *man = ttm_manager_type(&adev->mman.bdev, > TTM_PL_VRAM); > uint64_t size; > int r; > > if (!adev->mman.initialized || amdgpu_in_reset(adev) || > adev->mman.buffer_funcs_enabled == enable) > return; > > if (enable) { > struct amdgpu_ring *ring; > struct drm_gpu_scheduler *sched; > > ring = adev->mman.buffer_funcs_ring; > sched = &ring->sched; <-- LT: No > one has initialized this scheduler > r = drm_sched_entity_init(&adev->mman.entity, <-- Oopses, > now that sched->sched_rq is not a static array > DRM_SCHED_PRIORITY_KERNEL, &sched, > 1, NULL); > if (r) { > DRM_ERROR("Failed setting up TTM BO move entity > (%d)\n", > r); > return; > } > > > Before commit 56e449603f0ac5, amdgpu was getting away with this, because the > sched->sched_rq > was a static array. > > Ideally, amdgpu code would be fixed. b70438004a14f4 was the amdgpu fix for this, but it appears to break suspend for some users. I'm not confident we can fix it in time for 6.7 final. Alex
Re: Radeon regression in 6.6 kernel
On 2023-11-29 10:22, Alex Deucher wrote: > On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher wrote: >> >> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov wrote: >>> >>> On 2023-11-28 17:13, Alex Deucher wrote: On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi wrote: > > Alex Deucher writes: > >>> In that case those are the already known problems with the scheduler >>> changes, aren't they? >> >> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm >> misunderstanding what the original report was actually testing. If it >> was 6.7, then try reverting: >> 56e449603f0ac580700621a356d35d5716a62ce5 >> b70438004a14f4d0f9890b3297cd66248728546c > > At some point it was suggested that I file a gitlab issue, but I took > this to mean it was already known and being worked on. -rc3 came out > today and still has the problem. Is there a known issue I could track? > At this point, unless there are any objections, I think we should just revert the two patches >>> Uhm, no. >>> >>> Why "the two" patches? >>> >>> This email, part of this thread, >>> >>> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ >>> >>> clearly states that reverting *only* this commit, >>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of >>> run-queues >>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't >>> really change >>> anything operational, other than using an allocated array, instead of a >>> static one, in DRM, >>> while the 2nd patch is solely contained within the amdgpu driver code.) >>> >>> Leaving us with only this change, >>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level >>> to be at fault, as the kernel log attached in the linked email above shows. >>> >>> The conclusion is that only b70438004a14f4 needs reverting. >> >> b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, >> 56e449603f0ac5 breaks amdgpu. > > We can try and re-enable it in the next kernel. I'm just not sure > we'll be able to fix this in time for 6.7 with the holidays and all > and I don't want to cause a lot of scheduler churn at the end of the > 6.7 cycle if we hold off and try and fix it. Reverting seems like the > best short term solution. A lot of subsequent code has come in since commit 56e449603f0ac5, as it opened the opportunity for a 1-to-1 relationship between an entity and a scheduler. (Should've always been the case, from the outset. Not sure why it was coded as a fixed-size array.) Given that commit 56e449603f0ac5 has nothing to do with amdgpu, and the problem is wholly contained in amdgpu, and no other driver has this problem, there is no reason to have to "churn", i.e. go back and forth in DRM, only to cover up an init bug in amdgpu. See the response I just sent in @this thread: https://lore.kernel.org/r/05007cb0-871e-4dc7-af58-1351f4ba4...@gmail.com And it's not like this issue is unknown. I first posted about it on 2023-10-16. Ideally, amdgpu would just fix their init code. -- Regards, Luben OpenPGP_0x4C15479431A334AF.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
Re: Radeon regression in 6.6 kernel
On 2023-11-29 08:50, Alex Deucher wrote: > On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov wrote: >> >> On 2023-11-28 17:13, Alex Deucher wrote: >>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi wrote: Alex Deucher writes: >> In that case those are the already known problems with the scheduler >> changes, aren't they? > > Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > misunderstanding what the original report was actually testing. If it > was 6.7, then try reverting: > 56e449603f0ac580700621a356d35d5716a62ce5 > b70438004a14f4d0f9890b3297cd66248728546c At some point it was suggested that I file a gitlab issue, but I took this to mean it was already known and being worked on. -rc3 came out today and still has the problem. Is there a known issue I could track? >>> >>> At this point, unless there are any objections, I think we should just >>> revert the two patches >> Uhm, no. >> >> Why "the two" patches? >> >> This email, part of this thread, >> >> https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ >> >> clearly states that reverting *only* this commit, >> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of >> run-queues >> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't >> really change >> anything operational, other than using an allocated array, instead of a >> static one, in DRM, >> while the 2nd patch is solely contained within the amdgpu driver code.) >> >> Leaving us with only this change, >> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level >> to be at fault, as the kernel log attached in the linked email above shows. >> >> The conclusion is that only b70438004a14f4 needs reverting. > > b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > 56e449603f0ac5 breaks amdgpu. It doesn't "break" it, amdgpu just needs to be fixed. I know we put in a Fixes tag in b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a level" pointing to 56e449603f0ac5 "drm/sched: Convert the GPU scheduler to variable number of run-queues", but given the testing Phillip has done, the culprit is wholly contained in the amdgpu driver code. No other driver has this problem since commit 56e449603f0ac5. The Fixes tag in b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a level" should've ideally pointed to an amdgpu-driver code commit only (perhaps an old-old commit), and I was a bit uncomfortable putting in a Fixes tag which pointed to drm code, but we did it so that the amdgpu commit follows the changes in DRM. In retrospect, the Fixes tag should've pointed to and amdgpu-driver commit when that the amdgpu code was originally written. I remember that the problem was really that amdgpu called drm_sched_entity_init(), in amdgpu_ttm_set_buffer_funcs_status() without actually having initialized the scheduler used therein. For instance, the code before commit b70438004a14f4, looked like this: void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable) { struct ttm_resource_manager *man = ttm_manager_type(&adev->mman.bdev, TTM_PL_VRAM); uint64_t size; int r; if (!adev->mman.initialized || amdgpu_in_reset(adev) || adev->mman.buffer_funcs_enabled == enable) return; if (enable) { struct amdgpu_ring *ring; struct drm_gpu_scheduler *sched; ring = adev->mman.buffer_funcs_ring; sched = &ring->sched; <-- LT: No one has initialized this scheduler r = drm_sched_entity_init(&adev->mman.entity, <-- Oopses, now that sched->sched_rq is not a static array DRM_SCHED_PRIORITY_KERNEL, &sched, 1, NULL); if (r) { DRM_ERROR("Failed setting up TTM BO move entity (%d)\n", r); return; } Before commit 56e449603f0ac5, amdgpu was getting away with this, because the sched->sched_rq was a static array. Ideally, amdgpu code would be fixed. -- Regards, Luben OpenPGP_0x4C15479431A334AF.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
Re: Radeon regression in 6.6 kernel
On Wed, Nov 29, 2023 at 8:50 AM Alex Deucher wrote: > > On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov wrote: > > > > On 2023-11-28 17:13, Alex Deucher wrote: > > > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi wrote: > > >> > > >> Alex Deucher writes: > > >> > > In that case those are the already known problems with the scheduler > > changes, aren't they? > > >>> > > >>> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > > >>> misunderstanding what the original report was actually testing. If it > > >>> was 6.7, then try reverting: > > >>> 56e449603f0ac580700621a356d35d5716a62ce5 > > >>> b70438004a14f4d0f9890b3297cd66248728546c > > >> > > >> At some point it was suggested that I file a gitlab issue, but I took > > >> this to mean it was already known and being worked on. -rc3 came out > > >> today and still has the problem. Is there a known issue I could track? > > >> > > > > > > At this point, unless there are any objections, I think we should just > > > revert the two patches > > Uhm, no. > > > > Why "the two" patches? > > > > This email, part of this thread, > > > > https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ > > > > clearly states that reverting *only* this commit, > > 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of > > run-queues > > *does not* mitigate the failed suspend. (Furthermore, this commit doesn't > > really change > > anything operational, other than using an allocated array, instead of a > > static one, in DRM, > > while the 2nd patch is solely contained within the amdgpu driver code.) > > > > Leaving us with only this change, > > b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > > to be at fault, as the kernel log attached in the linked email above shows. > > > > The conclusion is that only b70438004a14f4 needs reverting. > > b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, > 56e449603f0ac5 breaks amdgpu. We can try and re-enable it in the next kernel. I'm just not sure we'll be able to fix this in time for 6.7 with the holidays and all and I don't want to cause a lot of scheduler churn at the end of the 6.7 cycle if we hold off and try and fix it. Reverting seems like the best short term solution. Alex > > Alex
Re: Radeon regression in 6.6 kernel
On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov wrote: > > On 2023-11-28 17:13, Alex Deucher wrote: > > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi wrote: > >> > >> Alex Deucher writes: > >> > In that case those are the already known problems with the scheduler > changes, aren't they? > >>> > >>> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > >>> misunderstanding what the original report was actually testing. If it > >>> was 6.7, then try reverting: > >>> 56e449603f0ac580700621a356d35d5716a62ce5 > >>> b70438004a14f4d0f9890b3297cd66248728546c > >> > >> At some point it was suggested that I file a gitlab issue, but I took > >> this to mean it was already known and being worked on. -rc3 came out > >> today and still has the problem. Is there a known issue I could track? > >> > > > > At this point, unless there are any objections, I think we should just > > revert the two patches > Uhm, no. > > Why "the two" patches? > > This email, part of this thread, > > https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ > > clearly states that reverting *only* this commit, > 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of > run-queues > *does not* mitigate the failed suspend. (Furthermore, this commit doesn't > really change > anything operational, other than using an allocated array, instead of a > static one, in DRM, > while the 2nd patch is solely contained within the amdgpu driver code.) > > Leaving us with only this change, > b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level > to be at fault, as the kernel log attached in the linked email above shows. > > The conclusion is that only b70438004a14f4 needs reverting. b70438004a14f4 was a fix for 56e449603f0ac5. Without b70438004a14f4, 56e449603f0ac5 breaks amdgpu. Alex
Re: Radeon regression in 6.6 kernel
On 2023-11-28 17:13, Alex Deucher wrote: > On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi wrote: >> >> Alex Deucher writes: >> In that case those are the already known problems with the scheduler changes, aren't they? >>> >>> Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm >>> misunderstanding what the original report was actually testing. If it >>> was 6.7, then try reverting: >>> 56e449603f0ac580700621a356d35d5716a62ce5 >>> b70438004a14f4d0f9890b3297cd66248728546c >> >> At some point it was suggested that I file a gitlab issue, but I took >> this to mean it was already known and being worked on. -rc3 came out >> today and still has the problem. Is there a known issue I could track? >> > > At this point, unless there are any objections, I think we should just > revert the two patches Uhm, no. Why "the two" patches? This email, part of this thread, https://lore.kernel.org/all/87r0kircdo@vps.thesusis.net/ clearly states that reverting *only* this commit, 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of run-queues *does not* mitigate the failed suspend. (Furthermore, this commit doesn't really change anything operational, other than using an allocated array, instead of a static one, in DRM, while the 2nd patch is solely contained within the amdgpu driver code.) Leaving us with only this change, b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level to be at fault, as the kernel log attached in the linked email above shows. The conclusion is that only b70438004a14f4 needs reverting. -- Regards, Luben OpenPGP_0x4C15479431A334AF.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
Re: Radeon regression in 6.6 kernel
On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi wrote: > > Alex Deucher writes: > > >> In that case those are the already known problems with the scheduler > >> changes, aren't they? > > > > Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > > misunderstanding what the original report was actually testing. If it > > was 6.7, then try reverting: > > 56e449603f0ac580700621a356d35d5716a62ce5 > > b70438004a14f4d0f9890b3297cd66248728546c > > At some point it was suggested that I file a gitlab issue, but I took > this to mean it was already known and being worked on. -rc3 came out > today and still has the problem. Is there a known issue I could track? > At this point, unless there are any objections, I think we should just revert the two patches. Alex
Re: Radeon regression in 6.6 kernel
Alex Deucher writes: >> In that case those are the already known problems with the scheduler >> changes, aren't they? > > Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > misunderstanding what the original report was actually testing. If it > was 6.7, then try reverting: > 56e449603f0ac580700621a356d35d5716a62ce5 > b70438004a14f4d0f9890b3297cd66248728546c At some point it was suggested that I file a gitlab issue, but I took this to mean it was already known and being worked on. -rc3 came out today and still has the problem. Is there a known issue I could track?
Re: Radeon regression in 6.6 kernel
On 2023-11-21 17:05, Phillip Susi wrote: > Alex Deucher writes: > >> Does reverting 56e449603f0ac580700621a356d35d5716a62ce5 alone fix it? >> Can you also attach your full dmesg log for the failed suspend? > > No, it doesn't. Here is the full syslog from the boot with only that > revert: > Thank you Phillip for verifying this. BTW, luben.tui...@amd.com should absolutely bounce for everyone sending emails to it. Not sure why it is still active. My new email is the one this email is coming from. -- Regards, Luben OpenPGP_0x4C15479431A334AF.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
Re: Radeon regression in 6.6 kernel
On Mon, Nov 20, 2023 at 5:40 PM Phillip Susi wrote: > > Alex Deucher writes: > > > Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > > misunderstanding what the original report was actually testing. If it > > was 6.7, then try reverting: > > 56e449603f0ac580700621a356d35d5716a62ce5 > > b70438004a14f4d0f9890b3297cd66248728546c > > I had been running v6.6-rc5 before pulling. It looks like that got me > somewhere between v6.6 and v6.7-rc1. Reverting those two commits fixes > it. Does reverting 56e449603f0ac580700621a356d35d5716a62ce5 alone fix it? Can you also attach your full dmesg log for the failed suspend? Alex
Re: Radeon regression in 6.6 kernel
Christian König writes: > Well none of the commits mentioned can affect radeon in any way. Radeon > simply doesn't use the scheduler. > > My suspicion is that the user is actually using amdgpu instead of > radeon. The switch potentially occurred accidentally, for example by > compiling amdgpu support for SI/CIK. Indeed, the lspci I originally posted does indicate amdgpu. What is the difference and should I switch it? If so, how? > Those amdgpu problems for older ASIC have already been worked on and > should be fixed by now. I just pulled v6.7-rc2 and it's still broken. I'll see if I can revert those 3 patches.
Re: Radeon regression in 6.6 kernel
Alex Deucher writes: > Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm > misunderstanding what the original report was actually testing. If it > was 6.7, then try reverting: > 56e449603f0ac580700621a356d35d5716a62ce5 > b70438004a14f4d0f9890b3297cd66248728546c I had been running v6.6-rc5 before pulling. It looks like that got me somewhere between v6.6 and v6.7-rc1. Reverting those two commits fixes it.
Re: Radeon regression in 6.6 kernel
On Mon, Nov 20, 2023 at 11:24 AM Christian König wrote: > > Am 20.11.23 um 17:08 schrieb Alex Deucher: > > On Mon, Nov 20, 2023 at 10:57 AM Christian König > > wrote: > >> Am 19.11.23 um 07:47 schrieb Dave Airlie: > On 12.11.23 01:46, Phillip Susi wrote: > > I had been testing some things on a post 6.6-rc5 kernel for a week or > > two and then when I pulled to a post 6.6 release kernel, I found that > > system suspend was broken. It seems that the radeon driver failed to > > suspend, leaving the display dead, the wayland display server hung, and > > the system still running. I have been trying to bisect it for the last > > few days and have only been able to narrow it down to the following 3 > > commits: > > > > There are only 'skip'ped commits left to test. > > The first bad commit could be any of: > > 56e449603f0ac580700621a356d35d5716a62ce5 > > c07bf1636f0005f9eb7956404490672286ea59d3 > > b70438004a14f4d0f9890b3297cd66248728546c > > We cannot bisect more! > Hmm, not a single reply from the amdgpu folks. Wondering how we can > encourage them to look into this. > > Phillip, reporting issues by mail should still work, but you might have > more luck here, as that's where the amdgpu afaics prefer to track bugs: > https://gitlab.freedesktop.org/drm/amd/-/issues > > When you file an issue there, please mention it here. > > Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which > comes out later today) or 6.6.2-rc1 improve things. > >>> It would also be good to test if reverting any of these is possible or > >>> not. > >> Well none of the commits mentioned can affect radeon in any way. Radeon > >> simply doesn't use the scheduler. > >> > >> My suspicion is that the user is actually using amdgpu instead of > >> radeon. The switch potentially occurred accidentally, for example by > >> compiling amdgpu support for SI/CIK. > >> > >> Those amdgpu problems for older ASIC have already been worked on and > >> should be fixed by now. > > In this case it's a navi23 (so radeon in the marketing sense). > > Thanks, couldn't find that in the mail thread. > > In that case those are the already known problems with the scheduler > changes, aren't they? Yes. Those changes went into 6.7 though, not 6.6 AFAIK. Maybe I'm misunderstanding what the original report was actually testing. If it was 6.7, then try reverting: 56e449603f0ac580700621a356d35d5716a62ce5 b70438004a14f4d0f9890b3297cd66248728546c Alex > > Christian. > > > > > Alex > > > >> Regards, > >> Christian. > >> > >>> File the gitlab issue and we should poke amd a but more to take a look. > >>> > >>> Dave. >
Re: Radeon regression in 6.6 kernel
Am 20.11.23 um 17:08 schrieb Alex Deucher: On Mon, Nov 20, 2023 at 10:57 AM Christian König wrote: Am 19.11.23 um 07:47 schrieb Dave Airlie: On 12.11.23 01:46, Phillip Susi wrote: I had been testing some things on a post 6.6-rc5 kernel for a week or two and then when I pulled to a post 6.6 release kernel, I found that system suspend was broken. It seems that the radeon driver failed to suspend, leaving the display dead, the wayland display server hung, and the system still running. I have been trying to bisect it for the last few days and have only been able to narrow it down to the following 3 commits: There are only 'skip'ped commits left to test. The first bad commit could be any of: 56e449603f0ac580700621a356d35d5716a62ce5 c07bf1636f0005f9eb7956404490672286ea59d3 b70438004a14f4d0f9890b3297cd66248728546c We cannot bisect more! Hmm, not a single reply from the amdgpu folks. Wondering how we can encourage them to look into this. Phillip, reporting issues by mail should still work, but you might have more luck here, as that's where the amdgpu afaics prefer to track bugs: https://gitlab.freedesktop.org/drm/amd/-/issues When you file an issue there, please mention it here. Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which comes out later today) or 6.6.2-rc1 improve things. It would also be good to test if reverting any of these is possible or not. Well none of the commits mentioned can affect radeon in any way. Radeon simply doesn't use the scheduler. My suspicion is that the user is actually using amdgpu instead of radeon. The switch potentially occurred accidentally, for example by compiling amdgpu support for SI/CIK. Those amdgpu problems for older ASIC have already been worked on and should be fixed by now. In this case it's a navi23 (so radeon in the marketing sense). Thanks, couldn't find that in the mail thread. In that case those are the already known problems with the scheduler changes, aren't they? Christian. Alex Regards, Christian. File the gitlab issue and we should poke amd a but more to take a look. Dave.
Re: Radeon regression in 6.6 kernel
On Mon, Nov 20, 2023 at 10:57 AM Christian König wrote: > > Am 19.11.23 um 07:47 schrieb Dave Airlie: > >> On 12.11.23 01:46, Phillip Susi wrote: > >>> I had been testing some things on a post 6.6-rc5 kernel for a week or > >>> two and then when I pulled to a post 6.6 release kernel, I found that > >>> system suspend was broken. It seems that the radeon driver failed to > >>> suspend, leaving the display dead, the wayland display server hung, and > >>> the system still running. I have been trying to bisect it for the last > >>> few days and have only been able to narrow it down to the following 3 > >>> commits: > >>> > >>> There are only 'skip'ped commits left to test. > >>> The first bad commit could be any of: > >>> 56e449603f0ac580700621a356d35d5716a62ce5 > >>> c07bf1636f0005f9eb7956404490672286ea59d3 > >>> b70438004a14f4d0f9890b3297cd66248728546c > >>> We cannot bisect more! > >> Hmm, not a single reply from the amdgpu folks. Wondering how we can > >> encourage them to look into this. > >> > >> Phillip, reporting issues by mail should still work, but you might have > >> more luck here, as that's where the amdgpu afaics prefer to track bugs: > >> https://gitlab.freedesktop.org/drm/amd/-/issues > >> > >> When you file an issue there, please mention it here. > >> > >> Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which > >> comes out later today) or 6.6.2-rc1 improve things. > > It would also be good to test if reverting any of these is possible or not. > > Well none of the commits mentioned can affect radeon in any way. Radeon > simply doesn't use the scheduler. > > My suspicion is that the user is actually using amdgpu instead of > radeon. The switch potentially occurred accidentally, for example by > compiling amdgpu support for SI/CIK. > > Those amdgpu problems for older ASIC have already been worked on and > should be fixed by now. In this case it's a navi23 (so radeon in the marketing sense). Alex > > Regards, > Christian. > > > > > File the gitlab issue and we should poke amd a but more to take a look. > > > > Dave. >
Re: Radeon regression in 6.6 kernel
Am 19.11.23 um 07:47 schrieb Dave Airlie: On 12.11.23 01:46, Phillip Susi wrote: I had been testing some things on a post 6.6-rc5 kernel for a week or two and then when I pulled to a post 6.6 release kernel, I found that system suspend was broken. It seems that the radeon driver failed to suspend, leaving the display dead, the wayland display server hung, and the system still running. I have been trying to bisect it for the last few days and have only been able to narrow it down to the following 3 commits: There are only 'skip'ped commits left to test. The first bad commit could be any of: 56e449603f0ac580700621a356d35d5716a62ce5 c07bf1636f0005f9eb7956404490672286ea59d3 b70438004a14f4d0f9890b3297cd66248728546c We cannot bisect more! Hmm, not a single reply from the amdgpu folks. Wondering how we can encourage them to look into this. Phillip, reporting issues by mail should still work, but you might have more luck here, as that's where the amdgpu afaics prefer to track bugs: https://gitlab.freedesktop.org/drm/amd/-/issues When you file an issue there, please mention it here. Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which comes out later today) or 6.6.2-rc1 improve things. It would also be good to test if reverting any of these is possible or not. Well none of the commits mentioned can affect radeon in any way. Radeon simply doesn't use the scheduler. My suspicion is that the user is actually using amdgpu instead of radeon. The switch potentially occurred accidentally, for example by compiling amdgpu support for SI/CIK. Those amdgpu problems for older ASIC have already been worked on and should be fixed by now. Regards, Christian. File the gitlab issue and we should poke amd a but more to take a look. Dave.
Re: Radeon regression in 6.6 kernel
On 11/19/23 20:48, Linux regression tracking (Thorsten Leemhuis) wrote: > On 19.11.23 14:24, Bagas Sanjaya wrote: >> Hi Dave, >> >> AFAIK commit c07bf1636f0005 ("MAINTAINERS: Update the GPU Scheduler email") >> doesn't seem to do with this regression as it doesn't change any amdgpu code >> that may introduce the regression. > > Bagas, sorry for being blunt here, I know you mean well. But I feel the > need to say the following in the open, as this otherwise falls back on > me and regression tracking. > > Stating the above is not very helpful, as Dave for sure will know. > Telling Phillip that he likely can skip that commit might have been > something different. But I guess even for most users that are able to do > a bisection it's obvious and maybe not worth pointing out. > I was scratching my itch then when replying. Thanks anyway. -- An old man doll... just what I always wanted! - Clara
Re: Radeon regression in 6.6 kernel
On Sun, Nov 19, 2023 at 04:47:01PM +1000, Dave Airlie wrote: > > > > On 12.11.23 01:46, Phillip Susi wrote: > > > I had been testing some things on a post 6.6-rc5 kernel for a week or > > > two and then when I pulled to a post 6.6 release kernel, I found that > > > system suspend was broken. It seems that the radeon driver failed to > > > suspend, leaving the display dead, the wayland display server hung, and > > > the system still running. I have been trying to bisect it for the last > > > few days and have only been able to narrow it down to the following 3 > > > commits: > > > > > > There are only 'skip'ped commits left to test. > > > The first bad commit could be any of: > > > 56e449603f0ac580700621a356d35d5716a62ce5 > > > c07bf1636f0005f9eb7956404490672286ea59d3 > > > b70438004a14f4d0f9890b3297cd66248728546c > > > We cannot bisect more! > > > > Hmm, not a single reply from the amdgpu folks. Wondering how we can > > encourage them to look into this. > > > > Phillip, reporting issues by mail should still work, but you might have > > more luck here, as that's where the amdgpu afaics prefer to track bugs: > > https://gitlab.freedesktop.org/drm/amd/-/issues > > > > When you file an issue there, please mention it here. > > > > Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which > > comes out later today) or 6.6.2-rc1 improve things. > > It would also be good to test if reverting any of these is possible or not. > Hi Dave, AFAIK commit c07bf1636f0005 ("MAINTAINERS: Update the GPU Scheduler email") doesn't seem to do with this regression as it doesn't change any amdgpu code that may introduce the regression. Thanks. -- An old man doll... just what I always wanted! - Clara signature.asc Description: PGP signature
Re: Radeon regression in 6.6 kernel
On 19.11.23 14:24, Bagas Sanjaya wrote: > On Sun, Nov 19, 2023 at 04:47:01PM +1000, Dave Airlie wrote: >>> On 12.11.23 01:46, Phillip Susi wrote: I had been testing some things on a post 6.6-rc5 kernel for a week or two and then when I pulled to a post 6.6 release kernel, I found that system suspend was broken. It seems that the radeon driver failed to suspend, leaving the display dead, the wayland display server hung, and the system still running. I have been trying to bisect it for the last few days and have only been able to narrow it down to the following 3 commits: There are only 'skip'ped commits left to test. The first bad commit could be any of: 56e449603f0ac580700621a356d35d5716a62ce5 c07bf1636f0005f9eb7956404490672286ea59d3 b70438004a14f4d0f9890b3297cd66248728546c We cannot bisect more! >>> >>> Hmm, not a single reply from the amdgpu folks. Wondering how we can >>> encourage them to look into this. >>> >>> Phillip, reporting issues by mail should still work, but you might have >>> more luck here, as that's where the amdgpu afaics prefer to track bugs: >>> https://gitlab.freedesktop.org/drm/amd/-/issues >>> >>> When you file an issue there, please mention it here. >>> >>> Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which >>> comes out later today) or 6.6.2-rc1 improve things. BTW, ignore the "6.6.2-rc1" here, I misunderstood one detail earlier. Sorry. >> It would also be good to test if reverting any of these is possible or not. Good point, sorry, forgot to mention that. > Hi Dave, > > AFAIK commit c07bf1636f0005 ("MAINTAINERS: Update the GPU Scheduler email") > doesn't seem to do with this regression as it doesn't change any amdgpu code > that may introduce the regression. Bagas, sorry for being blunt here, I know you mean well. But I feel the need to say the following in the open, as this otherwise falls back on me and regression tracking. Stating the above is not very helpful, as Dave for sure will know. Telling Phillip that he likely can skip that commit might have been something different. But I guess even for most users that are able to do a bisection it's obvious and maybe not worth pointing out. Ciao, Thorsten
Re: Radeon regression in 6.6 kernel
Lo! On 12.11.23 01:46, Phillip Susi wrote: > I had been testing some things on a post 6.6-rc5 kernel for a week or > two and then when I pulled to a post 6.6 release kernel, I found that > system suspend was broken. It seems that the radeon driver failed to > suspend, leaving the display dead, the wayland display server hung, and > the system still running. I have been trying to bisect it for the last > few days and have only been able to narrow it down to the following 3 > commits: > > There are only 'skip'ped commits left to test. > The first bad commit could be any of: > 56e449603f0ac580700621a356d35d5716a62ce5 > c07bf1636f0005f9eb7956404490672286ea59d3 > b70438004a14f4d0f9890b3297cd66248728546c > We cannot bisect more! Hmm, not a single reply from the amdgpu folks. Wondering how we can encourage them to look into this. Phillip, reporting issues by mail should still work, but you might have more luck here, as that's where the amdgpu afaics prefer to track bugs: https://gitlab.freedesktop.org/drm/amd/-/issues When you file an issue there, please mention it here. Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which comes out later today) or 6.6.2-rc1 improve things. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. #regzbot poke > It appears that there was a late merge in the 6.6 window that originally > forked from the -rc2, as many of the later commits that I bisected had > that version number. > > I couldn't get it more narrowed down because I had to skip the > surrounding commits because they wouldn't even boot up to a gui desktop, > let alone try to suspend. > > When system suspend fails, I find the following in my syslog after I > have to magic-sysrq reboot because the the display is dead: > > Nov 11 18:44:39 faldara kernel: PM: suspend entry (deep) > Nov 11 18:44:39 faldara kernel: Filesystems sync: 0.035 seconds > Nov 11 18:44:40 faldara kernel: Freezing user space processes > Nov 11 18:44:40 faldara kernel: Freezing user space processes completed > (elapsed 0.001 seconds) > Nov 11 18:44:40 faldara kernel: OOM killer disabled. > Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks > Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks completed > (elapsed 0.001 seconds) > Nov 11 18:44:40 faldara kernel: printk: Suspending console(s) (use > no_console_suspend to debug) > Nov 11 18:44:40 faldara kernel: serial 00:01: disabled > Nov 11 18:44:40 faldara kernel: e1000e: EEE TX LPI TIMER: 0011 > Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Synchronizing SCSI cache > Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Synchronizing SCSI cache > Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Synchronizing SCSI cache > Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Stopping disk > Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Stopping disk > Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Stopping disk > Nov 11 18:44:40 faldara kernel: amdgpu: Move buffer fallback to memcpy > unavailable > Nov 11 18:44:40 faldara kernel: [TTM] Buffer eviction failed > Nov 11 18:44:40 faldara kernel: [drm] evicting device resources failed > Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: pci_pm_suspend(): > amdgpu_pmops_suspend+0x0/0x80 [amdgpu] returns -19 > Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: dpm_run_callback(): > pci_pm_suspend+0x0/0x170 returns -19 > Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: failed to suspend > async: error -19 > Nov 11 18:44:40 faldara kernel: PM: Some devices failed to suspend, or early > wake event detected > Nov 11 18:44:40 faldara kernel: xhci_hcd :06:00.0: xHC error in resume, > USBSTS 0x401, Reinit > Nov 11 18:44:40 faldara kernel: usb usb3: root hub lost power or was reset > Nov 11 18:44:40 faldara kernel: usb usb4: root hub lost power or was reset > Nov 11 18:44:40 faldara kernel: serial 00:01: activated > Nov 11 18:44:40 faldara kernel: nvme nvme0: 4/0/0 default/read/poll queues > Nov 11 18:44:40 faldara kernel: ata8: SATA link down (SStatus 0 SControl 300) > Nov 11 18:44:40 faldara kernel: ata7: SATA link down (SStatus 0 SControl 300) > Nov 11 18:44:40 faldara kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 > SControl 300) > Nov 11 18:44:40 faldara kernel: ata1: SATA link down (SStatus 4 SControl 300) > Nov 11 18:44:40 faldara kernel: ata3: SATA link down (SStatus 4 SControl 300) > Nov 11 18:44:40 faldara kernel: ata4.00: configured for UDMA/133 > Nov 11 18:44:40 faldara kernel: OOM killer enabled. > Nov 11 18:44:40 faldara kernel: Restarting tasks ... done. > Nov 11 18:44:40 faldara kernel: random: crng reseeded on system resumption > Nov 11 18:44:40 faldara kernel: PM: suspend exit > Nov 11 18:44:40 faldara kernel: PM: suspend entry (s2idle)
Re: Radeon regression in 6.6 kernel
> > On 12.11.23 01:46, Phillip Susi wrote: > > I had been testing some things on a post 6.6-rc5 kernel for a week or > > two and then when I pulled to a post 6.6 release kernel, I found that > > system suspend was broken. It seems that the radeon driver failed to > > suspend, leaving the display dead, the wayland display server hung, and > > the system still running. I have been trying to bisect it for the last > > few days and have only been able to narrow it down to the following 3 > > commits: > > > > There are only 'skip'ped commits left to test. > > The first bad commit could be any of: > > 56e449603f0ac580700621a356d35d5716a62ce5 > > c07bf1636f0005f9eb7956404490672286ea59d3 > > b70438004a14f4d0f9890b3297cd66248728546c > > We cannot bisect more! > > Hmm, not a single reply from the amdgpu folks. Wondering how we can > encourage them to look into this. > > Phillip, reporting issues by mail should still work, but you might have > more luck here, as that's where the amdgpu afaics prefer to track bugs: > https://gitlab.freedesktop.org/drm/amd/-/issues > > When you file an issue there, please mention it here. > > Furthermore it might help if you could verify if 6.7-rc1 (or rc2, which > comes out later today) or 6.6.2-rc1 improve things. It would also be good to test if reverting any of these is possible or not. File the gitlab issue and we should poke amd a but more to take a look. Dave.
Re: Radeon regression in 6.6 kernel
Bagas Sanjaya writes: > Please show the full bisect log, and also tell why these commits are > skipped. Two of them would not compile and one would not boot. Here's the log. # bad: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu # good: [94f6f0550c625fab1f373bb86a6669b45e9748b3] Linux 6.6-rc5 git bisect start 'HEAD' 'v6.6-rc5' # good: [8bc9e6515183935fa0cccaf67455c439afe4982b] Merge tag 'devicetree-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux git bisect good 8bc9e6515183935fa0cccaf67455c439afe4982b # bad: [431f1051884e38d2a5751e4731d69b2ff289ee56] Merge tag 'leds-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/leds git bisect bad 431f1051884e38d2a5751e4731d69b2ff289ee56 # bad: [0364249d2073c32c5214f02866999ce940bc35a2] Merge tag 'for-6.7/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm git bisect bad 0364249d2073c32c5214f02866999ce940bc35a2 # good: [27442758e9b4e083bef3f164a1739475c01f3202] Merge tag 'amd-drm-next-6.7-2023-10-13' of https://gitlab.freedesktop.org/agd5f/linux into drm-next git bisect good 27442758e9b4e083bef3f164a1739475c01f3202 # bad: [631808095a82e6b6f8410a95f8b12b8d0d38b161] Merge tag 'amd-drm-next-6.7-2023-10-27' of https://gitlab.freedesktop.org/agd5f/linux into drm-next git bisect bad 631808095a82e6b6f8410a95f8b12b8d0d38b161 # skip: [0ecf4aa32b7896b9160688bdbd20153dc06a50fb] Merge tag 'amd-drm-next-6.7-2023-10-20' of https://gitlab.freedesktop.org/agd5f/linux into drm-next git bisect skip 0ecf4aa32b7896b9160688bdbd20153dc06a50fb # good: [74ce0f3873821f12391bcf5469d81583d34f4c6c] accel/ivpu: Fix verbose version of REG_POLL macros git bisect good 74ce0f3873821f12391bcf5469d81583d34f4c6c # skip: [3f5ba636d6987ddffeaa056dea1c524da63912f3] Merge tag 'drm-msm-next-2023-10-17' of https://gitlab.freedesktop.org/drm/msm into drm-next git bisect skip 3f5ba636d6987ddffeaa056dea1c524da63912f3 # good: [a3cd664e7f971b0f33acb3ba790c142669cd34e5] accel/ivpu: Print IPC type string instead of number git bisect good a3cd664e7f971b0f33acb3ba790c142669cd34e5 # good: [b0b0d811eac6b4c52cb9ad632fa6384cf48869e7] drm/mediatek: Fix coverity issue with unintentional integer overflow git bisect good b0b0d811eac6b4c52cb9ad632fa6384cf48869e7 # good: [808b43fa7e56e94563b86af2703ba88ee156e3c2] drm/i915/dp_mst: Set connector DSC capabilities and decompression AUX git bisect good 808b43fa7e56e94563b86af2703ba88ee156e3c2 # skip: [11ae5eb516b656e8a0e4efbea90ea24c152a346d] Merge tag 'topic/vmemdup-user-array-2023-10-24-1' of git://anongit.freedesktop.org/drm/drm into drm-next git bisect skip 11ae5eb516b656e8a0e4efbea90ea24c152a346d # good: [a67f7a0b18c09d5b62eafb6d5c2f54e6f6ea6cf1] drm/amd/display: Update SDP VSC colorimetry from DP test automation request git bisect good a67f7a0b18c09d5b62eafb6d5c2f54e6f6ea6cf1 # good: [dd3dd9829bf9a4ecd55482050745efdd9f7f97fc] drm/amdgpu: Remove unused variables from amdgpu_show_fdinfo git bisect good dd3dd9829bf9a4ecd55482050745efdd9f7f97fc # good: [b1abb484417ec8edd68df0c9bf8cb1c1fc035fd2] drm/ci: force-enable CONFIG_MSM_MMCC_8996 as built-in git bisect good b1abb484417ec8edd68df0c9bf8cb1c1fc035fd2 # good: [5fa8f128462c5b3b20576b12286dca7fe95b3af1] drm/ci: increase i915 job timeout to 1h30m git bisect good 5fa8f128462c5b3b20576b12286dca7fe95b3af1 # good: [3ddba96b0d7e714dee4db5aed4f7d413be43b4ba] MAINTAINERS: drm/ci: add entries for xfail files git bisect good 3ddba96b0d7e714dee4db5aed4f7d413be43b4ba # bad: [b70438004a14f4d0f9890b3297cd66248728546c] drm/amdgpu: move buffer funcs setting up a level git bisect bad b70438004a14f4d0f9890b3297cd66248728546c # skip: [56e449603f0ac580700621a356d35d5716a62ce5] drm/sched: Convert the GPU scheduler to variable number of run-queues git bisect skip 56e449603f0ac580700621a356d35d5716a62ce5 # skip: [c07bf1636f0005f9eb7956404490672286ea59d3] MAINTAINERS: Update the GPU Scheduler email git bisect skip c07bf1636f0005f9eb7956404490672286ea59d3 # only skipped commits left to test # possible first bad commit: [b70438004a14f4d0f9890b3297cd66248728546c] drm/amdgpu: move buffer funcs setting up a level # possible first bad commit: [c07bf1636f0005f9eb7956404490672286ea59d3] MAINTAINERS: Update the GPU Scheduler email # possible first bad commit: [56e449603f0ac580700621a356d35d5716a62ce5] drm/sched: Convert the GPU scheduler to variable number of run-queues
Re: Radeon regression in 6.6 kernel
On Sat, Nov 11, 2023 at 07:46:41PM -0500, Phillip Susi wrote: > I had been testing some things on a post 6.6-rc5 kernel for a week or > two and then when I pulled to a post 6.6 release kernel, I found that > system suspend was broken. It seems that the radeon driver failed to > suspend, leaving the display dead, the wayland display server hung, and > the system still running. I have been trying to bisect it for the last > few days and have only been able to narrow it down to the following 3 > commits: > > There are only 'skip'ped commits left to test. > The first bad commit could be any of: > 56e449603f0ac580700621a356d35d5716a62ce5 > c07bf1636f0005f9eb7956404490672286ea59d3 > b70438004a14f4d0f9890b3297cd66248728546c > We cannot bisect more! Please show the full bisect log, and also tell why these commits are skipped. > > It appears that there was a late merge in the 6.6 window that originally > forked from the -rc2, as many of the later commits that I bisected had > that version number. > > I couldn't get it more narrowed down because I had to skip the > surrounding commits because they wouldn't even boot up to a gui desktop, > let alone try to suspend. > > When system suspend fails, I find the following in my syslog after I > have to magic-sysrq reboot because the the display is dead: > > Nov 11 18:44:39 faldara kernel: PM: suspend entry (deep) > Nov 11 18:44:39 faldara kernel: Filesystems sync: 0.035 seconds > Nov 11 18:44:40 faldara kernel: Freezing user space processes > Nov 11 18:44:40 faldara kernel: Freezing user space processes completed > (elapsed 0.001 seconds) > Nov 11 18:44:40 faldara kernel: OOM killer disabled. > Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks > Nov 11 18:44:40 faldara kernel: Freezing remaining freezable tasks completed > (elapsed 0.001 seconds) > Nov 11 18:44:40 faldara kernel: printk: Suspending console(s) (use > no_console_suspend to debug) > Nov 11 18:44:40 faldara kernel: serial 00:01: disabled > Nov 11 18:44:40 faldara kernel: e1000e: EEE TX LPI TIMER: 0011 > Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Synchronizing SCSI cache > Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Synchronizing SCSI cache > Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Synchronizing SCSI cache > Nov 11 18:44:40 faldara kernel: sd 4:0:0:0: [sdb] Stopping disk > Nov 11 18:44:40 faldara kernel: sd 1:0:0:0: [sda] Stopping disk > Nov 11 18:44:40 faldara kernel: sd 5:0:0:0: [sdc] Stopping disk > Nov 11 18:44:40 faldara kernel: amdgpu: Move buffer fallback to memcpy > unavailable > Nov 11 18:44:40 faldara kernel: [TTM] Buffer eviction failed > Nov 11 18:44:40 faldara kernel: [drm] evicting device resources failed > Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: pci_pm_suspend(): > amdgpu_pmops_suspend+0x0/0x80 [amdgpu] returns -19 > Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: dpm_run_callback(): > pci_pm_suspend+0x0/0x170 returns -19 > Nov 11 18:44:40 faldara kernel: amdgpu :03:00.0: PM: failed to suspend > async: error -19 > Nov 11 18:44:40 faldara kernel: PM: Some devices failed to suspend, or early > wake event detected > Nov 11 18:44:40 faldara kernel: xhci_hcd :06:00.0: xHC error in resume, > USBSTS 0x401, Reinit > Nov 11 18:44:40 faldara kernel: usb usb3: root hub lost power or was reset > Nov 11 18:44:40 faldara kernel: usb usb4: root hub lost power or was reset > Nov 11 18:44:40 faldara kernel: serial 00:01: activated > Nov 11 18:44:40 faldara kernel: nvme nvme0: 4/0/0 default/read/poll queues > Nov 11 18:44:40 faldara kernel: ata8: SATA link down (SStatus 0 SControl 300) > Nov 11 18:44:40 faldara kernel: ata7: SATA link down (SStatus 0 SControl 300) > Nov 11 18:44:40 faldara kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 > SControl 300) > Nov 11 18:44:40 faldara kernel: ata1: SATA link down (SStatus 4 SControl 300) > Nov 11 18:44:40 faldara kernel: ata3: SATA link down (SStatus 4 SControl 300) > Nov 11 18:44:40 faldara kernel: ata4.00: configured for UDMA/133 > Nov 11 18:44:40 faldara kernel: OOM killer enabled. > Nov 11 18:44:40 faldara kernel: Restarting tasks ... done. > Nov 11 18:44:40 faldara kernel: random: crng reseeded on system resumption > Nov 11 18:44:40 faldara kernel: PM: suspend exit > Nov 11 18:44:40 faldara kernel: PM: suspend entry (s2idle) > Nov 11 18:44:40 faldara systemd-networkd[384]: enp0s31f6: Gained IPv6LL > Nov 11 18:44:40 faldara avahi-daemon[668]: Joining mDNS multicast group on > interface enp0s31f6.IPv6 with address fe80::3ad5:47ff:fe0f:488a. > > My video card is this: > > 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] > Navi 23 (rev c7) (prog-if 00 [VGA controller]) > Subsystem: Gigabyte Technology Co., Ltd Navi 23 > Flags: bus master, fast devsel, latency 0, IRQ 139 > Memory at e000 (64-bit, prefetchable) [size=256M] > Memory at f000 (64-bit, prefetchable) [size=2M] > I/O ports at e000 [size=25