On Tue, Jan 31, 2023 at 5:58 AM Christian König <christian.koe...@amd.com> wrote: > > Am 31.01.23 um 10:17 schrieb Chen, Guchun: > > Hi Piccoli, > > > > Please ignore my request of full dmesg log. I can reproduce the issue and > > get the same failure callstack by returning early with an error code prior > > to amdgpu_device_init_schedulers. > > > > Regards, > > Guchun > > > > -----Original Message----- > > From: Chen, Guchun > > Sent: Tuesday, January 31, 2023 2:37 PM > > To: Alex Deucher <alexdeuc...@gmail.com>; Guilherme G. Piccoli > > <gpicc...@igalia.com> > > Cc: amd-...@lists.freedesktop.org; ker...@gpiccoli.net; Pan, Xinhui > > <xinhui....@amd.com>; dri-devel@lists.freedesktop.org; Tuikov, Luben > > <luben.tui...@amd.com>; Limonciello, Mario <mario.limoncie...@amd.com>; > > kernel-...@igalia.com; Deucher, Alexander <alexander.deuc...@amd.com>; > > Koenig, Christian <christian.koe...@amd.com> > > Subject: RE: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching > > drm_sched init/fini > > > > Hi Piccoli, > > > > I agree with Alex's point, using ring->sched.name for such check is not a > > good way. BTW, can you please attach a full dmesg long in bad case to help > > me understand more? > > > > Regards, > > Guchun > > > > -----Original Message----- > > From: Alex Deucher <alexdeuc...@gmail.com> > > Sent: Tuesday, January 31, 2023 6:30 AM > > To: Guilherme G. Piccoli <gpicc...@igalia.com> > > Cc: amd-...@lists.freedesktop.org; ker...@gpiccoli.net; Chen, Guchun > > <guchun.c...@amd.com>; Pan, Xinhui <xinhui....@amd.com>; > > dri-devel@lists.freedesktop.org; Tuikov, Luben <luben.tui...@amd.com>; > > Limonciello, Mario <mario.limoncie...@amd.com>; kernel-...@igalia.com; > > Deucher, Alexander <alexander.deuc...@amd.com>; Koenig, Christian > > <christian.koe...@amd.com> > > Subject: Re: [PATCH] drm/amdgpu/fence: Fix oops due to non-matching > > drm_sched init/fini > > > > On Mon, Jan 30, 2023 at 4:51 PM Guilherme G. Piccoli <gpicc...@igalia.com> > > wrote: > >> + Luben > >> > >> (sorry, missed that in the first submission). > >> > >> On 30/01/2023 18:45, Guilherme G. Piccoli wrote: > >>> Currently amdgpu calls drm_sched_fini() from the fence driver sw > >>> fini routine - such function is expected to be called only after the > >>> respective init function - drm_sched_init() - was executed successfully. > >>> > >>> Happens that we faced a driver probe failure in the Steam Deck > >>> recently, and the function drm_sched_fini() was called even without > >>> its counter-part had been previously called, causing the following oops: > >>> > >>> amdgpu: probe of 0000:04:00.0 failed with error -110 > >>> BUG: kernel NULL pointer dereference, address: 0000000000000090 PGD > >>> 0 P4D 0 > >>> Oops: 0002 [#1] PREEMPT SMP NOPTI > >>> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli > >>> #338 Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022 > >>> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched] [...] Call Trace: > >>> <TASK> > >>> amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu] > >>> amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu] > >>> amdgpu_driver_release_kms+0x16/0x30 [amdgpu] > >>> devm_drm_dev_init_release+0x49/0x70 > >>> [...] > >>> > >>> To prevent that, check if the drm_sched was properly initialized for > >>> a given ring before calling its fini counter-part. > >>> > >>> Notice ideally we'd use sched.ready for that; such field is set as > >>> the latest thing on drm_sched_init(). But amdgpu seems to "override" > >>> the meaning of such field - in the above oops for example, it was a > >>> GFX ring causing the crash, and the sched.ready field was set to > >>> true in the ring init routine, regardless of the state of the DRM > >>> scheduler. Hence, we ended-up using another sched field. > >>>>> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence > >>>>> driver fini in s3 test (v2)") > >>> Cc: Andrey Grodzovsky <andrey.grodzov...@amd.com> > >>> Cc: Guchun Chen <guchun.c...@amd.com> > >>> Cc: Mario Limonciello <mario.limoncie...@amd.com> > >>> Signed-off-by: Guilherme G. Piccoli <gpicc...@igalia.com> > >>> --- > >>> > >>> > >>> Hi folks, first of all thanks in advance for reviews / comments! > >>> Notice that I've used the Fixes tag more in the sense to bring it to > >>> stable, I didn't find a good patch candidate that added the call to > >>> drm_sched_fini(), was reaching way too old commits...so > >>> 067f44c8b459 seems a good candidate - or maybe not? > >>> > >>> Now, with regards sched.ready, spent a bit of time to figure what > >>> was happening...would be feasible maybe to stop using that to mark > >>> some kind ring status? I think it should be possible to add a flag > >>> to the ring structure for that, and free sched.ready from being > >>> manipulate by the amdgpu driver, what's your thoughts on that? > > It's been a while, but IIRC, we used to have a ring->ready field in the > > driver which at some point got migrated out of the driver into the GPU > > scheduler and the driver side code never got cleaned up. I think we should > > probably just drop the driver messing with that field and leave it up to > > the drm scheduler. > > To use ring->ready is not a good idea since this doesn't specify if the > scheduler was initialized, but rather if bringup of the engine worked.
ring->ready no longer exists. It was moved into the scheduler and the driver side never got cleaned up. Also, we set ring->sched.ready = true once the rings are set up, but before we actually test them, so it's currently a proxy for the ring being ready to test/use. Alex > > It's perfectly possible that the scheduler was initialized, but then the > IB test failed later on. > > I strongly suggest to use scheduler->ops instead since those are set > during init and mandatory for correct operation. > > Christian. > > > > > Alex > > > > > >>> I could try myself, but first of course I'd like to raise the > >>> "temperature" on this topic and check if somebody is already working > >>> on that. > >>> > >>> Cheers, > >>> > >>> Guilherme > >>> > >>> > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++- > >>> 1 file changed, 7 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >>> index 00444203220d..e154eb8241fb 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >>> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct > >>> amdgpu_device *adev) > >>> if (!ring || !ring->fence_drv.initialized) > >>> continue; > >>> > >>> - if (!ring->no_scheduler) > >>> + /* > >>> + * Notice we check for sched.name since there's some > >>> + * override on the meaning of sched.ready by amdgpu. > >>> + * The natural check would be sched.ready, which is > >>> + * set as drm_sched_init() finishes... > >>> + */ > >>> + if (!ring->no_scheduler && ring->sched.name) > >>> drm_sched_fini(&ring->sched); > >>> > >>> for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j) >