Re: [REGRESSION][BISECTED] 5.14.0-rc4 thru rc6 69de4421bb broke
Duncan posted on Mon, 16 Aug 2021 07:58:37 + as excerpted: > Mikael Pettersson posted on Tue, 03 Aug 2021 08:54:18 +0200 as > excerpted: >> On Mon, Aug 2, 2021 at 8:29 PM Duncan wrote: >>> Mikael Pettersson wrote... >>> > Booting 5.14.0-rc4 on my box with Radeon graphics breaks with >>> > >>> > [drm:radeon_ttm_init [radeon]] *ERROR* failed initializing buffer >>> > object driver(-19). >>> > radeon :01:00.0: Fatal error during GPU init >>> >>> Seeing this here too. amdgpu on polaris-11, on an old amd-fx6100 >>> system. >>> >>> > after which the screen goes black for the rest of kernel boot and >>> > early user-space init. >>> >>> *NOT* seeing that. However, I have boot messages turned on by >>> default and I see them as usual, only it stays in vga-console mode >>> instead of switching to framebuffer after early-boot. I'm guessing >>> MP has a high-res boot-splash which doesn't work in vga mode, thus >>> the black-screen until the login shows up. >> >> Yes, I have the Fedora boot splash enabled. >> >>> > Once the console login shows up the screen is in some legacy >>> > low-res mode and Xorg can't be started. >>> > >>> > A git bisect between v5.14-rc3 (good) and v5.14-rc4 (bad) >>> > identified >>> > >>> > # first bad commit: [69de4421bb4c103ef42a32bafc596e23918c106f] >>> > drm/ttm: Initialize debugfs from ttm_global_init() >>> > >>> > Reverting that from 5.14.0-rc4 gives me a working kernel again. >>> > >>> > Note that I have # CONFIG_DEBUG_FS is not set >>> >>> That all matches here, including the unset CONFIG_DEBUG_FS and >>> confirming the revert on 5.14.0-rc4 works. >> >> Thanks for the confirmation. > > 69de44d1bb introduced a regression in rc4, reported to the list on > August 2, that's still there in rc6. It's also reported on kernel > bugzilla as bug #214000. No maintainer response either on-list or to > the bug. The commit was general ttm and the original post went to > dri-devel and kernel, > Jason E. and Daniel V., but all three user reports I've seen so far > (two on-list and the bug reporter) are on amdgpu or radeon, so in an > effort to at least get a response and hopefully a fix before release, > I'm adding the amdgpu/radeon list and maintainers. > > The bugzilla report confirmed that CONFIG_DEBUG_FS=y AND > CONFIG_DEBUG_FS_ALLOW_ALL=y were *both* required to get a working > kernel after that commit. I and I believe the on-list reporter just > reverted the commit in question, and kept our CONFIG_DEBUG_FS=n. Trying again. I apologize if anyone gets this twice but I don't think the first one made it at all (buggy client). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman
Re: [PATCH] drm/amd/display: Clear dm_state for fast updates
On Mon, 27 Jul 2020 10:05:01 -0400 "Kazlauskas, Nicholas" wrote: > On 2020-07-27 9:39 a.m., Christian König wrote: > > Am 27.07.20 um 07:40 schrieb Mazin Rezk: > >> This patch fixes a race condition that causes a use-after-free > >> during amdgpu_dm_atomic_commit_tail. This can occur when 2 > >> non-blocking commits are requested and the second one finishes > >> before the first. Essentially, this bug occurs when the following > >> sequence of events happens: > >> > >> 1. Non-blocking commit #1 is requested w/ a new dm_state #1 and is > >> deferred to the workqueue. > >> > >> 2. Non-blocking commit #2 is requested w/ a new dm_state #2 and is > >> deferred to the workqueue. > >> > >> 3. Commit #2 starts before commit #1, dm_state #1 is used in the > >> commit_tail and commit #2 completes, freeing dm_state #1. > >> > >> 4. Commit #1 starts after commit #2 completes, uses the freed > >> dm_state 1 and dereferences a freelist pointer while setting the > >> context. > > > > Well I only have a one mile high view on this, but why don't you > > let the work items execute in order? > > > > That would be better anyway cause this way we don't trigger a cache > > line ping pong between CPUs. > > > > Christian. > > We use the DRM helpers for managing drm_atomic_commit_state and those > helpers internally push non-blocking commit work into the system > unbound work queue. > > While we could duplicate a copy of that code with nothing but the > workqueue changed that isn't something I'd really like to maintain > going forward. > > Regards, > Nicholas Kazlauskas Additionally, I don't see mentioned on-thread (it's on the bug and now in the details below) that we're talking multi-monitor, not single-monitor. Presumably that goes some way toward answering the "why not force order?" question, considering the outputs may be running at different refresh frequencies, etc... All the reports on the bug seem to be multi-monitor (after seeing multi-monitor/output in several reports I asked if anyone was seeing it with only one monitor and no answers), and as you commented on the bug for your proposed patch but seems missing from this one here (different author/proposal) ... Commits #1 and #2 don't touch any of the same core DRM objects (CRTCs, Planes, Connectors) so Commit #2 does not stall for Commit #1. DRM Private Objects have always been avoided in stall checks, so we have no safety from DRM core in this regard. > >> > >> Since this bug has only been spotted with fast commits, this patch > >> fixes the bug by clearing the dm_state instead of using the old > >> dc_state for fast updates. In addition, since dm_state is only > >> used for its dc_state and amdgpu_dm_atomic_commit_tail will retain > >> the dc_state if none is found, > >> removing the dm_state should not have any consequences in fast > >> updates. > >> > >> This use-after-free bug has existed for a while now, but only > >> caused a noticeable issue starting from 5.7-rc1 due to 3202fa62f > >> ("slub: relocate freelist pointer to middle of object") moving the > >> freelist pointer from dm_state->base (which was unused) to > >> dm_state->context (which is dereferenced). > >> > >> Bugzilla: > >> https://bugzilla.kernel.org/show_bug.cgi?id=207383 > >> > >> Fixes: bd200d190f45 ("drm/amd/display: Don't replace the dc_state > >> for fast updates") > >> Reported-by: Duncan <1i5t5.dun...@cox.net> > >> Signed-off-by: Mazin Rezk > >> --- > >> .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 36 > >> ++- 1 file changed, 27 insertions(+), 9 > >> deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > >> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > >> index 86ffa0c2880f..710edc70e37e 100644 > >> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > >> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > >> @@ -8717,20 +8717,38 @@ static int amdgpu_dm_atomic_check(struct > >> drm_device *dev, > >> * the same resource. If we have a new DC context as > >> part of > >> * the DM atomic state from validation we need to free > >> it and > >> * retain the existing one instead. > >> + * > >> + * Furthermore, since the DM atomic state only contains > >>
Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free
On Sat, 25 Jul 2020 03:03:52 + Mazin Rezk wrote: > > Am 24.07.20 um 19:33 schrieb Kees Cook: > > > > > There was a fix to disable the async path for this driver that > > > worked around the bug too, yes? That seems like a safer and more > > > focused change that doesn't revert the SLUB defense for all > > > users, and would actually provide a complete, I think, workaround > > That said, I haven't seen the async disabling patch. If you could > link to it, I'd be glad to test it out and perhaps we can use that > instead. I'm confused. Not to put words in Kees' mouth; /I/ am confused (which admittedly could well be just because I make no claims to be a coder and am simply reading the bug and thread, but I'd appreciate some "unconfusing" anyway). My interpretation of the "async disabling" reference was that it was to comment #30 on the bug: https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30 ... which (if I'm not confused on this point too) appears to be yours. There it was stated... >>>> I've also found that this bug exclusively occurs when commit_work is on the workqueue. After forcing drm_atomic_helper_commit to run all of the commits without adding to the workqueue and running the OS, the issue seems to have disappeared. <<<< Would not forcing all commits to run directly, without placing them on the workqueue, be "async disabling"? That's what I /thought/ he was referencing. OTOH your base/context swap idea sounds like a possibly "less disturbance" workaround, if it works, and given the point in the commit cycle... (But if it's out Sunday it's likely too late to test and get it in now anyway; if it's another week, tho...) -- Duncan - No HTML messages please; they are filtered as spam. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx