Re: [REGRESSION][BISECTED] 5.14.0-rc4 thru rc6 69de4421bb broke

2021-08-16 Thread Duncan
Duncan posted on Mon, 16 Aug 2021 07:58:37 + as excerpted:

> Mikael Pettersson posted on Tue, 03 Aug 2021 08:54:18 +0200 as
> excerpted:
>> On Mon, Aug 2, 2021 at 8:29 PM Duncan  wrote:
>>> Mikael Pettersson  wrote...
>>> > Booting 5.14.0-rc4 on my box with Radeon graphics breaks with
>>> >
>>> > [drm:radeon_ttm_init [radeon]] *ERROR* failed initializing buffer
>>> > object driver(-19).
>>> > radeon :01:00.0: Fatal error during GPU init
>>>
>>> Seeing this here too.  amdgpu on polaris-11, on an old amd-fx6100
>>> system.
>>>
>>> > after which the screen goes black for the rest of kernel boot and
>>> > early user-space init.
>>>
>>> *NOT* seeing that.  However, I have boot messages turned on by
>>> default and I see them as usual, only it stays in vga-console mode
>>> instead of switching to framebuffer after early-boot. I'm guessing
>>> MP has a high-res boot-splash which doesn't work in vga mode, thus
>>> the black-screen until the login shows up.
>> 
>> Yes, I have the Fedora boot splash enabled.
>> 
>>> > Once the console login shows up the screen is in some legacy
>>> > low-res mode and Xorg can't be started.
>>> >
>>> > A git bisect between v5.14-rc3 (good) and v5.14-rc4 (bad)
>>> > identified
>>> >
>>> > # first bad commit: [69de4421bb4c103ef42a32bafc596e23918c106f]
>>> > drm/ttm: Initialize debugfs from ttm_global_init()
>>> >
>>> > Reverting that from 5.14.0-rc4 gives me a working kernel again.
>>> >
>>> > Note that I have # CONFIG_DEBUG_FS is not set
>>>
>>> That all matches here, including the unset CONFIG_DEBUG_FS and
>>> confirming the revert on 5.14.0-rc4 works.
>> 
>> Thanks for the confirmation.
> 
> 69de44d1bb introduced a regression in rc4, reported to the list on
> August 2, that's still there in rc6.  It's also reported on kernel
> bugzilla as bug #214000.  No maintainer response either on-list or to
> the bug.  The commit was general ttm and the original post went to
> dri-devel and kernel,
> Jason E. and Daniel V., but all three user reports I've seen so far
> (two on-list and the bug reporter) are on amdgpu or radeon, so in an
> effort to at least get a response and hopefully a fix before release,
> I'm adding the amdgpu/radeon list and maintainers.
> 
> The bugzilla report confirmed that CONFIG_DEBUG_FS=y AND
> CONFIG_DEBUG_FS_ALLOW_ALL=y were *both* required to get a working
> kernel after that commit.  I and I believe the on-list reporter just
> reverted the commit in question, and kept our CONFIG_DEBUG_FS=n.

Trying again. I apologize if anyone gets this twice but I don't think
the first one made it at all (buggy client).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


Re: [PATCH] drm/amd/display: Clear dm_state for fast updates

2020-07-27 Thread Duncan
On Mon, 27 Jul 2020 10:05:01 -0400
"Kazlauskas, Nicholas"  wrote:

> On 2020-07-27 9:39 a.m., Christian König wrote:
> > Am 27.07.20 um 07:40 schrieb Mazin Rezk:
> >> This patch fixes a race condition that causes a use-after-free
> >> during amdgpu_dm_atomic_commit_tail. This can occur when 2
> >> non-blocking commits are requested and the second one finishes
> >> before the first. Essentially, this bug occurs when the following
> >> sequence of events happens:
> >>
> >> 1. Non-blocking commit #1 is requested w/ a new dm_state #1 and is
> >> deferred to the workqueue.
> >>
> >> 2. Non-blocking commit #2 is requested w/ a new dm_state #2 and is
> >> deferred to the workqueue.
> >>
> >> 3. Commit #2 starts before commit #1, dm_state #1 is used in the
> >> commit_tail and commit #2 completes, freeing dm_state #1.
> >>
> >> 4. Commit #1 starts after commit #2 completes, uses the freed
> >> dm_state 1 and dereferences a freelist pointer while setting the
> >> context.
> > 
> > Well I only have a one mile high view on this, but why don't you
> > let the work items execute in order?
> > 
> > That would be better anyway cause this way we don't trigger a cache
> > line ping pong between CPUs.
> > 
> > Christian.
> 
> We use the DRM helpers for managing drm_atomic_commit_state and those 
> helpers internally push non-blocking commit work into the system
> unbound work queue.
> 
> While we could duplicate a copy of that code with nothing but the 
> workqueue changed that isn't something I'd really like to maintain
> going forward.
> 
> Regards,
> Nicholas Kazlauskas

Additionally, I don't see mentioned on-thread (it's on the bug and now
in the details below) that we're talking multi-monitor, not
single-monitor. Presumably that goes some way toward answering the "why
not force order?" question, considering the outputs may be running at
different refresh frequencies, etc...

All the reports on the bug seem to be multi-monitor (after seeing
multi-monitor/output in several reports I asked if anyone was seeing it
with only one monitor and no answers), and as you commented on the bug
for your proposed patch but seems missing from this one here (different
author/proposal) ...

Commits #1 and #2 don't touch any of the same core DRM objects (CRTCs,
Planes, Connectors) so Commit #2 does not stall for Commit #1. DRM
Private Objects have always been avoided in stall checks, so we have no
safety from DRM core in this regard.

> >>
> >> Since this bug has only been spotted with fast commits, this patch
> >> fixes the bug by clearing the dm_state instead of using the old
> >> dc_state for fast updates. In addition, since dm_state is only
> >> used for its dc_state and amdgpu_dm_atomic_commit_tail will retain
> >> the dc_state if none is found,
> >> removing the dm_state should not have any consequences in fast
> >> updates.
> >>
> >> This use-after-free bug has existed for a while now, but only
> >> caused a noticeable issue starting from 5.7-rc1 due to 3202fa62f
> >> ("slub: relocate freelist pointer to middle of object") moving the
> >> freelist pointer from dm_state->base (which was unused) to
> >> dm_state->context (which is dereferenced).
> >>
> >> Bugzilla: 
> >> https://bugzilla.kernel.org/show_bug.cgi?id=207383 
> >>
> >> Fixes: bd200d190f45 ("drm/amd/display: Don't replace the dc_state
> >> for fast updates")
> >> Reported-by: Duncan <1i5t5.dun...@cox.net>
> >> Signed-off-by: Mazin Rezk 
> >> ---
> >>   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 36
> >> ++- 1 file changed, 27 insertions(+), 9
> >> deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> >> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >> index 86ffa0c2880f..710edc70e37e 100644
> >> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> >> @@ -8717,20 +8717,38 @@ static int amdgpu_dm_atomic_check(struct 
> >> drm_device *dev,
> >>    * the same resource. If we have a new DC context as
> >> part of
> >>    * the DM atomic state from validation we need to free
> >> it and
> >>    * retain the existing one instead.
> >> + *
> >> + * Furthermore, since the DM atomic state only contains
> >> 

Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free

2020-07-26 Thread Duncan
On Sat, 25 Jul 2020 03:03:52 +
Mazin Rezk  wrote:

> > Am 24.07.20 um 19:33 schrieb Kees Cook:
> >  
> > > There was a fix to disable the async path for this driver that
> > > worked around the bug too, yes? That seems like a safer and more
> > > focused change that doesn't revert the SLUB defense for all
> > > users, and would actually provide a complete, I think, workaround
> 
> That said, I haven't seen the async disabling patch. If you could
> link to it, I'd be glad to test it out and perhaps we can use that
> instead.

I'm confused.  Not to put words in Kees' mouth; /I/ am confused (which
admittedly could well be just because I make no claims to be a
coder and am simply reading the bug and thread, but I'd appreciate some
"unconfusing" anyway).

My interpretation of the "async disabling" reference was that it was to
comment #30 on the bug:

https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30

... which (if I'm not confused on this point too) appears to be yours.
There it was stated...

>>>>
I've also found that this bug exclusively occurs when commit_work is on
the workqueue. After forcing drm_atomic_helper_commit to run all of the
commits without adding to the workqueue and running the OS, the issue
seems to have disappeared.
<<<<

Would not forcing all commits to run directly, without placing them on
the workqueue, be "async disabling"?  That's what I /thought/ he was
referencing.

OTOH your base/context swap idea sounds like a possibly "less
disturbance" workaround, if it works, and given the point in the
commit cycle... (But if it's out Sunday it's likely too late to test
and get it in now anyway; if it's another week, tho...)

-- 
Duncan - No HTML messages please; they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx