Re: [PATCH] drm/amd/display: Clear dm_state for fast updates

2020-07-28 Thread Mazin Rezk
On Monday, July 27, 2020 7:42 PM, Mazin Rezk  wrote:

> On Monday, July 27, 2020 5:32 PM, Daniel Vetter  wrote:
>
> > On Mon, Jul 27, 2020 at 11:11 PM Mazin Rezk  wrote:
> > >
> > > On Monday, July 27, 2020 4:29 PM, Daniel Vetter  wrote:
> > >
> > > > On Mon, Jul 27, 2020 at 9:28 PM Christian König
> > > >  wrote:
> > > > >
> > > > > Am 27.07.20 um 16:05 schrieb Kazlauskas, Nicholas:
> > > > > > On 2020-07-27 9:39 a.m., Christian König wrote:
> > > > > >> Am 27.07.20 um 07:40 schrieb Mazin Rezk:
> > > > > >>> This patch fixes a race condition that causes a use-after-free 
> > > > > >>> during
> > > > > >>> amdgpu_dm_atomic_commit_tail. This can occur when 2 non-blocking
> > > > > >>> commits
> > > > > >>> are requested and the second one finishes before the first.
> > > > > >>> Essentially,
> > > > > >>> this bug occurs when the following sequence of events happens:
> > > > > >>>
> > > > > >>> 1. Non-blocking commit #1 is requested w/ a new dm_state #1 and is
> > > > > >>> deferred to the workqueue.
> > > > > >>>
> > > > > >>> 2. Non-blocking commit #2 is requested w/ a new dm_state #2 and is
> > > > > >>> deferred to the workqueue.
> > > > > >>>
> > > > > >>> 3. Commit #2 starts before commit #1, dm_state #1 is used in the
> > > > > >>> commit_tail and commit #2 completes, freeing dm_state #1.
> > > > > >>>
> > > > > >>> 4. Commit #1 starts after commit #2 completes, uses the freed 
> > > > > >>> dm_state
> > > > > >>> 1 and dereferences a freelist pointer while setting the context.
> > > > > >>
> > > > > >> Well I only have a one mile high view on this, but why don't you 
> > > > > >> let
> > > > > >> the work items execute in order?
> > > > > >>
> > > > > >> That would be better anyway cause this way we don't trigger a cache
> > > > > >> line ping pong between CPUs.
> > > > > >>
> > > > > >> Christian.
> > > > > >
> > > > > > We use the DRM helpers for managing drm_atomic_commit_state and 
> > > > > > those
> > > > > > helpers internally push non-blocking commit work into the system
> > > > > > unbound work queue.
> > > > >
> > > > > Mhm, well if you send those helper atomic commits in the order A,B and
> > > > > they execute it in the order B,A I would call that a bug :)
> > > >
> > > > The way it works is it pushes all commits into unbound work queue, but
> > > > then forces serialization as needed. We do _not_ want e.g. updates on
> > > > different CRTC to be serialized, that would result in lots of judder.
> > > > And hw is funny enough that there's all kinds of dependencies.
> > > >
> > > > The way you force synchronization is by adding other CRTC state
> > > > objects. So if DC is busted and can only handle a single update per
> > > > work item, then I guess you always need all CRTC states and everything
> > > > will be run in order. But that also totally kills modern multi-screen
> > > > compositors. Xorg isn't modern, just in case that's not clear :-)
> > > >
> > > > Lucking at the code it seems like you indeed have only a single dm
> > > > state, so yeah global sync is what you'll need as immediate fix, and
> > > > then maybe fix up DM to not be quite so silly ... or at least only do
> > > > the dm state stuff when really needed.
> > > >
> > > > We could also sprinkle the drm_crtc_commit structure around a bit
> > > > (it's the glue that provides the synchronization across commits), but
> > > > since your dm state is global just grabbing all crtc states
> > > > unconditionally as part of that is probably best.
> > > >
> > > > > > While we could duplicate a copy of that code with nothing but the
> > > > > > workqueue changed that isn't something I'd really like to maintain
> > > > > > going forward.
> > > > >
> > > > > I'm not t

Re: [PATCH] drm/amd/display: Clear dm_state for fast updates

2020-07-28 Thread Mazin Rezk
On Monday, July 27, 2020 4:29 PM, Daniel Vetter  wrote:

> On Mon, Jul 27, 2020 at 9:28 PM Christian König
>  wrote:
> >
> > Am 27.07.20 um 16:05 schrieb Kazlauskas, Nicholas:
> > > On 2020-07-27 9:39 a.m., Christian König wrote:
> > >> Am 27.07.20 um 07:40 schrieb Mazin Rezk:
> > >>> This patch fixes a race condition that causes a use-after-free during
> > >>> amdgpu_dm_atomic_commit_tail. This can occur when 2 non-blocking
> > >>> commits
> > >>> are requested and the second one finishes before the first.
> > >>> Essentially,
> > >>> this bug occurs when the following sequence of events happens:
> > >>>
> > >>> 1. Non-blocking commit #1 is requested w/ a new dm_state #1 and is
> > >>> deferred to the workqueue.
> > >>>
> > >>> 2. Non-blocking commit #2 is requested w/ a new dm_state #2 and is
> > >>> deferred to the workqueue.
> > >>>
> > >>> 3. Commit #2 starts before commit #1, dm_state #1 is used in the
> > >>> commit_tail and commit #2 completes, freeing dm_state #1.
> > >>>
> > >>> 4. Commit #1 starts after commit #2 completes, uses the freed dm_state
> > >>> 1 and dereferences a freelist pointer while setting the context.
> > >>
> > >> Well I only have a one mile high view on this, but why don't you let
> > >> the work items execute in order?
> > >>
> > >> That would be better anyway cause this way we don't trigger a cache
> > >> line ping pong between CPUs.
> > >>
> > >> Christian.
> > >
> > > We use the DRM helpers for managing drm_atomic_commit_state and those
> > > helpers internally push non-blocking commit work into the system
> > > unbound work queue.
> >
> > Mhm, well if you send those helper atomic commits in the order A,B and
> > they execute it in the order B,A I would call that a bug :)
>
> The way it works is it pushes all commits into unbound work queue, but
> then forces serialization as needed. We do _not_ want e.g. updates on
> different CRTC to be serialized, that would result in lots of judder.
> And hw is funny enough that there's all kinds of dependencies.
>
> The way you force synchronization is by adding other CRTC state
> objects. So if DC is busted and can only handle a single update per
> work item, then I guess you always need all CRTC states and everything
> will be run in order. But that also totally kills modern multi-screen
> compositors. Xorg isn't modern, just in case that's not clear :-)
>
> Lucking at the code it seems like you indeed have only a single dm
> state, so yeah global sync is what you'll need as immediate fix, and
> then maybe fix up DM to not be quite so silly ... or at least only do
> the dm state stuff when really needed.
>
> We could also sprinkle the drm_crtc_commit structure around a bit
> (it's the glue that provides the synchronization across commits), but
> since your dm state is global just grabbing all crtc states
> unconditionally as part of that is probably best.
>
> > > While we could duplicate a copy of that code with nothing but the
> > > workqueue changed that isn't something I'd really like to maintain
> > > going forward.
> >
> > I'm not talking about duplicating the code, I'm talking about fixing the
> > helpers. I don't know that code well, but from the outside it sounds
> > like a bug there.
> >
> > And executing work items in the order they are submitted is trivial.
> >
> > Had anybody pinged Daniel or other people familiar with the helper code
> > about it?
>
> Yeah something is wrong here, and the fix looks horrible :-)
>
> Aside, I've also seen some recent discussion flare up about
> drm_atomic_state_get/put used to paper over some other use-after-free,
> but this time related to interrupt handlers. Maybe a few rules about
> that:
> - dont
> - especially not when it's interrupt handlers, because you can't call
> drm_atomic_state_put from interrupt handlers.
>
> Instead have an spin_lock_irq to protect the shared date with your
> interrupt handler, and _copy_ the date over. This is e.g. what
> drm_crtc_arm_vblank_event does.

Nicholas wrote a patch that attempted to resolve the issue by adding every
CRTC into the commit to use use the stall checks. [1] While this forces
synchronisation on commits, it's kind of a hacky method that may take a
toll on performance.

Is it possible to have a DRM helper that forces synchronisation on some
commits without having to add every CRTC into the co

Re: [PATCH] drm/amd/display: Clear dm_state for fast updates

2020-07-28 Thread Mazin Rezk
On Monday, July 27, 2020 5:32 PM, Daniel Vetter  wrote:

> On Mon, Jul 27, 2020 at 11:11 PM Mazin Rezk  wrote:
> >
> > On Monday, July 27, 2020 4:29 PM, Daniel Vetter  wrote:
> >
> > > On Mon, Jul 27, 2020 at 9:28 PM Christian König
> > >  wrote:
> > > >
> > > > Am 27.07.20 um 16:05 schrieb Kazlauskas, Nicholas:
> > > > > On 2020-07-27 9:39 a.m., Christian König wrote:
> > > > >> Am 27.07.20 um 07:40 schrieb Mazin Rezk:
> > > > >>> This patch fixes a race condition that causes a use-after-free 
> > > > >>> during
> > > > >>> amdgpu_dm_atomic_commit_tail. This can occur when 2 non-blocking
> > > > >>> commits
> > > > >>> are requested and the second one finishes before the first.
> > > > >>> Essentially,
> > > > >>> this bug occurs when the following sequence of events happens:
> > > > >>>
> > > > >>> 1. Non-blocking commit #1 is requested w/ a new dm_state #1 and is
> > > > >>> deferred to the workqueue.
> > > > >>>
> > > > >>> 2. Non-blocking commit #2 is requested w/ a new dm_state #2 and is
> > > > >>> deferred to the workqueue.
> > > > >>>
> > > > >>> 3. Commit #2 starts before commit #1, dm_state #1 is used in the
> > > > >>> commit_tail and commit #2 completes, freeing dm_state #1.
> > > > >>>
> > > > >>> 4. Commit #1 starts after commit #2 completes, uses the freed 
> > > > >>> dm_state
> > > > >>> 1 and dereferences a freelist pointer while setting the context.
> > > > >>
> > > > >> Well I only have a one mile high view on this, but why don't you let
> > > > >> the work items execute in order?
> > > > >>
> > > > >> That would be better anyway cause this way we don't trigger a cache
> > > > >> line ping pong between CPUs.
> > > > >>
> > > > >> Christian.
> > > > >
> > > > > We use the DRM helpers for managing drm_atomic_commit_state and those
> > > > > helpers internally push non-blocking commit work into the system
> > > > > unbound work queue.
> > > >
> > > > Mhm, well if you send those helper atomic commits in the order A,B and
> > > > they execute it in the order B,A I would call that a bug :)
> > >
> > > The way it works is it pushes all commits into unbound work queue, but
> > > then forces serialization as needed. We do _not_ want e.g. updates on
> > > different CRTC to be serialized, that would result in lots of judder.
> > > And hw is funny enough that there's all kinds of dependencies.
> > >
> > > The way you force synchronization is by adding other CRTC state
> > > objects. So if DC is busted and can only handle a single update per
> > > work item, then I guess you always need all CRTC states and everything
> > > will be run in order. But that also totally kills modern multi-screen
> > > compositors. Xorg isn't modern, just in case that's not clear :-)
> > >
> > > Lucking at the code it seems like you indeed have only a single dm
> > > state, so yeah global sync is what you'll need as immediate fix, and
> > > then maybe fix up DM to not be quite so silly ... or at least only do
> > > the dm state stuff when really needed.
> > >
> > > We could also sprinkle the drm_crtc_commit structure around a bit
> > > (it's the glue that provides the synchronization across commits), but
> > > since your dm state is global just grabbing all crtc states
> > > unconditionally as part of that is probably best.
> > >
> > > > > While we could duplicate a copy of that code with nothing but the
> > > > > workqueue changed that isn't something I'd really like to maintain
> > > > > going forward.
> > > >
> > > > I'm not talking about duplicating the code, I'm talking about fixing the
> > > > helpers. I don't know that code well, but from the outside it sounds
> > > > like a bug there.
> > > >
> > > > And executing work items in the order they are submitted is trivial.
> > > >
> > > > Had anybody pinged Daniel or other people familiar with the helper code
> > > > about it?
> > >
> > > Yeah something is 

Re: [PATCH] drm/amd/display: Clear dm_state for fast updates

2020-07-27 Thread Mazin Rezk
On Monday, July 27, 2020 9:26 AM, Kazlauskas, Nicholas 
 wrote:

> On 2020-07-27 1:40 a.m., Mazin Rezk wrote:
> > This patch fixes a race condition that causes a use-after-free during
> > amdgpu_dm_atomic_commit_tail. This can occur when 2 non-blocking commits
> > are requested and the second one finishes before the first. Essentially,
> > this bug occurs when the following sequence of events happens:
> >
> > 1. Non-blocking commit #1 is requested w/ a new dm_state #1 and is
> > deferred to the workqueue.
> >
> > 2. Non-blocking commit #2 is requested w/ a new dm_state #2 and is
> > deferred to the workqueue.
> >
> > 3. Commit #2 starts before commit #1, dm_state #1 is used in the
> > commit_tail and commit #2 completes, freeing dm_state #1.
> >
> > 4. Commit #1 starts after commit #2 completes, uses the freed dm_state
> > 1 and dereferences a freelist pointer while setting the context.
> >
> > Since this bug has only been spotted with fast commits, this patch fixes
> > the bug by clearing the dm_state instead of using the old dc_state for
> > fast updates. In addition, since dm_state is only used for its dc_state
> > and amdgpu_dm_atomic_commit_tail will retain the dc_state if none is found,
> > removing the dm_state should not have any consequences in fast updates.
> >
> > This use-after-free bug has existed for a while now, but only caused a
> > noticeable issue starting from 5.7-rc1 due to 3202fa62f ("slub: relocate
> > freelist pointer to middle of object") moving the freelist pointer from
> > dm_state->base (which was unused) to dm_state->context (which is
> > dereferenced).
> >
> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207383
> > Fixes: bd200d190f45 ("drm/amd/display: Don't replace the dc_state for fast 
> > updates")
> > Reported-by: Duncan <1i5t5.dun...@cox.net>
> > Signed-off-by: Mazin Rezk 
> > ---
> >   .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 36 ++-
> >   1 file changed, 27 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 86ffa0c2880f..710edc70e37e 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -8717,20 +8717,38 @@ static int amdgpu_dm_atomic_check(struct drm_device 
> > *dev,
> >  * the same resource. If we have a new DC context as part of
> >  * the DM atomic state from validation we need to free it and
> >  * retain the existing one instead.
> > +*
> > +* Furthermore, since the DM atomic state only contains the DC
> > +* context and can safely be annulled, we can free the state
> > +* and clear the associated private object now to free
> > +* some memory and avoid a possible use-after-free later.
> >  */
> > -   struct dm_atomic_state *new_dm_state, *old_dm_state;
> >
> > -   new_dm_state = dm_atomic_get_new_state(state);
> > -   old_dm_state = dm_atomic_get_old_state(state);
> > +   for (i = 0; i < state->num_private_objs; i++) {
> > +   struct drm_private_obj *obj = 
> > state->private_objs[i].ptr;
> >
> > -   if (new_dm_state && old_dm_state) {
> > -   if (new_dm_state->context)
> > -   dc_release_state(new_dm_state->context);
> > +   if (obj->funcs == adev->dm.atomic_obj.funcs) {
> > +   int j = state->num_private_objs-1;
> >
> > -   new_dm_state->context = old_dm_state->context;
> > +   dm_atomic_destroy_state(obj,
> > +   state->private_objs[i].state);
> > +
> > +   /* If i is not at the end of the array then the
> > +* last element needs to be moved to where i was
> > +* before the array can safely be truncated.
> > +*/
> > +   if (i != j)
> > +   state->private_objs[i] =
> > +   state->private_objs[j];
> >
> > -   if (old_dm_state->context)
> > -   dc_retain_state(old_dm_state->context);
> > +   

Re: [PATCH] drm/amd/display: Clear dm_state for fast updates

2020-07-27 Thread Mazin Rezk
On Monday, July 27, 2020 1:40 AM, Mazin Rezk  wrote:
> This patch fixes a race condition that causes a use-after-free during
> amdgpu_dm_atomic_commit_tail. This can occur when 2 non-blocking commits
> are requested and the second one finishes before the first. Essentially,
> this bug occurs when the following sequence of events happens:
>
> 1. Non-blocking commit #1 is requested w/ a new dm_state #1 and is
> deferred to the workqueue.
>
> 2. Non-blocking commit #2 is requested w/ a new dm_state #2 and is
> deferred to the workqueue.
>
> 3. Commit #2 starts before commit #1, dm_state #1 is used in the
> commit_tail and commit #2 completes, freeing dm_state #1.
>
> 4. Commit #1 starts after commit #2 completes, uses the freed dm_state
> 1 and dereferences a freelist pointer while setting the context.
>
> Since this bug has only been spotted with fast commits, this patch fixes
> the bug by clearing the dm_state instead of using the old dc_state for
> fast updates. In addition, since dm_state is only used for its dc_state
> and amdgpu_dm_atomic_commit_tail will retain the dc_state if none is found,
> removing the dm_state should not have any consequences in fast updates.
>
> This use-after-free bug has existed for a while now, but only caused a
> noticeable issue starting from 5.7-rc1 due to 3202fa62f ("slub: relocate
> freelist pointer to middle of object") moving the freelist pointer from
> dm_state->base (which was unused) to dm_state->context (which is
> dereferenced).
>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207383
> Fixes: bd200d190f45 ("drm/amd/display: Don't replace the dc_state for fast 
> updates")
> Reported-by: Duncan <1i5t5.dun...@cox.net>
> Signed-off-by: Mazin Rezk 
> ---
>  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 36 ++-
>  1 file changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 86ffa0c2880f..710edc70e37e 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -8717,20 +8717,38 @@ static int amdgpu_dm_atomic_check(struct drm_device 
> *dev,
>* the same resource. If we have a new DC context as part of
>* the DM atomic state from validation we need to free it and
>* retain the existing one instead.
> +  *
> +  * Furthermore, since the DM atomic state only contains the DC
> +  * context and can safely be annulled, we can free the state
> +  * and clear the associated private object now to free
> +  * some memory and avoid a possible use-after-free later.
>*/
> - struct dm_atomic_state *new_dm_state, *old_dm_state;
>
> - new_dm_state = dm_atomic_get_new_state(state);
> - old_dm_state = dm_atomic_get_old_state(state);
> + for (i = 0; i < state->num_private_objs; i++) {
> + struct drm_private_obj *obj = 
> state->private_objs[i].ptr;
>
> - if (new_dm_state && old_dm_state) {
> - if (new_dm_state->context)
> - dc_release_state(new_dm_state->context);
> + if (obj->funcs == adev->dm.atomic_obj.funcs) {
> + int j = state->num_private_objs-1;
>
> - new_dm_state->context = old_dm_state->context;
> + dm_atomic_destroy_state(obj,
> + state->private_objs[i].state);
> +
> + /* If i is not at the end of the array then the
> +  * last element needs to be moved to where i was
> +  * before the array can safely be truncated.
> +  */
> + if (i != j)
> + state->private_objs[i] =
> + state->private_objs[j];
>
> - if (old_dm_state->context)
> - dc_retain_state(old_dm_state->context);
> + state->private_objs[j].ptr = NULL;
> + state->private_objs[j].state = NULL;
> + state->private_objs[j].old_state = NULL;
> + state->private_objs[j].new_state = NULL;
> +
> + state->num_private_objs = j;
> + break;
> +

[PATCH] drm/amd/display: Clear dm_state for fast updates

2020-07-27 Thread Mazin Rezk
This patch fixes a race condition that causes a use-after-free during
amdgpu_dm_atomic_commit_tail. This can occur when 2 non-blocking commits
are requested and the second one finishes before the first. Essentially,
this bug occurs when the following sequence of events happens:

1. Non-blocking commit #1 is requested w/ a new dm_state #1 and is
deferred to the workqueue.

2. Non-blocking commit #2 is requested w/ a new dm_state #2 and is
deferred to the workqueue.

3. Commit #2 starts before commit #1, dm_state #1 is used in the
commit_tail and commit #2 completes, freeing dm_state #1.

4. Commit #1 starts after commit #2 completes, uses the freed dm_state
1 and dereferences a freelist pointer while setting the context.

Since this bug has only been spotted with fast commits, this patch fixes
the bug by clearing the dm_state instead of using the old dc_state for
fast updates. In addition, since dm_state is only used for its dc_state
and amdgpu_dm_atomic_commit_tail will retain the dc_state if none is found,
removing the dm_state should not have any consequences in fast updates.

This use-after-free bug has existed for a while now, but only caused a
noticeable issue starting from 5.7-rc1 due to 3202fa62f ("slub: relocate
freelist pointer to middle of object") moving the freelist pointer from
dm_state->base (which was unused) to dm_state->context (which is
dereferenced).

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207383
Fixes: bd200d190f45 ("drm/amd/display: Don't replace the dc_state for fast 
updates")
Reported-by: Duncan <1i5t5.dun...@cox.net>
Signed-off-by: Mazin Rezk 
---
 .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 36 ++-
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 86ffa0c2880f..710edc70e37e 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -8717,20 +8717,38 @@ static int amdgpu_dm_atomic_check(struct drm_device 
*dev,
 * the same resource. If we have a new DC context as part of
 * the DM atomic state from validation we need to free it and
 * retain the existing one instead.
+*
+* Furthermore, since the DM atomic state only contains the DC
+* context and can safely be annulled, we can free the state
+* and clear the associated private object now to free
+* some memory and avoid a possible use-after-free later.
 */
-   struct dm_atomic_state *new_dm_state, *old_dm_state;

-   new_dm_state = dm_atomic_get_new_state(state);
-   old_dm_state = dm_atomic_get_old_state(state);
+   for (i = 0; i < state->num_private_objs; i++) {
+   struct drm_private_obj *obj = 
state->private_objs[i].ptr;

-   if (new_dm_state && old_dm_state) {
-   if (new_dm_state->context)
-   dc_release_state(new_dm_state->context);
+   if (obj->funcs == adev->dm.atomic_obj.funcs) {
+   int j = state->num_private_objs-1;

-   new_dm_state->context = old_dm_state->context;
+   dm_atomic_destroy_state(obj,
+   state->private_objs[i].state);
+
+   /* If i is not at the end of the array then the
+* last element needs to be moved to where i was
+* before the array can safely be truncated.
+*/
+   if (i != j)
+   state->private_objs[i] =
+   state->private_objs[j];

-   if (old_dm_state->context)
-   dc_retain_state(old_dm_state->context);
+   state->private_objs[j].ptr = NULL;
+   state->private_objs[j].state = NULL;
+   state->private_objs[j].old_state = NULL;
+   state->private_objs[j].new_state = NULL;
+
+   state->num_private_objs = j;
+   break;
+   }
}
}

--
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free

2020-07-26 Thread Mazin Rezk
On Saturday, July 25, 2020 12:59 AM, Duncan <1i5t5.dun...@cox.net> wrote:

> On Sat, 25 Jul 2020 03:03:52 +0000
> Mazin Rezk mn...@protonmail.com wrote:
>
> > > Am 24.07.20 um 19:33 schrieb Kees Cook:
> > >
> > > > There was a fix to disable the async path for this driver that
> > > > worked around the bug too, yes? That seems like a safer and more
> > > > focused change that doesn't revert the SLUB defense for all
> > > > users, and would actually provide a complete, I think, workaround
> >
> > That said, I haven't seen the async disabling patch. If you could
> > link to it, I'd be glad to test it out and perhaps we can use that
> > instead.
>
> I'm confused. Not to put words in Kees' mouth; /I/ am confused (which
> admittedly could well be just because I make no claims to be a
> coder and am simply reading the bug and thread, but I'd appreciate some
> "unconfusing" anyway).
>
> My interpretation of the "async disabling" reference was that it was to
> comment #30 on the bug:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=207383#c30
>
> ... which (if I'm not confused on this point too) appears to be yours.
> There it was stated...
>
> > > > >
>
> I've also found that this bug exclusively occurs when commit_work is on
> the workqueue. After forcing drm_atomic_helper_commit to run all of the
> commits without adding to the workqueue and running the OS, the issue
> seems to have disappeared.
> <<<<
>
> Would not forcing all commits to run directly, without placing them on
> the workqueue, be "async disabling"? That's what I /thought/ he was
> referencing.

Oh, I thought he was referring to a different patch. Kees, could I get
your confirmation on this?

The change I made actually affected all of the DRM code, although this could
easily be changed to be specific to amdgpu. (By forcing blocking on
amdgpu_dm's non-blocking commit code)

That said, I'd still need to test further because I only did test it for a
couple of hours then. Although it should work in theory.

>
> OTOH your base/context swap idea sounds like a possibly "less
> disturbance" workaround, if it works, and given the point in the
> commit cycle... (But if it's out Sunday it's likely too late to test
> and get it in now anyway; if it's another week, tho...)

The base/context swap idea should make the use-after-free behave how it
did in 5.6. Since the bug doesn't cause an issue in 5.6, it's less of a
"less disturbance" workaround and more of a "no disturbance" workaround.

Thanks,
Mazin Rezk

>
> 
>
> Duncan - No HTML messages please; they are filtered as spam.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." Richard Stallman


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free

2020-07-26 Thread Mazin Rezk
On Friday, July 24, 2020 5:19 PM, Paul Menzel  wrote:

> Dear Kees,
>
> Am 24.07.20 um 19:33 schrieb Kees Cook:
>
> > On Fri, Jul 24, 2020 at 09:45:18AM +0200, Paul Menzel wrote:
> >
> > > Am 24.07.20 um 00:32 schrieb Kees Cook:
> > >
> > > > On Thu, Jul 23, 2020 at 09:10:15PM +, Mazin Rezk wrote:
> > > > As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if 
> > > > commit
> > > > 3202fa62f ("slub: relocate freelist pointer to middle of object") 
> > > > should be
> > > > reverted for now to fix the regression for the users according to 
> > > > Linux’ no
> > > > regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be
> > > > reapplied. I know it’s not optimal, but as some testing is going to be
> > > > involved for the fix, I’d argue it’s the best option for the users.
> >
> > Well, the SLUB defense was already released in v5.7, so I'm not sure it
> > really helps for amdgpu_dm users seeing it there too.
>
> In my opinion, it would help, as the stable release could pick up the
> revert, ones it’s in Linus’ master branch.
>
> > There was a fix to disable the async path for this driver that worked
> > around the bug too, yes? That seems like a safer and more focused
> > change that doesn't revert the SLUB defense for all users, and would
> > actually provide a complete, I think, workaround whereas reverting
> > the SLUB change means the race still exists. For example, it would be
> > hit with slab poisoning, etc.
>
> I do not know. If there is such a fix, that would be great. But if you
> do not know, how should a normal user? ;-)
>
> Kind regards,
>
> Paul
>
> Kind regards,
>
> Paul

If we're talking about workarounds now, I suggest simply swapping the base
and context variables in struct dm_atomic_state. By that way, we won't need
to change non-amdgpu parts of the code (e.g. by reverting the SLUB patch).

Prior to 3202fa62f, the freelist pointer was stored in dm_state->base which
was never dereferenced and therefore caused no noticeable issue. After
3202fa62f, the freelist pointer is stored in the middle of the struct (i.e.
dm_state->context).

Swapping the position of the base and context variables in dm_atomic_state
should, in theory, revert this code back to it's pre-5.7 state since the
code would be back to overwriting base instead.

If we decide to use this workaround, I can write the patch and do more
extended tests to confirm it works around the issues.

That said, I haven't seen the async disabling patch. If you could link to
it, I'd be glad to test it out and perhaps we can use that instead.

Thanks,
Mazin Rezk

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free

2020-07-24 Thread Mazin Rezk
On Thursday, July 23, 2020 6:57 PM, Mazin Rezk  wrote:

> It seems that I spoke too soon. I ran the system for another hour after
> submitting the patch and the bug just occurred. :/
>
> Sadly, that means the bug isn't really fixed and that I have to go
> investigate further.
>
> At the very least, this patch seems to delay the occurrence of the bug
> significantly which may help in further discovering the cause.
>
> On Thursday, July 23, 2020 6:16 PM, Kazlauskas, Nicholas 
> nicholas.kazlaus...@amd.com wrote:
>
> > On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
> >
> > > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > > running, causing a race condition where state (and then dm_state) is
> > > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug 
> > > has
> > > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > > was stored at the beginning of dm_state (base), which was unused. After
> > > changing the freelist pointer to be stored in the middle of the struct, 
> > > the
> > > freelist pointer overwrote the context, causing dc_state to become garbage
> > > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > > a freelist pointer.
> > > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > > Bugzilla [1].
> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> > > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> > > Reported-by: Duncan 1i5t5.dun...@cox.net
> > > Signed-off-by: Mazin Rezk mn...@protonmail.com
> >
> > Thanks for the investigation and your patch. I appreciate the help in
> > trying to narrow down the root cause as this issue has been difficult to
> > reproduce on my setups.
> > Though I'm not sure this really resolves the issue - we make use of the
> > drm_atomic_helper_commit helper function from DRM which internally does
> > what you're doing with this patch:
> > drm_atomic_state_get(state);
> > if (nonblock)
> > queue_work(system_unbound_wq, >commit_work);
> >
> > else
> > commit_tail(state);
> >
> >
> > So even when it gets queued off to the unbound workqueue we still have a
> > reference on the state.
> > That reference gets dropped as part of commit tail helper in DRM as well:
> > if (funcs && funcs->atomic_commit_tail)
> >
> > funcs->atomic_commit_tail(old_state);
> >
> > else
> > drm_atomic_helper_commit_tail(old_state);
> >
> >
> > commit_time_ms = ktime_ms_delta(ktime_get(), start);
> > if (commit_time_ms > 0)
> >
> > drm_self_refresh_helper_update_avg_times(old_state,
> >  (unsigned long)commit_time_ms,
> >  new_self_refresh_mask);
> >
> >
> > drm_atomic_helper_commit_cleanup_done(old_state);
> > drm_atomic_state_put(old_state);
>
> I initially noticed that right after I wrote this patch so I was expecting
> the patch to fail. However, after several hours of testing, the crash just
> didn't occur so I believed the bug was fixed.
>
> > So instead of a use after free happening when we access the state we get
> > a double-free happening later at the end of commit tail in DRM.
> > What I think would be the right next step here is to actually determine
> > what sequence of IOCTLs and atomic commits are happening under your
> > setup with a very verbose dmesg log. You can set a debug level for DRM
> > in your kernel parameters with something like:
> > drm.debug=0x54
> > I don't see anything in amdgpu_dm.c that looks like it would be freeing
> > the state so I suspect something in the core is this doing this.
>
> Going through the KASAN use-after-free bug report in the Bugzilla
> attachments, it appears that the state is being freed at the end of
> commit_tail. Perhaps amdgpu_dm_atomic_commit_tail is being called on the
> the same old state twice? I can't quite think of any other possible
> explanation as to why that happens.

I think I've more or less con

Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free

2020-07-23 Thread Mazin Rezk
On Thursday, July 23, 2020 6:32 PM, Kees Cook  wrote:

> On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
>
> > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > running, causing a race condition where state (and then dm_state) is
> > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > was stored at the beginning of dm_state (base), which was unused. After
> > changing the freelist pointer to be stored in the middle of the struct, the
> > freelist pointer overwrote the context, causing dc_state to become garbage
> > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > a freelist pointer.
> > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > Bugzilla [1].
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
>
> Nice work tracking this down!
>
> > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
>
> I do, however, object to this Fixes tag. :) The flaw appears to have
> been with amdgpu_dm's reference tracking of "state" in the nonblocking
> case. (How this reference counting is supposed to work correctly, though,
> I'm not sure.) If I look at where the drm helper was split from being
> the default callback, it looks like this was what introduced the bug:
>
> da5c47f682ab ("drm/amd/display: Remove acrtc->stream")
>
> ? 3202fa62f certainly exposed it much more quickly, but there was a race
> even without 3202fa62f where something could have realloced the memory
> and written over it.
>
> ---
>
> Kees Cook


Thanks, I'll be sure to avoid using 3202fa62f as the cause next time.
I just thought to do that because it was what made the use-after-free cause
a noticeable bug.

Also, by the way, I just realised the patch didn't completely solve the bug.
Sorry about that, making an LKML thread on this was hasty on my part. Should
I get further confirmation from the Bugzilla thread before submitting a patch
for this bug in the future?

Thanks,
Mazin Rezk
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free

2020-07-23 Thread Mazin Rezk
When amdgpu_dm_atomic_commit_tail is running in the workqueue,
drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
running, causing a race condition where state (and then dm_state) is
sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
occurred since 5.7-rc1 and is well documented among polaris11 users [1].

Prior to 5.7, this was not a noticeable issue since the freelist pointer
was stored at the beginning of dm_state (base), which was unused. After
changing the freelist pointer to be stored in the middle of the struct, the
freelist pointer overwrote the context, causing dc_state to become garbage
data and made the call to dm_enable_per_frame_crtc_master_sync dereference
a freelist pointer.

This patch fixes the aforementioned issue by calling drm_atomic_state_get
in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.

According to my testing on 5.8.0-rc6, this should fix bug 207383 on
Bugzilla [1].

[1] https://bugzilla.kernel.org/show_bug.cgi?id=207383

Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
Reported-by: Duncan <1i5t5.dun...@cox.net>
Signed-off-by: Mazin Rezk 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 86ffa0c2880f..86d6652872f2 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device *dev,
 * unset legacy_cursor_update
 */

+   drm_atomic_state_get(state);
return drm_atomic_helper_commit(dev, state, nonblock);

/*TODO Handle EINTR, reenable IRQ*/
@@ -7628,6 +7629,8 @@ static void amdgpu_dm_atomic_commit_tail(struct 
drm_atomic_state *state)

if (dc_state_temp)
dc_release_state(dc_state_temp);
+
+   drm_atomic_state_put(state);
 }


--
2.27.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu_dm: fix nonblocking atomic commit use-after-free

2020-07-23 Thread Mazin Rezk
It seems that I spoke too soon. I ran the system for another hour after
submitting the patch and the bug just occurred. :/

Sadly, that means the bug isn't really fixed and that I have to go
investigate further.

At the very least, this patch seems to delay the occurrence of the bug
significantly which may help in further discovering the cause.

On Thursday, July 23, 2020 6:16 PM, Kazlauskas, Nicholas 
 wrote:

> On 2020-07-23 5:10 p.m., Mazin Rezk wrote:
>
> > When amdgpu_dm_atomic_commit_tail is running in the workqueue,
> > drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
> > running, causing a race condition where state (and then dm_state) is
> > sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
> > occurred since 5.7-rc1 and is well documented among polaris11 users [1].
> > Prior to 5.7, this was not a noticeable issue since the freelist pointer
> > was stored at the beginning of dm_state (base), which was unused. After
> > changing the freelist pointer to be stored in the middle of the struct, the
> > freelist pointer overwrote the context, causing dc_state to become garbage
> > data and made the call to dm_enable_per_frame_crtc_master_sync dereference
> > a freelist pointer.
> > This patch fixes the aforementioned issue by calling drm_atomic_state_get
> > in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
> > drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.
> > According to my testing on 5.8.0-rc6, this should fix bug 207383 on
> > Bugzilla [1].
> > [1] https://bugzilla.kernel.org/show_bug.cgi?id=207383
> > Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")
> > Reported-by: Duncan 1i5t5.dun...@cox.net
> > Signed-off-by: Mazin Rezk mn...@protonmail.com
>
> Thanks for the investigation and your patch. I appreciate the help in
> trying to narrow down the root cause as this issue has been difficult to
> reproduce on my setups.
>
> Though I'm not sure this really resolves the issue - we make use of the
> drm_atomic_helper_commit helper function from DRM which internally does
> what you're doing with this patch:
>
> drm_atomic_state_get(state);
> if (nonblock)
> queue_work(system_unbound_wq, >commit_work);
>
> else
>   commit_tail(state);
>
>
> So even when it gets queued off to the unbound workqueue we still have a
> reference on the state.
>
> That reference gets dropped as part of commit tail helper in DRM as well:
>
> if (funcs && funcs->atomic_commit_tail)
>
>   funcs->atomic_commit_tail(old_state);
>
> else
>   drm_atomic_helper_commit_tail(old_state);
>
>
> commit_time_ms = ktime_ms_delta(ktime_get(), start);
> if (commit_time_ms > 0)
>
>   drm_self_refresh_helper_update_avg_times(old_state,
>(unsigned long)commit_time_ms,
>new_self_refresh_mask);
>
>
> drm_atomic_helper_commit_cleanup_done(old_state);
>
> drm_atomic_state_put(old_state);
>

I initially noticed that right after I wrote this patch so I was expecting
the patch to fail. However, after several hours of testing, the crash just
didn't occur so I believed the bug was fixed.

> So instead of a use after free happening when we access the state we get
> a double-free happening later at the end of commit tail in DRM.
>
> What I think would be the right next step here is to actually determine
> what sequence of IOCTLs and atomic commits are happening under your
> setup with a very verbose dmesg log. You can set a debug level for DRM
> in your kernel parameters with something like:
>
> drm.debug=0x54
>
> I don't see anything in amdgpu_dm.c that looks like it would be freeing
> the state so I suspect something in the core is this doing this.

Going through the KASAN use-after-free bug report in the Bugzilla
attachments, it appears that the state is being freed at the end of
commit_tail. Perhaps amdgpu_dm_atomic_commit_tail is being called on the
the same old state twice? I can't quite think of any other possible
explanation as to why that happens.

>
> > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 3 +++
> > 1 file changed, 3 insertions(+)
> > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > index 86ffa0c2880f..86d6652872f2 100644
> > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> > @@ -7303,6 +7303,7 @@ static int amdgpu_dm_atomic_commit(struct drm_device 
> > *dev,
> > * unset legacy_cursor_update