Dear Kees,

Am 24.07.20 um 00:32 schrieb Kees Cook:
On Thu, Jul 23, 2020 at 09:10:15PM +0000, Mazin Rezk wrote:
When amdgpu_dm_atomic_commit_tail is running in the workqueue,
drm_atomic_state_put will get called while amdgpu_dm_atomic_commit_tail is
running, causing a race condition where state (and then dm_state) is
sometimes freed while amdgpu_dm_atomic_commit_tail is running. This bug has
occurred since 5.7-rc1 and is well documented among polaris11 users [1].

Prior to 5.7, this was not a noticeable issue since the freelist pointer
was stored at the beginning of dm_state (base), which was unused. After
changing the freelist pointer to be stored in the middle of the struct, the
freelist pointer overwrote the context, causing dc_state to become garbage
data and made the call to dm_enable_per_frame_crtc_master_sync dereference
a freelist pointer.

This patch fixes the aforementioned issue by calling drm_atomic_state_get
in amdgpu_dm_atomic_commit before drm_atomic_helper_commit is called and
drm_atomic_state_put after amdgpu_dm_atomic_commit_tail is complete.

According to my testing on 5.8.0-rc6, this should fix bug 207383 on
Bugzilla [1].

[1] https://bugzilla.kernel.org/show_bug.cgi?id=207383

Nice work tracking this down!

Fixes: 3202fa62f ("slub: relocate freelist pointer to middle of object")

I do, however, object to this Fixes tag. :) The flaw appears to have
been with amdgpu_dm's reference tracking of "state" in the nonblocking
case. (How this reference counting is supposed to work correctly, though,
I'm not sure.) If I look at where the drm helper was split from being
the default callback, it looks like this was what introduced the bug:

da5c47f682ab ("drm/amd/display: Remove acrtc->stream")

? 3202fa62f certainly exposed it much more quickly, but there was a race
even without 3202fa62f where something could have realloced the memory
and written over it.

I understand the Fixes tag mainly a help when backporting commits.

As Linux 5.8-rc7 is going to be released this Sunday, I wonder, if commit 3202fa62f ("slub: relocate freelist pointer to middle of object") should be reverted for now to fix the regression for the users according to Linux’ no regression policy. Once the AMDGPU/DRM driver issue is fixed, it can be reapplied. I know it’s not optimal, but as some testing is going to be involved for the fix, I’d argue it’s the best option for the users.


Kind regards,

Paul

Reply via email to