On 2021-08-17 11:37 a.m., Lazar, Lijo wrote: > > > On 8/17/2021 2:56 PM, Michel Dänzer wrote: >> On 2021-08-17 11:12 a.m., Lazar, Lijo wrote: >>> >>> >>> On 8/17/2021 1:53 PM, Michel Dänzer wrote: >>>> From: Michel Dänzer <mdaen...@redhat.com> >>>> >>>> schedule_delayed_work does not push back the work if it was already >>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms >>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF >>>> was disabled and re-enabled again during those 100 ms. >>>> >>>> This resulted in frame drops / stutter with the upcoming mutter 41 >>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and >>>> disabling it again (for getting the GPU clock counter). >>>> >>>> To fix this, call cancel_delayed_work_sync when the disable count >>>> transitions from 0 to 1, and only schedule the delayed work on the >>>> reverse transition, not if the disable count was already 0. This makes >>>> sure the delayed work doesn't run at unexpected times, and allows it to >>>> be lock-free. >>>> >>>> v2: >>>> * Use cancel_delayed_work_sync & mutex_trylock instead of >>>> mod_delayed_work. >>>> v3: >>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König) >>>> v4: >>>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing >>>> adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off >>>> checking for it to be 0 (Evan Quan) >>>> >>>> Cc: sta...@vger.kernel.org >>>> Reviewed-by: Lijo Lazar <lijo.la...@amd.com> # v3 >>>> Acked-by: Christian König <christian.koe...@amd.com> # v3 >>>> Signed-off-by: Michel Dänzer <mdaen...@redhat.com> >>>> --- >>>> >>>> Alex, probably best to wait a bit longer before picking this up. :) >>>> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++---- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 36 +++++++++++++++------- >>>> 2 files changed, 30 insertions(+), 17 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> index f3fd5ec710b6..f944ed858f3e 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -2777,12 +2777,11 @@ static void >>>> amdgpu_device_delay_enable_gfx_off(struct work_struct *work) >>>> struct amdgpu_device *adev = >>>> container_of(work, struct amdgpu_device, >>>> gfx.gfx_off_delay_work.work); >>>> - mutex_lock(&adev->gfx.gfx_off_mutex); >>>> - if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) { >>>> - if (!amdgpu_dpm_set_powergating_by_smu(adev, >>>> AMD_IP_BLOCK_TYPE_GFX, true)) >>>> - adev->gfx.gfx_off_state = true; >>>> - } >>>> - mutex_unlock(&adev->gfx.gfx_off_mutex); >>>> + WARN_ON_ONCE(adev->gfx.gfx_off_state); >>>> + WARN_ON_ONCE(adev->gfx.gfx_off_req_count); >>>> + >>>> + if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, >>>> true)) >>>> + adev->gfx.gfx_off_state = true; >>>> } >>>> /** >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>>> index a0be0772c8b3..b4ced45301be 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c >>>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, >>>> bool enable) >>>> mutex_lock(&adev->gfx.gfx_off_mutex); >>>> - if (!enable) >>>> - adev->gfx.gfx_off_req_count++; >>>> - else if (adev->gfx.gfx_off_req_count > 0) >>>> + if (enable) { >>>> + /* If the count is already 0, it means there's an imbalance bug >>>> somewhere. >>>> + * Note that the bug may be in a different caller than the one >>>> which triggers the >>>> + * WARN_ON_ONCE. >>>> + */ >>>> + if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0)) >>>> + goto unlock; >>>> + >>>> adev->gfx.gfx_off_req_count--; >>>> - if (enable && !adev->gfx.gfx_off_state && >>>> !adev->gfx.gfx_off_req_count) { >>>> - schedule_delayed_work(&adev->gfx.gfx_off_delay_work, >>>> GFX_OFF_DELAY_ENABLE); >>>> - } else if (!enable && adev->gfx.gfx_off_state) { >>>> - if (!amdgpu_dpm_set_powergating_by_smu(adev, >>>> AMD_IP_BLOCK_TYPE_GFX, false)) { >>>> - adev->gfx.gfx_off_state = false; >>>> + if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state) >>>> + schedule_delayed_work(&adev->gfx.gfx_off_delay_work, >>>> GFX_OFF_DELAY_ENABLE); >>>> + } else { >>>> + if (adev->gfx.gfx_off_req_count == 0) { >>>> + cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work); >>>> + >>>> + if (adev->gfx.gfx_off_state && >>> >>> More of a question which I didn't check last time - Is this expected to be >>> true when the disable call comes in first? >> >> My assumption is that cancel_delayed_work_sync guarantees >> amdgpu_device_delay_enable_gfx_off's assignment is visible here. >> > > To clarify - when nothing is scheduled. If enable() is called when the count > is 0, it goes to unlock. Now the expectation is someone to call Disable first.
Yes, the very first amdgpu_gfx_off_ctrl call must pass enable=false, or it's a bug, which if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0)) will catch. > Let's say Disable() is called first, then the variable will be false, right? Ohh, I see what you mean. The first time amdgpu_gfx_off_ctrl is called with enable=false, adev->gfx.gfx_off_state == false (what it was initialized to), so it doesn't actually disable GFXOFF in HW. Note that this is a separate pre-existing bug, not a regression of my patch. I wonder what's the best solution for that, move the adev->gfx.gfx_off_state assignments into amdgpu_dpm_set_powergating_by_smu? -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer