On 2021-08-17 11:37 a.m., Lazar, Lijo wrote:
> 
> 
> On 8/17/2021 2:56 PM, Michel Dänzer wrote:
>> On 2021-08-17 11:12 a.m., Lazar, Lijo wrote:
>>>
>>>
>>> On 8/17/2021 1:53 PM, Michel Dänzer wrote:
>>>> From: Michel Dänzer <mdaen...@redhat.com>
>>>>
>>>> schedule_delayed_work does not push back the work if it was already
>>>> scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
>>>> after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
>>>> was disabled and re-enabled again during those 100 ms.
>>>>
>>>> This resulted in frame drops / stutter with the upcoming mutter 41
>>>> release on Navi 14, due to constantly enabling GFXOFF in the HW and
>>>> disabling it again (for getting the GPU clock counter).
>>>>
>>>> To fix this, call cancel_delayed_work_sync when the disable count
>>>> transitions from 0 to 1, and only schedule the delayed work on the
>>>> reverse transition, not if the disable count was already 0. This makes
>>>> sure the delayed work doesn't run at unexpected times, and allows it to
>>>> be lock-free.
>>>>
>>>> v2:
>>>> * Use cancel_delayed_work_sync & mutex_trylock instead of
>>>>     mod_delayed_work.
>>>> v3:
>>>> * Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
>>>> v4:
>>>> * Fix race condition between amdgpu_gfx_off_ctrl incrementing
>>>>     adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
>>>>     checking for it to be 0 (Evan Quan)
>>>>
>>>> Cc: sta...@vger.kernel.org
>>>> Reviewed-by: Lijo Lazar <lijo.la...@amd.com> # v3
>>>> Acked-by: Christian König <christian.koe...@amd.com> # v3
>>>> Signed-off-by: Michel Dänzer <mdaen...@redhat.com>
>>>> ---
>>>>
>>>> Alex, probably best to wait a bit longer before picking this up. :)
>>>>
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 +++----
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    | 36 +++++++++++++++-------
>>>>    2 files changed, 30 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index f3fd5ec710b6..f944ed858f3e 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2777,12 +2777,11 @@ static void 
>>>> amdgpu_device_delay_enable_gfx_off(struct work_struct *work)
>>>>        struct amdgpu_device *adev =
>>>>            container_of(work, struct amdgpu_device, 
>>>> gfx.gfx_off_delay_work.work);
>>>>    -    mutex_lock(&adev->gfx.gfx_off_mutex);
>>>> -    if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) {
>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, 
>>>> AMD_IP_BLOCK_TYPE_GFX, true))
>>>> -            adev->gfx.gfx_off_state = true;
>>>> -    }
>>>> -    mutex_unlock(&adev->gfx.gfx_off_mutex);
>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_state);
>>>> +    WARN_ON_ONCE(adev->gfx.gfx_off_req_count);
>>>> +
>>>> +    if (!amdgpu_dpm_set_powergating_by_smu(adev, AMD_IP_BLOCK_TYPE_GFX, 
>>>> true))
>>>> +        adev->gfx.gfx_off_state = true;
>>>>    }
>>>>      /**
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>> index a0be0772c8b3..b4ced45301be 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
>>>> @@ -563,24 +563,38 @@ void amdgpu_gfx_off_ctrl(struct amdgpu_device *adev, 
>>>> bool enable)
>>>>          mutex_lock(&adev->gfx.gfx_off_mutex);
>>>>    -    if (!enable)
>>>> -        adev->gfx.gfx_off_req_count++;
>>>> -    else if (adev->gfx.gfx_off_req_count > 0)
>>>> +    if (enable) {
>>>> +        /* If the count is already 0, it means there's an imbalance bug 
>>>> somewhere.
>>>> +         * Note that the bug may be in a different caller than the one 
>>>> which triggers the
>>>> +         * WARN_ON_ONCE.
>>>> +         */
>>>> +        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))
>>>> +            goto unlock;
>>>> +
>>>>            adev->gfx.gfx_off_req_count--;
>>>>    -    if (enable && !adev->gfx.gfx_off_state && 
>>>> !adev->gfx.gfx_off_req_count) {
>>>> -        schedule_delayed_work(&adev->gfx.gfx_off_delay_work, 
>>>> GFX_OFF_DELAY_ENABLE);
>>>> -    } else if (!enable && adev->gfx.gfx_off_state) {
>>>> -        if (!amdgpu_dpm_set_powergating_by_smu(adev, 
>>>> AMD_IP_BLOCK_TYPE_GFX, false)) {
>>>> -            adev->gfx.gfx_off_state = false;
>>>> +        if (adev->gfx.gfx_off_req_count == 0 && !adev->gfx.gfx_off_state)
>>>> +            schedule_delayed_work(&adev->gfx.gfx_off_delay_work, 
>>>> GFX_OFF_DELAY_ENABLE);
>>>> +    } else {
>>>> +        if (adev->gfx.gfx_off_req_count == 0) {
>>>> +            cancel_delayed_work_sync(&adev->gfx.gfx_off_delay_work);
>>>> +
>>>> +            if (adev->gfx.gfx_off_state &&
>>>
>>> More of a question which I didn't check last time - Is this expected to be 
>>> true when the disable call comes in first?
>>
>> My assumption is that cancel_delayed_work_sync guarantees 
>> amdgpu_device_delay_enable_gfx_off's assignment is visible here.
>>
> 
> To clarify - when nothing is scheduled. If enable() is called when the count 
> is 0, it goes to unlock. Now the expectation is someone to call Disable first.

Yes, the very first amdgpu_gfx_off_ctrl call must pass enable=false, or it's a 
bug, which

        if (WARN_ON_ONCE(adev->gfx.gfx_off_req_count == 0))

will catch.


> Let's say  Disable() is called first, then the variable will be false, right?

Ohh, I see what you mean. The first time amdgpu_gfx_off_ctrl is called with 
enable=false, adev->gfx.gfx_off_state == false (what it was initialized to), so 
it doesn't actually disable GFXOFF in HW.

Note that this is a separate pre-existing bug, not a regression of my patch.

I wonder what's the best solution for that, move the adev->gfx.gfx_off_state 
assignments into amdgpu_dpm_set_powergating_by_smu?


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

Reply via email to