On 2021-08-13 6:23 a.m., Lazar, Lijo wrote:
> 
> 
> On 8/12/2021 10:24 PM, Michel Dänzer wrote:
>> On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
>>> On 8/12/2021 1:41 PM, Michel Dänzer wrote:
>>>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
>>>>> Hi James,
>>>>>
>>>>> Evan seems to have understood how this all works together.
>>>>>
>>>>> See while any begin/end use critical section is active the work should 
>>>>> not be active.
>>>>>
>>>>> When you handle only one ring you can just call cancel in begin use and 
>>>>> schedule in end use. But when you have more than one ring you need a lock 
>>>>> or counter to prevent concurrent work items to be started.
>>>>>
>>>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes 
>>>>> that the delayed work is still running.
>>>>
>>>> It merely assumes that the work may already have been scheduled before.
>>>>
>>>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While 
>>>> I think it can still have some effect when there's a single work item for 
>>>> multiple rings, as described by James, it's probably negligible, since 
>>>> presumably the time intervals between ring_begin_use and ring_end_use are 
>>>> normally much shorter than a second.
>>>>
>>>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same 
>>>> as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine 
>>>> with dropping it.
>>>>
>>>>
>>>>> Something similar applies to the first patch I think,
>>>>
>>>> There are no cancel work calls in that case, so the commit log is accurate 
>>>> TTBOMK.
>>>
>>> Curious -
>>>
>>> For patch 1, does it make a difference if any delayed work scheduled is 
>>> cancelled in the else part before proceeding?
>>>
>>> } else if (!enable && adev->gfx.gfx_off_state) {
>>> cancel_delayed_work();
>>
>> I tried the patch below.
>>
>> While this does seem to fix the problem as well, I see a potential issue:
>>
>> 1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
>> 2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
>> 3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync
>>
>> I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain 
>> though)
> 
> Should use the cancel_delayed_work instead of the _sync version.

The thing is, it's not clear to me from cancel_delayed_work's description that 
it's guaranteed not to wait for amdgpu_device_delay_enable_gfx_off to finish if 
it's already running. If that's not guaranteed, it's prone to the same deadlock.

> As you mentioned - at best work is not scheduled yet and cancelled 
> successfully, or at worst it's waiting for the mutex. In the worst case, if 
> amdgpu_device_delay_enable_gfx_off gets the mutex after amdgpu_gfx_off_ctrl 
> unlocks it, there is an extra check as below.
> 
> if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)
> 
> The count wouldn't be 0 and hence it won't enable GFXOFF.

I'm not sure, but it might also be possible for 
amdgpu_device_delay_enable_gfx_off to get the mutex only after 
amdgpu_gfx_off_ctrl was called again and set adev->gfx.gfx_off_req_count back 
to 0.


>> Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm 
>> not sure how offhand. (With cancel_delayed_work instead, I'm worried 
>> amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW 
>> immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that 
>> might happen with mod_delayed_work as well...)
> 
> As mentioned earlier, cancel_delayed_work won't cause this issue.
> 
> In the mod_delayed_ patch, mod_ version is called only when req_count is 0. 
> While that is a good thing, it keeps alive one more contender for the mutex.

Not sure what you mean. It leaves the possibility of 
amdgpu_device_delay_enable_gfx_off running just after amdgpu_gfx_off_ctrl tried 
to postpone it. As discussed above, something similar might be possible with 
cancel_delayed_work as well.

> The cancel_ version eliminates that contender if happens to be called at the 
> right time (more likely if there are multiple requests to disable gfxoff). On 
> the other hand, don't know how costly it is to call cancel_ every time on the 
> else part (or maybe call only once when count increments to 1?).

Sure, why not, though I doubt it matters much — I expect 
adev->gfx.gfx_off_req_count transitioning between 0 <-> 1 to be the most common 
case by far.


I sent out a v2 patch which should address all these issues.


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer

Reply via email to