On 8/13/2021 4:01 PM, Michel Dänzer wrote:
On 2021-08-13 6:23 a.m., Lazar, Lijo wrote:


On 8/12/2021 10:24 PM, Michel Dänzer wrote:
On 2021-08-12 1:33 p.m., Lazar, Lijo wrote:
On 8/12/2021 1:41 PM, Michel Dänzer wrote:
On 2021-08-12 7:55 a.m., Koenig, Christian wrote:
Hi James,

Evan seems to have understood how this all works together.

See while any begin/end use critical section is active the work should not be 
active.

When you handle only one ring you can just call cancel in begin use and 
schedule in end use. But when you have more than one ring you need a lock or 
counter to prevent concurrent work items to be started.

Michelle's idea to use mod_delayed_work is a bad one because it assumes that 
the delayed work is still running.

It merely assumes that the work may already have been scheduled before.

Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While I 
think it can still have some effect when there's a single work item for 
multiple rings, as described by James, it's probably negligible, since 
presumably the time intervals between ring_begin_use and ring_end_use are 
normally much shorter than a second.

So, while patch 2 is at worst a no-op (since mod_delayed_work is the same as 
schedule_delayed_work if the work hasn't been scheduled yet), I'm fine with 
dropping it.


Something similar applies to the first patch I think,

There are no cancel work calls in that case, so the commit log is accurate 
TTBOMK.

Curious -

For patch 1, does it make a difference if any delayed work scheduled is 
cancelled in the else part before proceeding?

} else if (!enable && adev->gfx.gfx_off_state) {
cancel_delayed_work();

I tried the patch below.

While this does seem to fix the problem as well, I see a potential issue:

1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex
2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock
3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync

I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain though)

Should use the cancel_delayed_work instead of the _sync version.

The thing is, it's not clear to me from cancel_delayed_work's description that 
it's guaranteed not to wait for amdgpu_device_delay_enable_gfx_off to finish if 
it's already running. If that's not guaranteed, it's prone to the same deadlock.

From what I understood from the the description, cancel initiates a cancel. If the work has already started, it returns false saying it couldn't succeed otherwise cancels out the scheduled work and returns true. In the note below, it asks to specifically use the _sync version if we need to wait for an already started work and that definitely has the problem of deadlock you mentioned above.

 * Note:
 * The work callback function may still be running on return, unless
* it returns %true and the work doesn't re-arm itself. Explicitly flush or
 * use cancel_delayed_work_sync() to wait on it.



As you mentioned - at best work is not scheduled yet and cancelled 
successfully, or at worst it's waiting for the mutex. In the worst case, if 
amdgpu_device_delay_enable_gfx_off gets the mutex after amdgpu_gfx_off_ctrl 
unlocks it, there is an extra check as below.

if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count)

The count wouldn't be 0 and hence it won't enable GFXOFF.

I'm not sure, but it might also be possible for amdgpu_device_delay_enable_gfx_off 
to get the mutex only after amdgpu_gfx_off_ctrl was called again and set 
adev->gfx.gfx_off_req_count back to 0.


Yes, this is a case we can't avoid in either case. If the work has already started, then mod_delayed_ also doesn't have any impact. Another case is work thread already got the mutex and a disable request comes just at that time. It needs to wait till mutex is released by work, that could mean enable gfxoff immediately followed by disable.


Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm 
not sure how offhand. (With cancel_delayed_work instead, I'm worried 
amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW 
immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that might 
happen with mod_delayed_work as well...)

As mentioned earlier, cancel_delayed_work won't cause this issue.

In the mod_delayed_ patch, mod_ version is called only when req_count is 0. 
While that is a good thing, it keeps alive one more contender for the mutex.

Not sure what you mean. It leaves the possibility of 
amdgpu_device_delay_enable_gfx_off running just after amdgpu_gfx_off_ctrl tried 
to postpone it. As discussed above, something similar might be possible with 
cancel_delayed_work as well.


The mod_delayed is called only req_count gets back to 0. If there is another disable request comes after that, it doesn't cancel out the work
scheduled nor does it adjust the delay.

Ex:
Disable gfxoff -> Enable gfxoff (now the work is scheduled) -> Disable gfxoff (within 5ms or whatever the delay be, but this call won't go to the mod_delayed path to delay it further) -> Work starts after 5ms and creates a contention for the mutex -> Enable gfxoff

When cancel_ is used, the second disable call immediately cancels out any work that is scheduled but not started and it doesn't create an unnecessary contention for the mutex. It's a matter of who gets the mutex first. Cancel has a better chance to eliminate the second thread possibility.

The cancel_ version eliminates that contender if happens to be called at the 
right time (more likely if there are multiple requests to disable gfxoff). On 
the other hand, don't know how costly it is to call cancel_ every time on the 
else part (or maybe call only once when count increments to 1?).

Sure, why not, though I doubt it matters much — I expect adev->gfx.gfx_off_req_count 
transitioning between 0 <-> 1 to be the most common case by far.


I sent out a v2 patch which should address all these issues.


Will check that.

Thanks,
Lijo


Reply via email to