Re: drm/vkms: deadlock between dev->event_lock and timer

2023-09-18 Thread Daniel Stone
On Mon, 18 Sept 2023 at 23:02, Helen Koike  wrote:
> On 14/09/2023 05:12, Daniel Vetter wrote:
> > Also yes how that landed without anyone running lockdep is ... not good. I
> > guess we need a lockdep enabled drm ci target that runs vkms tests asap
> > :-)
>
> btw, I just executed a draft version of vkms targed on the ci, we do get
> the warning:
>
> https://gitlab.freedesktop.org/helen.fornazier/linux/-/jobs/49156305#L623
>
> I'm just not sure if tests would fail (since it is a warning) and it has
> a chance to be ignored if people don't look at the logs (unless if igt
> already handles that).

Hmm, dmesg-warn is already a separate igt test status. I guess it just
needs to be handled explicitly.

> I still need to do some adjustments (it seems the tests is hanging
> somewhere and we got a timeout) but we should have vkms in drm ci soon.

Might be due to the locking explosion? The kernels should probably
have soft-lockup detection enabled as well as lockdep.

Cheers,
Daniel


Re: drm/vkms: deadlock between dev->event_lock and timer

2023-09-18 Thread Helen Koike




On 14/09/2023 05:12, Daniel Vetter wrote:

On Thu, Sep 14, 2023 at 03:33:41PM +0900, Tetsuo Handa wrote:

On 2023/09/14 6:08, Thomas Gleixner wrote:

Maybe the VKMS people need to understand locking in the first place. The
first thing I saw in this code is:

static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
{
...
mutex_unlock(&output->enabled_lock);

What?

Unlocking a mutex in the context of a hrtimer callback is simply
violating all mutex locking rules.

How has this code ever survived lock debugging without triggering a big
fat warning?


Commit a0e6a017ab56936c ("drm/vkms: Fix race-condition between the hrtimer
and the atomic commit") in 6.6-rc1 replaced spinlock with mutex. So we haven't
tested with the lock debugging yet...


Yeah that needs an immediate revert, there's not much that looks legit in
that patch. I'll chat with Maira.

Also yes how that landed without anyone running lockdep is ... not good. I
guess we need a lockdep enabled drm ci target that runs vkms tests asap
:-)


btw, I just executed a draft version of vkms targed on the ci, we do get 
the warning:


https://gitlab.freedesktop.org/helen.fornazier/linux/-/jobs/49156305#L623

I'm just not sure if tests would fail (since it is a warning) and it has 
a chance to be ignored if people don't look at the logs (unless if igt 
already handles that).


I still need to do some adjustments (it seems the tests is hanging 
somewhere and we got a timeout) but we should have vkms in drm ci soon.


Regards,
Helen





Maíra and Arthur, mutex_unlock() from interrupt context is not permitted.
Please revert that patch immediately.
I guess that a semaphore (down()/up()) could be used instead of a mutex.


 From a quick look this smells like a classic "try to use locking when you
want synchronization primitives", so semaphore here doesn't look any
better. The vkms_set_composer() function was originally for crc
generation, where it's userspace's job to make sure they wait for all the
crc they need to be generated before they shut it down again. But for
writeback the kernel must guarantee that the compositiona actually
happens, and the current function just doesn't make any such guarantees.

Cheers, Daniel


Re: drm/vkms: deadlock between dev->event_lock and timer

2023-09-14 Thread Daniel Vetter
On Thu, Sep 14, 2023 at 03:33:41PM +0900, Tetsuo Handa wrote:
> On 2023/09/14 6:08, Thomas Gleixner wrote:
> > Maybe the VKMS people need to understand locking in the first place. The
> > first thing I saw in this code is:
> > 
> > static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> > {
> >...
> >mutex_unlock(&output->enabled_lock);
> > 
> > What?
> > 
> > Unlocking a mutex in the context of a hrtimer callback is simply
> > violating all mutex locking rules.
> > 
> > How has this code ever survived lock debugging without triggering a big
> > fat warning?
> 
> Commit a0e6a017ab56936c ("drm/vkms: Fix race-condition between the hrtimer
> and the atomic commit") in 6.6-rc1 replaced spinlock with mutex. So we haven't
> tested with the lock debugging yet...

Yeah that needs an immediate revert, there's not much that looks legit in
that patch. I'll chat with Maira.

Also yes how that landed without anyone running lockdep is ... not good. I
guess we need a lockdep enabled drm ci target that runs vkms tests asap
:-)

> Maíra and Arthur, mutex_unlock() from interrupt context is not permitted.
> Please revert that patch immediately.
> I guess that a semaphore (down()/up()) could be used instead of a mutex.

>From a quick look this smells like a classic "try to use locking when you
want synchronization primitives", so semaphore here doesn't look any
better. The vkms_set_composer() function was originally for crc
generation, where it's userspace's job to make sure they wait for all the
crc they need to be generated before they shut it down again. But for
writeback the kernel must guarantee that the compositiona actually
happens, and the current function just doesn't make any such guarantees.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: drm/vkms: deadlock between dev->event_lock and timer

2023-09-13 Thread Tetsuo Handa
On 2023/09/14 6:08, Thomas Gleixner wrote:
> Maybe the VKMS people need to understand locking in the first place. The
> first thing I saw in this code is:
> 
> static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
> {
>...
>mutex_unlock(&output->enabled_lock);
> 
> What?
> 
> Unlocking a mutex in the context of a hrtimer callback is simply
> violating all mutex locking rules.
> 
> How has this code ever survived lock debugging without triggering a big
> fat warning?

Commit a0e6a017ab56936c ("drm/vkms: Fix race-condition between the hrtimer
and the atomic commit") in 6.6-rc1 replaced spinlock with mutex. So we haven't
tested with the lock debugging yet...

Maíra and Arthur, mutex_unlock() from interrupt context is not permitted.
Please revert that patch immediately.
I guess that a semaphore (down()/up()) could be used instead of a mutex.



Re: drm/vkms: deadlock between dev->event_lock and timer

2023-09-13 Thread Thomas Gleixner
On Wed, Sep 13 2023 at 09:47, Linus Torvalds wrote:
> On Wed, 13 Sept 2023 at 07:21, Tetsuo Handa
>  wrote:
>>
>> Hello. A deadlock was reported in drivers/gpu/drm/vkms/ .
>> It looks like this locking pattern remains as of 6.6-rc1. Please fix.
>>
>> void drm_crtc_vblank_off(struct drm_crtc *crtc) {
>>   spin_lock_irq(&dev->event_lock);
>>   drm_vblank_disable_and_save(dev, pipe) {
>> __disable_vblank(dev, pipe) {
>>   crtc->funcs->disable_vblank(crtc) == vkms_disable_vblank {
>> hrtimer_cancel(&out->vblank_hrtimer) { // Retries with 
>> dev->event_lock held until lock_hrtimer_base() succeeds.
>>   ret = hrtimer_try_to_cancel(timer) {
>> base = lock_hrtimer_base(timer, &flags); // Fails forever 
>> because vkms_vblank_simulate() is in progress.
>
> Heh. Ok. This is clearly a bug, but it does seem to be limited to just
> the vkms driver, and literally only to the "simulate vblank" case.
>
> The worst part about it is that it's so subtle and not obvious.
>
> Some light grepping seems to show that amdgpu has almost the exact
> same pattern in its own vkms thing, except it uses
>
> hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);
>
> directly, which presumably fixes the livelock, but means that the
> cancel will fail if it's currently running.
>
> So just doing the same thing in the vkms driver probably fixes things.
>
> Maybe the vkms people need to add a flag to say "it's canceled" so
> that it doesn't then get re-enabled?  Or maybe it doesn't matter
> and/or already happens for some reason I didn't look into.

Maybe the VKMS people need to understand locking in the first place. The
first thing I saw in this code is:

static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer)
{
   ...
   mutex_unlock(&output->enabled_lock);

What?

Unlocking a mutex in the context of a hrtimer callback is simply
violating all mutex locking rules.

How has this code ever survived lock debugging without triggering a big
fat warning?

Thanks,

tglx


Re: drm/vkms: deadlock between dev->event_lock and timer

2023-09-13 Thread Linus Torvalds
On Wed, 13 Sept 2023 at 07:21, Tetsuo Handa
 wrote:
>
> Hello. A deadlock was reported in drivers/gpu/drm/vkms/ .
> It looks like this locking pattern remains as of 6.6-rc1. Please fix.
>
> void drm_crtc_vblank_off(struct drm_crtc *crtc) {
>   spin_lock_irq(&dev->event_lock);
>   drm_vblank_disable_and_save(dev, pipe) {
> __disable_vblank(dev, pipe) {
>   crtc->funcs->disable_vblank(crtc) == vkms_disable_vblank {
> hrtimer_cancel(&out->vblank_hrtimer) { // Retries with 
> dev->event_lock held until lock_hrtimer_base() succeeds.
>   ret = hrtimer_try_to_cancel(timer) {
> base = lock_hrtimer_base(timer, &flags); // Fails forever because 
> vkms_vblank_simulate() is in progress.

Heh. Ok. This is clearly a bug, but it does seem to be limited to just
the vkms driver, and literally only to the "simulate vblank" case.

The worst part about it is that it's so subtle and not obvious.

Some light grepping seems to show that amdgpu has almost the exact
same pattern in its own vkms thing, except it uses

hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer);

directly, which presumably fixes the livelock, but means that the
cancel will fail if it's currently running.

So just doing the same thing in the vkms driver probably fixes things.

Maybe the vkms people need to add a flag to say "it's canceled" so
that it doesn't then get re-enabled?  Or maybe it doesn't matter
and/or already happens for some reason I didn't look into.

   Linus