Re: drm/vkms: deadlock between dev->event_lock and timer
On Mon, 18 Sept 2023 at 23:02, Helen Koike wrote: > On 14/09/2023 05:12, Daniel Vetter wrote: > > Also yes how that landed without anyone running lockdep is ... not good. I > > guess we need a lockdep enabled drm ci target that runs vkms tests asap > > :-) > > btw, I just executed a draft version of vkms targed on the ci, we do get > the warning: > > https://gitlab.freedesktop.org/helen.fornazier/linux/-/jobs/49156305#L623 > > I'm just not sure if tests would fail (since it is a warning) and it has > a chance to be ignored if people don't look at the logs (unless if igt > already handles that). Hmm, dmesg-warn is already a separate igt test status. I guess it just needs to be handled explicitly. > I still need to do some adjustments (it seems the tests is hanging > somewhere and we got a timeout) but we should have vkms in drm ci soon. Might be due to the locking explosion? The kernels should probably have soft-lockup detection enabled as well as lockdep. Cheers, Daniel
Re: drm/vkms: deadlock between dev->event_lock and timer
On 14/09/2023 05:12, Daniel Vetter wrote: On Thu, Sep 14, 2023 at 03:33:41PM +0900, Tetsuo Handa wrote: On 2023/09/14 6:08, Thomas Gleixner wrote: Maybe the VKMS people need to understand locking in the first place. The first thing I saw in this code is: static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer) { ... mutex_unlock(&output->enabled_lock); What? Unlocking a mutex in the context of a hrtimer callback is simply violating all mutex locking rules. How has this code ever survived lock debugging without triggering a big fat warning? Commit a0e6a017ab56936c ("drm/vkms: Fix race-condition between the hrtimer and the atomic commit") in 6.6-rc1 replaced spinlock with mutex. So we haven't tested with the lock debugging yet... Yeah that needs an immediate revert, there's not much that looks legit in that patch. I'll chat with Maira. Also yes how that landed without anyone running lockdep is ... not good. I guess we need a lockdep enabled drm ci target that runs vkms tests asap :-) btw, I just executed a draft version of vkms targed on the ci, we do get the warning: https://gitlab.freedesktop.org/helen.fornazier/linux/-/jobs/49156305#L623 I'm just not sure if tests would fail (since it is a warning) and it has a chance to be ignored if people don't look at the logs (unless if igt already handles that). I still need to do some adjustments (it seems the tests is hanging somewhere and we got a timeout) but we should have vkms in drm ci soon. Regards, Helen Maíra and Arthur, mutex_unlock() from interrupt context is not permitted. Please revert that patch immediately. I guess that a semaphore (down()/up()) could be used instead of a mutex. From a quick look this smells like a classic "try to use locking when you want synchronization primitives", so semaphore here doesn't look any better. The vkms_set_composer() function was originally for crc generation, where it's userspace's job to make sure they wait for all the crc they need to be generated before they shut it down again. But for writeback the kernel must guarantee that the compositiona actually happens, and the current function just doesn't make any such guarantees. Cheers, Daniel
Re: drm/vkms: deadlock between dev->event_lock and timer
On Thu, Sep 14, 2023 at 03:33:41PM +0900, Tetsuo Handa wrote: > On 2023/09/14 6:08, Thomas Gleixner wrote: > > Maybe the VKMS people need to understand locking in the first place. The > > first thing I saw in this code is: > > > > static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer) > > { > >... > >mutex_unlock(&output->enabled_lock); > > > > What? > > > > Unlocking a mutex in the context of a hrtimer callback is simply > > violating all mutex locking rules. > > > > How has this code ever survived lock debugging without triggering a big > > fat warning? > > Commit a0e6a017ab56936c ("drm/vkms: Fix race-condition between the hrtimer > and the atomic commit") in 6.6-rc1 replaced spinlock with mutex. So we haven't > tested with the lock debugging yet... Yeah that needs an immediate revert, there's not much that looks legit in that patch. I'll chat with Maira. Also yes how that landed without anyone running lockdep is ... not good. I guess we need a lockdep enabled drm ci target that runs vkms tests asap :-) > Maíra and Arthur, mutex_unlock() from interrupt context is not permitted. > Please revert that patch immediately. > I guess that a semaphore (down()/up()) could be used instead of a mutex. >From a quick look this smells like a classic "try to use locking when you want synchronization primitives", so semaphore here doesn't look any better. The vkms_set_composer() function was originally for crc generation, where it's userspace's job to make sure they wait for all the crc they need to be generated before they shut it down again. But for writeback the kernel must guarantee that the compositiona actually happens, and the current function just doesn't make any such guarantees. Cheers, Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Re: drm/vkms: deadlock between dev->event_lock and timer
On 2023/09/14 6:08, Thomas Gleixner wrote: > Maybe the VKMS people need to understand locking in the first place. The > first thing I saw in this code is: > > static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer) > { >... >mutex_unlock(&output->enabled_lock); > > What? > > Unlocking a mutex in the context of a hrtimer callback is simply > violating all mutex locking rules. > > How has this code ever survived lock debugging without triggering a big > fat warning? Commit a0e6a017ab56936c ("drm/vkms: Fix race-condition between the hrtimer and the atomic commit") in 6.6-rc1 replaced spinlock with mutex. So we haven't tested with the lock debugging yet... Maíra and Arthur, mutex_unlock() from interrupt context is not permitted. Please revert that patch immediately. I guess that a semaphore (down()/up()) could be used instead of a mutex.
Re: drm/vkms: deadlock between dev->event_lock and timer
On Wed, Sep 13 2023 at 09:47, Linus Torvalds wrote: > On Wed, 13 Sept 2023 at 07:21, Tetsuo Handa > wrote: >> >> Hello. A deadlock was reported in drivers/gpu/drm/vkms/ . >> It looks like this locking pattern remains as of 6.6-rc1. Please fix. >> >> void drm_crtc_vblank_off(struct drm_crtc *crtc) { >> spin_lock_irq(&dev->event_lock); >> drm_vblank_disable_and_save(dev, pipe) { >> __disable_vblank(dev, pipe) { >> crtc->funcs->disable_vblank(crtc) == vkms_disable_vblank { >> hrtimer_cancel(&out->vblank_hrtimer) { // Retries with >> dev->event_lock held until lock_hrtimer_base() succeeds. >> ret = hrtimer_try_to_cancel(timer) { >> base = lock_hrtimer_base(timer, &flags); // Fails forever >> because vkms_vblank_simulate() is in progress. > > Heh. Ok. This is clearly a bug, but it does seem to be limited to just > the vkms driver, and literally only to the "simulate vblank" case. > > The worst part about it is that it's so subtle and not obvious. > > Some light grepping seems to show that amdgpu has almost the exact > same pattern in its own vkms thing, except it uses > > hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer); > > directly, which presumably fixes the livelock, but means that the > cancel will fail if it's currently running. > > So just doing the same thing in the vkms driver probably fixes things. > > Maybe the vkms people need to add a flag to say "it's canceled" so > that it doesn't then get re-enabled? Or maybe it doesn't matter > and/or already happens for some reason I didn't look into. Maybe the VKMS people need to understand locking in the first place. The first thing I saw in this code is: static enum hrtimer_restart vkms_vblank_simulate(struct hrtimer *timer) { ... mutex_unlock(&output->enabled_lock); What? Unlocking a mutex in the context of a hrtimer callback is simply violating all mutex locking rules. How has this code ever survived lock debugging without triggering a big fat warning? Thanks, tglx
Re: drm/vkms: deadlock between dev->event_lock and timer
On Wed, 13 Sept 2023 at 07:21, Tetsuo Handa wrote: > > Hello. A deadlock was reported in drivers/gpu/drm/vkms/ . > It looks like this locking pattern remains as of 6.6-rc1. Please fix. > > void drm_crtc_vblank_off(struct drm_crtc *crtc) { > spin_lock_irq(&dev->event_lock); > drm_vblank_disable_and_save(dev, pipe) { > __disable_vblank(dev, pipe) { > crtc->funcs->disable_vblank(crtc) == vkms_disable_vblank { > hrtimer_cancel(&out->vblank_hrtimer) { // Retries with > dev->event_lock held until lock_hrtimer_base() succeeds. > ret = hrtimer_try_to_cancel(timer) { > base = lock_hrtimer_base(timer, &flags); // Fails forever because > vkms_vblank_simulate() is in progress. Heh. Ok. This is clearly a bug, but it does seem to be limited to just the vkms driver, and literally only to the "simulate vblank" case. The worst part about it is that it's so subtle and not obvious. Some light grepping seems to show that amdgpu has almost the exact same pattern in its own vkms thing, except it uses hrtimer_try_to_cancel(&amdgpu_crtc->vblank_timer); directly, which presumably fixes the livelock, but means that the cancel will fail if it's currently running. So just doing the same thing in the vkms driver probably fixes things. Maybe the vkms people need to add a flag to say "it's canceled" so that it doesn't then get re-enabled? Or maybe it doesn't matter and/or already happens for some reason I didn't look into. Linus