On 07/29, Daniel Vetter wrote:
> On Wed, Jul 29, 2020 at 9:09 PM Melissa Wen <melissa....@gmail.com> wrote:
> >
> > Melissa Wen
> >
> > On Sat, Jul 25, 2020 at 3:12 PM Daniel Vetter <dan...@ffwll.ch> wrote:
> > >
> > > On Sat, Jul 25, 2020 at 7:45 PM Melissa Wen <melissa....@gmail.com> wrote:
> > > >
> > > > On 07/25, Daniel Vetter wrote:
> > > > > On Sat, Jul 25, 2020 at 5:12 AM Sidong Yang <realwa...@gmail.com> 
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Jul 22, 2020 at 05:17:05PM +0200, Daniel Vetter wrote:
> > > > > > > On Wed, Jul 22, 2020 at 4:06 PM Melissa Wen 
> > > > > > > <melissa....@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On 07/22, dan...@ffwll.ch wrote:
> > > > > > > > > On Wed, Jul 22, 2020 at 08:04:11AM -0300, Melissa Wen wrote:
> > > > > > > > > > This patch adds a missing drm_crtc_vblank_put op to the pair
> > > > > > > > > > drm_crtc_vblank_get/put (inc/decrement counter to guarantee 
> > > > > > > > > > vblanks).
> > > > > > > > > >
> > > > > > > > > > It clears the execution of the following kms_cursor_crc 
> > > > > > > > > > subtests:
> > > > > > > > > > 1. pipe-A-cursor-[size,alpha-opaque, NxN-(on-screen, 
> > > > > > > > > > off-screen, sliding,
> > > > > > > > > >    random, fast-moving])] - successful when running 
> > > > > > > > > > individually.
> > > > > > > > > > 2. pipe-A-cursor-dpms passes again
> > > > > > > > > > 3. pipe-A-cursor-suspend also passes
> > > > > > > > > >
> > > > > > > > > > The issue was initially tracked in the sequential execution 
> > > > > > > > > > of IGT
> > > > > > > > > > kms_cursor_crc subtest: when running the test sequence or 
> > > > > > > > > > one of its
> > > > > > > > > > subtests twice, the odd execs complete and the pairs get 
> > > > > > > > > > stuck in an
> > > > > > > > > > endless wait. In the IGT code, calling a wait_for_vblank 
> > > > > > > > > > before the start
> > > > > > > > > > of CRC capture prevented the busy-wait. But the problem 
> > > > > > > > > > persisted in the
> > > > > > > > > > pipe-A-cursor-dpms and -suspend subtests.
> > > > > > > > > >
> > > > > > > > > > Checking the history, the pipe-A-cursor-dpms subtest was 
> > > > > > > > > > successful when,
> > > > > > > > > > in vkms_atomic_commit_tail, instead of using the flip_done 
> > > > > > > > > > op, it used
> > > > > > > > > > wait_for_vblanks. Another way to prevent blocking was 
> > > > > > > > > > wait_one_vblank when
> > > > > > > > > > enabling crtc. However, in both cases, 
> > > > > > > > > > pipe-A-cursor-suspend persisted
> > > > > > > > > > blocking in the 2nd start of CRC capture, which may 
> > > > > > > > > > indicate that
> > > > > > > > > > something got stuck in the step of CRC setup. Indeed, 
> > > > > > > > > > wait_one_vblank in
> > > > > > > > > > the crc setup was able to sync things and free all 
> > > > > > > > > > kms_cursor_crc
> > > > > > > > > > subtests.
> > > > > > > > > >
> > > > > > > > > > Tracing and comparing a clean run with a blocked one:
> > > > > > > > > > - in a clean one, vkms_crtc_atomic_flush enables vblanks;
> > > > > > > > > > - when blocked, only in next op, vkms_crtc_atomic_enable, 
> > > > > > > > > > the vblanks
> > > > > > > > > > started. Moreover, a series of vkms_vblank_simulate flow 
> > > > > > > > > > out until
> > > > > > > > > > disabling vblanks.
> > > > > > > > > > Also watching the steps of vkms_crtc_atomic_flush, when the 
> > > > > > > > > > very first
> > > > > > > > > > drm_crtc_vblank_get returned an error, the subtest crashed. 
> > > > > > > > > > On the other
> > > > > > > > > > hand, when vblank_get succeeded, the subtest completed. 
> > > > > > > > > > Finally, checking
> > > > > > > > > > the flush steps: it increases counter to hold a vblank 
> > > > > > > > > > reference (get),
> > > > > > > > > > but there isn't a op to decreased it and release vblanks 
> > > > > > > > > > (put).
> > > > > > > > > >
> > > > > > > > > > Cc: Daniel Vetter <dan...@ffwll.ch>
> > > > > > > > > > Cc: Rodrigo Siqueira <rodrigosiqueiram...@gmail.com>
> > > > > > > > > > Cc: Haneen Mohammed <hamohammed...@gmail.com>
> > > > > > > > > > Signed-off-by: Melissa Wen <melissa....@gmail.com>
> > > > > > > > > > ---
> > > > > > > > > >  drivers/gpu/drm/vkms/vkms_crtc.c | 1 +
> > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/gpu/drm/vkms/vkms_crtc.c 
> > > > > > > > > > b/drivers/gpu/drm/vkms/vkms_crtc.c
> > > > > > > > > > index ac85e17428f8..a99d6b4a92dd 100644
> > > > > > > > > > --- a/drivers/gpu/drm/vkms/vkms_crtc.c
> > > > > > > > > > +++ b/drivers/gpu/drm/vkms/vkms_crtc.c
> > > > > > > > > > @@ -246,6 +246,7 @@ static void 
> > > > > > > > > > vkms_crtc_atomic_flush(struct drm_crtc *crtc,
> > > > > > > > > >
> > > > > > > > > >             spin_unlock(&crtc->dev->event_lock);
> > > > > > > > > >
> > > > > > > > > > +           drm_crtc_vblank_put(crtc);
> > > > > > > > >
> > > > > > > > > Uh so I reviewed this a bit more carefully now, and I dont 
> > > > > > > > > think this is
> > > > > > > > > the correct bugfix. From the kerneldoc of 
> > > > > > > > > drm_crtc_arm_vblank_event():
> > > > > > > > >
> > > > > > > > >  * Caller must hold a vblank reference for the event @e 
> > > > > > > > > acquired by a
> > > > > > > > >  * drm_crtc_vblank_get(), which will be dropped when the next 
> > > > > > > > > vblank arrives.
> > > > > > > > >
> > > > > > > > > So when we call drm_crtc_arm_vblank_event then the vblank_put 
> > > > > > > > > gets called
> > > > > > > > > for us. And that's the only case where we successfully 
> > > > > > > > > acquired a vblank
> > > > > > > > > interrupt reference since on failure of drm_crtc_vblank_get 
> > > > > > > > > (0 indicates
> > > > > > > > > success for that function, failure negative error number) we 
> > > > > > > > > directly send
> > > > > > > > > out the event.
> > > > > > > > >
> > > > > > > > > So something else fishy is going on, and now I'm totally 
> > > > > > > > > confused why this
> > > > > > > > > even happens.
> > > > > > > > >
> > > > > > > > > We also have a pile of WARN_ON checks in drm_crtc_vblank_put 
> > > > > > > > > to make sure
> > > > > > > > > we don't underflow the refcount, so it's also not that I 
> > > > > > > > > think (except if
> > > > > > > > > this patch creates more WARNING backtraces).
> > > > > > > > >
> > > > > > > > > But clearly it changes behaviour somehow ... can you try to 
> > > > > > > > > figure out
> > > > > > > > > what changes? Maybe print out the vblank->refcount at various 
> > > > > > > > > points in
> > > > > > > > > the driver, and maybe also trace when exactly the fake vkms 
> > > > > > > > > vblank hrtimer
> > > > > > > > > is enabled/disabled ...
> > > > > > > >
> > > > > > > > :(
> > > > > > > >
> > > > > > > > I can check these, but I also have other suspicions. When I 
> > > > > > > > place the
> > > > > > > > drm_crct_vblank_put out of the if (at the end of flush), it not 
> > > > > > > > only solve
> > > > > > > > the issue of blocking on kms_cursor_crc, but also the WARN_ON 
> > > > > > > > on kms_flip
> > > > > > > > doesn't appear anymore (a total cleanup). Just after:
> > > > > > > >
> > > > > > > > vkms_output->composer_state = to_vkms_crtc_state(crtc->state);
> > > > > > > >
> > > > > > > > looks like there is something stuck around here.
> > > > > > >
> > > > > > > Hm do you have the full WARNING for this? Maybe this gives me an 
> > > > > > > idea
> > > > > > > what's going wrong.
> > > > > > >
> > > > > > > > Besides, there is a lock at atomic_begin:
> > > > > > > >
> > > > > > > >   /* This lock is held across the atomic commit to block vblank 
> > > > > > > > timer
> > > > > > > >    * from scheduling vkms_composer_worker until the composer is 
> > > > > > > > updated
> > > > > > > >    */
> > > > > > > >   spin_lock_irq(&vkms_output->lock);
> > > > > > > >
> > > > > > > > that seems to be released on atomic_flush and make me suspect 
> > > > > > > > something
> > > > > > > > missing on the composer update.
> > > > > > >
> > > > > > > atomic_begin/atomic_flush are symmetric functions an always called
> > > > > > > around all the plane updates. So having the spin_lock in _begin 
> > > > > > > and
> > > > > > > the spin_unlock in _flush should be symmetric and correct.
> > > > > > >
> > > > > > > If you want to make sure, recompile with CONFIG_PROVE_LOCKING, 
> > > > > > > which
> > > > > > > should immmediately give you a huge splat in dmesg if there's 
> > > > > > > anything
> > > > > > > unbalanced with locking.
> > > > > > >
> > > > > > > > I'll check all these things and come back with news (hope) :)
> > > > > > >
> > > > > > > Have fun chasing stuff :-)
> > > > > > >
> > > > > > > Cheers, Daniel
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Melissa
> > > > > > > > >
> > > > > > > > > I'm totally confused about what's going on here now.
> > > > > > > > > -Daniel
> > > > > >
> > > > > > Hi Daniel, Melissa.
> > > > > > I found something about this problem.
> > > > > > I traced vblank->refcount that it's important in the problem.
> > > > > > In normal case, first test run calls commit_tail() and enable 
> > > > > > vblank in
> > > > > > atomic_flush(). in drm_vblank_get(), it enable vblank when refcount 
> > > > > > was zero.
> > > > > >
> > > > > > in first test run, it disable crtc for cleanup test. 
> > > > > > drm_crtc_vblank_off() was
> > > > > > called by atomic_disable. in this function vblank's refcount was 
> > > > > > increased for
> > > > > > prevent subsequent drm_vblank_get() from re-enabling the vblank 
> > > > > > interrupt.
> > > > > > and refcount goes one not zero for next test run.
> > > > > >
> > > > > > and next test run, drm_vblank_get() was called but it didn't enable 
> > > > > > vblank
> > > > > > because refcount was already one. drm_crtc_vblank_on() was called 
> > > > > > in next. but
> > > > > > it didn't enable vblank but just increase refcount only.
> > > > > >
> > > > > > I think this is why this problem happen. don't know how to fix this 
> > > > > > correctly.
> > > > > > should we force to enable vblank after enabling crtc?
> > > > >
> > > > > Hm, between drm_crtc_vblank_off and drm_crtc_vblank_on
> > > > > drm_crtc_vblank_get should fail (and leave the refcount unchanged).
> > > > > It's convoluted logic, but the check for vblank->enabled should catch
> > > > > that and return -EINVAL for this case. Does that not happen?
> > > > >
> > > > > It would indeed explain the bug (I think, I've been wrong way too many
> > > > > times with this).
> > > > > -Daniel
> > > > >
> > > > Hi Daniel and Sidong,
> > > >
> > > > I don't know if it will be confusing, but I will try to explain in a
> > > > little more detail (and newbie way) what I saw in this behavior of the
> > > > refcount (similar to what Sidong evaluated).
> > > >
> > > > 1. Starting with the loading of vkms is:
> > > > In vkms_init:
> > > > After drm_vblank_init (refcount=0), it calls:
> > > > vkms_modeset_init
> > > > --> vkms_output_init
> > > > ----> drm_mode_config_reset
> > > > -------> vkms_atomic_crtc_reset
> > > > (even more inside)--> drm_crtc_vblank_reset that bumps the refcount to
> > > > prevent vblank_get to enable vblank (refcount=1)
> > > >
> > > > 2. So, when we start a subtest, vblank is still disabled and in
> > > > commit_tail, commit_planes triggers a atomic_begin/flush->vblank_get 
> > > > that
> > > > return -EINVAL because !vblank->enabled (refcount ends 1) and 
> > > > send_vblank;
> > > > however the test fails before atomic_enable decrements refcount to 0 and
> > > > reset timestamp.
> > > > ** This warning also appears in this very first running:
> > > > WARNING: CPU: 0 PID: 708 at drivers/gpu/drm/vkms/vkms_crtc.c:91 
> > > > vkms_get_vblank_timestamp+0x41/0x50 [vkms]
> > >
> > > Hm yeah I guess that's something we should paper over a bit, but maybe
> > > the bugfix will take care of that.
> > >
> > > > In the end, this sequence modeset_disable -> atomic_begin ->
> > > > atomic_flush: refcount going from 0 to 1 and than drm_vblank_enable
> > > > prepares to everything going well in the next subtest (because
> > > > atomic_disable is not called).
> > > >
> > > > 3. It could be nice, but in the next subtest (with success), as 
> > > > refcount +
> > > > vblank_enabled ok, after doind its job, it calls
> > > > atomic_disable->vblank_off and here refcount ends 1 and vblank disabled
> > > > (the problem returns).
> > > > So, we have a kind of good turn and bad turn.
> > > >
> > > > I tried different things, but the only relatively stable result was
> > > > putting the sequence modeset_disable + modeset_enables + commit_planes 
> > > > in
> > > > the commit_tail. That didn't convince me and then I keep trying things.
> > >
> > > This actually sounds like a good idea, I had the same one. Doing it
> > > this way should also resolve the WARNING you've pointed out I think?
> >
> > Hi Daniel,
> >
> > My uncertainty in this idea was related to a subtest, the cursor-suspend.
> > Although the reordering solves most of the blocking in kms-cursor-crc, the
> > suspend subtest fails because when vkms suspends, it disables vblank,
> > and when it resumes, vblank is not enabled in time. In this subtest, there 
> > is
> > a pipe-crc-start and adding a igt_wait_for_vblank solves... but again,
> > I know it is not the real fix.
> >
> > Would be the case to develop a specific feature of suspend/resume in vkms?
> > I mean, something to enable vblank when resume. I am trying to figure out 
> > how
> > to develop it, but still without success.
> 
> Hm since it's all software I expected that the hrtimer will simply
> continue to run as if nothing happened. For real hw we'd need to use
> drm_mode_config_helper_suspend/resume, but for vkms I dont think
> that's required. Is the vblank hrtimer not working after resume? Or is
> it simply reporting a garbage timestamp and that's why the testcase
> fails?

The testcase fails for the same timeout in waiting the first crc
(already applying the change in the commit_tail sequence):

(kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: (power_dir = 
open("/sys/power", O_RDONLY)) >= 0
(kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: 
get_supported_suspend_states(power_dir) & (1 << state)
(kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: test == 
SUSPEND_TEST_NONE || faccessat(power_dir, "pm_test", R_OK | W_OK, 0) == 0
(kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: !(state == 
SUSPEND_STATE_DISK && !intel_get_total_swap_mb())
(kms_cursor_crc:732) igt_aux-DEBUG: Test requirement passed: ret == 0
(kms_cursor_crc:732) igt_core-INFO: [cmd] rtcwake: wakeup from "mem" using 
/dev/rtc0 at Thu Jul 30 09:23:59 2020
(kms_cursor_crc:732) igt_debugfs-DEBUG: Opening debugfs directory 
'/sys/kernel/debug/dri/0'
(kms_cursor_crc:732) igt_core-INFO: Timed out: Opening crc fd, and poll for 
first CRC.

What I could check was, when suspend, vblanks are disabled (calling
vkms_disable_vblank), and when resume, the testcase fails and only after
the failure vblanks are enabled (vkms_enable_vblank) and
hrtimer_init/starts.

If I "force" enabling vblanks via testcase (adding a
igt_wait_for_vblank before igt_pipe_crc_start), things work fine.
This is why I thought about anticipating the restarting of activities by
placing a vblank "wakeup" in a resume function. Or perhaps prevent vblank
from being disabled when suspended, since that last attempt to place a
vblank_put at the end of the flush made this test case (suspend) work,
because it prevented the vblank from being disabled.

This failure in suspend subtest is so closer to the previous ones, that I
was unsure if my attempt in fix by reordering commit_tail would be enough. 
But maybe they are different situations that deserve different treats.
Do you think restarting a vblank in resume can make sense for vkms?

Melissa
> 
> Not sure how to wire it up for fake drivers like vkms, but maybe doing
> the suspend/resume like for real drivers helps. I think ideally we'd
> try to attach a platform driver to our platform device we create (but
> not sure how to do that).
> -Daniel
> 
> >
> > Melissa
> > >
> > > But I'm still wondering why after step 3 we don't get -EINVAL from
> > > vblank_get() - after vblank_off() vblank->enabled should be false
> > > again, getting us back to the same state as after 1. Is that not
> > > happening?
> > > -Daniel
> > >
> > > >
> > > > > >
> > > > > > Thanks
> > > > > > -Sidong
> > > > > >
> > > > > > > > >
> > > > > > > > > >             crtc->state->event = NULL;
> > > > > > > > > >     }
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > 2.27.0
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Daniel Vetter
> > > > > > > > > Software Engineer, Intel Corporation
> > > > > > > > > http://blog.ffwll.ch
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Daniel Vetter
> > > > > > > Software Engineer, Intel Corporation
> > > > > > > http://blog.ffwll.ch
> > > > > > > _______________________________________________
> > > > > > > dri-devel mailing list
> > > > > > > dri-de...@lists.freedesktop.org
> > > > > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
> > >
> > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
> 
> 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Reply via email to