Re: [Intel-gfx] i915 PSR test results and cursor lag

Rodrigo Vivi Mon, 05 Feb 2018 16:27:41 -0800

On Mon, Feb 05, 2018 at 11:54:04PM +0000, Andy Lutomirski wrote:
> 
> 
> > On Feb 5, 2018, at 2:50 PM, Rodrigo Vivi <rodrigo.v...@intel.com> wrote:
> > 
> >> On Sat, Feb 03, 2018 at 05:33:08PM +0000, Andy Lutomirski wrote:
> >>> On Fri, Feb 2, 2018 at 7:18 PM, Andy Lutomirski <l...@kernel.org> wrote:
> >>>> On Fri, Feb 2, 2018 at 1:24 AM, Andy Lutomirski <l...@kernel.org> wrote:
> >>>>> On Thu, Feb 1, 2018 at 9:20 PM, Chris Wilson <ch...@chris-wilson.co.uk> 
> >>>>> wrote:
> >>>>> Quoting Andy Lutomirski (2018-02-01 21:04:30)
> >>>>>> I got this after a recent suspend/resume:
> >>>>>> 
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Lid closed.
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: device-enumerator: scan 
> >>>>>> all dirs
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]:   device-enumerator:
> >>>>>> scanning /sys/bus
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]:   device-enumerator:
> >>>>>> scanning /sys/class
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Failed to open
> >>>>>> configuration file '/etc/systemd/sleep.conf': No such file or
> >>>>>> directory
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Suspending...
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal
> >>>>>> sender=n/a destination=n/a object=/org/freedesktop/login1
> >>>>>> interface=org.freedesktop.login1.Manager member=PrepareForSleep
> >>>>>> cookie=570 reply
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Got message
> >>>>>> type=method_call sender=:1.46 destination=:1.1
> >>>>>> object=/org/freedesktop/login1/session/_32
> >>>>>> interface=org.freedesktop.login1.Session member=ReleaseDevice
> >>>>>> Feb 01 09:44:34 laptop systemd-logind[2412]: Sent message type=signal
> >>>>>> sender=n/a destination=:1.46
> >>>>>> object=/org/freedesktop/login1/session/_32
> >>>>>> interface=org.freedesktop.login1.Session member=PauseDevice cookie
> >>>>>> Feb 01 09:44:34 laptop gnome-shell[2630]: Failed to apply DRM plane
> >>>>>> transform 0: Permission denied
> >>>>>> Feb 01 09:44:34 laptop gnome-shell[2630]: drmModeSetCursor2 failed
> >>>>>> with (Permission denied), drawing cursor with OpenGL from now on
> >>>>>> 
> >>>>>> But I don't see the word "cursor" in my system logs before the first
> >>>>>> suspend.  What am I looking for?  This is Fedora 27 running a Gnome
> >>>>>> Wayland session, but it hasn't been reinstalled in some time, so it's
> >>>>>> possible that there are some weird settings sitting around.  But I did
> >>>>>> check and I have no weird i915 parameters.
> >>>>> 
> >>>>> You are using gnome-shell as the display server. From that it appears to
> >>>>> have started off with a HW cursor and switched to a SW cursor after
> >>>>> suspend. Did you notice a change in behaviour? After rebooting or just
> >>>>> restarting gnome-shell?
> >>>> 
> >>>> I think it's less consistently bad after a reboot before suspending.
> >>>> 
> >>>>> 
> >>>>>> Also, are these things potentially related:
> >>>>>> 
> >>>>>> [ 3067.702527] [drm:intel_pipe_update_start [i915]] *ERROR* Potential
> >>>>>> atomic update failure on pipe A
> >>>>> 
> >>>>> They are just "missed the immediate vblank for the screen update"
> >>>>> messages. Should not be related to PSR, but may cause jitter by delaying
> >>>>> the odd screen update.
> >>>> 
> >>>> I just got this one, and the timestamp is at least reasonably close to
> >>>> a giant latency spike:
> >>>> 
> >>>> [  288.799654] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic
> >>>> update failure on pipe A (start=31 end=32) time 15 us, min 1073, max
> >>>> 1079, scanline start 1087, end 1088
> >>>> 
> >>>>> 
> >>>>>> As I'm typing this, I've seen a couple instances of what seems like a
> >>>>>> full *second* of cursor latency, but I've only gotten the potential
> >>>>>> atomic update failure once.
> >>>>>> 
> >>>>>> And is there any straightforward tracing to do to distinguish between
> >>>>>> PSR exit latency and other potential sources of latency?
> >>>>> 
> >>>>> It looks plausible that we could at least report how long it takes the
> >>>>> registers to reflect the change in state (but we don't). The best source
> >>>>> of information atm is /sys/kernel/debug/dri/0/i915_edp_psr_status.
> >>>> 
> >>>> Hmm.
> >>>> 
> >>>> I went and looked at the code, and I noticed what could be bugs or
> >>>> could (more likely) be my confusion since I don't know this code at
> >>>> all:
> >>>> 
> >>>> intel_single_frame_update() does something inscrutable to me, but I
> >>>> imagine it does something that causes the next page flip to get
> >>>> noticed by the panel even with PSR on.  But how does the code that
> >>>> calls it know that anything happened?  (Looking at the commit history,
> >>>> maybe this is something special that's only needed on some platforms
> >>>> but doesn't replace the normal PSR exit sequence.)
> >>>> 
> >>>> Perhaps more interestingly, intel_psr_flush() does this:
> >>>> 
> >>>>    /* By definition flush = invalidate + flush */
> >>>>    if (frontbuffer_bits)
> >>>>        intel_psr_exit(dev_priv);
> >>>> 
> >>>>    if (!dev_priv->psr.active && !dev_priv->psr.busy_frontbuffer_bits)
> >>>>        if (!work_busy(&dev_priv->psr.work.work))
> >>>>            schedule_delayed_work(&dev_priv->psr.work,
> >>>>                          msecs_to_jiffies(100));
> >>>> 
> >>>> I'm guessing that the idea is that we're turning off PSR because we
> >>>> want the panel to update and we expect that, in 100ms, the update will
> >>>> have hit the panel and we'll have been idle long enough for it to make
> >>>> sense to re-enter PSR.  IOW, the code wants PSR to be off for at least
> >>>> 100ms and then to turn back on.  But this code actually says "turn PSR
> >>>> back on in at *most* 100ms".  What happens if there are two screen
> >>>> updates 99ms apart?  The first one should work fine, but the next one
> >>>> will hit with 1ms left on the delayed work, and intel_psr_work() will
> >>>> get called in 1ms.  There's some magic with busy_frontbuffer_bits, but
> >>>> it seems questionable to me that intel_psr_flush() clears
> >>>> busy_frontbuffer_bits and *then* calls intel_psr_exit().
> >>>> 
> >>>> Naively, I would expect that PSR needs to be kept off until the vblank
> >>>> following the page flip.
> >>>> 
> >>>> Also, in intel_psr_work(), shouldn't this code:
> >>>> 
> >>>>    /*
> >>>>     * The delayed work can race with an invalidate hence we need to
> >>>>     * recheck. Since psr_flush first clears this and then reschedules we
> >>>>     * won't ever miss a flush when bailing out here.
> >>>>     */
> >>>>    if (dev_priv->psr.busy_frontbuffer_bits)
> >>>>        goto unlock;
> >>>> 
> >>>> re-arm the delayed work?
> >>>> 
> >>>> Anyway, this is all on a 4.14 kernel.  I should update to 4.16 and see
> >>>> what happens.
> >>> 
> >>> I updated to 4.15, and the situation is much worse.  With
> >>> enable_psr=1, the system survives for several seconds and then the
> >>> screen stops updating entirely.  If I boot with i915.enable_psr=1, I
> >>> get to the Fedora login screen and then the system dies.  If I set
> >>> enable_psr=1 using sysfs, it does a bit after the next resume.  It
> >>> seems like it also sometimes hangs even worse a bit after the screen
> >>> stops updating, but it's hard to tell.
> >>> 
> >>> I see this in my logs:
> >>> 
> >>> [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR*
> >>> [CRTC:37:pipe A] flip_done timed out
> >>> 
> >>> Sometimes I see this a bit later:
> >>> 
> >>> [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR*
> >>> [CRTC:37:pipe A] flip_done timed out
> >>> 
> >>> I'm able to get some debugging out before the system dies.  I see
> >>> intel_psr_flush() getting called a bunch, and I don't see
> >>> intel_psr_invalidate() being called at all.  I also see
> >>> intel_psr_work() activating psr as little as 2ms after
> >>> intel_psr_flush() finishes.  So I think the code is indeed buggy or at
> >>> least questionable.  I'd try to fix it (at least as well as I can
> >>> without knowing anything about how the PSR state machine actually
> >>> works), but the fact that the system hangs would make it very hard to
> >>> test.
> >> 
> >> I filed https://bugs.freedesktop.org/show_bug.cgi?id=104931 for the
> >> questionable reactivation logic issue.  If nothing else, I doubt that
> >> the reactivation timeout logic does what its author intended for it to
> >> do.
> > 
> > Thanks for that. I will check the logic and your patch here.
> > 
> > One thing to consider is that when activating PSR HW engine will
> > consider the number of static frames before transitioning it to
> > active state. So, in general the immediate return to active status
> > shouldn't be a problem.
> > At least not for core platforms. maybe for vlv/chv.
> > 
> 
> Hmm.  Is there some register that can be poked to tell the hardware "hey, I 
> updated the image even if your dirty tracking didn't notice, so turn off PSR 
> if it's on and reset your idle frame counter"?  If so, the logic could be 
> dramatically simplified.


that would be my dream ;)

But the only thing we can do is to disable psr and re-enable it. So the frame 
counter will restart.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] i915 PSR test results and cursor lag

Reply via email to