Re: [RFC PATCH] drm/vc4: Add a load tracker to prevent HVS underflow errors

Boris Brezillon Mon, 29 Oct 2018 01:42:01 -0700

On Mon, 29 Oct 2018 09:06:40 +0100
Daniel Vetter <dan...@ffwll.ch> wrote:


> On Fri, Oct 26, 2018 at 06:10:03PM +0300, Ville Syrjälä wrote:
> > On Fri, Oct 26, 2018 at 04:52:33PM +0200, Boris Brezillon wrote:  
> > > On Fri, 26 Oct 2018 16:26:03 +0200
> > > Daniel Vetter <dan...@ffwll.ch> wrote:
> > >   
> > > > On Fri, Oct 26, 2018 at 3:57 PM Boris Brezillon
> > > > <boris.brezil...@bootlin.com> wrote:  
> > > > >
> > > > > On Fri, 26 Oct 2018 15:30:26 +0200
> > > > > Daniel Vetter <dan...@ffwll.ch> wrote:
> > > > >    
> > > > > > On Thu, Oct 25, 2018 at 11:41:14AM +0200, Boris Brezillon wrote:    
> > > > > > > On Thu, 25 Oct 2018 11:33:14 +0200
> > > > > > > Daniel Vetter <dan...@ffwll.ch> wrote:
> > > > > > >    
> > > > > > > > On Thu, Oct 25, 2018 at 10:09:31AM +0200, Boris Brezillon 
> > > > > > > > wrote:    
> > > > > > > > > On Tue, 23 Oct 2018 15:44:43 +0200
> > > > > > > > > Daniel Vetter <dan...@ffwll.ch> wrote:
> > > > > > > > >    
> > > > > > > > > > On Tue, Oct 23, 2018 at 09:55:08AM +0200, Boris Brezillon 
> > > > > > > > > > wrote:    
> > > > > > > > > > > Hi Daniel,
> > > > > > > > > > >
> > > > > > > > > > > On Tue, 16 Oct 2018 14:57:43 +0200
> > > > > > > > > > > Daniel Vetter <dan...@ffwll.ch> wrote:
> > > > > > > > > > >    
> > > > > > > > > > > > On Tue, Oct 16, 2018 at 11:40:45AM +0200, Boris 
> > > > > > > > > > > > Brezillon wrote:    
> > > > > > > > > > > > > The HVS block is supposed to fill the pixelvalve 
> > > > > > > > > > > > > FIFOs fast enough to
> > > > > > > > > > > > > meet the requested framerate. The problem is, the HVS 
> > > > > > > > > > > > > and memory bus
> > > > > > > > > > > > > bandwidths are limited, and if we don't take these 
> > > > > > > > > > > > > limitations into
> > > > > > > > > > > > > account we might end up with HVS underflow errors.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This patch is trying to model the per-plane HVS and 
> > > > > > > > > > > > > memory bus bandwidth
> > > > > > > > > > > > > consumption and take a decision at atomic_check() 
> > > > > > > > > > > > > time whether the
> > > > > > > > > > > > > estimated load will fit in the HVS and membus budget.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Note that we take an extra margin on the memory bus 
> > > > > > > > > > > > > consumption to let
> > > > > > > > > > > > > the system run smoothly when other blocks are doing 
> > > > > > > > > > > > > heavy use of the
> > > > > > > > > > > > > memory bus. Same goes for the HVS limit, except the 
> > > > > > > > > > > > > margin is smaller in
> > > > > > > > > > > > > this case, since the HVS is not used by external 
> > > > > > > > > > > > > components.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Boris Brezillon 
> > > > > > > > > > > > > <boris.brezil...@bootlin.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > This logic has been validated using a simple shell 
> > > > > > > > > > > > > script and
> > > > > > > > > > > > > some instrumentation in the VC4 driver:
> > > > > > > > > > > > >
> > > > > > > > > > > > > - capture underflow errors at the HVS level and 
> > > > > > > > > > > > > expose a debugfs file
> > > > > > > > > > > > >   reporting those errors
> > > > > > > > > > > > > - add debugfs files to expose when atomic_check fails 
> > > > > > > > > > > > > because of the
> > > > > > > > > > > > >   HVS or membus load limitation or when it fails for 
> > > > > > > > > > > > > other reasons
> > > > > > > > > > > > >
> > > > > > > > > > > > > The branch containing those modification is available 
> > > > > > > > > > > > > here [1], and the
> > > > > > > > > > > > > script (which is internally using modetest) is here 
> > > > > > > > > > > > > [2] (please note
> > > > > > > > > > > > > that I'm bad at writing shell scripts :-)).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Note that those modification tend to over-estimate 
> > > > > > > > > > > > > the load, and thus
> > > > > > > > > > > > > reject setups that might have previously worked, so 
> > > > > > > > > > > > > we might want to
> > > > > > > > > > > > > adjust the limits to avoid that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]https://github.com/bbrezillon/linux/tree/vc4/hvs-bandwidth-eval
> > > > > > > > > > > > > [2]https://github.com/bbrezillon/vc4-hvs-bandwidth-test
> > > > > > > > > > > > >     
> > > > > > > > > > > >
> > > > > > > > > > > > Any interest in using igt to test this stuff? We have 
> > > > > > > > > > > > at least a bunch of
> > > > > > > > > > > > tests already in there that try all kinds of plane 
> > > > > > > > > > > > setups. And we use
> > > > > > > > > > > > those to hunt for underruns on i915 hw.
> > > > > > > > > > > >
> > > > > > > > > > > > Wrt underrun reporting: On i915 we just dump them into 
> > > > > > > > > > > > dmesg at the error
> > > > > > > > > > > > level, using DRM_ERROR,    
> > > > > > > > > > >
> > > > > > > > > > > Are you masking the underrun interrupt after it's been 
> > > > > > > > > > > reported? If we
> > > > > > > > > > > don't do that on VC4 we just end up flooding the 
> > > > > > > > > > > kernel-log buffer until
> > > > > > > > > > > someone comes and update the config.    
> > > > > > > > > >
> > > > > > > > > > Yeah we do that too. Rule is that a full modeset will clear 
> > > > > > > > > > any underrun
> > > > > > > > > > masking (so tests need to make sure they start with a 
> > > > > > > > > > modeset, or it'll be
> > > > > > > > > > for nothing).
> > > > > > > > > >    
> > > > > > > > > > >    
> > > > > > > > > > > > plus a tracepoint. See e.g.
> > > > > > > > > > > > intel_pch_fifo_underrun_irq_handler(). If there's 
> > > > > > > > > > > > interest we could
> > > > > > > > > > > > perhaps extract this into something common, similar to 
> > > > > > > > > > > > what was done with
> > > > > > > > > > > > crc support already.
> > > > > > > > > > > >    
> > > > > > > > > > >
> > > > > > > > > > > I'm not a big fan of hardcoded trace points in general 
> > > > > > > > > > > (because of the
> > > > > > > > > > > whole "it's part of the stable ABI" thing), and in this 
> > > > > > > > > > > case, making the
> > > > > > > > > > > tracepoint generic sounds even more risky to me. Indeed, 
> > > > > > > > > > > how can we know
> > > > > > > > > > > about all the HW specific bits one might want to expose. 
> > > > > > > > > > > For instance,
> > > > > > > > > > > I see the intel underrun tracepoint exposes a struct with 
> > > > > > > > > > > a frame and
> > > > > > > > > > > scanline field, and AFAICT, we don't have such 
> > > > > > > > > > > information in the VC4
> > > > > > > > > > > case.
> > > > > > > > > > >
> > > > > > > > > > > Any opinion on that?    
> > > > > > > > > >
> > > > > > > > > > It's only abi if you're unlucky. If it's just for debugging 
> > > > > > > > > > and
> > > > > > > > > > validation, you can change it again. Tbh, no idea why we 
> > > > > > > > > > even have these
> > > > > > > > > > tracepoints, they're fairly useless imo. CI only relies 
> > > > > > > > > > upon the dmesg
> > > > > > > > > > output. Maybe run git blame and ask the original author, we 
> > > > > > > > > > can probably
> > > > > > > > > > update them to suit our needs.    
> > > > > > > > >
> > > > > > > > > Okay, I think I'll go for a generic debugfs entry that 
> > > > > > > > > returns true
> > > > > > > > > when an underrun error happened since the last modeset, false 
> > > > > > > > > otherwise.
> > > > > > > > >
> > > > > > > > > Next question is: should I attach the underrun status to the 
> > > > > > > > > drm_device
> > > > > > > > > or have one per CRTC? In my case, I only care about the "has 
> > > > > > > > > an
> > > > > > > > > underrun error occurred on any of the active CRTC" case, so 
> > > > > > > > > I'd vote for
> > > > > > > > > a per-device underrun status.    
> > > > > > > >
> > > > > > > > Yeah probably good enough. For our CI all we care about is the 
> > > > > > > > warn/error
> > > > > > > > level dmesg output. Anything at that level is considered a CI 
> > > > > > > > failure.    
> > > > > > >
> > > > > > > So igt is grepping dmesg to detect when an underrun happens?    
> > > > > >
> > > > > > No, but the CI runner is also observing dmesg. Anything in there at
> > > > > > warning or higher level is considered a failure.    
> > > > >
> > > > > Eric, do you do the same when you launch the IGT testsuite?
> > > > >    
> > > > > >    
> > > > > > > > What do you need the debugfs file for?    
> > > > > > >
> > > > > > > I just thought having a debugfs file to expose the underrun 
> > > > > > > information
> > > > > > > would be cleaner than having to grep in dmesg to detect such 
> > > > > > > failures.    
> > > > > >
> > > > > > The issue is that you want to detect underruns everywhere, not just 
> > > > > > in the
> > > > > > specific tests you're checking for it. Anything that does a modeset 
> > > > > > could
> > > > > > cause an underrun (at least we've managed to do so pretty much 
> > > > > > everywhere
> > > > > > on i915 hw, if you misprogram is sufficiently).    
> > > > >
> > > > > In my specific case, I want to have the IGT test check the underrun
> > > > > value while the test is being executed so that I know which exact
> > > > > configuration triggers the underrun situation. At least that's how I
> > > > > did to adjust/debug the HVS load tracking code. Maybe it's not really 
> > > > > a
> > > > > problem when all we do is tracking regressions.    
> > > > 
> > > > Ok, that makes sense, and explains why you want the overall underrun
> > > > counter exposed in debugfs.  
> > > 
> > > Just one tiny detail which is not exactly related to this discussion
> > > but I thought I'd mention it here: underrun is actually not a counter,
> > > it's just a boolean. I used an atomic_t type to avoid having to add a
> > > spinlock to protect the variable (the variable is modified from
> > > an interrupt context and read from a non-atomic context). So, the
> > > acceptable values for underrun are currently 0 or 1. I can make it a
> > > counter if required.  
> > 
> > One idea I had a while back for i915 would be to count the number of
> > frames that had an underrun. So basically something like this:
> > 
> > underrun_irq() {
> >     underrun_frames=1
> >     disable_underrun_irq();
> >     vblank_get();
> > }
> > vblank_irq() {
> >     if (underrun) {
> >             underrun_frames++;
> >     } else if (underrun_frames) {
> >             vblank_put();
> >             enable_underrun_irq();
> >             DEBUG("%d frames had an underrun\n", underrun_frames);
> >             underrun_frames=0;
> >     }
> > }
> > 
> > This avoids drowning in underrun interrupts while still
> > reporting at least how many frames had problems.
> > 
> > But I haven't had time to implement that yet.  

May I ask why you need to know how many frames had underrun issues?

> 
> The other upshot of a counter is that there's no problem with resetting.
> Userspace simply grabs the counter before and after the test and compares.
> If you only have 0/1 you need some way to reset, or otherwise automated
> running in a CI farm isn't possible.

The reset was done during atomic commit in my proposal, so no action
was required on the user side apart from applying a new config.
Anyway, I'll change that for a real counter.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH] drm/vc4: Add a load tracker to prevent HVS underflow errors

Reply via email to