i915: Use batch pools with the command parser

Chris Wilson Wed, 18 Jun 2014 11:12:06 -0700

On Wed, Jun 18, 2014 at 10:49:05AM -0700, Volkin, Bradley D wrote:
> On Wed, Jun 18, 2014 at 09:52:52AM -0700, Chris Wilson wrote:
> > On Wed, Jun 18, 2014 at 09:36:14AM -0700, bradley.d.vol...@intel.com wrote:
> > > From: Brad Volkin <bradley.d.vol...@intel.com>
> > > 
> > > This patch sets up all of the tracking and copying necessary to
> > > use batch pools with the command parser, but does not actually
> > > dispatch the copied (shadow) batch to the hardware yet. We still
> > > aren't quite ready to set the secure bit during dispatch.
> > > 
> > > Note that performance takes a hit from the copy in some cases
> > > and will likely need some work. At a rough pass, the memcpy
> > > appears to be the bottleneck. Without having done a deeper
> > > analysis, two ideas that come to mind are:
> > > 1) Copy sections of the batch at a time, as they are reached
> > >    by parsing. Might improve cache locality.
> > > 2) Copy only up to the userspace-supplied batch length and
> > >    memset the rest of the buffer. Reduces the number of reads.
> > > 
> > > Signed-off-by: Brad Volkin <bradley.d.vol...@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_cmd_parser.c       | 75 
> > > ++++++++++++++++++++++------
> > >  drivers/gpu/drm/i915/i915_drv.h              |  7 ++-
> > >  drivers/gpu/drm/i915/i915_gem.c              |  8 ++-
> > >  drivers/gpu/drm/i915/i915_gem_execbuffer.c   | 45 +++++++++++++++--
> > >  drivers/gpu/drm/i915/i915_gem_render_state.c |  2 +-
> > >  drivers/gpu/drm/i915/intel_ringbuffer.c      | 12 +++++
> > >  drivers/gpu/drm/i915/intel_ringbuffer.h      |  7 +++
> > >  7 files changed, 134 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c 
> > > b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > > index dea99d9..669afb0 100644
> > > --- a/drivers/gpu/drm/i915/i915_cmd_parser.c
> > > +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c
> > > @@ -831,6 +831,53 @@ finish:
> > >   return (u32*)addr;
> > >  }
> > >  
> > > +/* Returns a vmap'd pointer to dest_obj, which the caller must unmap */
> > > +static u32 *copy_batch(struct drm_i915_gem_object *dest_obj,
> > > +                struct drm_i915_gem_object *src_obj)
> > > +{
> > > + int ret = 0;
> > > + int needs_clflush = 0;
> > > + u32 *src_addr, *dest_addr = NULL;
> > > +
> > > + ret = i915_gem_obj_prepare_shmem_read(src_obj, &needs_clflush);
> > > + if (ret) {
> > > +         DRM_DEBUG_DRIVER("CMD: failed to prep read\n");
> > > +         return ERR_PTR(ret);
> > > + }
> > > +
> > > + src_addr = vmap_batch(src_obj);
> > > + if (!src_addr) {
> > > +         DRM_DEBUG_DRIVER("CMD: Failed to vmap batch\n");
> > > +         ret = -ENOMEM;
> > > +         goto unpin_src;
> > > + }
> > > +
> > > + if (needs_clflush)
> > > +         drm_clflush_virt_range((char *)src_addr, src_obj->base.size);
> > > +
> > > + ret = i915_gem_object_set_to_cpu_domain(dest_obj, true);
> > > + if (ret) {
> > > +         DRM_DEBUG_DRIVER("CMD: Failed to set batch CPU domain\n");
> > > +         goto unmap_src;
> > > + }
> > > +
> > > + dest_addr = vmap_batch(dest_obj);
> > > + if (!dest_addr) {
> > > +         DRM_DEBUG_DRIVER("CMD: Failed to vmap shadow batch\n");
> > > +         ret = -ENOMEM;
> > > +         goto unmap_src;
> > > + }
> > > +
> > > + memcpy(dest_addr, src_addr, src_obj->base.size);
> > > +
> > > +unmap_src:
> > > + vunmap(src_addr);
> > > +unpin_src:
> > > + i915_gem_object_unpin_pages(src_obj);
> > > +
> > > + return ret ? ERR_PTR(ret) : dest_addr;
> > > +}
> > 
> > src_obj->size? You should perhaps borrow a byt.
> 
> Not sure I completely follow you here. The dest_obj is as big or bigger than
> the source, so I think copying only as much data as exists in the source
> object is right. I should be writing nops into any extra space though. And in
> parse_cmds, I should update the batch_end calculation. Were those the issues
> that you saw?


Just that generally src->size >> batch len and clflush will make it
twice as expensive on byt. (The object may be about 1000 times larger
than the batch at the extreme, libva *cough*.)

> > >  static int
> > > @@ -1087,6 +1088,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void 
> > > *data,
> > >   struct drm_i915_private *dev_priv = dev->dev_private;
> > >   struct eb_vmas *eb;
> > >   struct drm_i915_gem_object *batch_obj;
> > > + struct drm_i915_gem_object *shadow_batch_obj = NULL;
> > >   struct drm_clip_rect *cliprects = NULL;
> > >   struct intel_engine_cs *ring;
> > >   struct intel_context *ctx;
> > > @@ -1288,8 +1290,31 @@ i915_gem_do_execbuffer(struct drm_device *dev, 
> > > void *data,
> > >   batch_obj->base.pending_read_domains |= I915_GEM_DOMAIN_COMMAND;
> > >  
> > >   if (i915_needs_cmd_parser(ring)) {
> > > +         shadow_batch_obj =
> > > +                 i915_gem_batch_pool_get(ring->batch_pool,
> > > +                                         batch_obj->base.size);
> > > +         if (IS_ERR(shadow_batch_obj)) {
> > > +                 ret = PTR_ERR(shadow_batch_obj);
> > > +                 /* Don't try to clean up the obj in the error path */
> > > +                 shadow_batch_obj = NULL;
> > > +
> > > +                 /*
> > > +                  * If the pool is at capcity, retiring requests might
> > > +                  * return some buffers.
> > > +                  */
> > > +                 if (ret == -EAGAIN)
> > > +                         i915_gem_retire_requests_ring(ring);
> > > +
> > > +                 goto err;
> > > +         }
> > > +
> > > +         ret = i915_gem_obj_ggtt_pin(shadow_batch_obj, 4096, 0);
> > > +         if (ret)
> > > +                 goto err;
> > > +
> > >           ret = i915_parse_cmds(ring,
> > >                                 batch_obj,
> > > +                               shadow_batch_obj,
> > >                                 args->batch_start_offset,
> > >                                 file->is_master);
> > 
> > You could just allocate the shadow inside parse_cmds and return it. Then
> > replace the conventional batch_boj with the new pointer and add it to
> > the eb list. Similarly, you do not need to track shadow_batch_obj
> > explicitly in the requests, the pool can do its own busy tracking
> > external to the core and reduce the invasiveness of the patch.
> 
> Overall, I agree that this touched more places than I would have liked.
> 
> I initially thought about replacing batch_obj and hooking into the eb list,
> but that seemed like it would involve trickier code w.r.t. ref counting,
> making up an exec_entry for the vma, etc. Not the end of the world, but
> it felt less obvious. I can look again though if you think it's worth it.

I think so. The request is already complicated enough and I think the
shadow batch can fit neatly inside the rules we already have with a
modicum of stitching here.
 
> What specifically are you thinking in terms of implementing busy tracking
> in the pool? The idea with adding the shadow object to the request was just
> to get it back in the pool and purgeable asap. I also thought it would limit
> some additional code given that we know buffers in the pool have had any
> pending work completed. Maybe the suggested approach would do a better job
> of those things though.

I was thinking that a pool is basically a list of bo, and you simply
query whether the oldest was still busy when we need a new bo. Which is
the same as how userspace implements its pool of active/inactive objects.
 
> > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h 
> > > b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > index e72017b..82aae29 100644
> > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > > @@ -188,6 +188,13 @@ struct  intel_engine_cs {
> > >   bool needs_cmd_parser;
> > >  
> > >   /*
> > > +  * A pool of objects to use as shadow copies of client batch buffers
> > > +  * when the command parser is enabled. Prevents the client from
> > > +  * modifying the batch contents after software parsing.
> > > +  */
> > > + struct i915_gem_batch_pool *batch_pool;
> > 
> > Why are they tied to a ring?
> 
> There was a question as to whether to make a global pool or per-ring
> pools way back when I first sent the parser series. Daniel expressed
> a slight preference for per-ring, but I don't object to a global pool.
> I suppose a potential benefit to per-ring would be if userspace uses
> different sized buffers on different rings, we'd more easily find a
> suitable buffer. But enhancing the pool management could fix that too.

Batch buffer sizes are not fixed per ring anyway. I guess Daniel thought
it might work out easier, but memory is a global resource and should be
tracked primarily at the device level. Userspace segregates its caches
based on size to speed up searches.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [RFC 2/4] drm/i915: Use batch pools with the command parser

Reply via email to