Thank you for all your useful feedback, Konrad.

I have made most of the changes you suggest. I will test them and
repost. I have made a few points below.

On Wed, 2012-09-19 at 15:11 +0100, Konrad Rzeszutek Wilk wrote:
> On Wed, Sep 19, 2012 at 11:51:27AM +0100, Oliver Chick wrote:
> > This patch implements persistent grants for the xen-blk{front,back}
> > mechanism. The effect of this change is to reduce the number of unmap
> > operations performed, since they cause a (costly) TLB shootdown. This
> > allows the I/O performance to scale better when a large number of VMs
> > are performing I/O.
> >
> > Previously, the blkfront driver was supplied a bvec[] from the request
> > queue. This was granted to dom0; dom0 performed the I/O and wrote
> > directly into the grant-mapped memory and unmapped it; blkfront then
> > removed foreign access for that grant. The cost of unmapping scales
> > badly with the number of CPUs in Dom0. An experiment showed that when
> > Dom0 has 24 VCPUs, and guests are performing parallel I/O to a
> > ramdisk, the IPIs from performing unmap's is a bottleneck at 5 guests
> > (at which point 650,000 IOPS are being performed in total). If more
> > than 5 guests are used, the performance declines. By 10 guests, only
> > 400,000 IOPS are being performed.
> >
> > This patch improves performance by only unmapping when the connection
> > between blkfront and back is broken.
> >
> > On startup, blk{front,back} use xenbus to communicate their ability to
> > perform 'feature-persistent'. Iff both ends have this ability,
> > persistent mode is used.
> >
> > To perform a read, in persistent mode, blkfront uses a separate pool
> > of pages that it maps to dom0. When a request comes in, blkfront
> > transmutes the request so that blkback will write into one of these
> > free pages. Blkback keeps note of which grefs it has already
> > mapped. When a new ring request comes to blkback, it looks to see if
> > it has already mapped that page. If so, it will not map it again. If
> > the page hasn't been previously mapped, it is mapped now, and a record
> > is kept of this mapping. Blkback proceeds as usual. When blkfront is
> > notified that blkback has completed a request, it memcpy's from the
> > shared memory, into the bvec supplied. A record that the {gref, page}
> > tuple is mapped, and not inflight is kept.
> >
> > Writes are similar, except that the memcpy is peformed from the
> > supplied bvecs, into the shared pages, before the request is put onto
> > the ring.
> >
> > Blkback has to store a mapping of grefs=>{page mapped to by gref} in
> > an array. As the grefs are not known apriori, and provide no
> > guarantees on their ordering, we have to perform a linear search
> > through this array to find the page, for every gref we receive. The
> > overhead of this is low, however future work might want to use a more
> > efficient data structure to reduce this O(n) operation.
> >
> > We (ijc, and myself) have introduced a new constant,
> > BLKIF_MAX_PERS_REQUESTS_PER_DEV. This is to prevent a malicious guest
> > from attempting a DoS, by supplying fresh grefs, causing the Dom0
> > kernel from to map excessively. This is currently set to 256---the
> > maximum number of entires in the ring. As the number of inflight
> > requests <= size of ring, 256 is also the maximum sensible size. This
> > introduces a maximum overhead of 11MB of mapped memory, per block
> > device. In practice, we don't typically use about 60 of these. If the
> > guest exceeds the 256 limit, it is either buggy or malicious. We treat
> > this in one of two ways:
> > 1) If we have mapped <
> > BLKIF_MAX_SEGMENTS_PER_REQUEST * BLKIF_MAX_PERS_REQUESTS_PER_DEV
> > pages, we will persistently map the grefs. This can occur is previous
> > requests have not used all BLKIF_MAX_SEGMENTS_PER_REQUEST segments.
> > 2) Otherwise, we revert to non-persistent grants for all future grefs.
> >
> > In writing this patch, the question arrises as to if the additional
> > cost of performing memcpys in the guest (to/from the pool of granted
> > pages) outweigh the gains of not performing TLB shootdowns. The answer
> > to that question is `no'. There appears to be very little, if any
> > additional cost to the guest of using persistent grants. There is
> > perhaps a small saving, from the reduced number of hypercalls
> > performed in granting, and ending foreign access.
> >
> >
> > Signed-off-by: Oliver Chick <oliver.ch...@citrix.com>
> > ---
> >  drivers/block/xen-blkback/blkback.c |  149 +++++++++++++++++++++++++---
> >  drivers/block/xen-blkback/common.h  |   16 +++
> >  drivers/block/xen-blkback/xenbus.c  |   21 +++-
> >  drivers/block/xen-blkfront.c        |  186 
> > ++++++++++++++++++++++++++++++-----
> >  include/xen/interface/io/blkif.h    |    9 ++
> >  5 files changed, 338 insertions(+), 43 deletions(-)
> >
> > diff --git a/drivers/block/xen-blkback/blkback.c 
> > b/drivers/block/xen-blkback/blkback.c
> > index 73f196c..f95dee9 100644
> > --- a/drivers/block/xen-blkback/blkback.c
> > +++ b/drivers/block/xen-blkback/blkback.c
> > @@ -78,6 +78,7 @@ struct pending_req {
> >       unsigned short          operation;
> >       int                     status;
> >       struct list_head        free_list;
> > +     u8                      is_pers;
> 
> Just do what you did in the blkfront.c:
> 
>         unsigned int feature_persistent:1
> 
> or:
>         unsigned int flag_persistent:1
> 
> or:
>         unsigned int flag_pers_gnt:1
> 
> >  };
> >
> >  #define BLKBACK_INVALID_HANDLE (~0)
> > @@ -128,6 +129,24 @@ static int dispatch_rw_block_io(struct xen_blkif 
> > *blkif,
> >  static void make_response(struct xen_blkif *blkif, u64 id,
> >                         unsigned short op, int st);
> >
> > +static void add_pers_gnt(struct pers_gnt *pers_gnt, struct xen_blkif 
> > *blkif)
> > +{
> > +     BUG_ON(blkif->pers_gnt_c >= BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> > +            BLKIF_MAX_SEGMENTS_PER_REQUEST);
> > +     blkif->pers_gnts[blkif->pers_gnt_c++] = pers_gnt;
> > +}
> > +
> > +static struct pers_gnt *get_pers_gnt(struct xen_blkif *blkif,
> > +                                  grant_ref_t gref)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < blkif->pers_gnt_c; i++)
> > +             if (gref == blkif->pers_gnts[i]->gnt)
> > +                     return blkif->pers_gnts[i];
> > +     return NULL;
> > +}
> > +
> >  /*
> >   * Retrieve from the 'pending_reqs' a free pending_req structure to be 
> > used.
> >   */
> > @@ -274,6 +293,12 @@ int xen_blkif_schedule(void *arg)
> >  {
> >       struct xen_blkif *blkif = arg;
> >       struct xen_vbd *vbd = &blkif->vbd;
> > +     struct gnttab_unmap_grant_ref unmap[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct pers_gnt *pers_gnt;
> > +     int i;
> > +     int ret = 0;
> > +     int segs_to_unmap;
> >
> >       xen_blkif_get(blkif);
> >
> > @@ -301,6 +326,29 @@ int xen_blkif_schedule(void *arg)
> >                       print_stats(blkif);
> >       }
> >
> > +     /* Free all persistent grant pages */
> > +
> > +     while (segs_to_unmap = min(BLKIF_MAX_SEGMENTS_PER_REQUEST,
> > +                              blkif->pers_gnt_c)) {
> > +
> > +             for (i = 0; i < segs_to_unmap; i++) {
> > +                     pers_gnt = blkif->pers_gnts[blkif->pers_gnt_c - i - 
> > 1];
> > +
> > +                     gnttab_set_unmap_op(&unmap[i],
> > +                                         pfn_to_kaddr(page_to_pfn(
> > +                                                          pers_gnt->page)),
> > +                                         GNTMAP_host_map,
> > +                                         pers_gnt->handle);
> > +
> > +                     pages[i] = pers_gnt->page;
> > +             }
> > +
> > +             ret = gnttab_unmap_refs(unmap, pages, segs_to_unmap, false);
> > +             BUG_ON(ret);
> > +
> > +             blkif->pers_gnt_c -= segs_to_unmap;
> > +
> > +     }
> > +
> >       if (log_stats)
> >               print_stats(blkif);
> >
> > @@ -343,13 +391,28 @@ static void xen_blkbk_unmap(struct pending_req *req)
> >
> >  static int xen_blkbk_map(struct blkif_request *req,
> >                        struct pending_req *pending_req,
> > -                      struct seg_buf seg[])
> > +                      struct seg_buf seg[],
> > +                      struct page *pages[])
> >  {
> >       struct gnttab_map_grant_ref map[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct pers_gnt *new_pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct pers_gnt *pers_gnts[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct page *pages_to_gnt[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct pers_gnt *pers_gnt;
> > +     phys_addr_t addr;
> >       int i;
> > +     int new_map;
> >       int nseg = req->u.rw.nr_segments;
> > +     int segs_to_init = 0;
> 
> segs_to_map is probably a better name.
> 
> >       int ret = 0;
> > +     int use_pers_gnts;
> >
> > +     use_pers_gnts = (pending_req->blkif->can_grant_persist &&
> > +                      pending_req->blkif->pers_gnt_c <
> > +                      BLKIF_MAX_SEGMENTS_PER_REQUEST *
> > +                      BLKIF_MAX_PERS_REQUESTS_PER_DEV);
> > +
> > +     pending_req->is_pers = use_pers_gnts;
> >       /*
> >        * Fill out preq.nr_sects with proper amount of sectors, and setup
> >        * assign map[..] with the PFN of the page in our domain with the
> > @@ -358,36 +421,87 @@ static int xen_blkbk_map(struct blkif_request *req,
> >       for (i = 0; i < nseg; i++) {
> >               uint32_t flags;
> >
> > +             if (use_pers_gnts) {
> > +                     pers_gnt = get_pers_gnt(pending_req->blkif,
> > +                                             req->u.rw.seg[i].gref);
> > +                     if (!pers_gnt) {
> > +                             new_map = 1;
> > +                             pers_gnt = kmalloc(sizeof(struct pers_gnt),
> > +                                                GFP_KERNEL);
> > +                             if (!pers_gnt)
> > +                                     return -ENOMEM;
> > +                             pers_gnt->page = alloc_page(GFP_KERNEL);
> > +                             if (!pers_gnt->page)
> > +                                     return -ENOMEM;
> 
> You want to kfree pers_gnt here.
> 
> > +                             pers_gnt->gnt = req->u.rw.seg[i].gref;
> > +
> > +                             pages_to_gnt[segs_to_init] = pers_gnt->page;
> > +                             new_pers_gnts[segs_to_init] = pers_gnt;
> > +
> > +                             add_pers_gnt(pers_gnt, pending_req->blkif);
> > +
> > +                     } else {
> > +                             new_map = 0;
> > +                     }
> > +                     pages[i] = pers_gnt->page;
> > +                     addr = (unsigned long) pfn_to_kaddr(
> > +                             page_to_pfn(pers_gnt->page));
> 
> Would it make sense to cache this in the 'pers_gnt' structure?

As far as I can tell, we only need this value when mapping, and
unmapping. So if we cache it, we will use it a maximum of one time. I
think it's cheap to calculate. Am I right?

> 
> > +                     pers_gnts[i] = pers_gnt;
> > +             } else {
> > +                     new_map = 1;
> > +                     pages[i] = blkbk->pending_page(pending_req, i);
> > +                     addr = vaddr(pending_req, i);
> > +                     pages_to_gnt[i] = blkbk->pending_page(pending_req, i);
> > +             }
> > +
> >               flags = GNTMAP_host_map;
> > -             if (pending_req->operation != BLKIF_OP_READ)
> > +             if (!use_pers_gnts && (pending_req->operation != 
> > BLKIF_OP_READ))
> >                       flags |= GNTMAP_readonly;
> > -             gnttab_set_map_op(&map[i], vaddr(pending_req, i), flags,
> > -                               req->u.rw.seg[i].gref,
> > -                               pending_req->blkif->domid);
> > +             if (new_map) {
> > +                     gnttab_set_map_op(&map[segs_to_init++], addr,
> > +                                       flags, req->u.rw.seg[i].gref,
> > +                                       pending_req->blkif->domid);
> > +             }
> >       }
> >
> > -     ret = gnttab_map_refs(map, NULL, &blkbk->pending_page(pending_req, 
> > 0), nseg);
> > -     BUG_ON(ret);
> > +     if (segs_to_init) {
> > +             ret = gnttab_map_refs(map, NULL, pages_to_gnt, segs_to_init);
> > +             BUG_ON(ret);
> > +     }
> >
> >       /*
> >        * Now swizzle the MFN in our domain with the MFN from the other 
> > domain
> >        * so that when we access vaddr(pending_req,i) it has the contents of
> >        * the page from the other domain.
> >        */
> > -     for (i = 0; i < nseg; i++) {
> > +     for (i = 0; i < segs_to_init; i++) {
> >               if (unlikely(map[i].status != 0)) {
> >                       pr_debug(DRV_PFX "invalid buffer -- could not remap 
> > it\n");
> >                       map[i].handle = BLKBACK_INVALID_HANDLE;
> >                       ret |= 1;
> >               }
> >
> > -             pending_handle(pending_req, i) = map[i].handle;
> > +             if (use_pers_gnts) {
> > +                     /* store the `out' values from map */
> > +                 pending_req->blkif->pers_gnts
> > +                     [pending_req->blkif->pers_gnt_c - segs_to_init +
> > +                      i]->handle = map[i].handle;
> > +                     new_pers_gnts[i]->dev_bus_addr = map[i].dev_bus_addr;
> > +             }
> >
> >               if (ret)
> >                       continue;
> > -
> > -             seg[i].buf  = map[i].dev_bus_addr |
> > -                     (req->u.rw.seg[i].first_sect << 9);
> > +     }
> > +     for (i = 0; i < nseg; i++) {
> > +             if (use_pers_gnts) {
> > +                     pending_handle(pending_req, i) = pers_gnts[i]->handle;
> > +                     seg[i].buf = pers_gnts[i]->dev_bus_addr |
> > +                             (req->u.rw.seg[i].first_sect << 9);
> > +             } else {
> > +                     pending_handle(pending_req, i) = map[i].handle;
> > +                     seg[i].buf = map[i].dev_bus_addr |
> > +                             (req->u.rw.seg[i].first_sect << 9);
> > +             }
> >       }
> >       return ret;
> >  }
> > @@ -468,7 +582,8 @@ static void __end_block_io_op(struct pending_req 
> > *pending_req, int error)
> >        * the proper response on the ring.
> >        */
> >       if (atomic_dec_and_test(&pending_req->pendcnt)) {
> > -             xen_blkbk_unmap(pending_req);
> > +             if (!pending_req->is_pers)
> > +                     xen_blkbk_unmap(pending_req);
> >               make_response(pending_req->blkif, pending_req->id,
> >                             pending_req->operation, pending_req->status);
> >               xen_blkif_put(pending_req->blkif);
> > @@ -590,6 +705,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >       int operation;
> >       struct blk_plug plug;
> >       bool drain = false;
> > +     struct page *pages[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> >
> >       switch (req->operation) {
> >       case BLKIF_OP_READ:
> > @@ -676,7 +792,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >        * the hypercall to unmap the grants - that is all done in
> >        * xen_blkbk_unmap.
> >        */
> > -     if (xen_blkbk_map(req, pending_req, seg))
> > +     if (xen_blkbk_map(req, pending_req, seg, pages))
> >               goto fail_flush;
> >
> >       /*
> > @@ -688,7 +804,7 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >       for (i = 0; i < nseg; i++) {
> >               while ((bio == NULL) ||
> >                      (bio_add_page(bio,
> > -                                  blkbk->pending_page(pending_req, i),
> 
> Can we get rid of pending_page macro?

Unfortunately not, it is still used in the non-persistent mode to
populate the pages[].

> 
> > +                                  pages[i],
> >                                    seg[i].nsec << 9,
> >                                    seg[i].buf & ~PAGE_MASK) == 0)) {
> >
> > @@ -743,7 +859,8 @@ static int dispatch_rw_block_io(struct xen_blkif *blkif,
> >       return 0;
> >
> >   fail_flush:
> > -     xen_blkbk_unmap(pending_req);
> > +     if (!blkif->can_grant_persist)
> > +             xen_blkbk_unmap(pending_req);
> >   fail_response:
> >       /* Haven't submitted any bio's yet. */
> >       make_response(blkif, req->u.rw.id, req->operation, BLKIF_RSP_ERROR);
> > diff --git a/drivers/block/xen-blkback/common.h 
> > b/drivers/block/xen-blkback/common.h
> > index 9ad3b5e..1ecb709 100644
> > --- a/drivers/block/xen-blkback/common.h
> > +++ b/drivers/block/xen-blkback/common.h
> > @@ -47,6 +47,8 @@
> >  #define DPRINTK(fmt, args...)                                \
> >       pr_debug(DRV_PFX "(%s:%d) " fmt ".\n",          \
> >                __func__, __LINE__, ##args)
> > +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> 
> You need a comment explaining why this number.
> 
> > +
> >
> >
> >  /* Not a real protocol.  Used to generate ring structs which contain
> > @@ -164,6 +166,14 @@ struct xen_vbd {
> >
> >  struct backend_info;
> >
> > +
> > +struct pers_gnt {
> > +     struct page *page;
> > +     grant_ref_t gnt;
> > +     uint32_t handle;
> 
> Not grant_handle_t ?
> 
> > +     uint64_t dev_bus_addr;
> > +};
> > +
> >  struct xen_blkif {
> >       /* Unique identifier for this interface. */
> >       domid_t                 domid;
> > @@ -190,6 +200,12 @@ struct xen_blkif {
> >       struct task_struct      *xenblkd;
> >       unsigned int            waiting_reqs;
> >
> > +     /* frontend feature information */
> > +     u8                      can_grant_persist:1;
> 
> Pls move that to the vbd structure, and if you feel
> especially adventourous, you could modify the
> 
> bool flush_support
> bool discard_secure
> 
> to be 'unsigned int feature_flush:1' ,etc.. as its own
> seperate patch and then introduce the
> 
> unsigned int feature_grnt_pers:1
> 
> flag.
> > +     struct pers_gnt         *pers_gnts[BLKIF_MAX_PERS_REQUESTS_PER_DEV *
> > +                                       BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     unsigned int            pers_gnt_c;
> > +
> >       /* statistics */
> >       unsigned long           st_print;
> >       int                     st_rd_req;
> > diff --git a/drivers/block/xen-blkback/xenbus.c 
> > b/drivers/block/xen-blkback/xenbus.c
> > index 4f66171..dc58cc4 100644
> > --- a/drivers/block/xen-blkback/xenbus.c
> > +++ b/drivers/block/xen-blkback/xenbus.c
> > @@ -681,6 +681,13 @@ again:
> >               goto abort;
> >       }
> >
> > +     err = xenbus_printf(xbt, dev->nodename, "feature-persistent-grants",
> > +                         "%d", 1);
> > +     if (err) {
> > +             xenbus_dev_fatal(dev, err, "writing persistent capability");
> 
> It is not fatal. Just do dev_warn, like the xen_blkbk_discard does.
> 
> > +             goto abort;
> > +     }
> > +
> >       /* FIXME: use a typename instead */
> >       err = xenbus_printf(xbt, dev->nodename, "info", "%u",
> >                           be->blkif->vbd.type |
> > @@ -721,6 +728,7 @@ static int connect_ring(struct backend_info *be)
> >       struct xenbus_device *dev = be->dev;
> >       unsigned long ring_ref;
> >       unsigned int evtchn;
> > +     u8 pers_grants;
> >       char protocol[64] = "";
> >       int err;
> >
> > @@ -750,8 +758,17 @@ static int connect_ring(struct backend_info *be)
> >               xenbus_dev_fatal(dev, err, "unknown fe protocol %s", 
> > protocol);
> >               return -1;
> >       }
> > -     pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s)\n",
> > -             ring_ref, evtchn, be->blkif->blk_protocol, protocol);
> > +     err = xenbus_gather(XBT_NIL, dev->otherend,
> > +                         "feature-persistent-grants", "%d",
> > +                         &pers_grants, NULL);
> > +     if (err)
> > +             pers_grants = 0;
> > +
> > +     be->blkif->can_grant_persist = pers_grants;
> 
> We should also have a patch for the Xen blkif.h to document this feature.. 
> but that
> can be done later.
> 
> > +
> > +     pr_info(DRV_PFX "ring-ref %ld, event-channel %d, protocol %d (%s) 
> > persistent %d\n",
> > +             ring_ref, evtchn, be->blkif->blk_protocol, protocol,
> > +             pers_grants);
> >
> >       /* Map the shared frame, irq etc. */
> >       err = xen_blkif_map(be->blkif, ring_ref, evtchn);
> > diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> > index e4fb337..c1cc5fe 100644
> > --- a/drivers/block/xen-blkfront.c
> > +++ b/drivers/block/xen-blkfront.c
> > @@ -68,6 +68,13 @@ struct blk_shadow {
> >       struct blkif_request req;
> >       struct request *request;
> >       unsigned long frame[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct gnt_list *grants_used[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +};
> > +
> > +struct gnt_list {
> > +     grant_ref_t gref;
> > +     unsigned long frame;
> 
> Pls mention what 'frame' is and what the other ones do.
> 
> > +     struct gnt_list *tail;
> >  };
> 
> How did the compiler actually compile this? You are using it in 'struct 
> blk_shadow'
> but it is declared later?
> >
> >  static DEFINE_MUTEX(blkfront_mutex);
> > @@ -97,11 +104,14 @@ struct blkfront_info
> >       struct work_struct work;
> >       struct gnttab_free_callback callback;
> >       struct blk_shadow shadow[BLK_RING_SIZE];
> > +     struct gnt_list *persistent_grants_front;
> 
> I don't think you need the 'front' in there. Its obvious you are
> in the frontend code.
> 
> > +     unsigned int persistent_grants_c;
> >       unsigned long shadow_free;
> >       unsigned int feature_flush;
> >       unsigned int flush_op;
> >       unsigned int feature_discard:1;
> >       unsigned int feature_secdiscard:1;
> > +     unsigned int feature_persistent:1;
> >       unsigned int discard_granularity;
> >       unsigned int discard_alignment;
> >       int is_ready;
> > @@ -286,22 +296,37 @@ static int blkif_queue_request(struct request *req)
> >       struct blkif_request *ring_req;
> >       unsigned long id;
> >       unsigned int fsect, lsect;
> > -     int i, ref;
> > +     int i, ref, use_pers_gnts, new_pers_gnts;
> 
> use_pers_gnts and new_pers_gnt? Can you document what the difference
> is pls?
> 
> Perhaps 'new_pers_gnts' should be called:
> 'got_grant'? or 'using_grant' ?
> 
> >       grant_ref_t gref_head;
> > +     struct bio_vec *bvecs[BLKIF_MAX_SEGMENTS_PER_REQUEST];
> > +     struct bio_vec *bvec;
> > +     struct req_iterator iter;
> > +     char *bvec_data;
> > +     void *shared_data;
> > +     unsigned long flags;
> 
> What kind of flags?
> > +     struct page *shared_page;
> 
> It is not really shared_page. It is a temporary page. Perhaps
> tmp_page ?
> 
> > +     struct gnt_list *gnt_list_entry;
> >       struct scatterlist *sg;
> >
> >       if (unlikely(info->connected != BLKIF_STATE_CONNECTED))
> >               return 1;
> >
> > -     if (gnttab_alloc_grant_references(
> > -             BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> > -             gnttab_request_free_callback(
> > -                     &info->callback,
> > -                     blkif_restart_queue_callback,
> > -                     info,
> > -                     BLKIF_MAX_SEGMENTS_PER_REQUEST);
> > -             return 1;
> > +     use_pers_gnts = info->feature_persistent;
> > +
> > +     if (info->persistent_grants_c < BLKIF_MAX_SEGMENTS_PER_REQUEST) {
> > +             new_pers_gnts = 1;
> > +             if (gnttab_alloc_grant_references(
> > +                         BLKIF_MAX_SEGMENTS_PER_REQUEST, &gref_head) < 0) {
> > +                     gnttab_request_free_callback(
> > +                             &info->callback,
> > +                             blkif_restart_queue_callback,
> > +                             info,
> > +                             BLKIF_MAX_SEGMENTS_PER_REQUEST);
> > +                     return 1;
> > +             }
> >       } else
> > +             new_pers_gnts = 0;
> >
> >       /* Fill out a communications ring structure. */
> >       ring_req = RING_GET_REQUEST(&info->ring, info->ring.req_prod_pvt);
> > @@ -341,20 +366,66 @@ static int blkif_queue_request(struct request *req)
> >                      BLKIF_MAX_SEGMENTS_PER_REQUEST);
> >
> >               for_each_sg(info->sg, sg, ring_req->u.rw.nr_segments, i) {
> > -                     buffer_mfn = pfn_to_mfn(page_to_pfn(sg_page(sg)));
> >                       fsect = sg->offset >> 9;
> >                       lsect = fsect + (sg->length >> 9) - 1;
> > -                     /* install a grant reference. */
> > -                     ref = gnttab_claim_grant_reference(&gref_head);
> > -                     BUG_ON(ref == -ENOSPC);
> >
> > -                     gnttab_grant_foreign_access_ref(
> > +                     if (use_pers_gnts && info->persistent_grants_c) {
> > +                             gnt_list_entry = 
> > info->persistent_grants_front;
> > +
> > +                             info->persistent_grants_front =
> > +                                     info->persistent_grants_front->tail;
> > +                             ref = gnt_list_entry->gref;
> > +                             buffer_mfn = 
> > pfn_to_mfn(gnt_list_entry->frame);
> 
> Ah, so frame is a PFN! why don't you just call it that?
> 
> > +                             info->persistent_grants_c--;
> > +                     } else {
> > +                             ref = 
> > gnttab_claim_grant_reference(&gref_head);
> > +                             BUG_ON(ref == -ENOSPC);
> > +
> > +                             if (use_pers_gnts) {
> > +                                     gnt_list_entry =
> > +                                             kmalloc(sizeof(struct 
> > gnt_list),
> > +                                                              GFP_ATOMIC);
> > +                                     if (!gnt_list_entry)
> > +                                             return -ENOMEM;
> > +
> > +                                     shared_page = alloc_page(GFP_ATOMIC);
> > +                                     if (!shared_page)
> > +                                             return -ENOMEM;
> 
> Make sure to kfree gnt_list_entry.
> 
> > +
> > +                                     gnt_list_entry->frame =
> > +                                             page_to_pfn(shared_page);
> > +                                     gnt_list_entry->gref = ref;
> > +                             } else
> > +                                     shared_page = sg_page(sg);
> > +
> > +                             buffer_mfn = pfn_to_mfn(page_to_pfn(
> > +                                                             shared_page));
> > +                             gnttab_grant_foreign_access_ref(
> >                                       ref,
> >                                       info->xbdev->otherend_id,
> >                                       buffer_mfn,
> > -                                     rq_data_dir(req));
> > +                                     !use_pers_gnts && rq_data_dir(req));
> > +                     }
> > +
> > +                     if (use_pers_gnts)
> > +                             info->shadow[id].grants_used[i] =
> > +                                     gnt_list_entry;
> > +
> > +                     if (use_pers_gnts && rq_data_dir(req)) {
> 
> You can declare the 'shared_data' here:
> 
>                                 void *shared_data;
> 
> and also bvec_data.
> 
> > +                             shared_data = kmap_atomic(
> > +                                     pfn_to_page(gnt_list_entry->frame));
> > +                             bvec_data = kmap_atomic(sg_page(sg));
> > +
> > +                             memcpy(shared_data + sg->offset,
> > +                                    bvec_data   + sg->offset,
> > +                                    sg->length);
> 
> Do we need to worry about it spilling over a page? Should we check that
> sg>offset + sg->length < PAGE_SIZE ?

I agree, this is probably a worthwhile thing to check.

> 
> Also this would imply that based on the offset (so say it is 3999) that the 
> old data (0->3998)
> might be still there - don't know how important that is?

This is true. I spoke with IanC about this, and we *think* that this is
ok. Any old data that is there will have already been given to blkback,
so we're not really leaking data that we shouldn't be. 

> > +
> > +                             kunmap_atomic(bvec_data);
> > +                             kunmap_atomic(shared_data);
> > +                     }
> >
> >                       info->shadow[id].frame[i] = mfn_to_pfn(buffer_mfn);
> > +
> >                       ring_req->u.rw.seg[i] =
> >                                       (struct blkif_request_segment) {
> >                                               .gref       = ref,
> > @@ -368,7 +439,8 @@ static int blkif_queue_request(struct request *req)
> >       /* Keep a private copy so we can reissue requests when recovering. */
> >       info->shadow[id].req = *ring_req;
> >
> > -     gnttab_free_grant_references(gref_head);
> > +     if (new_pers_gnts)
> > +             gnttab_free_grant_references(gref_head);
> >
> >       return 0;
> >  }
> > @@ -480,12 +552,13 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, 
> > u16 sector_size)
> >  static void xlvbd_flush(struct blkfront_info *info)
> >  {
> >       blk_queue_flush(info->rq, info->feature_flush);
> > -     printk(KERN_INFO "blkfront: %s: %s: %s\n",
> > +     printk(KERN_INFO "blkfront: %s: %s: %s. Persistent=%d\n",
> 
> Just use %s like the rest
> >              info->gd->disk_name,
> >              info->flush_op == BLKIF_OP_WRITE_BARRIER ?
> >               "barrier" : (info->flush_op == BLKIF_OP_FLUSH_DISKCACHE ?
> >               "flush diskcache" : "barrier or flush"),
> > -            info->feature_flush ? "enabled" : "disabled");
> > +            info->feature_flush ? "enabled" : "disabled",
> > +         info->feature_persistent);
> 
> and this can be 'info->feature_persistent ? "persitent" : "".
> 
> >  }
> >
> >  static int xen_translate_vdev(int vdevice, int *minor, unsigned int 
> > *offset)
> > @@ -707,6 +780,8 @@ static void blkif_restart_queue(struct work_struct 
> > *work)
> >
> >  static void blkif_free(struct blkfront_info *info, int suspend)
> >  {
> > +     struct gnt_list *pers_gnt;
> > +
> >       /* Prevent new requests being issued until we fix things up. */
> >       spin_lock_irq(&info->io_lock);
> >       info->connected = suspend ?
> > @@ -714,6 +789,15 @@ static void blkif_free(struct blkfront_info *info, int 
> > suspend)
> >       /* No more blkif_request(). */
> >       if (info->rq)
> >               blk_stop_queue(info->rq);
> > +
> > +     /* Remove all persistent grants */
> > +     while (info->persistent_grants_front) {
> > +             pers_gnt = info->persistent_grants_front;
> > +             info->persistent_grants_front = pers_gnt->tail;
> > +             gnttab_end_foreign_access(pers_gnt->gref, 0, 0UL);
> > +             kfree(pers_gnt);
> > +     }
> > +
> >       /* No more gnttab callback work. */
> >       gnttab_cancel_free_callback(&info->callback);
> >       spin_unlock_irq(&info->io_lock);
> > @@ -734,13 +818,44 @@ static void blkif_free(struct blkfront_info *info, 
> > int suspend)
> >
> >  }
> >
> > -static void blkif_completion(struct blk_shadow *s)
> > +static void blkif_completion(struct blk_shadow *s, struct blkfront_info 
> > *info,
> > +                          struct blkif_response *bret)
> >  {
> >       int i;
> > -     /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same place
> > -      * flag. */
> > -     for (i = 0; i < s->req.u.rw.nr_segments; i++)
> > -             gnttab_end_foreign_access(s->req.u.rw.seg[i].gref, 0, 0UL);
> > +     struct gnt_list *new_gnt_list_entry;
> > +     struct bio_vec *bvec;
> > +     struct req_iterator iter;
> > +     unsigned long flags;
> > +     char *bvec_data;
> > +     void *shared_data;
> > +
> > +     i = 0;
> > +     if (info->feature_persistent == 0) {
> > +             /* Do not let BLKIF_OP_DISCARD as nr_segment is in the same
> > +              * place flag. */
> > +             for (i = 0; i < s->req.u.rw.nr_segments; i++)
> > +                     gnttab_end_foreign_access(s->req.u.rw.seg[i].gref,
> > +                                               0, 0UL);
> > +             return;
> > +     }
> > +
> 
> Move the 'i = 0' to here.
> 
> > +     if (bret->operation == BLKIF_OP_READ)
> > +             rq_for_each_segment(bvec, s->request, iter) {
> > +                     shared_data = kmap_atomic
> > +                             (pfn_to_page(s->grants_used[i++]->frame));
> > +                     bvec_data = bvec_kmap_irq(bvec, &flags);
> > +                     memcpy(bvec_data, shared_data + bvec->bv_offset,
> > +                            bvec->bv_len);
> > +                     bvec_kunmap_irq(bvec_data, &flags);
> > +                     kunmap_atomic(shared_data);
> > +             }
> > +     /* Add the persistent grant into the list of free grants */
> > +     for (i = 0; i < s->req.u.rw.nr_segments; i++) {
> > +             new_gnt_list_entry = s->grants_used[i];
> > +             new_gnt_list_entry->tail = info->persistent_grants_front;
> > +             info->persistent_grants_front = new_gnt_list_entry;
> > +             info->persistent_grants_c++;
> > +     }
> >  }
> >
> >  static irqreturn_t blkif_interrupt(int irq, void *dev_id)
> > @@ -783,7 +898,7 @@ static irqreturn_t blkif_interrupt(int irq, void 
> > *dev_id)
> >               req  = info->shadow[id].request;
> >
> >               if (bret->operation != BLKIF_OP_DISCARD)
> > -                     blkif_completion(&info->shadow[id]);
> > +                     blkif_completion(&info->shadow[id], info, bret);
> >
> >               if (add_id_to_freelist(info, id)) {
> >                       WARN(1, "%s: response to %s (id %ld) couldn't be 
> > recycled!\n",
> > @@ -943,6 +1058,12 @@ again:
> >               message = "writing protocol";
> >               goto abort_transaction;
> >       }
> > +     err = xenbus_printf(xbt, dev->nodename,
> > +                         "feature-persistent-grants", "%d", 1);
> > +     if (err) {
> > +             message = "Writing persistent grants front feature";
> > +             goto abort_transaction;
> 
> I wouldn't abort. Just continue on using the non-grant feature.
> 
> > +     }
> >
> >       err = xenbus_transaction_end(xbt, 0);
> >       if (err) {
> > @@ -1030,6 +1151,7 @@ static int blkfront_probe(struct xenbus_device *dev,
> >       spin_lock_init(&info->io_lock);
> >       info->xbdev = dev;
> >       info->vdevice = vdevice;
> > +     info->persistent_grants_c = 0;
> >       info->connected = BLKIF_STATE_DISCONNECTED;
> >       INIT_WORK(&info->work, blkif_restart_queue);
> >
> > @@ -1094,6 +1216,7 @@ static int blkif_recover(struct blkfront_info *info)
> >                                       req->u.rw.seg[j].gref,
> >                                       info->xbdev->otherend_id,
> >                                       
> > pfn_to_mfn(info->shadow[req->u.rw.id].frame[j]),
> > +                                     !info->feature_persistent &&
> >                                       
> > rq_data_dir(info->shadow[req->u.rw.id].request));
> >               }
> >               info->shadow[req->u.rw.id].req = *req;
> > @@ -1226,7 +1349,7 @@ static void blkfront_connect(struct blkfront_info 
> > *info)
> >       unsigned long sector_size;
> >       unsigned int binfo;
> >       int err;
> > -     int barrier, flush, discard;
> > +     int barrier, flush, discard, persistent;
> >
> >       switch (info->connected) {
> >       case BLKIF_STATE_CONNECTED:
> > @@ -1297,6 +1420,19 @@ static void blkfront_connect(struct blkfront_info 
> > *info)
> >               info->flush_op = BLKIF_OP_FLUSH_DISKCACHE;
> >       }
> >
> > +     /*
> > +      * Are we dealing with an old blkback that will unmap
> > +      * all grefs?
> > +      */
> > +     err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> > +                         "feature-persistent-grants", "%d", &persistent,
> > +                         NULL);
> > +
> > +     if (err)
> > +             info->feature_persistent = 0;
> > +     else
> > +             info->feature_persistent = persistent;
> > +
> >       err = xenbus_gather(XBT_NIL, info->xbdev->otherend,
> >                           "feature-discard", "%d", &discard,
> >                           NULL);
> > diff --git a/include/xen/interface/io/blkif.h 
> > b/include/xen/interface/io/blkif.h
> > index ee338bf..cd267a9b 100644
> > --- a/include/xen/interface/io/blkif.h
> > +++ b/include/xen/interface/io/blkif.h
> > @@ -109,6 +109,15 @@ typedef uint64_t blkif_sector_t;
> >   */
> >  #define BLKIF_MAX_SEGMENTS_PER_REQUEST 11
> >
> > +/*
> > + * Maximum number of persistent grants that can be mapped by Dom0 for each
> 
> by blkback
> > + * interface. This is set to be the size of the ring, as this is a limit on
> 
> Size of the ring? You sure?
> > + * the number of requests that can be inflight at any one time. 256 imposes
> > + * an overhead of 11MB of mapped kernel space per interface.
> > + */
> > +#define BLKIF_MAX_PERS_REQUESTS_PER_DEV 256
> > +
> > +
> >  struct blkif_request_rw {
> >       uint8_t        nr_segments;  /* number of segments                   
> > */
> >       blkif_vdev_t   handle;       /* only for read/write requests         
> > */
> > --
> > 1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to