On Sat, Jul 28, 2012 at 10:21:05AM +0800, Asias He wrote:
> This patch introduces bio-based IO path for virtio-blk.
> 
> Compared to request-based IO path, bio-based IO path uses driver
> provided ->make_request_fn() method to bypasses the IO scheduler. It
> handles the bio to device directly without allocating a request in block
> layer. This reduces the IO path in guest kernel to achieve high IOPS
> and lower latency. The downside is that guest can not use the IO
> scheduler to merge and sort requests. However, this is not a big problem
> if the backend disk in host side uses faster disk device.

If this optimization depends on the host, then it
should be reported to the guest using a feature bit,
as opposed to being guest driven.

> When the bio-based IO path is not enabled, virtio-blk still uses the
> original request-based IO path, no performance difference is observed.
> 
> Performance evaluation:
> -----------------------------
> 1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
> kvm tool.
> 
> Short version:
>  With bio-based IO path, sequential read/write, random read/write
>  IOPS boost         : 28%, 24%, 21%, 16%
>  Latency improvement: 32%, 17%, 21%, 16%
> 
> Long version:
>  With bio-based IO path:
>   seq-read  : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
>   seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
>   rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
>   rand-write: io=3095.7MB, bw=96198KB/s,  iops=192396 , runt= 32952msec
>     clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
>     clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
>     clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
>     clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
>   cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
>   cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
>   cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
>   cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
>  With request-based IO path:
>   seq-read  : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
>   seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
>   rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
>   rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
>     clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
>     clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
>     clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
>     clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
>   cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
>   cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
>   cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
>   cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985


So bio based causes cpu to jump up by some 30%?
What happens if you divide IOPS/CPU?
Any improvement that comes from increasing the cpu share
of the given guest on the host will not scale well on
a typical overcommitted host.

> 2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
> kvm tool.
> 
> Short version:
>  With bio-based IO path, sequential read/write, random read/write
>  IOPS boost         : 11%, 11%, 13%, 10%
>  Latency improvement: 10%, 10%, 12%, 10%
> Long Version:
>  With bio-based IO path:
>   read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
>   write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
>   read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
>   write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
>     clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
>     clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
>     clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
>     clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
>   cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
>   cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
>   cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
>   cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
>  With request-based IO path:
>   read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
>   write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
>   read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
>   write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
>     clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
>     clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
>     clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
>     clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
>   cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
>   cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
>   cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
>   cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409


Is this host or guest cpu? We should probably measure host cpu
as this includes device overhead which could vary by load.

> How to use:
> -----------------------------
> Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
> use_bio=1' to enable ->make_request_fn() based I/O path.
> 
> Cc: Rusty Russell <ru...@rustcorp.com.au>
> Cc: "Michael S. Tsirkin" <m...@redhat.com>
> Cc: virtualizat...@lists.linux-foundation.org
> Cc: k...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Acked-by: Rusty Russell <ru...@rustcorp.com.au>
> Signed-off-by: Christoph Hellwig <h...@lst.de>
> Signed-off-by: Minchan Kim <minchan....@gmail.com>
> Signed-off-by: Asias He <as...@redhat.com>
> ---
>  drivers/block/virtio_blk.c |  203 
> +++++++++++++++++++++++++++++++++++---------
>  1 file changed, 163 insertions(+), 40 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index c0bbeb4..95cfeed 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -14,6 +14,9 @@
>  
>  #define PART_BITS 4
>  
> +static bool use_bio;
> +module_param(use_bio, bool, S_IRUGO);
> +
>  static int major;
>  static DEFINE_IDA(vd_index_ida);
>  
> @@ -23,6 +26,7 @@ struct virtio_blk
>  {
>       struct virtio_device *vdev;
>       struct virtqueue *vq;
> +     wait_queue_head_t queue_wait;
>  
>       /* The disk structure for the kernel. */
>       struct gendisk *disk;
> @@ -51,53 +55,87 @@ struct virtio_blk
>  struct virtblk_req
>  {
>       struct request *req;
> +     struct bio *bio;
>       struct virtio_blk_outhdr out_hdr;
>       struct virtio_scsi_inhdr in_hdr;
>       u8 status;
> +     struct scatterlist sg[];
>  };
>  
> -static void blk_done(struct virtqueue *vq)
> +static inline int virtblk_result(struct virtblk_req *vbr)
> +{
> +     switch (vbr->status) {
> +     case VIRTIO_BLK_S_OK:
> +             return 0;
> +     case VIRTIO_BLK_S_UNSUPP:
> +             return -ENOTTY;
> +     default:
> +             return -EIO;
> +     }
> +}
> +
> +static inline void virtblk_request_done(struct virtio_blk *vblk,
> +                                     struct virtblk_req *vbr)
> +{
> +     struct request *req = vbr->req;
> +     int error = virtblk_result(vbr);
> +
> +     if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
> +             req->resid_len = vbr->in_hdr.residual;
> +             req->sense_len = vbr->in_hdr.sense_len;
> +             req->errors = vbr->in_hdr.errors;
> +     } else if (req->cmd_type == REQ_TYPE_SPECIAL) {
> +             req->errors = (error != 0);
> +     }
> +
> +     __blk_end_request_all(req, error);
> +     mempool_free(vbr, vblk->pool);
> +}
> +
> +static inline void virtblk_bio_done(struct virtio_blk *vblk,
> +                                 struct virtblk_req *vbr)
> +{
> +     bio_endio(vbr->bio, virtblk_result(vbr));
> +     mempool_free(vbr, vblk->pool);
> +}
> +
> +static void virtblk_done(struct virtqueue *vq)
>  {
>       struct virtio_blk *vblk = vq->vdev->priv;
> +     unsigned long bio_done = 0, req_done = 0;
>       struct virtblk_req *vbr;
> -     unsigned int len;
>       unsigned long flags;
> +     unsigned int len;
>  
>       spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
>       while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
> -             int error;
> -
> -             switch (vbr->status) {
> -             case VIRTIO_BLK_S_OK:
> -                     error = 0;
> -                     break;
> -             case VIRTIO_BLK_S_UNSUPP:
> -                     error = -ENOTTY;
> -                     break;
> -             default:
> -                     error = -EIO;
> -                     break;
> -             }
> -
> -             switch (vbr->req->cmd_type) {
> -             case REQ_TYPE_BLOCK_PC:
> -                     vbr->req->resid_len = vbr->in_hdr.residual;
> -                     vbr->req->sense_len = vbr->in_hdr.sense_len;
> -                     vbr->req->errors = vbr->in_hdr.errors;
> -                     break;
> -             case REQ_TYPE_SPECIAL:
> -                     vbr->req->errors = (error != 0);
> -                     break;
> -             default:
> -                     break;
> +             if (vbr->bio) {
> +                     virtblk_bio_done(vblk, vbr);
> +                     bio_done++;
> +             } else {
> +                     virtblk_request_done(vblk, vbr);
> +                     req_done++;
>               }
> -
> -             __blk_end_request_all(vbr->req, error);
> -             mempool_free(vbr, vblk->pool);
>       }
>       /* In case queue is stopped waiting for more buffers. */
> -     blk_start_queue(vblk->disk->queue);
> +     if (req_done)
> +             blk_start_queue(vblk->disk->queue);
>       spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
> +
> +     if (bio_done)
> +             wake_up(&vblk->queue_wait);
> +}
> +
> +static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
> +                                                 gfp_t gfp_mask)
> +{
> +     struct virtblk_req *vbr;
> +
> +     vbr = mempool_alloc(vblk->pool, gfp_mask);
> +     if (vbr && use_bio)
> +             sg_init_table(vbr->sg, vblk->sg_elems);
> +
> +     return vbr;
>  }
>  
>  static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
> @@ -106,13 +144,13 @@ static bool do_req(struct request_queue *q, struct 
> virtio_blk *vblk,
>       unsigned long num, out = 0, in = 0;
>       struct virtblk_req *vbr;
>  
> -     vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
> +     vbr = virtblk_alloc_req(vblk, GFP_ATOMIC);
>       if (!vbr)
>               /* When another request finishes we'll try again. */
>               return false;
>  
>       vbr->req = req;
> -
> +     vbr->bio = NULL;
>       if (req->cmd_flags & REQ_FLUSH) {
>               vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
>               vbr->out_hdr.sector = 0;
> @@ -172,7 +210,8 @@ static bool do_req(struct request_queue *q, struct 
> virtio_blk *vblk,
>               }
>       }
>  
> -     if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
> +     if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr,
> +                           GFP_ATOMIC) < 0) {
>               mempool_free(vbr, vblk->pool);
>               return false;
>       }
> @@ -180,7 +219,7 @@ static bool do_req(struct request_queue *q, struct 
> virtio_blk *vblk,
>       return true;
>  }
>  
> -static void do_virtblk_request(struct request_queue *q)
> +static void virtblk_request(struct request_queue *q)
>  {
>       struct virtio_blk *vblk = q->queuedata;
>       struct request *req;
> @@ -203,6 +242,82 @@ static void do_virtblk_request(struct request_queue *q)
>               virtqueue_kick(vblk->vq);
>  }
>  
> +static void virtblk_add_buf_wait(struct virtio_blk *vblk,
> +                              struct virtblk_req *vbr,
> +                              unsigned long out,
> +                              unsigned long in)
> +{
> +     DEFINE_WAIT(wait);
> +
> +     for (;;) {
> +             prepare_to_wait_exclusive(&vblk->queue_wait, &wait,
> +                                       TASK_UNINTERRUPTIBLE);
> +
> +             spin_lock_irq(vblk->disk->queue->queue_lock);
> +             if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
> +                                   GFP_ATOMIC) < 0) {
> +                     spin_unlock_irq(vblk->disk->queue->queue_lock);
> +                     io_schedule();
> +             } else {
> +                     virtqueue_kick(vblk->vq);
> +                     spin_unlock_irq(vblk->disk->queue->queue_lock);
> +                     break;
> +             }
> +
> +     }
> +
> +     finish_wait(&vblk->queue_wait, &wait);
> +}
> +
> +static void virtblk_make_request(struct request_queue *q, struct bio *bio)
> +{
> +     struct virtio_blk *vblk = q->queuedata;
> +     unsigned int num, out = 0, in = 0;
> +     struct virtblk_req *vbr;
> +
> +     BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
> +     BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
> +
> +     vbr = virtblk_alloc_req(vblk, GFP_NOIO);
> +     if (!vbr) {
> +             bio_endio(bio, -ENOMEM);
> +             return;
> +     }
> +
> +     vbr->bio = bio;
> +     vbr->req = NULL;
> +     vbr->out_hdr.type = 0;
> +     vbr->out_hdr.sector = bio->bi_sector;
> +     vbr->out_hdr.ioprio = bio_prio(bio);
> +
> +     sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
> +
> +     num = blk_bio_map_sg(q, bio, vbr->sg + out);
> +
> +     sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
> +                sizeof(vbr->status));
> +
> +     if (num) {
> +             if (bio->bi_rw & REQ_WRITE) {
> +                     vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
> +                     out += num;
> +             } else {
> +                     vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
> +                     in += num;
> +             }
> +     }
> +
> +     spin_lock_irq(vblk->disk->queue->queue_lock);
> +     if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
> +                                    GFP_ATOMIC) < 0)) {
> +             spin_unlock_irq(vblk->disk->queue->queue_lock);
> +             virtblk_add_buf_wait(vblk, vbr, out, in);
> +             return;
> +     }
> +     virtqueue_kick(vblk->vq);
> +     spin_unlock_irq(vblk->disk->queue->queue_lock);
> +}
> +
>  /* return id (s/n) string for *disk to *id_str
>   */
>  static int virtblk_get_id(struct gendisk *disk, char *id_str)
> @@ -360,7 +475,7 @@ static int init_vq(struct virtio_blk *vblk)
>       int err = 0;
>  
>       /* We expect one virtqueue, for output. */
> -     vblk->vq = virtio_find_single_vq(vblk->vdev, blk_done, "requests");
> +     vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests");
>       if (IS_ERR(vblk->vq))
>               err = PTR_ERR(vblk->vq);
>  
> @@ -414,7 +529,7 @@ static void virtblk_update_cache_mode(struct 
> virtio_device *vdev)
>       u8 writeback = virtblk_get_cache_mode(vdev);
>       struct virtio_blk *vblk = vdev->priv;
>  
> -     if (writeback)
> +     if (writeback && !use_bio)
>               blk_queue_flush(vblk->disk->queue, REQ_FLUSH);
>       else
>               blk_queue_flush(vblk->disk->queue, 0);
> @@ -477,6 +592,8 @@ static int __devinit virtblk_probe(struct virtio_device 
> *vdev)
>       struct virtio_blk *vblk;
>       struct request_queue *q;
>       int err, index;
> +     int pool_size;
> +
>       u64 cap;
>       u32 v, blk_size, sg_elems, opt_io_size;
>       u16 min_io_size;
> @@ -506,10 +623,12 @@ static int __devinit virtblk_probe(struct virtio_device 
> *vdev)
>               goto out_free_index;
>       }
>  
> +     init_waitqueue_head(&vblk->queue_wait);
>       vblk->vdev = vdev;
>       vblk->sg_elems = sg_elems;
>       sg_init_table(vblk->sg, vblk->sg_elems);
>       mutex_init(&vblk->config_lock);
> +
>       INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
>       vblk->config_enable = true;
>  
> @@ -517,7 +636,10 @@ static int __devinit virtblk_probe(struct virtio_device 
> *vdev)
>       if (err)
>               goto out_free_vblk;
>  
> -     vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
> +     pool_size = sizeof(struct virtblk_req);
> +     if (use_bio)
> +             pool_size += sizeof(struct scatterlist) * sg_elems;
> +     vblk->pool = mempool_create_kmalloc_pool(1, pool_size);
>       if (!vblk->pool) {
>               err = -ENOMEM;
>               goto out_free_vq;
> @@ -530,12 +652,14 @@ static int __devinit virtblk_probe(struct virtio_device 
> *vdev)
>               goto out_mempool;
>       }
>  
> -     q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL);
> +     q = vblk->disk->queue = blk_init_queue(virtblk_request, NULL);
>       if (!q) {
>               err = -ENOMEM;
>               goto out_put_disk;
>       }
>  
> +     if (use_bio)
> +             blk_queue_make_request(q, virtblk_make_request);
>       q->queuedata = vblk;
>  
>       virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
> @@ -620,7 +744,6 @@ static int __devinit virtblk_probe(struct virtio_device 
> *vdev)
>       if (!err && opt_io_size)
>               blk_queue_io_opt(q, blk_size * opt_io_size);
>  
> -
>       add_disk(vblk->disk);
>       err = device_create_file(disk_to_dev(vblk->disk), &dev_attr_serial);
>       if (err)
> -- 
> 1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to