Am 30.01.2015 um 18:16 schrieb Max Reitz: > On 2015-01-30 at 09:33, Peter Lieven wrote: >> this patch finally introduces multiread support to virtio-blk. While >> multiwrite support was there for a long time, read support was missing. >> >> The complete merge logic is moved into virtio-blk.c which has >> been the only user of request merging ever since. This is required >> to be able to merge chunks of requests and immediately invoke callbacks >> for those requests. Secondly, this is required to switch to >> direct invocation of coroutines which is planned at a later stage. >> >> The following benchmarks show the performance of running fio with >> 4 worker threads on a local ram disk. The numbers show the average >> of 10 test runs after 1 run as warmup phase. >> >> | 4k | 64k | 4k >> MB/s | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand >> --------------+--------+---------+--------+---------+--------+-------- >> master | 1221 | 1187 | 4178 | 4114 | 1745 | 1213 >> multiread | 1829 | 1189 | 4639 | 4110 | 1894 | 1216 >> >> Signed-off-by: Peter Lieven <p...@kamp.de> >> --- >> hw/block/dataplane/virtio-blk.c | 8 +- >> hw/block/virtio-blk.c | 288 >> +++++++++++++++++++++++++++------------- >> include/hw/virtio/virtio-blk.h | 19 +-- >> trace-events | 1 + >> 4 files changed, 210 insertions(+), 106 deletions(-) >> >> diff --git a/hw/block/dataplane/virtio-blk.c >> b/hw/block/dataplane/virtio-blk.c >> index 39c5d71..93472e9 100644 >> --- a/hw/block/dataplane/virtio-blk.c >> +++ b/hw/block/dataplane/virtio-blk.c >> @@ -96,9 +96,7 @@ static void handle_notify(EventNotifier *e) >> event_notifier_test_and_clear(&s->host_notifier); >> blk_io_plug(s->conf->conf.blk); >> for (;;) { >> - MultiReqBuffer mrb = { >> - .num_writes = 0, >> - }; >> + MultiReqBuffer mrb = {}; >> int ret; >> /* Disable guest->host notifies to avoid unnecessary vmexits */ >> @@ -120,7 +118,9 @@ static void handle_notify(EventNotifier *e) >> virtio_blk_handle_request(req, &mrb); >> } >> - virtio_submit_multiwrite(s->conf->conf.blk, &mrb); >> + if (mrb.num_reqs) { >> + virtio_submit_multireq(s->conf->conf.blk, &mrb); >> + } >> if (likely(ret == -EAGAIN)) { /* vring emptied */ >> /* Re-enable guest->host notifies and stop processing the >> vring. >> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c >> index e04adb8..77890a0 100644 >> --- a/hw/block/virtio-blk.c >> +++ b/hw/block/virtio-blk.c >> @@ -34,6 +34,7 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s) >> req->dev = s; >> req->qiov.size = 0; >> req->next = NULL; >> + req->mr_next = NULL; >> return req; >> } >> @@ -84,20 +85,29 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq >> *req, int error, >> static void virtio_blk_rw_complete(void *opaque, int ret) >> { >> - VirtIOBlockReq *req = opaque; >> + VirtIOBlockReq *next = opaque; >> - trace_virtio_blk_rw_complete(req, ret); >> + while (next) { >> + VirtIOBlockReq *req = next; >> + next = req->mr_next; >> + trace_virtio_blk_rw_complete(req, ret); >> - if (ret) { >> - int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type); >> - bool is_read = !(p & VIRTIO_BLK_T_OUT); >> - if (virtio_blk_handle_rw_error(req, -ret, is_read)) >> - return; >> - } >> + if (req->qiov.nalloc != -1) { >> + qemu_iovec_destroy(&req->qiov); >> + } >> - virtio_blk_req_complete(req, VIRTIO_BLK_S_OK); >> - block_acct_done(blk_get_stats(req->dev->blk), &req->acct); >> - virtio_blk_free_request(req); >> + if (ret) { >> + int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type); >> + bool is_read = !(p & VIRTIO_BLK_T_OUT); >> + if (virtio_blk_handle_rw_error(req, -ret, is_read)) { >> + continue; >> + } >> + } >> + >> + virtio_blk_req_complete(req, VIRTIO_BLK_S_OK); >> + block_acct_done(blk_get_stats(req->dev->blk), &req->acct); >> + virtio_blk_free_request(req); >> + } >> } >> static void virtio_blk_flush_complete(void *opaque, int ret) >> @@ -291,24 +301,125 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq >> *req) >> } >> } >> -void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb) >> +static inline void > > Is the "inline" really worth it? This function is rather long...
I tried really hard to not add a regression for random reads and the difference between good and bad here is sometimes just a matter of a few small changes. If gcc inlines on its own anyway would you mind to keep the inline? The function is actually only called from 2 places. > > (To my surprise, gcc actually does inline the function) > >> +virtio_submit_multireq2(BlockBackend *blk, MultiReqBuffer *mrb, > > Hm, that's not a very descriptive name... Maybe "submit_merged_requests" or > something like that (it's static, so you can drop the virtio_ prefix if you > want to) would be better? Sure, you are right. > >> + int start, int num_reqs, int niov) >> { >> - int i, ret; >> + QEMUIOVector *qiov = &mrb->reqs[start]->qiov; >> + int64_t sector_num = mrb->reqs[start]->sector_num; >> + int nb_sectors = mrb->reqs[start]->qiov.size / BDRV_SECTOR_SIZE; >> + bool is_write = mrb->is_write; >> + >> + if (num_reqs > 1) { >> + int i; >> + struct iovec *_iov = qiov->iov; >> + int _niov = qiov->niov; > > Identifiers with leading underscores are considered reserved, I'd rather > avoid using them. Okay. I will use tmp_iov and tmp_niov instead. > >> + >> + qemu_iovec_init(qiov, niov); >> + >> + for (i = 0; i < _niov; i++) { >> + qemu_iovec_add(qiov, _iov[i].iov_base, _iov[i].iov_len); >> + } >> - if (!mrb->num_writes) { >> + for (i = start + 1; i < start + num_reqs; i++) { >> + qemu_iovec_concat(qiov, &mrb->reqs[i]->qiov, 0, >> + mrb->reqs[i]->qiov.size); >> + mrb->reqs[i - 1]->mr_next = mrb->reqs[i]; >> + nb_sectors += mrb->reqs[i]->qiov.size / BDRV_SECTOR_SIZE; >> + } >> + assert(nb_sectors == qiov->size / BDRV_SECTOR_SIZE); >> + >> + trace_virtio_blk_submit_multireq(mrb, start, num_reqs, sector_num, >> + nb_sectors, is_write); >> + block_acct_merge_done(blk_get_stats(blk), >> + is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ, >> + num_reqs - 1); >> + } >> + >> + if (is_write) { >> + blk_aio_writev(blk, sector_num, qiov, nb_sectors, >> + virtio_blk_rw_complete, mrb->reqs[start]); >> + } else { >> + blk_aio_readv(blk, sector_num, qiov, nb_sectors, >> + virtio_blk_rw_complete, mrb->reqs[start]); >> + } >> +} >> + >> +static int virtio_multireq_compare(const void *a, const void *b) >> +{ >> + const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a, >> + *req2 = *(VirtIOBlockReq **)b; >> + >> + /* >> + * Note that we can't simply subtract sector_num1 from sector_num2 >> + * here as that could overflow the return value. >> + */ >> + if (req1->sector_num > req2->sector_num) { >> + return 1; >> + } else if (req1->sector_num < req2->sector_num) { >> + return -1; >> + } else { >> + return 0; >> + } >> +} >> + >> +void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb) >> +{ >> + int i = 0, start = 0, num_reqs = 0, niov = 0, nb_sectors = 0, >> + max_xfer_len = 0; > > Albeit not necessary, starting a new "int" line won't take any more space > (and would look nicer to me, probably). okay, I don't mind. > >> + int64_t sector_num = 0; >> + >> + if (mrb->num_reqs == 1) { >> + virtio_submit_multireq2(blk, mrb, 0, 1, -1); >> + mrb->num_reqs = 0; >> return; >> } >> - ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes); >> - if (ret != 0) { >> - for (i = 0; i < mrb->num_writes; i++) { >> - if (mrb->blkreq[i].error) { >> - virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO); >> + max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk); >> + max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX); >> + >> + qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs), >> + &virtio_multireq_compare); >> + >> + for (i = 0; i < mrb->num_reqs; i++) { >> + VirtIOBlockReq *req = mrb->reqs[i]; >> + if (num_reqs > 0) { >> + bool merge = true; >> + >> + /* merge would exceed maximum number of IOVs */ >> + if (niov + req->qiov.niov + 1 > IOV_MAX) { > > Hm, why the +1? A really good question. I copied this piece from the old merge routine. It seems definetely wrong. > >> + merge = false; >> + } >> + >> + /* merge would exceed maximum transfer length of backend device >> */ >> + if (req->qiov.size / BDRV_SECTOR_SIZE + nb_sectors > >> max_xfer_len) { >> + merge = false; >> + } >> + >> + /* requests are not sequential */ >> + if (sector_num + nb_sectors != req->sector_num) { >> + merge = false; >> + } >> + >> + if (!merge) { >> + virtio_submit_multireq2(blk, mrb, start, num_reqs, niov); >> + num_reqs = 0; >> } >> } >> + >> + if (num_reqs == 0) { >> + sector_num = req->sector_num; >> + nb_sectors = niov = 0; >> + start = i; >> + } >> + >> + nb_sectors += req->qiov.size / BDRV_SECTOR_SIZE; >> + niov += req->qiov.niov; >> + num_reqs++; >> } >> - mrb->num_writes = 0; >> + virtio_submit_multireq2(blk, mrb, start, num_reqs, niov); >> + mrb->num_reqs = 0; >> } >> static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer >> *mrb) >> @@ -319,7 +430,9 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req, >> MultiReqBuffer *mrb) >> /* >> * Make sure all outstanding writes are posted to the backing device. >> */ >> - virtio_submit_multiwrite(req->dev->blk, mrb); >> + if (mrb->is_write && mrb->num_reqs > 0) { >> + virtio_submit_multireq(req->dev->blk, mrb); >> + } >> blk_aio_flush(req->dev->blk, virtio_blk_flush_complete, req); >> } >> @@ -329,6 +442,9 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev, >> uint64_t nb_sectors = size >> BDRV_SECTOR_BITS; >> uint64_t total_sectors; >> + if (nb_sectors > INT_MAX) { >> + return false; >> + } >> if (sector & dev->sector_mask) { >> return false; >> } >> @@ -342,60 +458,6 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev, >> return true; >> } >> -static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer >> *mrb) >> -{ >> - BlockRequest *blkreq; >> - uint64_t sector; >> - >> - sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector); >> - >> - trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512); >> - >> - if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) { >> - virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR); >> - virtio_blk_free_request(req); >> - return; >> - } >> - >> - block_acct_start(blk_get_stats(req->dev->blk), &req->acct, >> req->qiov.size, >> - BLOCK_ACCT_WRITE); >> - >> - if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) { >> - virtio_submit_multiwrite(req->dev->blk, mrb); >> - } >> - >> - blkreq = &mrb->blkreq[mrb->num_writes]; >> - blkreq->sector = sector; >> - blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE; >> - blkreq->qiov = &req->qiov; >> - blkreq->cb = virtio_blk_rw_complete; >> - blkreq->opaque = req; >> - blkreq->error = 0; >> - >> - mrb->num_writes++; >> -} >> - >> -static void virtio_blk_handle_read(VirtIOBlockReq *req) >> -{ >> - uint64_t sector; >> - >> - sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector); >> - >> - trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512); >> - >> - if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) { >> - virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR); >> - virtio_blk_free_request(req); >> - return; >> - } >> - >> - block_acct_start(blk_get_stats(req->dev->blk), &req->acct, >> req->qiov.size, >> - BLOCK_ACCT_READ); >> - blk_aio_readv(req->dev->blk, sector, &req->qiov, >> - req->qiov.size / BDRV_SECTOR_SIZE, >> - virtio_blk_rw_complete, req); >> -} >> - >> void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb) >> { >> uint32_t type; >> @@ -430,11 +492,54 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, >> MultiReqBuffer *mrb) >> type = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type); >> - if (type & VIRTIO_BLK_T_FLUSH) { >> + switch (type & 0xff) { > > Somehow I'd prefer type & ~VIRTIO_BLK_T_BARRIER, but this is correct, too. The problem is that type encodes operations and flags and there is no official mask to distinguish them in the specs defined. If anytime a new flag is defined it will break the switch statement. I could add a macro for the 0xff in the headers if you like? > >> + case VIRTIO_BLK_T_IN: >> + case VIRTIO_BLK_T_OUT: >> + { >> + bool is_write = type & VIRTIO_BLK_T_OUT; >> + req->sector_num = virtio_ldq_p(VIRTIO_DEVICE(req->dev), >> + &req->out.sector); >> + >> + if (!virtio_blk_sect_range_ok(req->dev, req->sector_num, >> + req->qiov.size)) { >> + virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR); >> + virtio_blk_free_request(req); >> + return; >> + } >> + >> + if (is_write) { >> + qemu_iovec_init_external(&req->qiov, iov, out_num); >> + trace_virtio_blk_handle_write(req, req->sector_num, >> + req->qiov.size / >> BDRV_SECTOR_SIZE); >> + } else { >> + qemu_iovec_init_external(&req->qiov, in_iov, in_num); >> + trace_virtio_blk_handle_read(req, req->sector_num, >> + req->qiov.size / BDRV_SECTOR_SIZE); >> + } > > Before this patch, req->qiov (and subsequently req->qiov.size) was > initialized before virtio_blk_sect_range_ok() was called (with req->qiov.size > as an argument). Is it fine to change that? Oops, thats definetly a mistake. Half of the checks in virtio_blk_sect_range_ok won't work with the size set. > >> + >> + block_acct_start(blk_get_stats(req->dev->blk), >> + &req->acct, req->qiov.size, >> + is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ); >> + >> + /* merge would exceed maximum number of requests or IO direction >> + * changes */ >> + if (mrb->num_reqs > 0 && (mrb->num_reqs == >> VIRTIO_BLK_MAX_MERGE_REQS || >> + is_write != mrb->is_write)) { >> + virtio_submit_multireq(req->dev->blk, mrb); >> + } >> + >> + mrb->reqs[mrb->num_reqs++] = req; > > Somehow I'd like an assert(mrb->num_reqs < VIRTIO_BLK_MAX_MERGE_REQS) before > this. Feel free to ignore my request. This can't happen and the check for this is right above. Of what scenario do you think of that could overflow num_reqs? > >> + mrb->is_write = is_write; >> + break; >> + } >> + case VIRTIO_BLK_T_FLUSH: > > I think VIRTIO_BLK_T_FLUSH_OUT is missing (that's the reason it was "type & > VIRTIO_BLK_T_..." before instead of "type == VIRTIO_BLK_T_..."). VIRTIO_BLK_T_FLUSH_OUT is deprecated and not defined in the headers of qemu anymore. I can add it, but then I would also add VIRTIO_BLK_T_SCSI_CMD_OUT. > >> virtio_blk_handle_flush(req, mrb); >> - } else if (type & VIRTIO_BLK_T_SCSI_CMD) { >> + break; >> + case VIRTIO_BLK_T_SCSI_CMD: > > And VIRTIO_BLK_T_SCSI_CMD_OUT here. See above. I will define them in the headers. > > > The overall logic of this patch seems good to me (although I had some trouble > following which QIOVs had which purpose, which were allocated and which were > not, and so on...). If you tell me at which point it was most difficult I can try to add comments. Thanks for reviewing, Peter