[PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
Problem statement : Whenever try to get outstanding request via scsi_host_find_tag, block layer will return stale entries instead of actual outstanding request. Kernel panic if stale entry is inaccessible or memory is reused. Fix : Undo request mapping in blk_mq_put_driver_tag nce request is return. More detail : Whenever each SDEV entry is created, block layer allocate separate tags and static requestis.Those requests are not valid after SDEV is deleted from the system. On the fly, block layer maps static rqs to rqs as below from blk_mq_get_driver_tag() data.hctx->tags->rqs[rq->tag] = rq; Above mapping is active in-used requests and it is the same mapping which is referred in function scsi_host_find_tag(). After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some entries which will never be reset in block layer. There would be a kernel panic, If request pointing to “data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed and as part of that all the memory allocation of request associated with that sdev might be reused or inaccessible to the driver. Kernel panic snippet - BUG: unable to handle kernel paging request at ff800010 IP: [] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas] PGD aa4414067 PUD 0 Oops: [#1] SMP Call Trace: [] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas] [] scsih_shutdown+0x55/0x100 [mpt3sas] Cc: Signed-off-by: Kashyap Desai Signed-off-by: Sreekanth Reddy --- block/blk-mq.h | 1 + 1 file changed, 1 insertion(+) diff --git a/block/blk-mq.h b/block/blk-mq.h index 9497b47..57432be 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -175,6 +175,7 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq) { +hctx->tags->rqs[rq->tag] = NULL; blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); rq->tag = -1; -- 1.8.3.1
Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
On Tue, Dec 04, 2018 at 03:30:11PM +0530, Kashyap Desai wrote: > Problem statement : > Whenever try to get outstanding request via scsi_host_find_tag, > block layer will return stale entries instead of actual outstanding > request. Kernel panic if stale entry is inaccessible or memory is reused. > Fix : > Undo request mapping in blk_mq_put_driver_tag nce request is return. > > More detail : > Whenever each SDEV entry is created, block layer allocate separate tags > and static requestis.Those requests are not valid after SDEV is deleted > from the system. On the fly, block layer maps static rqs to rqs as below > from blk_mq_get_driver_tag() > > data.hctx->tags->rqs[rq->tag] = rq; > > Above mapping is active in-used requests and it is the same mapping which > is referred in function scsi_host_find_tag(). > After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some > entries which will never be reset in block layer. However, if rq & rq->tag is valid, data.hctx->tags->rqs[rq->tag] should have pointed to one active request instead of the stale one, right? Thanks, Ming
Re: [PATCH v5 0/5] lightnvm: Flexible metadata
On Mon, Dec 3, 2018 at 9:51 AM Hans Holmberg wrote: > > Great! The tests(rocksdb, pblk recovery and the generic xfs suite) > completed successfully on one of our disks, so feel free to add: > > Tested-by: Hans Holmberg > The tests completed fine on our hardware (which has metadata support), but fails when I run the code on a qemu disk without metadata support. I started to look into what goes wrong, and it seems the ppa list gets corrupted during smeta and padding writes. Building the kernel with CONFIG_NVM_PBLK_DEBUG, i see these errors: [ 159.735409] pblk : ppa: (boundary: 1) cache line: 9223372036854775807 I also got this during recovery: [ 5384.511939] BUG: KASAN: use-after-free in pblk_get_packed_meta+0x5e/0x151 [ 5384.513036] Read of size 8 at addr 8881edc220e0 by task nvme/9147 There might of course be something very wrong with my qemu setup, but it it looks more like something went wrong in the later revisions of the patch set (i tested the one of the first versions and did not see these issues then). Igor: If you want it, I can share my qemu environment, more debugging info and a couple of fixups i've done while looking into this. Thanks, Hans > Thanks, > Hans > On Fri, Nov 30, 2018 at 2:03 PM Hans Holmberg > wrote: > > > > I just started a regression test on this patch set that'll run over > > the weekend. I'll add a tested-by if everything checks out. > > > > All the best, > > Hans > > On Fri, Nov 30, 2018 at 12:49 PM Igor Konopko > > wrote: > > > > > > This series of patches extends the way how pblk can > > > store L2P sector metadata. After this set of changes > > > any size of NVMe metadata is supported in pblk. > > > Also there is an support for case without NVMe metadata. > > > > > > Changes v4 --> v5: > > > -rebase on top of ocssd/for-4.21/core > > > > > > Changes v3 --> v4: > > > -rename nvm_alloc_dma_pool() to nvm_create_dma_pool() > > > -split pblk_get_meta() calls and lba setting into > > > two operations for better core readability > > > -fixing compilation with CONFIG_NVM disabled > > > -getting rid of unnecessary memcpy for packed metadata > > > on write path > > > -support for drives with oob size >0 and <16B in packed > > > metadata mode > > > -minor commit message updates > > > > > > Changes v2 --> v3: > > > -Rebase on top of ocssd/for-4.21/core > > > -get/set_meta_lba helpers were removed > > > -dma reallocation was replaced with single allocation > > > -oob metadata size was added to pblk structure > > > -proper checks on pblk creation were added > > > > > > Changes v1 --> v2: > > > -Revert sector meta size back to 16b for pblk > > > -Dma pool for larger oob meta are handled in core instead of pblk > > > -Pblk oob meta helpers uses __le64 as input outpu instead of u64 > > > -Other minor fixes based on v1 patch review > > > > > > Igor Konopko (5): > > > lightnvm: pblk: Move lba list to partial read context > > > lightnvm: pblk: Helpers for OOB metadata > > > lightnvm: Flexible DMA pool entry size > > > lightnvm: Disable interleaved metadata > > > lightnvm: pblk: Support for packed metadata > > > > > > drivers/lightnvm/core.c | 9 -- > > > drivers/lightnvm/pblk-core.c | 61 > > > +++-- > > > drivers/lightnvm/pblk-init.c | 44 +-- > > > drivers/lightnvm/pblk-map.c | 20 +++- > > > drivers/lightnvm/pblk-rb.c | 3 ++ > > > drivers/lightnvm/pblk-read.c | 66 > > > +++- > > > drivers/lightnvm/pblk-recovery.c | 25 +-- > > > drivers/lightnvm/pblk-sysfs.c| 7 + > > > drivers/lightnvm/pblk-write.c| 9 +++--- > > > drivers/lightnvm/pblk.h | 24 +-- > > > drivers/nvme/host/lightnvm.c | 6 ++-- > > > include/linux/lightnvm.h | 3 +- > > > 12 files changed, 209 insertions(+), 68 deletions(-) > > > > > > -- > > > 2.14.5 > > >
Re: [PATCH 02/27] aio: clear IOCB_HIPRI
On Fri, Nov 30, 2018 at 10:14:31AM -0700, Jens Axboe wrote: > On 11/30/18 10:13 AM, Christoph Hellwig wrote: > > I think we'll need to queue this up for 4.21 ASAP independent of the > > rest, given that with separate poll queues userspace could otherwise > > submit I/O that will never get polled for anywhere. > > Probably a good idea, I can just move it to my 4.21 branch, it's not > strictly dependent on the series. So, can you add it to the 4.21 branch?
Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT
On Fri, Nov 30, 2018 at 10:17:49AM -0700, Jens Axboe wrote: > > Setting REQ_NOWAIT from inside the block layer will make the code that > > submits requests harder to review. Have you considered to make this code > > fail I/O if REQ_NOWAIT has not been set and to require that the context > > that submits I/O sets REQ_NOWAIT? > > It's technically still feasible to do for sync polled IO, it's only > the async case that makes it a potential deadlock. I wonder if we want a REQ_ASYNC_POLL compound flag #define that sets REQ_POLL and REQ_NOWAIT to make this blindly obvious.
Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
On 12/4/18 2:00 AM, Kashyap Desai wrote: Problem statement : Whenever try to get outstanding request via scsi_host_find_tag, block layer will return stale entries instead of actual outstanding request. Kernel panic if stale entry is inaccessible or memory is reused. Fix : Undo request mapping in blk_mq_put_driver_tag nce request is return. More detail : Whenever each SDEV entry is created, block layer allocate separate tags and static requestis.Those requests are not valid after SDEV is deleted from the system. On the fly, block layer maps static rqs to rqs as below from blk_mq_get_driver_tag() data.hctx->tags->rqs[rq->tag] = rq; Above mapping is active in-used requests and it is the same mapping which is referred in function scsi_host_find_tag(). After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some entries which will never be reset in block layer. There would be a kernel panic, If request pointing to “data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed and as part of that all the memory allocation of request associated with that sdev might be reused or inaccessible to the driver. Kernel panic snippet - BUG: unable to handle kernel paging request at ff800010 IP: [] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas] PGD aa4414067 PUD 0 Oops: [#1] SMP Call Trace: [] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas] [] scsih_shutdown+0x55/0x100 [mpt3sas] Cc: Signed-off-by: Kashyap Desai Signed-off-by: Sreekanth Reddy --- block/blk-mq.h | 1 + 1 file changed, 1 insertion(+) diff --git a/block/blk-mq.h b/block/blk-mq.h index 9497b47..57432be 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -175,6 +175,7 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, struct request *rq) { +hctx->tags->rqs[rq->tag] = NULL; blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); rq->tag = -1; No SCSI driver should call scsi_host_find_tag() after a request has finished. The above patch introduces yet another race and hence can't be a proper fix. Bart.
Re: [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req()
> - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO); > - if (unlikely(!req)) > - return NULL; > + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); > + if (req) { > + percpu_ref_get(&ctx->reqs); > + req->ki_ctx = ctx; > + INIT_LIST_HEAD(&req->ki_list); > + refcount_set(&req->ki_refcnt, 0); > + req->ki_eventfd = NULL; > + } > > - percpu_ref_get(&ctx->reqs); > - INIT_LIST_HEAD(&req->ki_list); > - refcount_set(&req->ki_refcnt, 0); > - req->ki_ctx = ctx; > return req; Why the reformatting? Otherwise this looks fine to me: Reviewed-by: Christoph Hellwig
Re: [PATCH 11/27] aio: only use blk plugs for > 2 depth submissions
On Fri, Nov 30, 2018 at 09:56:30AM -0700, Jens Axboe wrote: > Plugging is meant to optimize submission of a string of IOs, if we don't > have more than 2 being submitted, don't bother setting up a plug. > > Signed-off-by: Jens Axboe Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH 12/27] aio: use iocb_put() instead of open coding it
On Fri, Nov 30, 2018 at 09:56:31AM -0700, Jens Axboe wrote: > Replace the percpu_ref_put() + kmem_cache_free() with a call to > iocb_put() instead. > > Signed-off-by: Jens Axboe Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
On Fri, Nov 30, 2018 at 07:21:02PM +, Al Viro wrote: > On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote: > > For an ITER_KVEC, we can just iterate the iov and add the pages > > to the bio directly. > > > + page = virt_to_page(kv->iov_base); > > + size = bio_add_page(bio, page, kv->iov_len, > > + offset_in_page(kv->iov_base)); > > Who said that you *can* do virt_to_page() on those? E.g. vmalloc()'ed > addresses are fine for ITER_KVEC, etc. In this particular case the caller it seems the callers knows what kind of pages we have, but we need to properly document this at the very least. Note that in the completely generic case iov_base could also point to memory without a struct page at all (e.g. ioremap()ed memory).
Re: [PATCH 01/13] block: move queues types to the block layer
On Mon, Dec 03, 2018 at 04:49:56PM -0800, Sagi Grimberg wrote: >> @@ -103,12 +101,17 @@ static inline struct blk_mq_hw_ctx >> *blk_mq_map_queue(struct request_queue *q, >> unsigned int flags, >> unsigned int cpu) >> { >> -int hctx_type = 0; >> +enum hctx_type type = HCTX_TYPE_DEFAULT; >> + >> +if (q->tag_set->nr_maps > HCTX_TYPE_POLL && >> +((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags))) >> +type = HCTX_TYPE_POLL; >> - if (q->mq_ops->rq_flags_to_type) >> -hctx_type = q->mq_ops->rq_flags_to_type(q, flags); >> +else if (q->tag_set->nr_maps > HCTX_TYPE_READ && >> + ((flags & REQ_OP_MASK) == REQ_OP_READ)) >> +type = HCTX_TYPE_READ; > > Nit, there seems to be an extra newline that can be omitted here before > the else if statement (if I'm reading this correctly)... Empty lines can always be ommited, but in this case I actually like it as it seems to help readability..
Re: [PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension
Why does this go to linux-block?
Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
On Mon, Dec 03, 2018 at 04:54:15PM -0800, Sagi Grimberg wrote: > >> @@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) >> if (nr_io_queues == 0) >> return 0; >> + >> +clear_bit(NVMEQ_ENABLED, &adminq->flags); >> > > This is a change of behavior, looks correct though as we can fail > nvme_setup_irqs after we freed the admin vector. Needs documentation > though.. I have a hard time parsing the above, can you point to where the problem is?
Re: [PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension
On Tue, 4 Dec 2018 at 16:01, Christoph Hellwig wrote: > > Why does this go to linux-block? Because xor_blocks() is part of the RAID driver?
Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
On Mon, Dec 03, 2018 at 04:58:25PM -0800, Sagi Grimberg wrote: > >> +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag) > > Do we still need to carry the tag around? Yes, the timeout handler polls for a specific tag.
Re: [PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension
On Tue, Dec 04, 2018 at 07:01:24AM -0800, Christoph Hellwig wrote: > Why does this go to linux-block? FWIW, I'll take this via arm64 so feel free to ignore. Will
Re: [PATCH 10/13] nvme-mpath: remove I/O polling support
On Mon, Dec 03, 2018 at 05:11:43PM -0800, Sagi Grimberg wrote: >> If it really becomes an issue we >> should rework the nvme code to also skip the multipath code for any >> private namespace, even if that could mean some trouble when rescanning. >> > > This requires some explanation? skip the multipath code how? We currently always go through the multipath node as long the the controller is multipath capable. If we care about e.g. polling on a private namespace on a dual ported U.2 drive we'd have to make sure we don't go through the multipath device node for private namespaces that can only have one path, but only for shared namespaces.
Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
On Mon, Dec 03, 2018 at 05:00:59PM -0800, Sagi Grimberg wrote: > >> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, >> bool shutdown) >> nvme_stop_queues(&dev->ctrl); >> if (!dead && dev->ctrl.queue_count > 0) { >> -nvme_disable_io_queues(dev); >> +if (nvme_disable_io_queues(dev, nvme_admin_delete_sq)) >> + > > Would be nice if the opcode change would be kept inside but still > split like: I actually like not having another wrapper to stop through.. Keith, Jens, any preference?
Re: [PATCH 24/27] block: implement bio helper to add iter kvec pages to bio
On 12/4/18 7:55 AM, Christoph Hellwig wrote: > On Fri, Nov 30, 2018 at 07:21:02PM +, Al Viro wrote: >> On Fri, Nov 30, 2018 at 09:56:43AM -0700, Jens Axboe wrote: >>> For an ITER_KVEC, we can just iterate the iov and add the pages >>> to the bio directly. >> >>> + page = virt_to_page(kv->iov_base); >>> + size = bio_add_page(bio, page, kv->iov_len, >>> + offset_in_page(kv->iov_base)); >> >> Who said that you *can* do virt_to_page() on those? E.g. vmalloc()'ed >> addresses are fine for ITER_KVEC, etc. > > In this particular case the caller it seems the callers knows what kind > of pages we have, but we need to properly document this at the very least. > > Note that in the completely generic case iov_base could also point to > memory without a struct page at all (e.g. ioremap()ed memory). That's why I went to ITER_BVEC instead, that's much saner for this use case. -- Jens Axboe
Re: [PATCH 10/27] aio: don't zero entire aio_kiocb aio_get_req()
On 12/4/18 7:49 AM, Christoph Hellwig wrote: >> -req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO); >> -if (unlikely(!req)) >> -return NULL; >> +req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); >> +if (req) { >> +percpu_ref_get(&ctx->reqs); >> +req->ki_ctx = ctx; >> +INIT_LIST_HEAD(&req->ki_list); >> +refcount_set(&req->ki_refcnt, 0); >> +req->ki_eventfd = NULL; >> +} >> >> -percpu_ref_get(&ctx->reqs); >> -INIT_LIST_HEAD(&req->ki_list); >> -refcount_set(&req->ki_refcnt, 0); >> -req->ki_ctx = ctx; >> return req; > > Why the reformatting? Otherwise this looks fine to me: > > Reviewed-by: Christoph Hellwig Probably just the (over) abuse of likely/unlikely in aio.c. I can get rid of it. -- Jens Axboe
Re: [PATCH v6 0/2] arm64: provide a NEON-accelerated XOR algorithm extension
On Tue, Dec 04, 2018 at 04:02:36PM +0100, Ard Biesheuvel wrote: > On Tue, 4 Dec 2018 at 16:01, Christoph Hellwig wrote: > > > > Why does this go to linux-block? > > Because xor_blocks() is part of the RAID driver? The only caller of xor_blocks() seems btrfs. And the RAID code has its own list anyway..
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On Mon, Dec 03, 2018 at 05:33:06PM -0800, Sagi Grimberg wrote: > >> > > Yes, I'm very much in favour of this, too. > > We always have this IMO slightly weird notion of stopping the queue, set > > some error flags in the driver, then _restarting_ the queue, just so > > that the driver then sees the error flag and terminates the requests. > > Which I always found quite counter-intuitive. > > What about requests that come in after the iteration runs? how are those > terminated? If we've reached a dead state, I think you'd want to start a queue freeze before running the terminating iterator.
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On 12/4/2018 7:46 AM, Keith Busch wrote: On Mon, Dec 03, 2018 at 05:33:06PM -0800, Sagi Grimberg wrote: Yes, I'm very much in favour of this, too. We always have this IMO slightly weird notion of stopping the queue, set some error flags in the driver, then _restarting_ the queue, just so that the driver then sees the error flag and terminates the requests. Which I always found quite counter-intuitive. What about requests that come in after the iteration runs? how are those terminated? If we've reached a dead state, I think you'd want to start a queue freeze before running the terminating iterator. For the requests that come in after the iterator, the nvmf_check_ready() routine, which validates controller state, will catch and bounce them. Keith states why we froze it in the past. Whatever the itterator is, I'd prefer we not get abort calls on io that has yet to be successfully started. -- james
Re: [PATCH 02/27] aio: clear IOCB_HIPRI
On 12/4/18 7:46 AM, Christoph Hellwig wrote: > On Fri, Nov 30, 2018 at 10:14:31AM -0700, Jens Axboe wrote: >> On 11/30/18 10:13 AM, Christoph Hellwig wrote: >>> I think we'll need to queue this up for 4.21 ASAP independent of the >>> rest, given that with separate poll queues userspace could otherwise >>> submit I/O that will never get polled for anywhere. >> >> Probably a good idea, I can just move it to my 4.21 branch, it's not >> strictly dependent on the series. > > So, can you add it to the 4.21 branch? Done -- Jens Axboe
RE: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
> On Tue, Dec 04, 2018 at 03:30:11PM +0530, Kashyap Desai wrote: > > Problem statement : > > Whenever try to get outstanding request via scsi_host_find_tag, > > block layer will return stale entries instead of actual outstanding > > request. Kernel panic if stale entry is inaccessible or memory is > > reused. > > Fix : > > Undo request mapping in blk_mq_put_driver_tag nce request is return. > > > > More detail : > > Whenever each SDEV entry is created, block layer allocate separate tags > > and static requestis.Those requests are not valid after SDEV is deleted > > from the system. On the fly, block layer maps static rqs to rqs as below > > from blk_mq_get_driver_tag() > > > > data.hctx->tags->rqs[rq->tag] = rq; > > > > Above mapping is active in-used requests and it is the same mapping > > which > > is referred in function scsi_host_find_tag(). > > After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some > > entries which will never be reset in block layer. > > However, if rq & rq->tag is valid, data.hctx->tags->rqs[rq->tag] should > have pointed to one active request instead of the stale one, right? Yes that is my understanding and learning from this issue. Side note - At driver load whenever driver does scsi_add_host_with_dma(), it follows below code path in block layer. scsi_mq_setup_tags ->blk_mq_alloc_tag_set -> blk_mq_alloc_rq_maps -> __blk_mq_alloc_rq_maps SML create two set of request pool. One is per HBA and other is per SDEV. I was confused why SML creates request pool per HBA. > > Thanks, > Ming
Re: [PATCH 01/13] block: move queues types to the block layer
Nit, there seems to be an extra newline that can be omitted here before the else if statement (if I'm reading this correctly)... Empty lines can always be ommited, but in this case I actually like it as it seems to help readability.. If you think its useful I'm fine with it as is...
Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled
@@ -2173,6 +2157,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev) if (nr_io_queues == 0) return 0; + + clear_bit(NVMEQ_ENABLED, &adminq->flags); This is a change of behavior, looks correct though as we can fail nvme_setup_irqs after we freed the admin vector. Needs documentation though.. I have a hard time parsing the above, can you point to where the problem is? This flag replaces cq_vector = -1 for indicating the queue state so I'd expect that it would not appear where the former didn't. In this case we clear the flag in a place that before we did not set the cq_vector before, which seems correct to me though.
Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
+static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag) Do we still need to carry the tag around? Yes, the timeout handler polls for a specific tag. Does it have to? the documentation suggests that we missed an interrupt, so it is probably waiting on the completion queue. I'd say that it'd be better if the tag search would be implemented on the timeout handler alone so we don't have to pass the tag around for everyone... Thoughts?
Re: [PATCH 10/13] nvme-mpath: remove I/O polling support
If it really becomes an issue we should rework the nvme code to also skip the multipath code for any private namespace, even if that could mean some trouble when rescanning. This requires some explanation? skip the multipath code how? We currently always go through the multipath node as long the the controller is multipath capable. If we care about e.g. polling on a private namespace on a dual ported U.2 drive we'd have to make sure we don't go through the multipath device node for private namespaces that can only have one path, but only for shared namespaces. But we'd still use the multipath node for shared namespaces (and also polling if needed). I agree that private namespaces can skip the multipath node.
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
Yes, I'm very much in favour of this, too. We always have this IMO slightly weird notion of stopping the queue, set some error flags in the driver, then _restarting_ the queue, just so that the driver then sees the error flag and terminates the requests. Which I always found quite counter-intuitive. What about requests that come in after the iteration runs? how are those terminated? If we've reached a dead state, I think you'd want to start a queue freeze before running the terminating iterator. For the requests that come in after the iterator, the nvmf_check_ready() routine, which validates controller state, will catch and bounce them. The point of this patch was to omit the check in queue_rq like Keith did in patch #2.
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
Yes, I'm very much in favour of this, too. We always have this IMO slightly weird notion of stopping the queue, set some error flags in the driver, then _restarting_ the queue, just so that the driver then sees the error flag and terminates the requests. Which I always found quite counter-intuitive. What about requests that come in after the iteration runs? how are those terminated? If we've reached a dead state, I think you'd want to start a queue freeze before running the terminating iterator. Its not necessarily dead, in fabrics we need to handle disconnections that last for a while before we are able to reconnect (for a variety of reasons) and we need a way to fail I/O for failover (or requeue, or block its up to the upper layer). Its less of a "last resort" action like in the pci case. Does this guarantee that after freeze+iter we won't get queued with any other request? If not then we still need to unfreeze and fail at queue_rq.
[PATCH 2/3] blktests: add python scripts for parsing fio json output
I wrote these scripts for xfstests, just copying them over to blktests as well. The terse output fio support that blktests doesn't get all of the various fio performance things that we may want to look at, this gives us a lot of flexibility around getting specific values out of the fio results data. Signed-off-by: Josef Bacik --- src/FioResultDecoder.py | 64 + src/fio-key-value.py| 28 ++ 2 files changed, 92 insertions(+) create mode 100644 src/FioResultDecoder.py create mode 100644 src/fio-key-value.py diff --git a/src/FioResultDecoder.py b/src/FioResultDecoder.py new file mode 100644 index ..d004140c0fdf --- /dev/null +++ b/src/FioResultDecoder.py @@ -0,0 +1,64 @@ +# SPDX-License-Identifier: GPL-2.0 + +import json + +class FioResultDecoder(json.JSONDecoder): +"""Decoder for decoding fio result json to an object for our database + +This decodes the json output from fio into an object that can be directly +inserted into our database. This just strips out the fields we don't care +about and collapses the read/write/trim classes into a flat value structure +inside of the jobs object. + +For example +"write" : { +"io_bytes" : 313360384, +"bw" : 1016, +} + +Get's collapsed to + +"write_io_bytes" : 313360384, +"write_bw": 1016, + +Currently any dict under 'jobs' get's dropped, with the exception of 'read', +'write', and 'trim'. For those sub sections we drop any dict's under those. + +Attempt to keep this as generic as possible, we don't want to break every +time fio changes it's json output format. +""" +_ignore_types = ['dict', 'list'] +_override_keys = ['lat_ns', 'lat'] +_io_ops = ['read', 'write', 'trim'] + +_transform_keys = { 'lat': 'lat_ns' } + +def decode(self, json_string): +"""This does the dirty work of converting everything""" +default_obj = super(FioResultDecoder, self).decode(json_string) +obj = {} +obj['global'] = {} +obj['global']['time'] = default_obj['time'] +obj['jobs'] = [] +for job in default_obj['jobs']: +new_job = {} +for key,value in job.iteritems(): +if key not in self._io_ops: +if value.__class__.__name__ in self._ignore_types: +continue +new_job[key] = value +continue +for k,v in value.iteritems(): +if k in self._override_keys: +if k in self._transform_keys: +k = self._transform_keys[k] +for subk,subv in v.iteritems(): +collapsed_key = "{}_{}_{}".format(key, k, subk) +new_job[collapsed_key] = subv +continue +if v.__class__.__name__ in self._ignore_types: +continue +collapsed_key = "{}_{}".format(key, k) +new_job[collapsed_key] = v +obj['jobs'].append(new_job) +return obj diff --git a/src/fio-key-value.py b/src/fio-key-value.py new file mode 100644 index ..208e9a453a19 --- /dev/null +++ b/src/fio-key-value.py @@ -0,0 +1,28 @@ +# SPDX-License-Identifier: GPL-2.0 + +import FioResultDecoder +import json +import argparse +import sys +import platform + +parser = argparse.ArgumentParser() +parser.add_argument('-j', '--jobname', type=str, +help="The jobname we want our key from.", +required=True) +parser.add_argument('-k', '--key', type=str, +help="The key we want the value of", required=True) +parser.add_argument('result', type=str, +help="The result file read") +args = parser.parse_args() + +json_data = open(args.result) +data = json.load(json_data, cls=FioResultDecoder.FioResultDecoder) + +for job in data['jobs']: +if job['jobname'] == args.jobname: +if args.key not in job: +print('') +else: +print("{}".format(job[args.key])) +break -- 2.14.3
[PATCH 1/3] blktests: add cgroup2 infrastructure
In order to test io.latency and other cgroup related things we need some supporting helpers to setup and tear down cgroup2. This adds support for checking that we can even configure cgroup2 things, set them up if need be, and then add the cleanup stuff to the main cleanup function so everything is always in a clean state. Signed-off-by: Josef Bacik --- check | 2 ++ common/rc | 48 2 files changed, 50 insertions(+) diff --git a/check b/check index ebd87c097e25..1c9dbc518fa1 100755 --- a/check +++ b/check @@ -294,6 +294,8 @@ _cleanup() { done unset RESTORE_CPUS_ONLINE fi + + _cleanup_cgroup2 } _call_test() { diff --git a/common/rc b/common/rc index 8a892bcd5fde..a785f2329687 100644 --- a/common/rc +++ b/common/rc @@ -202,3 +202,51 @@ _test_dev_in_hotplug_slot() { _filter_xfs_io_error() { sed -e 's/^\(.*\)64\(: .*$\)/\1\2/' } + +_cgroup2_base_dir() +{ + grep cgroup2 /proc/mounts | awk '{ print $2 }' +} + +_cleanup_cgroup2() +{ + _dir=$(_cgroup2_base_dir)/blktests + [ -d "${_dir}" ] || return + + for i in $(find ${_dir} -type d | tac) + do + rmdir $i + done +} + +_have_cgroup2() +{ + if ! grep -q 'cgroup2' /proc/mounts; then + SKIP_REASON="This test requires cgroup2" + return 1 + fi + return 0 +} + +_have_cgroup2_controller_file() +{ + _have_cgroup2 || return 1 + + _controller=$1 + _file=$2 + _dir=$(_cgroup2_base_dir) + + if ! grep -q ${_controller} ${_dir}/cgroup.controllers; then + SKIP_REASON="No support for ${_controller} cgroup controller" + return 1 + fi + + mkdir ${_dir}/blktests + echo "+${_controller}" > ${_dir}/cgroup.subtree_control + if [ ! -f ${_dir}/blktests/${_file} ]; then + _cleanup_cgroup2 + SKIP_REASON="Cgroup file ${_file} doesn't exist" + return 1 + fi + return 0 +} -- 2.14.3
[PATCH 0/3] io.latency test for blktests
This patchset is to add a test to verify io.latency is working properly, and to add all the supporting code to run that test. First is the cgroup2 infrastructure which is fairly straightforward. Just verifies we have cgroup2, and gives us the helpers to check and make sure we have the right controllers in place for the test. The second patch brings over some python scripts I put in xfstests for parsing the fio json output. I looked at the existing fio performance stuff in blktests, but we only capture bw stuff, which is wonky with this particular test because once the fast group is finished the slow group is allowed to go as fast as it wants. So I needed this to pull out actual jobtime spent. This will give us flexibility to pull out other fio performance data in the future. The final patch is the test itself. It simply runs a job by itself to get a baseline view of the disk performance. Then it creates 2 cgroups, one fast and one slow, and runs the same job simultaneously in both groups. The result should be that the fast group takes just slightly longer time than the baseline (I use a 15% threshold to be safe), and that the slow one takes considerably longer. Thanks, Josef
[PATCH 3/3] blktests: block/025: an io.latency test
This is a test to verify io.latency is working properly. It does this by first running a fio job by itself to figure out how fast it runs. Then we calculate some thresholds, set up 2 cgroups, a fast and a slow cgroup, and then run the same job in both groups simultaneously. We should see the slow group get throttled until the first cgroup is able to finish, and then the slow cgroup will be allowed to finish. Signed-off-by: Josef Bacik --- tests/block/025 | 133 tests/block/025.out | 1 + 2 files changed, 134 insertions(+) create mode 100644 tests/block/025 create mode 100644 tests/block/025.out diff --git a/tests/block/025 b/tests/block/025 new file mode 100644 index ..88ce7cca944e --- /dev/null +++ b/tests/block/025 @@ -0,0 +1,133 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-3.0+ +# +# Test io.latency to make sure it's protecting the higher priority group +# properly. + +. tests/block/rc + +DESCRIPTION="test the io.latency interface to make sure it's working right" + +requires() { + _have_cgroup2_controller_file io io.latency && _have_fio && \ + _have_program python +} + +_fio_results_key() { + _job=$1 + _key=$2 + _resultfile=$3 + + python src/fio-key-value.py -k $_key -j $_job $_resultfile +} + +test_device() { + local fio_config_single fio_config_double fio_results fio_args qd + + fio_config_single=${TMPDIR}/single.fio + fio_config_double=${TMPDIR}/double.fio + fio_results=${TMPDIR}/025.json + fio_args="--output-format=json --output=${fio_results}" + qd=$(cat ${TEST_DEV_SYSFS}/queue/nr_requests) + + cat << EOF > "${fio_config_single}" + [fast] + filename=${TEST_DEV} + direct=1 + allrandrepeat=1 + readwrite=randrw + size=4G + ioengine=libaio + iodepth=${qd} + fallocate=none + randseed=12345 +EOF + + cat << EOF > "${fio_config_double}" + [global] + filename=${TEST_DEV} + direct=1 + allrandrepeat=1 + readwrite=randrw + size=4G + ioengine=libaio + iodepth=${qd} + fallocate=none + randseed=12345 + + [fast] + cgroup=blktests/fast + + [slow] + cgroup=blktests/slow +EOF + # We run the test once so we have an idea of how fast this workload will + # go with nobody else doing IO on the device. + if ! fio ${fio_args} ${fio_config_single}; then + echo "fio exited with status $?" + return 1 + fi + + _time_taken=$(_fio_results_key fast job_runtime ${fio_results}) + if [ "${_time_taken}" = "" ]; then + echo "fio doesn't report job_runtime" + return 1 + fi + + echo "normal time taken ${_time_taken}" >> $FULL + + # There's no way to predict how the two workloads are going to affect + # each other, so we weant to set thresholds to something reasonable so + # we can verify io.latency is doing something. This means we set 15% + # for the fast cgroup, just to give us enough wiggle room as throttling + # doesn't happen immediately. But if we have a super fast disk we could + # run both groups really fast and make it under our fast threshold, so + # we need to set a threshold for the slow group at 50%. We assume that + # if it was faster than 50% of the fast threshold then we probably + # didn't throttle and we can assume io.latency is broken. + _fast_thresh=$((${_time_taken} + ${_time_taken} * 15 / 100)) + _slow_thresh=$((${_time_taken} + ${_time_taken} * 50 / 100)) + echo "fast threshold time is ${_fast_thresh}" >> $FULL + echo "slow threshold time is ${_slow_thresh}" >> $FULL + + # Create teh cgroup files + _dir=$(_cgroup2_base_dir)/blktests + echo "+io" > ${_dir}/cgroup.subtree_control + mkdir ${_dir}/fast + mkdir ${_dir}/slow + + # We set teh target to 1usec because we could have a fast device that is + # capable of remarkable IO latencies that would skew the test. It needs + # to be low enough that we do actually throttle the slow group, + # otherwise the test will fail when there's nothing wrong. + _major=$((0x$(stat -c "%t" ${TEST_DEV}))) + _minor=$((0x$(stat -c "%T" ${TEST_DEV}))) + echo "${_major}:${_minor} is our device" >> $FULL + if ! echo "${_major}:${_minor} target=1" > ${_dir}/fast/io.latency; then + echo "Failed to set our latency target" + return 1 + fi + + if ! fio ${fio_args} ${fio_config_double}; then + echo "fio exited with status $?" + return 1 + fi + + _fast_time=$(_fio_results_key fast job_runtime ${fio_results}) + echo "Fast time ${_fast_time}" >> $FULL + _slow_time=$(_fio_results_key slow job_runtime ${fio_results}) + echo "Slow time
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote: > > > > > Yes, I'm very much in favour of this, too. > > > > We always have this IMO slightly weird notion of stopping the queue, set > > > > some error flags in the driver, then _restarting_ the queue, just so > > > > that the driver then sees the error flag and terminates the requests. > > > > Which I always found quite counter-intuitive. > > > > > > What about requests that come in after the iteration runs? how are those > > > terminated? > > > > If we've reached a dead state, I think you'd want to start a queue freeze > > before running the terminating iterator. > > Its not necessarily dead, in fabrics we need to handle disconnections > that last for a while before we are able to reconnect (for a variety of > reasons) and we need a way to fail I/O for failover (or requeue, or > block its up to the upper layer). Its less of a "last resort" action > like in the pci case. > > Does this guarantee that after freeze+iter we won't get queued with any > other request? If not then we still need to unfreeze and fail at > queue_rq. It sounds like there are different scenarios to consider. For the dead controller, we call blk_cleanup_queue() at the end which ends callers who blocked on entering. If you're doing a failover, you'd replace the freeze with a current path update in order to prevent new requests from entering. In either case, you don't need checks in queue_rq. The queue_rq check is redundant with the quiesce state that blk-mq already provides. Once quiesced, the proposed iterator can handle the final termination of the request, perform failover, or some other lld specific action depending on your situation.
[PATCH 3/3] block: convert io-latency to use rq_qos_wait
Now that we have this common helper, convert io-latency over to use it as well. Signed-off-by: Josef Bacik --- block/blk-iolatency.c | 31 --- 1 file changed, 8 insertions(+), 23 deletions(-) diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c index 5f7f1773be61..ac74b41f4131 100644 --- a/block/blk-iolatency.c +++ b/block/blk-iolatency.c @@ -262,15 +262,15 @@ static inline void iolat_update_total_lat_avg(struct iolatency_grp *iolat, stat->rqs.mean); } -static inline bool iolatency_may_queue(struct iolatency_grp *iolat, - wait_queue_entry_t *wait, - bool first_block) +static void iolat_cleanup_cb(struct rq_wait *rqw, void *private_data) { - struct rq_wait *rqw = &iolat->rq_wait; + atomic_dec(&rqw->inflight); + wake_up(&rqw->wait); +} - if (first_block && waitqueue_active(&rqw->wait) && - rqw->wait.head.next != &wait->entry) - return false; +static bool iolat_acquire_inflight(struct rq_wait *rqw, void *private_data) +{ + struct iolatency_grp *iolat = private_data; return rq_wait_inc_below(rqw, iolat->rq_depth.max_depth); } @@ -281,8 +281,6 @@ static void __blkcg_iolatency_throttle(struct rq_qos *rqos, { struct rq_wait *rqw = &iolat->rq_wait; unsigned use_delay = atomic_read(&lat_to_blkg(iolat)->use_delay); - DEFINE_WAIT(wait); - bool first_block = true; if (use_delay) blkcg_schedule_throttle(rqos->q, use_memdelay); @@ -299,20 +297,7 @@ static void __blkcg_iolatency_throttle(struct rq_qos *rqos, return; } - if (iolatency_may_queue(iolat, &wait, first_block)) - return; - - do { - prepare_to_wait_exclusive(&rqw->wait, &wait, - TASK_UNINTERRUPTIBLE); - - if (iolatency_may_queue(iolat, &wait, first_block)) - break; - first_block = false; - io_schedule(); - } while (1); - - finish_wait(&rqw->wait, &wait); + rq_qos_wait(rqw, iolat, iolat_acquire_inflight, iolat_cleanup_cb); } #define SCALE_DOWN_FACTOR 2 -- 2.14.3
[PATCH 1/3] block: add rq_qos_wait to rq_qos
Originally when I split out the common code from blk-wbt into rq_qos I left the wbt_wait() where it was and simply copied and modified it slightly to work for io-latency. However they are both basically the same thing, and as time has gone on wbt_wait() has ended up much smarter and kinder than it was when I copied it into io-latency, which means io-latency has lost out on these improvements. Since they are the same thing essentially except for a few minor things, create rq_qos_wait() that replicates what wbt_wait() currently does with callbacks that can be passed in for the snowflakes to do their own thing as appropriate. Signed-off-by: Josef Bacik --- block/blk-rq-qos.c | 86 ++ block/blk-rq-qos.h | 6 2 files changed, 92 insertions(+) diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c index 80f603b76f61..e932ef9d2718 100644 --- a/block/blk-rq-qos.c +++ b/block/blk-rq-qos.c @@ -176,6 +176,92 @@ void rq_depth_scale_down(struct rq_depth *rqd, bool hard_throttle) rq_depth_calc_max_depth(rqd); } +struct rq_qos_wait_data { + struct wait_queue_entry wq; + struct task_struct *task; + struct rq_wait *rqw; + acquire_inflight_cb_t *cb; + void *private_data; + bool got_token; +}; + +static int rq_qos_wake_function(struct wait_queue_entry *curr, + unsigned int mode, int wake_flags, void *key) +{ + struct rq_qos_wait_data *data = container_of(curr, +struct rq_qos_wait_data, +wq); + + /* +* If we fail to get a budget, return -1 to interrupt the wake up loop +* in __wake_up_common. +*/ + if (!data->cb(data->rqw, data->private_data)) + return -1; + + data->got_token = true; + list_del_init(&curr->entry); + wake_up_process(data->task); + return 1; +} + +/** + * rq_qos_wait - throttle on a rqw if we need to + * @private_data - caller provided specific data + * @acquire_inflight_cb - inc the rqw->inflight counter if we can + * @cleanup_cb - the callback to cleanup in case we race with a waker + * + * This provides a uniform place for the rq_qos users to do their throttling. + * Since you can end up with a lot of things sleeping at once, this manages the + * waking up based on the resources available. The acquire_inflight_cb should + * inc the rqw->inflight if we have the ability to do so, or return false if not + * and then we will sleep until the room becomes available. + * + * cleanup_cb is in case that we race with a waker and need to cleanup the + * inflight count accordingly. + */ +void rq_qos_wait(struct rq_wait *rqw, void *private_data, +acquire_inflight_cb_t *acquire_inflight_cb, +cleanup_cb_t *cleanup_cb) +{ + struct rq_qos_wait_data data = { + .wq = { + .func = rq_qos_wake_function, + .entry = LIST_HEAD_INIT(data.wq.entry), + }, + .task = current, + .rqw = rqw, + .cb = acquire_inflight_cb, + .private_data = private_data, + }; + bool has_sleeper; + + has_sleeper = wq_has_sleeper(&rqw->wait); + if (!has_sleeper && acquire_inflight_cb(rqw, private_data)) + return; + + prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE); + do { + if (data.got_token) + break; + if (!has_sleeper && acquire_inflight_cb(rqw, private_data)) { + finish_wait(&rqw->wait, &data.wq); + + /* +* We raced with wbt_wake_function() getting a token, +* which means we now have two. Put our local token +* and wake anyone else potentially waiting for one. +*/ + if (data.got_token) + cleanup_cb(rqw, private_data); + break; + } + io_schedule(); + has_sleeper = false; + } while (1); + finish_wait(&rqw->wait, &data.wq); +} + void rq_qos_exit(struct request_queue *q) { while (q->rq_qos) { diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h index 6e09e98b93ea..8678875de420 100644 --- a/block/blk-rq-qos.h +++ b/block/blk-rq-qos.h @@ -93,6 +93,12 @@ static inline void rq_qos_del(struct request_queue *q, struct rq_qos *rqos) } } +typedef bool (acquire_inflight_cb_t)(struct rq_wait *rqw, void *private_data); +typedef void (cleanup_cb_t)(struct rq_wait *rqw, void *private_data); + +void rq_qos_wait(struct rq_wait *rqw, void *private_data, +acquire_inflight_cb_t *acquire_inflight_cb, +cleanup_cb_t *cleanup_cb); bool rq_wait_inc_below(struct
[PATCH 2/3] block: convert wbt_wait() to use rq_qos_wait()
Now that we have rq_qos_wait() in place, convert wbt_wait() over to using it with it's specific callbacks. Signed-off-by: Josef Bacik --- block/blk-wbt.c | 65 ++--- 1 file changed, 11 insertions(+), 54 deletions(-) diff --git a/block/blk-wbt.c b/block/blk-wbt.c index d051ebfb4852..40207edd1d89 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -489,31 +489,21 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw) } struct wbt_wait_data { - struct wait_queue_entry wq; - struct task_struct *task; struct rq_wb *rwb; - struct rq_wait *rqw; + enum wbt_flags wb_acct; unsigned long rw; - bool got_token; }; -static int wbt_wake_function(struct wait_queue_entry *curr, unsigned int mode, -int wake_flags, void *key) +static bool wbt_inflight_cb(struct rq_wait *rqw, void *private_data) { - struct wbt_wait_data *data = container_of(curr, struct wbt_wait_data, - wq); - - /* -* If we fail to get a budget, return -1 to interrupt the wake up -* loop in __wake_up_common. -*/ - if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw))) - return -1; + struct wbt_wait_data *data = private_data; + return rq_wait_inc_below(rqw, get_limit(data->rwb, data->rw)); +} - data->got_token = true; - list_del_init(&curr->entry); - wake_up_process(data->task); - return 1; +static void wbt_cleanup_cb(struct rq_wait *rqw, void *private_data) +{ + struct wbt_wait_data *data = private_data; + wbt_rqw_done(data->rwb, rqw, data->wb_acct); } /* @@ -525,45 +515,12 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct, { struct rq_wait *rqw = get_rq_wait(rwb, wb_acct); struct wbt_wait_data data = { - .wq = { - .func = wbt_wake_function, - .entry = LIST_HEAD_INIT(data.wq.entry), - }, - .task = current, .rwb = rwb, - .rqw = rqw, + .wb_acct = wb_acct, .rw = rw, }; - bool has_sleeper; - - has_sleeper = wq_has_sleeper(&rqw->wait); - if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) - return; - - prepare_to_wait_exclusive(&rqw->wait, &data.wq, TASK_UNINTERRUPTIBLE); - do { - if (data.got_token) - break; - - if (!has_sleeper && - rq_wait_inc_below(rqw, get_limit(rwb, rw))) { - finish_wait(&rqw->wait, &data.wq); - - /* -* We raced with wbt_wake_function() getting a token, -* which means we now have two. Put our local token -* and wake anyone else potentially waiting for one. -*/ - if (data.got_token) - wbt_rqw_done(rwb, rqw, wb_acct); - break; - } - - io_schedule(); - has_sleeper = false; - } while (1); - finish_wait(&rqw->wait, &data.wq); + rq_qos_wait(rqw, &data, wbt_inflight_cb, wbt_cleanup_cb); } static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio) -- 2.14.3
[PATCH 0/3] Unify the throttling code for wbt and io-latency
Originally when I wrote io-latency and the rq_qos code to provide a common base between wbt and io-latency I left out the throttling part. These were basically the same, but slightly different in both cases. The difference was enough and the code wasn't too complicated that I just copied it into io-latency and modified it for what I needed and carried on. Since then Jens has fixed a few issues with wakeups with the niave approach. Before you could easily cycle waiters back to the end of the line if they were woken up without the ability to actually do their IO yet. But because this was only in wbt we didn't get it in io-latency. Resolve this by creating a unified interface for doing the throttling, and then just handle the differences between the two users with user specific callbacks. This allows us to have one place where we have to mess with wakeups, and gives each user the ability to be their own special snowflake. Jens, I based this on for-next from 12/03, let me know if you want a different base. I tested this with my blktests test. Thanks, Josef
Re: [PATCH 05/27] block: ensure that async polled IO is marked REQ_NOWAIT
On 12/4/18 7:48 AM, Christoph Hellwig wrote: > On Fri, Nov 30, 2018 at 10:17:49AM -0700, Jens Axboe wrote: >>> Setting REQ_NOWAIT from inside the block layer will make the code that >>> submits requests harder to review. Have you considered to make this code >>> fail I/O if REQ_NOWAIT has not been set and to require that the context >>> that submits I/O sets REQ_NOWAIT? >> >> It's technically still feasible to do for sync polled IO, it's only >> the async case that makes it a potential deadlock. > > I wonder if we want a REQ_ASYNC_POLL compound flag #define that sets > REQ_POLL and REQ_NOWAIT to make this blindly obvious. Yeah that might make sense, all the async cases should certainly use it, and sync can keep using REQ_POLL. I'll add that and fold where I can. -- Jens Axboe
Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues
On 12/4/18 8:05 AM, Christoph Hellwig wrote: > On Mon, Dec 03, 2018 at 05:00:59PM -0800, Sagi Grimberg wrote: >> >>> @@ -2428,7 +2426,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, >>> bool shutdown) >>> nvme_stop_queues(&dev->ctrl); >>> if (!dead && dev->ctrl.queue_count > 0) { >>> - nvme_disable_io_queues(dev); >>> + if (nvme_disable_io_queues(dev, nvme_admin_delete_sq)) >>> + >> >> Would be nice if the opcode change would be kept inside but still >> split like: > > I actually like not having another wrapper to stop through.. > > Keith, Jens, any preference? Fine either way, prefer not having a wrapper. -- Jens Axboe
Re: [PATCH 05/13] nvme-pci: consolidate code for polling non-dedicated queues
On 12/4/18 10:13 AM, Sagi Grimberg wrote: > +static int nvme_poll_irqdisable(struct nvme_queue *nvmeq, unsigned int tag) >>> >>> Do we still need to carry the tag around? >> >> Yes, the timeout handler polls for a specific tag. > > Does it have to? the documentation suggests that we missed > an interrupt, so it is probably waiting on the completion queue. > > I'd say that it'd be better if the tag search would be implemented > on the timeout handler alone so we don't have to pass the tag around > for everyone... Thoughts? Without that you don't know if that's the request that completed. You have to be able to look that up from the timeout handler. -- Jens Axboe
Re: block and nvme polling improvements V3
On 12/2/18 9:46 AM, Christoph Hellwig wrote: > Hi all, > > this series optimizes a few bits in the block layer and nvme code > related to polling. > > It starts by moving the queue types recently introduce entirely into > the block layer instead of requiring an indirect call for them. > > It then switches nvme and the block layer to only allow polling > with separate poll queues, which allows us to realize the following > benefits: > > - poll queues can safely avoid disabling irqs on any locks >(we already do that in NVMe, but it isn't 100% kosher as-is) > - regular interrupt driven queues can drop the CQ lock entirely, >as we won't race for completing CQs > > Then we drop the NVMe RDMA code, as it doesn't follow the new mode, > and remove the nvme multipath polling code including the block hooks > for it, which didn't make much sense to start with given that we > started bypassing the multipath code for single controller subsystems > early on. Last but not least we enable polling in the block layer > by default if the underlying driver has poll queues, as that already > requires explicit user action. > > Note that it would be really nice to have polling back for RDMA with > dedicated poll queues, but that might take a while. Also based on > Jens' polling aio patches we could now implement a model in nvmet > where we have a thread polling both the backend nvme device and > the RDMA CQs, which might give us some pretty nice performace > (I know Sagi looked into something similar a while ago). Applied, thanks. -- Jens Axboe
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On 12/4/2018 9:23 AM, Sagi Grimberg wrote: Yes, I'm very much in favour of this, too. We always have this IMO slightly weird notion of stopping the queue, set some error flags in the driver, then _restarting_ the queue, just so that the driver then sees the error flag and terminates the requests. Which I always found quite counter-intuitive. What about requests that come in after the iteration runs? how are those terminated? If we've reached a dead state, I think you'd want to start a queue freeze before running the terminating iterator. For the requests that come in after the iterator, the nvmf_check_ready() routine, which validates controller state, will catch and bounce them. The point of this patch was to omit the check in queue_rq like Keith did in patch #2. well - I'm not sure that's possible. The fabrics will have different time constraints vs pci. -- james
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On 12/4/2018 9:48 AM, Keith Busch wrote: On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote: Yes, I'm very much in favour of this, too. We always have this IMO slightly weird notion of stopping the queue, set some error flags in the driver, then _restarting_ the queue, just so that the driver then sees the error flag and terminates the requests. Which I always found quite counter-intuitive. What about requests that come in after the iteration runs? how are those terminated? If we've reached a dead state, I think you'd want to start a queue freeze before running the terminating iterator. Its not necessarily dead, in fabrics we need to handle disconnections that last for a while before we are able to reconnect (for a variety of reasons) and we need a way to fail I/O for failover (or requeue, or block its up to the upper layer). Its less of a "last resort" action like in the pci case. Does this guarantee that after freeze+iter we won't get queued with any other request? If not then we still need to unfreeze and fail at queue_rq. It sounds like there are different scenarios to consider. For the dead controller, we call blk_cleanup_queue() at the end which ends callers who blocked on entering. If you're doing a failover, you'd replace the freeze with a current path update in order to prevent new requests from entering. and if you're not multipath ? I assume you want the io queues to be frozen so they queue there - which can block threads such as ns verification. It's good to have them live, as todays checks bounce the io, letting the thread terminate as its in a reset/reconnect state, which allows those threads to exit out or finish before a new reconnect kicks them off again. We've already been fighting deadlocks with the reset/delete/rescan paths and these io paths. suspending the queues completely over the reconnect will likely create more issues in this area. In either case, you don't need checks in queue_rq. The queue_rq check is redundant with the quiesce state that blk-mq already provides. I disagree. The cases I've run into are on the admin queue - where we are sending io to initialize the controller when another error/reset occurs, and the checks are required to identify/reject the "old" initialization commands, with another state check allowing them to proceed on the "new" initialization commands. And there are also cases for ioctls and other things that occur during the middle of those initialization steps that need to be weeded out. The Admin queue has to be kept live to allow the initialization commands on the new controller. state checks are also needed for those namespace validation cases Once quiesced, the proposed iterator can handle the final termination of the request, perform failover, or some other lld specific action depending on your situation. I don't believe they can remain frozen, definitely not for the admin queue. -- james
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote: > > > On 12/4/2018 9:48 AM, Keith Busch wrote: > > On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote: > > > > > > Yes, I'm very much in favour of this, too. > > > > > > We always have this IMO slightly weird notion of stopping the > > > > > > queue, set > > > > > > some error flags in the driver, then _restarting_ the queue, just so > > > > > > that the driver then sees the error flag and terminates the > > > > > > requests. > > > > > > Which I always found quite counter-intuitive. > > > > > What about requests that come in after the iteration runs? how are > > > > > those > > > > > terminated? > > > > If we've reached a dead state, I think you'd want to start a queue > > > > freeze > > > > before running the terminating iterator. > > > Its not necessarily dead, in fabrics we need to handle disconnections > > > that last for a while before we are able to reconnect (for a variety of > > > reasons) and we need a way to fail I/O for failover (or requeue, or > > > block its up to the upper layer). Its less of a "last resort" action > > > like in the pci case. > > > > > > Does this guarantee that after freeze+iter we won't get queued with any > > > other request? If not then we still need to unfreeze and fail at > > > queue_rq. > > It sounds like there are different scenarios to consider. > > > > For the dead controller, we call blk_cleanup_queue() at the end which > > ends callers who blocked on entering. > > > > If you're doing a failover, you'd replace the freeze with a current path > > update in order to prevent new requests from entering. > and if you're not multipath ? I assume you want the io queues to be > frozen so they queue there - which can block threads such as ns > verification. It's good to have them live, as todays checks bounce the io, > letting the thread terminate as its in a reset/reconnect state, which allows > those threads to exit out or finish before a new reconnect kicks them off > again. We've already been fighting deadlocks with the reset/delete/rescan > paths and these io paths. suspending the queues completely over the > reconnect will likely create more issues in this area. > > > > In either case, you don't need checks in queue_rq. The queue_rq check > > is redundant with the quiesce state that blk-mq already provides. > > I disagree. The cases I've run into are on the admin queue - where we are > sending io to initialize the controller when another error/reset occurs, and > the checks are required to identify/reject the "old" initialization > commands, with another state check allowing them to proceed on the "new" > initialization commands. And there are also cases for ioctls and other > things that occur during the middle of those initialization steps that need > to be weeded out. The Admin queue has to be kept live to allow the > initialization commands on the new controller. > > state checks are also needed for those namespace validation cases > > > > > Once quiesced, the proposed iterator can handle the final termination > > of the request, perform failover, or some other lld specific action > > depending on your situation. > > I don't believe they can remain frozen, definitely not for the admin queue. > -- james Quiesced and frozen carry different semantics. My understanding of the nvme-fc implementation is that it returns BLK_STS_RESOURCE in the scenario you've described where the admin command can't be executed at the moment. That just has the block layer requeue it for later resubmission 3 milliseconds later, which will continue to return the same status code until you're really ready for it. What I'm proposing is that instead of using that return code, you may have nvme-fc control when to dispatch those queued requests by utilizing the blk-mq quiesce on/off states. Is there a reason that wouldn't work?
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On Tue, Dec 04, 2018 at 02:21:17PM -0700, Keith Busch wrote: > On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote: > > On 12/4/2018 9:48 AM, Keith Busch wrote: > > > Once quiesced, the proposed iterator can handle the final termination > > > of the request, perform failover, or some other lld specific action > > > depending on your situation. > > > > I don't believe they can remain frozen, definitely not for the admin queue. > > -- james > > Quiesced and frozen carry different semantics. > > My understanding of the nvme-fc implementation is that it returns > BLK_STS_RESOURCE in the scenario you've described where the admin > command can't be executed at the moment. That just has the block layer > requeue it for later resubmission 3 milliseconds later, which will > continue to return the same status code until you're really ready for > it. > > What I'm proposing is that instead of using that return code, you may > have nvme-fc control when to dispatch those queued requests by utilizing > the blk-mq quiesce on/off states. Is there a reason that wouldn't work? BTW, this is digressing from this patch's use case. The proposed iteration doesn't come into play for the above scenario, which can be handled with existing interfaces.
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On 12/4/2018 1:21 PM, Keith Busch wrote: On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote: I disagree. The cases I've run into are on the admin queue - where we are sending io to initialize the controller when another error/reset occurs, and the checks are required to identify/reject the "old" initialization commands, with another state check allowing them to proceed on the "new" initialization commands. And there are also cases for ioctls and other things that occur during the middle of those initialization steps that need to be weeded out. The Admin queue has to be kept live to allow the initialization commands on the new controller. state checks are also needed for those namespace validation cases Once quiesced, the proposed iterator can handle the final termination of the request, perform failover, or some other lld specific action depending on your situation. I don't believe they can remain frozen, definitely not for the admin queue. -- james Quiesced and frozen carry different semantics. My understanding of the nvme-fc implementation is that it returns BLK_STS_RESOURCE in the scenario you've described where the admin command can't be executed at the moment. That just has the block layer requeue it for later resubmission 3 milliseconds later, which will continue to return the same status code until you're really ready for it. BLK_STS_RESOURCE is correct - but for "normal" io, which comes from the filesystem, etc and are mostly on the io queues. But if the io originated from other sources, like the core layer via nvme_alloc_request() - used by lots of service routines for the transport to initialize the controller, core routines to talk to the controller, and ioctls from user space - then they are failed with a PATHING ERROR status. The status doesn't mean much to these other places which mainly care if they succeed or not, and if not, they fail and unwind. It's pretty critical for these paths to get that error status as many of those threads do synchronous io. And this is not just for nvme-fc. Any transport initializing the controller and getting half way through it when an error occurs that kills the association will depend on this behavior. PCI is a large exception as interaction with a pci function is very different from sending packets over a network and detecting network errors. Io requests, on the io queues, that are flagged as multipath, also are failed this way rather than requeued. We would need some iterator here to classify the type of io (one valid to go down another path) and move it to another path (but the transport doesn't want to know about other paths). What I'm proposing is that instead of using that return code, you may have nvme-fc control when to dispatch those queued requests by utilizing the blk-mq quiesce on/off states. Is there a reason that wouldn't work? and quiesce on/off isn't sufficient to do this. -- james
Re: [PATCH 3/3] blktests: block/025: an io.latency test
On 12/4/18 6:47 PM, Josef Bacik wrote: > This is a test to verify io.latency is working properly. It does this > by first running a fio job by itself to figure out how fast it runs. > Then we calculate some thresholds, set up 2 cgroups, a fast and a slow > cgroup, and then run the same job in both groups simultaneously. We > should see the slow group get throttled until the first cgroup is able > to finish, and then the slow cgroup will be allowed to finish. > Hi, two small typos below. > Signed-off-by: Josef Bacik > --- > tests/block/025 | 133 > > tests/block/025.out | 1 + > 2 files changed, 134 insertions(+) > create mode 100644 tests/block/025 > create mode 100644 tests/block/025.out > > diff --git a/tests/block/025 b/tests/block/025 > new file mode 100644 > index ..88ce7cca944e > --- /dev/null > +++ b/tests/block/025 > @@ -0,0 +1,133 @@ > +#!/bin/bash > +# SPDX-License-Identifier: GPL-3.0+ > +# > +# Test io.latency to make sure it's protecting the higher priority group > +# properly. > + > +. tests/block/rc > + > +DESCRIPTION="test the io.latency interface to make sure it's working right" > + > +requires() { > + _have_cgroup2_controller_file io io.latency && _have_fio && \ > + _have_program python > +} > + > +_fio_results_key() { > + _job=$1 > + _key=$2 > + _resultfile=$3 > + > + python src/fio-key-value.py -k $_key -j $_job $_resultfile > +} > + > +test_device() { > + local fio_config_single fio_config_double fio_results fio_args qd > + > + fio_config_single=${TMPDIR}/single.fio > + fio_config_double=${TMPDIR}/double.fio > + fio_results=${TMPDIR}/025.json > + fio_args="--output-format=json --output=${fio_results}" > + qd=$(cat ${TEST_DEV_SYSFS}/queue/nr_requests) > + > + cat << EOF > "${fio_config_single}" > + [fast] > + filename=${TEST_DEV} > + direct=1 > + allrandrepeat=1 > + readwrite=randrw > + size=4G > + ioengine=libaio > + iodepth=${qd} > + fallocate=none > + randseed=12345 > +EOF > + > + cat << EOF > "${fio_config_double}" > + [global] > + filename=${TEST_DEV} > + direct=1 > + allrandrepeat=1 > + readwrite=randrw > + size=4G > + ioengine=libaio > + iodepth=${qd} > + fallocate=none > + randseed=12345 > + > + [fast] > + cgroup=blktests/fast > + > + [slow] > + cgroup=blktests/slow > +EOF > + # We run the test once so we have an idea of how fast this workload will > + # go with nobody else doing IO on the device. > + if ! fio ${fio_args} ${fio_config_single}; then > + echo "fio exited with status $?" > + return 1 > + fi > + > + _time_taken=$(_fio_results_key fast job_runtime ${fio_results}) > + if [ "${_time_taken}" = "" ]; then > + echo "fio doesn't report job_runtime" > + return 1 > + fi > + > + echo "normal time taken ${_time_taken}" >> $FULL > + > + # There's no way to predict how the two workloads are going to affect > + # each other, so we weant to set thresholds to something reasonable so > + # we can verify io.latency is doing something. This means we set 15% > + # for the fast cgroup, just to give us enough wiggle room as throttling > + # doesn't happen immediately. But if we have a super fast disk we could > + # run both groups really fast and make it under our fast threshold, so > + # we need to set a threshold for the slow group at 50%. We assume that > + # if it was faster than 50% of the fast threshold then we probably > + # didn't throttle and we can assume io.latency is broken. > + _fast_thresh=$((${_time_taken} + ${_time_taken} * 15 / 100)) > + _slow_thresh=$((${_time_taken} + ${_time_taken} * 50 / 100)) > + echo "fast threshold time is ${_fast_thresh}" >> $FULL > + echo "slow threshold time is ${_slow_thresh}" >> $FULL > + > + # Create teh cgroup files Typo, "the". ^ > + _dir=$(_cgroup2_base_dir)/blktests > + echo "+io" > ${_dir}/cgroup.subtree_control > + mkdir ${_dir}/fast > + mkdir ${_dir}/slow > + > + # We set teh target to 1usec because we could have a fast device that is Also here, "the". ^ > + # capable of remarkable IO latencies that would skew the test. It needs > + # to be low enough that we do actually throttle the slow group, > + # otherwise the test will fail when there's nothing wrong. > + _major=$((0x$(stat -c "%t" ${TEST_DEV}))) > + _minor=$((0x$(stat -c "%T" ${TEST_DEV}))) > + echo "${_major}:${_minor} is our device" >> $FULL > + if ! echo "${_major}:${_minor} target=1" > ${_dir}/fast/io.latency; then > + echo "Failed to set our latency target" > + return 1 > + fi > + > + if ! fio ${fio_args} ${fio_config_double}; then > + echo "fio exited with status $?" > + return 1 > + fi
Re: [PATCH 2/3] blktests: add python scripts for parsing fio json output
On 12/4/18 6:47 PM, Josef Bacik wrote: > I wrote these scripts for xfstests, just copying them over to blktests > as well. The terse output fio support that blktests doesn't get all of > the various fio performance things that we may want to look at, this > gives us a lot of flexibility around getting specific values out of the > fio results data. > Hi, some suggestions below :) > Signed-off-by: Josef Bacik > --- > src/FioResultDecoder.py | 64 > + > src/fio-key-value.py| 28 ++ > 2 files changed, 92 insertions(+) > create mode 100644 src/FioResultDecoder.py > create mode 100644 src/fio-key-value.py > > diff --git a/src/FioResultDecoder.py b/src/FioResultDecoder.py > new file mode 100644 > index ..d004140c0fdf > --- /dev/null > +++ b/src/FioResultDecoder.py > @@ -0,0 +1,64 @@ I think a shebang could clarify this runs under python2. > +# SPDX-License-Identifier: GPL-2.0 > + > +import json > + > +class FioResultDecoder(json.JSONDecoder): > +"""Decoder for decoding fio result json to an object for our database > + > +This decodes the json output from fio into an object that can be directly > +inserted into our database. This just strips out the fields we don't > care > +about and collapses the read/write/trim classes into a flat value > structure > +inside of the jobs object. > + > +For example > +"write" : { > +"io_bytes" : 313360384, > +"bw" : 1016, > +} > + > +Get's collapsed to > + > +"write_io_bytes" : 313360384, > +"write_bw": 1016, > + > +Currently any dict under 'jobs' get's dropped, with the exception of > 'read', > +'write', and 'trim'. For those sub sections we drop any dict's under > those. > + > +Attempt to keep this as generic as possible, we don't want to break every > +time fio changes it's json output format. > +""" > +_ignore_types = ['dict', 'list'] > +_override_keys = ['lat_ns', 'lat'] > +_io_ops = ['read', 'write', 'trim'] > + > +_transform_keys = { 'lat': 'lat_ns' } > + > +def decode(self, json_string): > +"""This does the dirty work of converting everything""" > +default_obj = super(FioResultDecoder, self).decode(json_string) > +obj = {} > +obj['global'] = {} > +obj['global']['time'] = default_obj['time'] > +obj['jobs'] = [] > +for job in default_obj['jobs']: > +new_job = {} > +for key,value in job.iteritems(): > +if key not in self._io_ops: > +if value.__class__.__name__ in self._ignore_types: > +continue > +new_job[key] = value > +continue > +for k,v in value.iteritems(): > +if k in self._override_keys: > +if k in self._transform_keys: > +k = self._transform_keys[k] > +for subk,subv in v.iteritems(): > +collapsed_key = "{}_{}_{}".format(key, k, subk) > +new_job[collapsed_key] = subv > +continue > +if v.__class__.__name__ in self._ignore_types: > +continue > +collapsed_key = "{}_{}".format(key, k) > +new_job[collapsed_key] = v > +obj['jobs'].append(new_job) > +return obj > diff --git a/src/fio-key-value.py b/src/fio-key-value.py > new file mode 100644 > index ..208e9a453a19 > --- /dev/null > +++ b/src/fio-key-value.py > @@ -0,0 +1,28 @@ Also here a shebang could be fine. > +# SPDX-License-Identifier: GPL-2.0 > + > +import FioResultDecoder > +import json > +import argparse > +import sys > +import platform > + > +parser = argparse.ArgumentParser() > +parser.add_argument('-j', '--jobname', type=str, > +help="The jobname we want our key from.", > +required=True) > +parser.add_argument('-k', '--key', type=str, > +help="The key we want the value of", required=True) > +parser.add_argument('result', type=str, If here you set type=open ^ > +help="The result file read") > +args = parser.parse_args() > + > +json_data = open(args.result) you could avoid this line ^ > +data = json.load(json_data, cls=FioResultDecoder.FioResultDecoder) and substitute json_data with args.result which would return a file object. In short the above piece of patch would become something like: parser.add_argument('result', type=open, help="The result file read") args = parser.parse_args() data = json.load(args.result, cls=FioResultDecoder.FioResultDecoder) Which still raises IOError if bad arguments are passed. > + > +for job in data['jobs']: > +if job['jobname'] == args.jobname: > +if args.key not in job: > +
[PATCH] blk-mq: fix corruption with direct issue
If we attempt a direct issue to a SCSI device, and it returns BUSY, then we queue the request up normally. However, the SCSI layer may have already setup SG tables etc for this particular command. If we later merge with this request, then the old tables are no longer valid. Once we issue the IO, we only read/write the original part of the request, not the new state of it. This causes data corruption, and is most often noticed with the file system complaining about the just read data being invalid: [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) because most of it is garbage... This doesn't happen from the normal issue path, as we will simply defer the request to the hardware queue dispatch list if we fail. Once it's on the dispatch list, we never merge with it. Fix this from the direct issue path by flagging the request as REQ_NOMERGE so we don't change the size of it before issue. See also: https://bugzilla.kernel.org/show_bug.cgi?id=201685 Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Signed-off-by: Jens Axboe --- diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f91c6e5b17a..d8f518c6ea38 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, break; case BLK_STS_RESOURCE: case BLK_STS_DEV_RESOURCE: + /* +* If direct dispatch fails, we cannot allow any merging on +* this IO. Drivers (like SCSI) may have set up permanent state +* for this request, like SG tables and mappings, and if we +* merge to it later on then we'll still only do IO to the +* original part. +*/ + rq->cmd_flags |= REQ_NOMERGE; + blk_mq_update_dispatch_busy(hctx, true); __blk_mq_requeue_request(rq); break; -- Jens Axboe
[PATCHSET v5] Support for polled aio
For the grand introduction to this feature, see my original posting here: https://lore.kernel.org/linux-block/20181117235317.7366-1-ax...@kernel.dk/ and refer to the previous postings of this patchset for whatever features were added there. Particularly v4 has some performance results: https://lore.kernel.org/linux-block/20181130165646.27341-1-ax...@kernel.dk/ New in this version is io_ring_enter(2) and a ring based SQ/CQ interface. The rings are mapped from user space, an application can submit IO by writing to its own SQ ring of struct iocbs, and get completions by reading the CQ ring of io_events. This eliminates the need to copy iocbs and io_events completely. It also opens up the possibility of doing polled IO without any system calls at all, if we add a thread on the kernel side... I've done experiments with that, but not ready yet. Doesn't require any changes to the API. I can trivially do 1.2M IOPS with a single thread through this, and 173K IOPS with QD=1/sync. You can also find the patches in my aio-poll branch: http://git.kernel.dk/cgit/linux-block/log/?h=aio-poll or by cloning: git://git.kernel.dk/linux-block aio-poll Patches are against for-4.21/block. Since v4 - Switch to using ITER_BVEC for user mapped buffers. - Drop unneeded import_kvec(). - Use RLIMIT_MEMLOCK as a cap on total memory pinned. - Fix poll check with min_events == 0 actually checking for events - Add REQ_HIPRI_ASYNC - Add ring interface and io_ring_enter(2) - Change io_setup2() system call to accommodate the new ring interface Documentation/filesystems/vfs.txt |3 + arch/x86/entry/syscalls/syscall_64.tbl |2 + block/bio.c| 33 +- fs/aio.c | 1440 +--- fs/block_dev.c | 34 +- fs/file.c | 15 +- fs/file_table.c| 10 +- fs/gfs2/file.c |2 + fs/iomap.c | 52 +- fs/xfs/xfs_file.c |1 + include/linux/bio.h|1 + include/linux/blk_types.h |2 + include/linux/file.h |2 + include/linux/fs.h |5 +- include/linux/iomap.h |1 + include/linux/syscalls.h |5 + include/uapi/asm-generic/unistd.h |4 +- include/uapi/linux/aio_abi.h | 32 + kernel/sys_ni.c|1 + 19 files changed, 1450 insertions(+), 195 deletions(-) -- Jens Axboe
[PATCH 05/26] iomap: wire up the iopoll method
From: Christoph Hellwig Store the request queue the last bio was submitted to in the iocb private data in addition to the cookie so that we find the right block device. Also refactor the common direct I/O bio submission code into a nice little helper. Signed-off-by: Christoph Hellwig Modified to use REQ_HIPRI_ASYNC for async polled IO. Signed-off-by: Jens Axboe --- fs/gfs2/file.c| 2 ++ fs/iomap.c| 47 +-- fs/xfs/xfs_file.c | 1 + include/linux/iomap.h | 1 + 4 files changed, 36 insertions(+), 15 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 45a17b770d97..358157efc5b7 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1280,6 +1280,7 @@ const struct file_operations gfs2_file_fops = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, @@ -1310,6 +1311,7 @@ const struct file_operations gfs2_file_fops_nolock = { .llseek = gfs2_llseek, .read_iter = gfs2_file_read_iter, .write_iter = gfs2_file_write_iter, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = gfs2_ioctl, .mmap = gfs2_mmap, .open = gfs2_open, diff --git a/fs/iomap.c b/fs/iomap.c index d094e5688bd3..bd483fcb7b5a 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1441,6 +1441,32 @@ struct iomap_dio { }; }; +int iomap_dio_iopoll(struct kiocb *kiocb, bool spin) +{ + struct request_queue *q = READ_ONCE(kiocb->private); + + if (!q) + return 0; + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), spin); +} +EXPORT_SYMBOL_GPL(iomap_dio_iopoll); + +static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap, + struct bio *bio) +{ + atomic_inc(&dio->ref); + + if (dio->iocb->ki_flags & IOCB_HIPRI) { + if (!dio->wait_for_completion) + bio->bi_opf |= REQ_HIPRI_ASYNC; + else + bio->bi_opf |= REQ_HIPRI; + } + + dio->submit.last_queue = bdev_get_queue(iomap->bdev); + dio->submit.cookie = submit_bio(bio); +} + static ssize_t iomap_dio_complete(struct iomap_dio *dio) { struct kiocb *iocb = dio->iocb; @@ -1553,7 +1579,7 @@ static void iomap_dio_bio_end_io(struct bio *bio) } } -static blk_qc_t +static void iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, unsigned len) { @@ -1567,15 +1593,10 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - if (dio->iocb->ki_flags & IOCB_HIPRI) - flags |= REQ_HIPRI; - get_page(page); __bio_add_page(bio, page, len, 0); bio_set_op_attrs(bio, REQ_OP_WRITE, flags); - - atomic_inc(&dio->ref); - return submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } static loff_t @@ -1678,9 +1699,6 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio_set_pages_dirty(bio); } - if (dio->iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; - iov_iter_advance(dio->submit.iter, n); dio->size += n; @@ -1688,11 +1706,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, copied += n; nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES); - - atomic_inc(&dio->ref); - - dio->submit.last_queue = bdev_get_queue(iomap->bdev); - dio->submit.cookie = submit_bio(bio); + iomap_dio_submit_bio(dio, iomap, bio); } while (nr_pages); /* @@ -1912,6 +1926,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; + WRITE_ONCE(iocb->ki_cookie, dio->submit.cookie); + WRITE_ONCE(iocb->private, dio->submit.last_queue); + if (!atomic_dec_and_test(&dio->ref)) { if (!dio->wait_for_completion) return -EIOCBQUEUED; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e47425071e65..60c2da41f0fc 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1203,6 +1203,7 @@ const struct file_operations xfs_file_operations = { .write_iter = xfs_file_write_iter, .splice_read= generic_file_splice_read, .splice_write = iter_file_splice_write, + .iopoll = iomap_dio_iopoll, .unlocked_ioctl = xfs_file_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = xfs_file_compat_ioctl, diff --git a/inc
[PATCH 07/26] aio: separate out ring reservation from req allocation
From: Christoph Hellwig This is in preparation for certain types of IO not needing a ring reserveration. Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/aio.c | 30 +- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index cf0de61743e8..eaceb40e6cf5 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -901,7 +901,7 @@ static void put_reqs_available(struct kioctx *ctx, unsigned nr) local_irq_restore(flags); } -static bool get_reqs_available(struct kioctx *ctx) +static bool __get_reqs_available(struct kioctx *ctx) { struct kioctx_cpu *kcpu; bool ret = false; @@ -993,6 +993,14 @@ static void user_refill_reqs_available(struct kioctx *ctx) spin_unlock_irq(&ctx->completion_lock); } +static bool get_reqs_available(struct kioctx *ctx) +{ + if (__get_reqs_available(ctx)) + return true; + user_refill_reqs_available(ctx); + return __get_reqs_available(ctx); +} + /* aio_get_req * Allocate a slot for an aio request. * Returns NULL if no requests are free. @@ -1001,24 +1009,15 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) { struct aio_kiocb *req; - if (!get_reqs_available(ctx)) { - user_refill_reqs_available(ctx); - if (!get_reqs_available(ctx)) - return NULL; - } - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO); if (unlikely(!req)) - goto out_put; + return NULL; percpu_ref_get(&ctx->reqs); INIT_LIST_HEAD(&req->ki_list); refcount_set(&req->ki_refcnt, 0); req->ki_ctx = ctx; return req; -out_put: - put_reqs_available(ctx, 1); - return NULL; } static struct kioctx *lookup_ioctx(unsigned long ctx_id) @@ -1805,9 +1804,13 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, return -EINVAL; } + if (!get_reqs_available(ctx)) + return -EAGAIN; + + ret = -EAGAIN; req = aio_get_req(ctx); if (unlikely(!req)) - return -EAGAIN; + goto out_put_reqs_available; if (iocb.aio_flags & IOCB_FLAG_RESFD) { /* @@ -1870,11 +1873,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, goto out_put_req; return 0; out_put_req: - put_reqs_available(ctx, 1); percpu_ref_put(&ctx->reqs); if (req->ki_eventfd) eventfd_ctx_put(req->ki_eventfd); kmem_cache_free(kiocb_cachep, req); +out_put_reqs_available: + put_reqs_available(ctx, 1); return ret; } -- 2.17.1
[PATCH 18/26] aio: use fget/fput_many() for file references
On the submission side, add file reference batching to the aio_submit_state. We get as many references as the number of iocbs we are submitting, and drop unused ones if we end up switching files. The assumption here is that we're usually only dealing with one fd, and if there are multiple, hopefuly they are at least somewhat ordered. Could trivially be extended to cover multiple fds, if needed. On the completion side we do the same thing, except this is trivially done just locally in aio_iopoll_reap(). Signed-off-by: Jens Axboe --- fs/aio.c | 106 +++ 1 file changed, 91 insertions(+), 15 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index ae0805dc814c..634b540b0c92 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -253,6 +253,15 @@ struct aio_submit_state { */ struct list_head req_list; unsigned int req_count; + + /* +* File reference cache +*/ + struct file *file; + unsigned int fd; + unsigned int has_refs; + unsigned int used_refs; + unsigned int ios_left; }; /*-- sysctl variables*/ @@ -1355,7 +1364,8 @@ static long aio_iopoll_reap(struct kioctx *ctx, struct io_event __user *evs, { void *iocbs[AIO_IOPOLL_BATCH]; struct aio_kiocb *iocb, *n; - int to_free = 0, ret = 0; + int file_count, to_free = 0, ret = 0; + struct file *file = NULL; /* Shouldn't happen... */ if (*nr_events >= max) @@ -1372,7 +1382,20 @@ static long aio_iopoll_reap(struct kioctx *ctx, struct io_event __user *evs, list_del(&iocb->ki_list); iocbs[to_free++] = iocb; - fput(iocb->rw.ki_filp); + /* +* Batched puts of the same file, to avoid dirtying the +* file usage count multiple times, if avoidable. +*/ + if (!file) { + file = iocb->rw.ki_filp; + file_count = 1; + } else if (file == iocb->rw.ki_filp) { + file_count++; + } else { + fput_many(file, file_count); + file = iocb->rw.ki_filp; + file_count = 1; + } if (evs && copy_to_user(evs + *nr_events, &iocb->ki_ev, sizeof(iocb->ki_ev))) { @@ -1382,6 +1405,9 @@ static long aio_iopoll_reap(struct kioctx *ctx, struct io_event __user *evs, (*nr_events)++; } + if (file) + fput_many(file, file_count); + if (to_free) iocb_put_many(ctx, iocbs, &to_free); @@ -1804,13 +1830,58 @@ static void aio_complete_rw_poll(struct kiocb *kiocb, long res, long res2) } } -static int aio_prep_rw(struct aio_kiocb *kiocb, const struct iocb *iocb) +static void aio_file_put(struct aio_submit_state *state) +{ + if (state->file) { + int diff = state->has_refs - state->used_refs; + + if (diff) + fput_many(state->file, diff); + state->file = NULL; + } +} + +/* + * Get as many references to a file as we have IOs left in this submission, + * assuming most submissions are for one file, or at least that each file + * has more than one submission. + */ +static struct file *aio_file_get(struct aio_submit_state *state, int fd) +{ + if (!state) + return fget(fd); + + if (!state->file) { +get_file: + state->file = fget_many(fd, state->ios_left); + if (!state->file) + return NULL; + + state->fd = fd; + state->has_refs = state->ios_left; + state->used_refs = 1; + state->ios_left--; + return state->file; + } + + if (state->fd == fd) { + state->used_refs++; + state->ios_left--; + return state->file; + } + + aio_file_put(state); + goto get_file; +} + +static int aio_prep_rw(struct aio_kiocb *kiocb, const struct iocb *iocb, + struct aio_submit_state *state) { struct kioctx *ctx = kiocb->ki_ctx; struct kiocb *req = &kiocb->rw; int ret; - req->ki_filp = fget(iocb->aio_fildes); + req->ki_filp = aio_file_get(state, iocb->aio_fildes); if (unlikely(!req->ki_filp)) return -EBADF; req->ki_pos = iocb->aio_offset; @@ -1960,7 +2031,8 @@ static void aio_iopoll_iocb_issued(struct aio_submit_state *state, } static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb, - bool vectored, bool compat) + struct aio_submit_state *state, bool vectored, + bool compat) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct kiocb *req = &kiocb->rw; @@ -1968,7 +2040,7 @@ s
[PATCH 17/26] fs: add fget_many() and fput_many()
Some uses cases repeatedly get and put references to the same file, but the only exposed interface is doing these one at the time. As each of these entail an atomic inc or dec on a shared structure, that cost can add up. Add fget_many(), which works just like fget(), except it takes an argument for how many references to get on the file. Ditto fput_many(), which can drop an arbitrary number of references to a file. Signed-off-by: Jens Axboe --- fs/file.c| 15 ++- fs/file_table.c | 10 -- include/linux/file.h | 2 ++ include/linux/fs.h | 3 ++- 4 files changed, 22 insertions(+), 8 deletions(-) diff --git a/fs/file.c b/fs/file.c index 7ffd6e9d103d..ad9870edfd51 100644 --- a/fs/file.c +++ b/fs/file.c @@ -676,7 +676,7 @@ void do_close_on_exec(struct files_struct *files) spin_unlock(&files->file_lock); } -static struct file *__fget(unsigned int fd, fmode_t mask) +static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs) { struct files_struct *files = current->files; struct file *file; @@ -691,7 +691,7 @@ static struct file *__fget(unsigned int fd, fmode_t mask) */ if (file->f_mode & mask) file = NULL; - else if (!get_file_rcu(file)) + else if (!get_file_rcu_many(file, refs)) goto loop; } rcu_read_unlock(); @@ -699,15 +699,20 @@ static struct file *__fget(unsigned int fd, fmode_t mask) return file; } +struct file *fget_many(unsigned int fd, unsigned int refs) +{ + return __fget(fd, FMODE_PATH, refs); +} + struct file *fget(unsigned int fd) { - return __fget(fd, FMODE_PATH); + return fget_many(fd, 1); } EXPORT_SYMBOL(fget); struct file *fget_raw(unsigned int fd) { - return __fget(fd, 0); + return __fget(fd, 0, 1); } EXPORT_SYMBOL(fget_raw); @@ -738,7 +743,7 @@ static unsigned long __fget_light(unsigned int fd, fmode_t mask) return 0; return (unsigned long)file; } else { - file = __fget(fd, mask); + file = __fget(fd, mask, 1); if (!file) return 0; return FDPUT_FPUT | (unsigned long)file; diff --git a/fs/file_table.c b/fs/file_table.c index e49af4caf15d..6a3964df33e4 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -326,9 +326,9 @@ void flush_delayed_fput(void) static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput); -void fput(struct file *file) +void fput_many(struct file *file, unsigned int refs) { - if (atomic_long_dec_and_test(&file->f_count)) { + if (atomic_long_sub_and_test(refs, &file->f_count)) { struct task_struct *task = current; if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) { @@ -347,6 +347,12 @@ void fput(struct file *file) } } +void fput(struct file *file) +{ + fput_many(file, 1); +} + + /* * synchronous analog of fput(); for kernel threads that might be needed * in some umount() (and thus can't use flush_delayed_fput() without diff --git a/include/linux/file.h b/include/linux/file.h index 6b2fb032416c..3fcddff56bc4 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -13,6 +13,7 @@ struct file; extern void fput(struct file *); +extern void fput_many(struct file *, unsigned int); struct file_operations; struct vfsmount; @@ -44,6 +45,7 @@ static inline void fdput(struct fd fd) } extern struct file *fget(unsigned int fd); +extern struct file *fget_many(unsigned int fd, unsigned int refs); extern struct file *fget_raw(unsigned int fd); extern unsigned long __fdget(unsigned int fd); extern unsigned long __fdget_raw(unsigned int fd); diff --git a/include/linux/fs.h b/include/linux/fs.h index 6a5f71f8ae06..dc54a65c401a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -952,7 +952,8 @@ static inline struct file *get_file(struct file *f) atomic_long_inc(&f->f_count); return f; } -#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count) +#define get_file_rcu_many(x, cnt) atomic_long_add_unless(&(x)->f_count, (cnt), 0) +#define get_file_rcu(x) get_file_rcu_many((x), 1) #define fput_atomic(x) atomic_long_add_unless(&(x)->f_count, -1, 1) #define file_count(x) atomic_long_read(&(x)->f_count) -- 2.17.1
[PATCH 06/26] aio: use assigned completion handler
We know this is a read/write request, but in preparation for having different kinds of those, ensure that we call the assigned handler instead of assuming it's aio_complete_rq(). Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/aio.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/aio.c b/fs/aio.c index 05647d352bf3..cf0de61743e8 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1490,7 +1490,7 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret) ret = -EINTR; /*FALLTHRU*/ default: - aio_complete_rw(req, ret, 0); + req->ki_complete(req, ret, 0); } } -- 2.17.1
[PATCH 10/26] aio: use iocb_put() instead of open coding it
Replace the percpu_ref_put() + kmem_cache_free() with a call to iocb_put() instead. Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/aio.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index ed6c3914477a..cf93b92bfb1e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1884,10 +1884,9 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, goto out_put_req; return 0; out_put_req: - percpu_ref_put(&ctx->reqs); if (req->ki_eventfd) eventfd_ctx_put(req->ki_eventfd); - kmem_cache_free(kiocb_cachep, req); + iocb_put(req); out_put_reqs_available: put_reqs_available(ctx, 1); return ret; -- 2.17.1
[PATCH 09/26] aio: only use blk plugs for > 2 depth submissions
Plugging is meant to optimize submission of a string of IOs, if we don't have more than 2 being submitted, don't bother setting up a plug. Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/aio.c | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 522c04864d82..ed6c3914477a 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -69,6 +69,12 @@ struct aio_ring { struct io_event io_events[0]; }; /* 128 bytes + ring size */ +/* + * Plugging is meant to work with larger batches of IOs. If we don't + * have more than the below, then don't bother setting up a plug. + */ +#define AIO_PLUG_THRESHOLD 2 + #define AIO_RING_PAGES 8 struct kioctx_table { @@ -1919,7 +1925,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr, if (nr > ctx->nr_events) nr = ctx->nr_events; - blk_start_plug(&plug); + if (nr > AIO_PLUG_THRESHOLD) + blk_start_plug(&plug); for (i = 0; i < nr; i++) { struct iocb __user *user_iocb; @@ -1932,7 +1939,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, nr, if (ret) break; } - blk_finish_plug(&plug); + if (nr > AIO_PLUG_THRESHOLD) + blk_finish_plug(&plug); percpu_ref_put(&ctx->users); return i ? i : ret; @@ -1959,7 +1967,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id, if (nr > ctx->nr_events) nr = ctx->nr_events; - blk_start_plug(&plug); + if (nr > AIO_PLUG_THRESHOLD) + blk_start_plug(&plug); for (i = 0; i < nr; i++) { compat_uptr_t user_iocb; @@ -1972,7 +1981,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, ctx_id, if (ret) break; } - blk_finish_plug(&plug); + if (nr > AIO_PLUG_THRESHOLD) + blk_finish_plug(&plug); percpu_ref_put(&ctx->users); return i ? i : ret; -- 2.17.1
[PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()
It's 192 bytes, fairly substantial. Most items don't need to be cleared, especially not upfront. Clear the ones we do need to clear, and leave the other ones for setup when the iocb is prepared and submitted. Reviewed-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/aio.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index eaceb40e6cf5..522c04864d82 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1009,14 +1009,15 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) { struct aio_kiocb *req; - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO); + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); if (unlikely(!req)) return NULL; percpu_ref_get(&ctx->reqs); + req->ki_ctx = ctx; INIT_LIST_HEAD(&req->ki_list); refcount_set(&req->ki_refcnt, 0); - req->ki_ctx = ctx; + req->ki_eventfd = NULL; return req; } @@ -1730,6 +1731,10 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb) if (unlikely(!req->file)) return -EBADF; + req->head = NULL; + req->woken = false; + req->cancelled = false; + apt.pt._qproc = aio_poll_queue_proc; apt.pt._key = req->events; apt.iocb = aiocb; -- 2.17.1
[PATCH 12/26] aio: abstract out io_event filler helper
Signed-off-by: Jens Axboe --- fs/aio.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 06c8bcc72496..173f1f79dc8f 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1063,6 +1063,15 @@ static inline void iocb_put(struct aio_kiocb *iocb) } } +static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb, + long res, long res2) +{ + ev->obj = (u64)(unsigned long)iocb->ki_user_iocb; + ev->data = iocb->ki_user_data; + ev->res = res; + ev->res2 = res2; +} + /* aio_complete * Called when the io request on the given iocb is complete. */ @@ -1090,10 +1099,7 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2) ev_page = kmap_atomic(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); event = ev_page + pos % AIO_EVENTS_PER_PAGE; - event->obj = (u64)(unsigned long)iocb->ki_user_iocb; - event->data = iocb->ki_user_data; - event->res = res; - event->res2 = res2; + aio_fill_event(event, iocb, res, res2); kunmap_atomic(ev_page); flush_dcache_page(ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE]); -- 2.17.1
[PATCH 22/26] block: implement bio helper to add iter bvec pages to bio
For an ITER_BVEC, we can just iterate the iov and add the pages to the bio directly. Signed-off-by: Jens Axboe --- block/bio.c | 27 +++ include/linux/bio.h | 1 + 2 files changed, 28 insertions(+) diff --git a/block/bio.c b/block/bio.c index ab174bce5436..0e466d06f748 100644 --- a/block/bio.c +++ b/block/bio.c @@ -903,6 +903,33 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) } EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages); +/** + * bio_iov_bvec_add_pages - add pages from an ITER_BVEC to a bio + * @bio: bio to add pages to + * @iter: iov iterator describing the region to be added + * + * Iterate pages in the @iter and add them to the bio. We flag the + * @bio with BIO_HOLD_PAGES, telling IO completion not to free them. + */ +int bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) +{ + unsigned short orig_vcnt = bio->bi_vcnt; + const struct bio_vec *bv; + + do { + size_t size; + + bv = iter->bvec + iter->iov_offset; + size = bio_add_page(bio, bv->bv_page, bv->bv_len, bv->bv_offset); + if (size != bv->bv_len) + break; + iov_iter_advance(iter, size); + } while (iov_iter_count(iter) && !bio_full(bio)); + + bio_set_flag(bio, BIO_HOLD_PAGES); + return bio->bi_vcnt > orig_vcnt ? 0 : -EINVAL; +} + static void submit_bio_wait_endio(struct bio *bio) { complete(bio->bi_private); diff --git a/include/linux/bio.h b/include/linux/bio.h index 056fb627edb3..39ddb53e7e7f 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -434,6 +434,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, void __bio_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int off); int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter); +int bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter); struct rq_map_data; extern struct bio *bio_map_user_iov(struct request_queue *, struct iov_iter *, gfp_t); -- 2.17.1
[PATCH 02/26] block: add REQ_HIPRI_ASYNC
For the upcoming async polled IO, we can't sleep allocating requests. If we do, then we introduce a deadlock where the submitter already has async polled IO in-flight, but can't wait for them to complete since polled requests must be active found and reaped. Signed-off-by: Jens Axboe --- include/linux/blk_types.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index c0ba1a038ff3..1d9c3da0f2a1 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -346,6 +346,7 @@ enum req_flag_bits { #define REQ_NOWAIT (1ULL << __REQ_NOWAIT) #define REQ_NOUNMAP(1ULL << __REQ_NOUNMAP) #define REQ_HIPRI (1ULL << __REQ_HIPRI) +#define REQ_HIPRI_ASYNC(REQ_HIPRI | REQ_NOWAIT) #define REQ_DRV(1ULL << __REQ_DRV) #define REQ_SWAP (1ULL << __REQ_SWAP) -- 2.17.1
[PATCH 21/26] block: add BIO_HOLD_PAGES flag
For user mapped IO, we do get_user_pages() upfront, and then do a put_page() on each page at end_io time to release the page reference. In preparation for having permanently mapped pages, add a BIO_HOLD_PAGES flag that tells us not to release the pages, the caller will do that. Signed-off-by: Jens Axboe --- block/bio.c | 6 -- include/linux/blk_types.h | 1 + 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/block/bio.c b/block/bio.c index 03895cc0d74a..ab174bce5436 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1635,7 +1635,8 @@ static void bio_dirty_fn(struct work_struct *work) next = bio->bi_private; bio_set_pages_dirty(bio); - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); } } @@ -1651,7 +1652,8 @@ void bio_check_pages_dirty(struct bio *bio) goto defer; } - bio_release_pages(bio); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_release_pages(bio); bio_put(bio); return; defer: diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 1d9c3da0f2a1..344e5b61aa4e 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -227,6 +227,7 @@ struct bio { #define BIO_TRACE_COMPLETION 10/* bio_endio() should trace the final completion * of this bio. */ #define BIO_QUEUE_ENTERED 11 /* can use blk_queue_enter_live() */ +#define BIO_HOLD_PAGES 12 /* don't put O_DIRECT pages */ /* See BVEC_POOL_OFFSET below before adding new flags */ -- 2.17.1
[PATCH 19/26] aio: split iocb init from allocation
In preparation from having pre-allocated requests, that we then just need to initialize before use. Signed-off-by: Jens Axboe --- fs/aio.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 634b540b0c92..416bb2e365e0 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1099,6 +1099,16 @@ static bool get_reqs_available(struct kioctx *ctx) return __get_reqs_available(ctx); } +static void aio_iocb_init(struct kioctx *ctx, struct aio_kiocb *req) +{ + percpu_ref_get(&ctx->reqs); + req->ki_ctx = ctx; + INIT_LIST_HEAD(&req->ki_list); + req->ki_flags = 0; + refcount_set(&req->ki_refcnt, 0); + req->ki_eventfd = NULL; +} + /* aio_get_req * Allocate a slot for an aio request. * Returns NULL if no requests are free. @@ -,12 +1121,7 @@ static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) if (unlikely(!req)) return NULL; - percpu_ref_get(&ctx->reqs); - req->ki_ctx = ctx; - INIT_LIST_HEAD(&req->ki_list); - req->ki_flags = 0; - refcount_set(&req->ki_refcnt, 0); - req->ki_eventfd = NULL; + aio_iocb_init(ctx, req); return req; } -- 2.17.1
[PATCH 20/26] aio: batch aio_kiocb allocation
Similarly to how we use the state->ios_left to know how many references to get to a file, we can use it to allocate the aio_kiocb's we need in bulk. Signed-off-by: Jens Axboe --- fs/aio.c | 47 +++ 1 file changed, 39 insertions(+), 8 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 416bb2e365e0..1735ec556089 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -240,6 +240,8 @@ struct aio_kiocb { }; }; +#define AIO_IOPOLL_BATCH 8 + struct aio_submit_state { struct kioctx *ctx; @@ -254,6 +256,13 @@ struct aio_submit_state { struct list_head req_list; unsigned int req_count; + /* +* aio_kiocb alloc cache +*/ + void *iocbs[AIO_IOPOLL_BATCH]; + unsigned int free_iocbs; + unsigned int cur_iocb; + /* * File reference cache */ @@ -1113,15 +1122,35 @@ static void aio_iocb_init(struct kioctx *ctx, struct aio_kiocb *req) * Allocate a slot for an aio request. * Returns NULL if no requests are free. */ -static inline struct aio_kiocb *aio_get_req(struct kioctx *ctx) +static struct aio_kiocb *aio_get_req(struct kioctx *ctx, +struct aio_submit_state *state) { struct aio_kiocb *req; - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); - if (unlikely(!req)) - return NULL; + if (!state) + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); + else if (!state->free_iocbs) { + size_t size; + + size = min_t(size_t, state->ios_left, ARRAY_SIZE(state->iocbs)); + size = kmem_cache_alloc_bulk(kiocb_cachep, GFP_KERNEL, size, + state->iocbs); + if (size < 0) + return ERR_PTR(size); + else if (!size) + return ERR_PTR(-ENOMEM); + state->free_iocbs = size - 1; + state->cur_iocb = 1; + req = state->iocbs[0]; + } else { + req = state->iocbs[state->cur_iocb]; + state->free_iocbs--; + state->cur_iocb++; + } + + if (req) + aio_iocb_init(ctx, req); - aio_iocb_init(ctx, req); return req; } @@ -1359,8 +1388,6 @@ static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr, return ret < 0 || *i >= min_nr; } -#define AIO_IOPOLL_BATCH 8 - /* * Process completed iocb iopoll entries, copying the result to userspace. */ @@ -2357,7 +2384,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, return -EAGAIN; ret = -EAGAIN; - req = aio_get_req(ctx); + req = aio_get_req(ctx, state); if (unlikely(!req)) goto out_put_reqs_available; @@ -2489,6 +2516,9 @@ static void aio_submit_state_end(struct aio_submit_state *state) if (!list_empty(&state->req_list)) aio_flush_state_reqs(state->ctx, state); aio_file_put(state); + if (state->free_iocbs) + kmem_cache_free_bulk(kiocb_cachep, state->free_iocbs, + &state->iocbs[state->cur_iocb]); } /* @@ -2500,6 +2530,7 @@ static void aio_submit_state_start(struct aio_submit_state *state, state->ctx = ctx; INIT_LIST_HEAD(&state->req_list); state->req_count = 0; + state->free_iocbs = 0; state->file = NULL; state->ios_left = max_ios; #ifdef CONFIG_BLOCK -- 2.17.1
[PATCH 13/26] aio: add io_setup2() system call
This is just like io_setup(), except add a flags argument to let the caller control/define some of the io_context behavior. Outside of the flags, we add an iocb array and two user pointers for future use. Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/aio.c | 69 -- include/linux/syscalls.h | 3 ++ include/uapi/asm-generic/unistd.h | 4 +- kernel/sys_ni.c| 1 + 5 files changed, 52 insertions(+), 26 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index f0b1709a5ffb..67c357225fb0 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -343,6 +343,7 @@ 332common statx __x64_sys_statx 333common io_pgetevents __x64_sys_io_pgetevents 334common rseq__x64_sys_rseq +335common io_setup2 __x64_sys_io_setup2 # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/aio.c b/fs/aio.c index 173f1f79dc8f..26631d6872d2 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -100,6 +100,8 @@ struct kioctx { unsigned long user_id; + unsigned intflags; + struct __percpu kioctx_cpu *cpu; /* @@ -686,10 +688,8 @@ static void aio_nr_sub(unsigned nr) spin_unlock(&aio_nr_lock); } -/* ioctx_alloc - * Allocates and initializes an ioctx. Returns an ERR_PTR if it failed. - */ -static struct kioctx *ioctx_alloc(unsigned nr_events) +static struct kioctx *io_setup_flags(unsigned long ctxid, +unsigned int nr_events, unsigned int flags) { struct mm_struct *mm = current->mm; struct kioctx *ctx; @@ -701,6 +701,12 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) */ unsigned int max_reqs = nr_events; + if (unlikely(ctxid || nr_events == 0)) { + pr_debug("EINVAL: ctx %lu nr_events %u\n", +ctxid, nr_events); + return ERR_PTR(-EINVAL); + } + /* * We keep track of the number of available ringbuffer slots, to prevent * overflow (reqs_available), and we also use percpu counters for this. @@ -726,6 +732,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) if (!ctx) return ERR_PTR(-ENOMEM); + ctx->flags = flags; ctx->max_reqs = max_reqs; spin_lock_init(&ctx->ctx_lock); @@ -1281,6 +1288,34 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr, return ret; } +SYSCALL_DEFINE6(io_setup2, u32, nr_events, u32, flags, struct iocb __user *, + iocbs, void __user *, user1, void __user *, user2, + aio_context_t __user *, ctxp) +{ + struct kioctx *ioctx; + unsigned long ctx; + long ret; + + if (flags || user1 || user2) + return -EINVAL; + + ret = get_user(ctx, ctxp); + if (unlikely(ret)) + goto out; + + ioctx = io_setup_flags(ctx, nr_events, flags); + ret = PTR_ERR(ioctx); + if (IS_ERR(ioctx)) + goto out; + + ret = put_user(ioctx->user_id, ctxp); + if (ret) + kill_ioctx(current->mm, ioctx, NULL); + percpu_ref_put(&ioctx->users); +out: + return ret; +} + /* sys_io_setup: * Create an aio_context capable of receiving at least nr_events. * ctxp must not point to an aio_context that already exists, and @@ -1296,7 +1331,7 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr, */ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) { - struct kioctx *ioctx = NULL; + struct kioctx *ioctx; unsigned long ctx; long ret; @@ -1304,14 +1339,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) if (unlikely(ret)) goto out; - ret = -EINVAL; - if (unlikely(ctx || nr_events == 0)) { - pr_debug("EINVAL: ctx %lu nr_events %u\n", -ctx, nr_events); - goto out; - } - - ioctx = ioctx_alloc(nr_events); + ioctx = io_setup_flags(ctx, nr_events, 0); ret = PTR_ERR(ioctx); if (!IS_ERR(ioctx)) { ret = put_user(ioctx->user_id, ctxp); @@ -1327,7 +1355,7 @@ SYSCALL_DEFINE2(io_setup, unsigned, nr_events, aio_context_t __user *, ctxp) #ifdef CONFIG_COMPAT COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, u32 __user *, ctx32p) { - struct kioctx *ioctx = NULL; + struct kioctx *ioctx; unsigned long ctx; long ret; @@ -1335,23 +1363,14 @@ COMPAT_SYSCALL_DEFINE2(io_setup, unsigned, nr_events, u32 __user *, ctx32p) if (unlikely(ret)) goto out; - ret = -EINVAL; - if
[PATCH 11/26] aio: split out iocb copy from io_submit_one()
In preparation of handing in iocbs in a different fashion as well. Also make it clear that the iocb being passed in isn't modified, by marking it const throughout. Signed-off-by: Jens Axboe --- fs/aio.c | 68 +++- 1 file changed, 38 insertions(+), 30 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index cf93b92bfb1e..06c8bcc72496 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1420,7 +1420,7 @@ static void aio_complete_rw(struct kiocb *kiocb, long res, long res2) aio_complete(iocb, res, res2); } -static int aio_prep_rw(struct kiocb *req, struct iocb *iocb) +static int aio_prep_rw(struct kiocb *req, const struct iocb *iocb) { int ret; @@ -1461,7 +1461,7 @@ static int aio_prep_rw(struct kiocb *req, struct iocb *iocb) return ret; } -static int aio_setup_rw(int rw, struct iocb *iocb, struct iovec **iovec, +static int aio_setup_rw(int rw, const struct iocb *iocb, struct iovec **iovec, bool vectored, bool compat, struct iov_iter *iter) { void __user *buf = (void __user *)(uintptr_t)iocb->aio_buf; @@ -1500,8 +1500,8 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret) } } -static ssize_t aio_read(struct kiocb *req, struct iocb *iocb, bool vectored, - bool compat) +static ssize_t aio_read(struct kiocb *req, const struct iocb *iocb, + bool vectored, bool compat) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct iov_iter iter; @@ -1533,8 +1533,8 @@ static ssize_t aio_read(struct kiocb *req, struct iocb *iocb, bool vectored, return ret; } -static ssize_t aio_write(struct kiocb *req, struct iocb *iocb, bool vectored, - bool compat) +static ssize_t aio_write(struct kiocb *req, const struct iocb *iocb, +bool vectored, bool compat) { struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; struct iov_iter iter; @@ -1589,7 +1589,8 @@ static void aio_fsync_work(struct work_struct *work) aio_complete(container_of(req, struct aio_kiocb, fsync), ret, 0); } -static int aio_fsync(struct fsync_iocb *req, struct iocb *iocb, bool datasync) +static int aio_fsync(struct fsync_iocb *req, const struct iocb *iocb, +bool datasync) { if (unlikely(iocb->aio_buf || iocb->aio_offset || iocb->aio_nbytes || iocb->aio_rw_flags)) @@ -1717,7 +1718,7 @@ aio_poll_queue_proc(struct file *file, struct wait_queue_head *head, add_wait_queue(head, &pt->iocb->poll.wait); } -static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb) +static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) { struct kioctx *ctx = aiocb->ki_ctx; struct poll_iocb *req = &aiocb->poll; @@ -1789,27 +1790,23 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, struct iocb *iocb) return 0; } -static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, -bool compat) +static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, + struct iocb __user *user_iocb, bool compat) { struct aio_kiocb *req; - struct iocb iocb; ssize_t ret; - if (unlikely(copy_from_user(&iocb, user_iocb, sizeof(iocb - return -EFAULT; - /* enforce forwards compatibility on users */ - if (unlikely(iocb.aio_reserved2)) { + if (unlikely(iocb->aio_reserved2)) { pr_debug("EINVAL: reserve field set\n"); return -EINVAL; } /* prevent overflows */ if (unlikely( - (iocb.aio_buf != (unsigned long)iocb.aio_buf) || - (iocb.aio_nbytes != (size_t)iocb.aio_nbytes) || - ((ssize_t)iocb.aio_nbytes < 0) + (iocb->aio_buf != (unsigned long)iocb->aio_buf) || + (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) || + ((ssize_t)iocb->aio_nbytes < 0) )) { pr_debug("EINVAL: overflow check\n"); return -EINVAL; @@ -1823,14 +1820,14 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, if (unlikely(!req)) goto out_put_reqs_available; - if (iocb.aio_flags & IOCB_FLAG_RESFD) { + if (iocb->aio_flags & IOCB_FLAG_RESFD) { /* * If the IOCB_FLAG_RESFD flag of aio_flags is set, get an * instance of the file* now. The file descriptor must be * an eventfd() fd, and will be signaled for each completed * event using the eventfd_signal() function. */ - req->ki_eventfd = eventfd_ctx_fdget((int) iocb.aio_resfd); + req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd); if (IS_ERR(req->ki_eventfd)) { ret = PTR_ERR(req->ki_eventf
[PATCH 04/26] block: use REQ_HIPRI_ASYNC for non-sync polled IO
Tell the block layer if it's a sync or async polled request, so it can do the right thing. Signed-off-by: Jens Axboe --- fs/block_dev.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 6de8d35f6e41..b8f574615792 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -402,8 +402,12 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) nr_pages = iov_iter_npages(iter, BIO_MAX_PAGES); if (!nr_pages) { - if (iocb->ki_flags & IOCB_HIPRI) - bio->bi_opf |= REQ_HIPRI; + if (iocb->ki_flags & IOCB_HIPRI) { + if (!is_sync) + bio->bi_opf |= REQ_HIPRI_ASYNC; + else + bio->bi_opf |= REQ_HIPRI; + } qc = submit_bio(bio); WRITE_ONCE(iocb->ki_cookie, qc); -- 2.17.1
[PATCH 15/26] aio: support for IO polling
Add polled variants of PREAD/PREADV and PWRITE/PWRITEV. These act like their non-polled counterparts, except we expect to poll for completion of them. The polling happens at io_getevent() time, and works just like non-polled IO. To setup an io_context for polled IO, the application must call io_setup2() with IOCTX_FLAG_IOPOLL as one of the flags. It is illegal to mix and match polled and non-polled IO on an io_context. Polled IO doesn't support the user mapped completion ring. Events must be reaped through the io_getevents() system call. For non-irq driven poll devices, there's no way to support completion reaping from userspace by just looking at the ring. The application itself is the one that pulls completion entries. Signed-off-by: Jens Axboe --- fs/aio.c | 393 +++ include/uapi/linux/aio_abi.h | 3 + 2 files changed, 360 insertions(+), 36 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index bb6f07ca6940..5d34317c2929 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -153,6 +153,18 @@ struct kioctx { atomic_treqs_available; } cacheline_aligned_in_smp; + /* iopoll submission state */ + struct { + spinlock_t poll_lock; + struct list_head poll_submitted; + } cacheline_aligned_in_smp; + + /* iopoll completion state */ + struct { + struct list_head poll_completing; + struct mutex getevents_lock; + } cacheline_aligned_in_smp; + struct { spinlock_t ctx_lock; struct list_head active_reqs; /* used for cancellation */ @@ -205,14 +217,27 @@ struct aio_kiocb { __u64 ki_user_data; /* user's data for completion */ struct list_headki_list;/* the aio core uses this -* for cancellation */ +* for cancellation, or for +* polled IO */ + + unsigned long ki_flags; +#define IOCB_POLL_COMPLETED0 /* polled IO has completed */ +#define IOCB_POLL_EAGAIN 1 /* polled submission got EAGAIN */ + refcount_t ki_refcnt; - /* -* If the aio_resfd field of the userspace iocb is not zero, -* this is the underlying eventfd context to deliver events to. -*/ - struct eventfd_ctx *ki_eventfd; + union { + /* +* If the aio_resfd field of the userspace iocb is not zero, +* this is the underlying eventfd context to deliver events to. +*/ + struct eventfd_ctx *ki_eventfd; + + /* +* For polled IO, stash completion info here +*/ + struct io_event ki_ev; + }; }; /*-- sysctl variables*/ @@ -233,6 +258,7 @@ static const unsigned int iocb_page_shift = ilog2(PAGE_SIZE / sizeof(struct iocb)); static void aio_useriocb_unmap(struct kioctx *); +static void aio_iopoll_reap_events(struct kioctx *); static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) { @@ -471,11 +497,15 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events) int i; struct file *file; - /* Compensate for the ring buffer's head/tail overlap entry */ - nr_events += 2; /* 1 is required, 2 for good luck */ - + /* +* Compensate for the ring buffer's head/tail overlap entry. +* IO polling doesn't require any io event entries +*/ size = sizeof(struct aio_ring); - size += sizeof(struct io_event) * nr_events; + if (!(ctx->flags & IOCTX_FLAG_IOPOLL)) { + nr_events += 2; /* 1 is required, 2 for good luck */ + size += sizeof(struct io_event) * nr_events; + } nr_pages = PFN_UP(size); if (nr_pages < 0) @@ -758,6 +788,11 @@ static struct kioctx *io_setup_flags(unsigned long ctxid, INIT_LIST_HEAD(&ctx->active_reqs); + spin_lock_init(&ctx->poll_lock); + INIT_LIST_HEAD(&ctx->poll_submitted); + INIT_LIST_HEAD(&ctx->poll_completing); + mutex_init(&ctx->getevents_lock); + if (percpu_ref_init(&ctx->users, free_ioctx_users, 0, GFP_KERNEL)) goto err; @@ -829,11 +864,15 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx, { struct kioctx_table *table; + mutex_lock(&ctx->getevents_lock); spin_lock(&mm->ioctx_lock); if (atomic_xchg(&ctx->dead, 1)) { spin_unlock(&mm->ioctx_lock); + mutex_unlock(&ctx->getevents_lock); return -EINVAL; } + aio_iopoll_reap_events(ctx); + mutex_unlock(&ctx->getevents_lock); table = rcu_dereference_raw(mm->ioctx_
[PATCH 14/26] aio: add support for having user mapped iocbs
For io_submit(), we have to first copy each pointer to an iocb, then copy the iocb. The latter is 64 bytes in size, and that's a lot of copying for a single IO. Add support for setting IOCTX_FLAG_USERIOCB through the new io_setup2() system call, which allows the iocbs to reside in userspace. If this flag is used, then io_submit() doesn't take pointers to iocbs anymore, it takes an index value into the array of iocbs instead. Similary, for io_getevents(), the iocb ->obj will be the index, not the pointer to the iocb. See the change made to fio to support this feature, it's pretty trivialy to adapt to. For applications, like fio, that previously embedded the iocb inside a application private structure, some sort of lookup table/structure is needed to find the private IO structure from the index at io_getevents() time. http://git.kernel.dk/cgit/fio/commit/?id=3c3168e91329c83880c91e5abc28b9d6b940fd95 Signed-off-by: Jens Axboe --- fs/aio.c | 126 +++ include/uapi/linux/aio_abi.h | 2 + 2 files changed, 116 insertions(+), 12 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 26631d6872d2..bb6f07ca6940 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -92,6 +92,11 @@ struct ctx_rq_wait { atomic_t count; }; +struct aio_mapped_range { + struct page **pages; + long nr_pages; +}; + struct kioctx { struct percpu_ref users; atomic_tdead; @@ -127,6 +132,8 @@ struct kioctx { struct page **ring_pages; longnr_pages; + struct aio_mapped_range iocb_range; + struct rcu_work free_rwork; /* see free_ioctx() */ /* @@ -222,6 +229,11 @@ static struct vfsmount *aio_mnt; static const struct file_operations aio_ring_fops; static const struct address_space_operations aio_ctx_aops; +static const unsigned int iocb_page_shift = + ilog2(PAGE_SIZE / sizeof(struct iocb)); + +static void aio_useriocb_unmap(struct kioctx *); + static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) { struct file *file; @@ -578,6 +590,7 @@ static void free_ioctx(struct work_struct *work) free_rwork); pr_debug("freeing %p\n", ctx); + aio_useriocb_unmap(ctx); aio_free_ring(ctx); free_percpu(ctx->cpu); percpu_ref_exit(&ctx->reqs); @@ -1288,6 +1301,70 @@ static long read_events(struct kioctx *ctx, long min_nr, long nr, return ret; } +static struct iocb *aio_iocb_from_index(struct kioctx *ctx, int index) +{ + struct iocb *iocb; + + iocb = page_address(ctx->iocb_range.pages[index >> iocb_page_shift]); + index &= ((1 << iocb_page_shift) - 1); + return iocb + index; +} + +static void aio_unmap_range(struct aio_mapped_range *range) +{ + int i; + + if (!range->nr_pages) + return; + + for (i = 0; i < range->nr_pages; i++) + put_page(range->pages[i]); + + kfree(range->pages); + range->pages = NULL; + range->nr_pages = 0; +} + +static int aio_map_range(struct aio_mapped_range *range, void __user *uaddr, +size_t size, int gup_flags) +{ + int nr_pages, ret; + + if ((unsigned long) uaddr & ~PAGE_MASK) + return -EINVAL; + + nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; + + range->pages = kzalloc(nr_pages * sizeof(struct page *), GFP_KERNEL); + if (!range->pages) + return -ENOMEM; + + down_write(¤t->mm->mmap_sem); + ret = get_user_pages((unsigned long) uaddr, nr_pages, gup_flags, + range->pages, NULL); + up_write(¤t->mm->mmap_sem); + + if (ret < nr_pages) { + kfree(range->pages); + return -ENOMEM; + } + + range->nr_pages = nr_pages; + return 0; +} + +static void aio_useriocb_unmap(struct kioctx *ctx) +{ + aio_unmap_range(&ctx->iocb_range); +} + +static int aio_useriocb_map(struct kioctx *ctx, struct iocb __user *iocbs) +{ + size_t size = sizeof(struct iocb) * ctx->max_reqs; + + return aio_map_range(&ctx->iocb_range, iocbs, size, 0); +} + SYSCALL_DEFINE6(io_setup2, u32, nr_events, u32, flags, struct iocb __user *, iocbs, void __user *, user1, void __user *, user2, aio_context_t __user *, ctxp) @@ -1296,7 +1373,9 @@ SYSCALL_DEFINE6(io_setup2, u32, nr_events, u32, flags, struct iocb __user *, unsigned long ctx; long ret; - if (flags || user1 || user2) + if (user1 || user2) + return -EINVAL; + if (flags & ~IOCTX_FLAG_USERIOCB) return -EINVAL; ret = get_user(ctx, ctxp); @@ -1308,9 +1387,17 @@ SYSCALL_DEFINE6(io_setup2, u32, nr_events, u32, flags, struct iocb __user *, if (IS_ERR(ioctx)) goto out; + if (flags
[PATCH 16/26] aio: add submission side request cache
We have to add each submitted polled request to the io_context poll_submitted list, which means we have to grab the poll_lock. We already use the block plug to batch submissions if we're doing a batch of IO submissions, extend that to cover the poll requests internally as well. Signed-off-by: Jens Axboe --- fs/aio.c | 136 +-- 1 file changed, 113 insertions(+), 23 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 5d34317c2929..ae0805dc814c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -240,6 +240,21 @@ struct aio_kiocb { }; }; +struct aio_submit_state { + struct kioctx *ctx; + + struct blk_plug plug; +#ifdef CONFIG_BLOCK + struct blk_plug_cb plug_cb; +#endif + + /* +* Polled iocbs that have been submitted, but not added to the ctx yet +*/ + struct list_head req_list; + unsigned int req_count; +}; + /*-- sysctl variables*/ static DEFINE_SPINLOCK(aio_nr_lock); unsigned long aio_nr; /* current system wide number of aio requests */ @@ -257,6 +272,15 @@ static const struct address_space_operations aio_ctx_aops; static const unsigned int iocb_page_shift = ilog2(PAGE_SIZE / sizeof(struct iocb)); +/* + * We rely on block level unplugs to flush pending requests, if we schedule + */ +#ifdef CONFIG_BLOCK +static const bool aio_use_state_req_list = true; +#else +static const bool aio_use_state_req_list = false; +#endif + static void aio_useriocb_unmap(struct kioctx *); static void aio_iopoll_reap_events(struct kioctx *); @@ -1887,13 +1911,28 @@ static inline void aio_rw_done(struct kiocb *req, ssize_t ret) } } +/* + * Called either at the end of IO submission, or through a plug callback + * because we're going to schedule. Moves out local batch of requests to + * the ctx poll list, so they can be found for polling + reaping. + */ +static void aio_flush_state_reqs(struct kioctx *ctx, +struct aio_submit_state *state) +{ + spin_lock(&ctx->poll_lock); + list_splice_tail_init(&state->req_list, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); + state->req_count = 0; +} + /* * After the iocb has been issued, it's safe to be found on the poll list. * Adding the kiocb to the list AFTER submission ensures that we don't * find it from a io_getevents() thread before the issuer is done accessing * the kiocb cookie. */ -static void aio_iopoll_iocb_issued(struct aio_kiocb *kiocb) +static void aio_iopoll_iocb_issued(struct aio_submit_state *state, + struct aio_kiocb *kiocb) { /* * For fast devices, IO may have already completed. If it has, add @@ -1903,12 +1942,21 @@ static void aio_iopoll_iocb_issued(struct aio_kiocb *kiocb) const int front_add = test_bit(IOCB_POLL_COMPLETED, &kiocb->ki_flags); struct kioctx *ctx = kiocb->ki_ctx; - spin_lock(&ctx->poll_lock); - if (front_add) - list_add(&kiocb->ki_list, &ctx->poll_submitted); - else - list_add_tail(&kiocb->ki_list, &ctx->poll_submitted); - spin_unlock(&ctx->poll_lock); + if (!state || !aio_use_state_req_list) { + spin_lock(&ctx->poll_lock); + if (front_add) + list_add(&kiocb->ki_list, &ctx->poll_submitted); + else + list_add_tail(&kiocb->ki_list, &ctx->poll_submitted); + spin_unlock(&ctx->poll_lock); + } else { + if (front_add) + list_add(&kiocb->ki_list, &state->req_list); + else + list_add_tail(&kiocb->ki_list, &state->req_list); + if (++state->req_count >= AIO_IOPOLL_BATCH) + aio_flush_state_reqs(ctx, state); + } } static ssize_t aio_read(struct aio_kiocb *kiocb, const struct iocb *iocb, @@ -2204,7 +2252,8 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, const struct iocb *iocb) } static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, - struct iocb __user *user_iocb, bool compat) + struct iocb __user *user_iocb, + struct aio_submit_state *state, bool compat) { struct aio_kiocb *req; ssize_t ret; @@ -2308,7 +2357,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, ret = -EAGAIN; goto out_put_req; } - aio_iopoll_iocb_issued(req); + aio_iopoll_iocb_issued(state, req); } return 0; out_put_req: @@ -2322,7 +2371,7 @@ static int __io_submit_one(struct kioctx *ctx, const struct iocb *iocb, } static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, -bool compat) +struct a
[PATCH 03/26] block: wire up block device iopoll method
From: Christoph Hellwig Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- fs/block_dev.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index e1886cc7048f..6de8d35f6e41 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -281,6 +281,14 @@ struct blkdev_dio { static struct bio_set blkdev_dio_pool; +static int blkdev_iopoll(struct kiocb *kiocb, bool wait) +{ + struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host); + struct request_queue *q = bdev_get_queue(bdev); + + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); +} + static void blkdev_bio_end_io(struct bio *bio) { struct blkdev_dio *dio = bio->bi_private; @@ -398,6 +406,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_opf |= REQ_HIPRI; qc = submit_bio(bio); + WRITE_ONCE(iocb->ki_cookie, qc); break; } @@ -2070,6 +2079,7 @@ const struct file_operations def_blk_fops = { .llseek = block_llseek, .read_iter = blkdev_read_iter, .write_iter = blkdev_write_iter, + .iopoll = blkdev_iopoll, .mmap = generic_file_mmap, .fsync = blkdev_fsync, .unlocked_ioctl = block_ioctl, -- 2.17.1
[PATCH 01/26] fs: add an iopoll method to struct file_operations
From: Christoph Hellwig This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. TODO: we can probably union ki_cookie with the existing hint and I/O priority fields to avoid struct kiocb growth. Reviewed-by: Johannes Thumshirn Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe --- Documentation/filesystems/vfs.txt | 3 +++ include/linux/fs.h| 2 ++ 2 files changed, 5 insertions(+) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 5f71a252e2e0..d9dc5e4d82b9 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -857,6 +857,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -902,6 +903,8 @@ otherwise noted. write_iter: possibly asynchronous write with iov_iter as source + iopoll: called when aio wants to poll for completions on HIPRI iocbs + iterate: called when the VFS needs to read the directory contents iterate_shared: called when the VFS needs to read the directory contents diff --git a/include/linux/fs.h b/include/linux/fs.h index a1ab233e6469..6a5f71f8ae06 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -310,6 +310,7 @@ struct kiocb { int ki_flags; u16 ki_hint; u16 ki_ioprio; /* See linux/ioprio.h */ + unsigned intki_cookie; /* for ->iopoll */ } __randomize_layout; static inline bool is_sync_kiocb(struct kiocb *kiocb) @@ -1781,6 +1782,7 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll)(struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); -- 2.17.1
[PATCH 25/26] aio: split old ring complete out from aio_complete()
Signed-off-by: Jens Axboe --- fs/aio.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 1c8a8bb631a9..39aaffd6d436 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1218,12 +1218,9 @@ static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb, ev->res2 = res2; } -/* aio_complete - * Called when the io request on the given iocb is complete. - */ -static void aio_complete(struct aio_kiocb *iocb, long res, long res2) +static void aio_ring_complete(struct kioctx *ctx, struct aio_kiocb *iocb, + long res, long res2) { - struct kioctx *ctx = iocb->ki_ctx; struct aio_ring *ring; struct io_event *ev_page, *event; unsigned tail, pos, head; @@ -1273,6 +1270,16 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2) spin_unlock_irqrestore(&ctx->completion_lock, flags); pr_debug("added to ring %p at [%u]\n", iocb, tail); +} + +/* aio_complete + * Called when the io request on the given iocb is complete. + */ +static void aio_complete(struct aio_kiocb *iocb, long res, long res2) +{ + struct kioctx *ctx = iocb->ki_ctx; + + aio_ring_complete(ctx, iocb, res, res2); /* * Check if the user asked us to deliver the result through an -- 2.17.1
[PATCH 23/26] fs: add support for mapping an ITER_BVEC for O_DIRECT
This adds support for sync/async O_DIRECT to make a bvec type iter for bdev access, as well as iomap. Signed-off-by: Jens Axboe --- fs/block_dev.c | 16 fs/iomap.c | 5 - 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index b8f574615792..236c6abe649d 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -219,7 +219,10 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct iov_iter *iter, bio.bi_end_io = blkdev_bio_end_io_simple; bio.bi_ioprio = iocb->ki_ioprio; - ret = bio_iov_iter_get_pages(&bio, iter); + if (iov_iter_is_bvec(iter)) + ret = bio_iov_bvec_add_pages(&bio, iter); + else + ret = bio_iov_iter_get_pages(&bio, iter); if (unlikely(ret)) goto out; ret = bio.bi_iter.bi_size; @@ -326,8 +329,9 @@ static void blkdev_bio_end_io(struct bio *bio) struct bio_vec *bvec; int i; - bio_for_each_segment_all(bvec, bio, i) - put_page(bvec->bv_page); + if (!bio_flagged(bio, BIO_HOLD_PAGES)) + bio_for_each_segment_all(bvec, bio, i) + put_page(bvec->bv_page); bio_put(bio); } } @@ -381,7 +385,11 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) bio->bi_end_io = blkdev_bio_end_io; bio->bi_ioprio = iocb->ki_ioprio; - ret = bio_iov_iter_get_pages(bio, iter); + if (iov_iter_is_bvec(iter)) + ret = bio_iov_bvec_add_pages(bio, iter); + else + ret = bio_iov_iter_get_pages(bio, iter); + if (unlikely(ret)) { bio->bi_status = BLK_STS_IOERR; bio_endio(bio); diff --git a/fs/iomap.c b/fs/iomap.c index bd483fcb7b5a..f00e5e198c57 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1673,7 +1673,10 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length, bio->bi_private = dio; bio->bi_end_io = iomap_dio_bio_end_io; - ret = bio_iov_iter_get_pages(bio, &iter); + if (iov_iter_is_bvec(&iter)) + ret = bio_iov_bvec_add_pages(bio, &iter); + else + ret = bio_iov_iter_get_pages(bio, &iter); if (unlikely(ret)) { /* * We have to stop part way through an IO. We must fall -- 2.17.1
[PATCH 24/26] aio: add support for pre-mapped user IO buffers
If we have fixed user buffers, we can map them into the kernel when we setup the io_context. That avoids the need to do get_user_pages() for each and every IO. To utilize this feature, the application must set both IOCTX_FLAG_USERIOCB, to provide iocb's in userspace, and then IOCTX_FLAG_FIXEDBUFS. The latter tells aio that the iocbs that are mapped already contain valid destination and sizes. These buffers can then be mapped into the kernel for the life time of the io_context, as opposed to just the duration of the each single IO. Only works with non-vectored read/write commands for now, not with PREADV/PWRITEV. A limit of 4M is imposed as the largest buffer we currently support. There's nothing preventing us from going larger, but we need some cap, and 4M seemed like it would definitely be big enough. RLIMIT_MEMLOCK is used to cap the total amount of memory pinned. See the fio change for how to utilize this feature: http://git.kernel.dk/cgit/fio/commit/?id=2041bd343da1c1e955253f62374588718c64f0f3 Signed-off-by: Jens Axboe --- fs/aio.c | 193 --- include/uapi/linux/aio_abi.h | 1 + 2 files changed, 177 insertions(+), 17 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 1735ec556089..1c8a8bb631a9 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -97,6 +98,11 @@ struct aio_mapped_range { long nr_pages; }; +struct aio_mapped_ubuf { + struct bio_vec *bvec; + unsigned int nr_bvecs; +}; + struct kioctx { struct percpu_ref users; atomic_tdead; @@ -132,6 +138,8 @@ struct kioctx { struct page **ring_pages; longnr_pages; + struct aio_mapped_ubuf *user_bufs; + struct aio_mapped_range iocb_range; struct rcu_work free_rwork; /* see free_ioctx() */ @@ -301,6 +309,7 @@ static const bool aio_use_state_req_list = false; static void aio_useriocb_unmap(struct kioctx *); static void aio_iopoll_reap_events(struct kioctx *); +static void aio_iocb_buffer_unmap(struct kioctx *); static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages) { @@ -662,6 +671,7 @@ static void free_ioctx(struct work_struct *work) free_rwork); pr_debug("freeing %p\n", ctx); + aio_iocb_buffer_unmap(ctx); aio_useriocb_unmap(ctx); aio_free_ring(ctx); free_percpu(ctx->cpu); @@ -1672,6 +1682,122 @@ static int aio_useriocb_map(struct kioctx *ctx, struct iocb __user *iocbs) return aio_map_range(&ctx->iocb_range, iocbs, size, 0); } +static void aio_iocb_buffer_unmap(struct kioctx *ctx) +{ + int i, j; + + if (!ctx->user_bufs) + return; + + for (i = 0; i < ctx->max_reqs; i++) { + struct aio_mapped_ubuf *amu = &ctx->user_bufs[i]; + + for (j = 0; j < amu->nr_bvecs; j++) + put_page(amu->bvec[j].bv_page); + + kfree(amu->bvec); + amu->nr_bvecs = 0; + } + + kfree(ctx->user_bufs); + ctx->user_bufs = NULL; +} + +static int aio_iocb_buffer_map(struct kioctx *ctx) +{ + unsigned long total_pages, page_limit; + struct page **pages = NULL; + int i, j, got_pages = 0; + struct iocb *iocb; + int ret = -EINVAL; + + ctx->user_bufs = kzalloc(ctx->max_reqs * sizeof(struct aio_mapped_ubuf), + GFP_KERNEL); + if (!ctx->user_bufs) + return -ENOMEM; + + /* Don't allow more pages than we can safely lock */ + total_pages = 0; + page_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + + for (i = 0; i < ctx->max_reqs; i++) { + struct aio_mapped_ubuf *amu = &ctx->user_bufs[i]; + unsigned long off, start, end, ubuf; + int pret, nr_pages; + size_t size; + + iocb = aio_iocb_from_index(ctx, i); + + /* +* Don't impose further limits on the size and buffer +* constraints here, we'll -EINVAL later when IO is +* submitted if they are wrong. +*/ + ret = -EFAULT; + if (!iocb->aio_buf) + goto err; + + /* arbitrary limit, but we need something */ + if (iocb->aio_nbytes > SZ_4M) + goto err; + + ubuf = iocb->aio_buf; + end = (ubuf + iocb->aio_nbytes + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = ubuf >> PAGE_SHIFT; + nr_pages = end - start; + + ret = -ENOMEM; + if (total_pages + nr_pages > page_limit) + goto err; + + if (!pages || nr_pages > got_pages) { + kfree(pages); +
[PATCH 26/26] aio: add support for submission/completion rings
Experimental support for submitting and completing IO through rings shared between the application and kernel. The submission rings are struct iocb, like we would submit through io_submit(), and the completion rings are struct io_event, like we would pass in (and copy back) from io_getevents(). A new system call is added for this, io_ring_enter(). This system call submits IO that is queued in the SQ ring, and/or completes IO and stores the results in the CQ ring. This could be augmented with a kernel thread that does the submission and polling, then the application would never have to enter the kernel to do IO. Sample application: http://brick.kernel.dk/snaps/aio-ring.c Signed-off-by: Jens Axboe --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + fs/aio.c | 312 +++-- include/linux/syscalls.h | 4 +- include/uapi/linux/aio_abi.h | 26 +++ 4 files changed, 323 insertions(+), 20 deletions(-) diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 67c357225fb0..55a26700a637 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -344,6 +344,7 @@ 333common io_pgetevents __x64_sys_io_pgetevents 334common rseq__x64_sys_rseq 335common io_setup2 __x64_sys_io_setup2 +336common io_ring_enter __x64_sys_io_ring_enter # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/fs/aio.c b/fs/aio.c index 39aaffd6d436..6024c6943d7d 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -142,6 +142,10 @@ struct kioctx { struct aio_mapped_range iocb_range; + /* if used, completion and submission rings */ + struct aio_mapped_range sq_ring; + struct aio_mapped_range cq_ring; + struct rcu_work free_rwork; /* see free_ioctx() */ /* @@ -297,6 +301,8 @@ static const struct address_space_operations aio_ctx_aops; static const unsigned int iocb_page_shift = ilog2(PAGE_SIZE / sizeof(struct iocb)); +static const unsigned int event_page_shift = + ilog2(PAGE_SIZE / sizeof(struct io_event)); /* * We rely on block level unplugs to flush pending requests, if we schedule @@ -307,6 +313,7 @@ static const bool aio_use_state_req_list = true; static const bool aio_use_state_req_list = false; #endif +static void aio_scqring_unmap(struct kioctx *); static void aio_useriocb_unmap(struct kioctx *); static void aio_iopoll_reap_events(struct kioctx *); static void aio_iocb_buffer_unmap(struct kioctx *); @@ -673,6 +680,7 @@ static void free_ioctx(struct work_struct *work) aio_iocb_buffer_unmap(ctx); aio_useriocb_unmap(ctx); + aio_scqring_unmap(ctx); aio_free_ring(ctx); free_percpu(ctx->cpu); percpu_ref_exit(&ctx->reqs); @@ -1218,6 +1226,47 @@ static void aio_fill_event(struct io_event *ev, struct aio_kiocb *iocb, ev->res2 = res2; } +static struct io_event *__aio_get_cqring_ev(struct aio_io_event_ring *ring, + struct aio_mapped_range *range, + unsigned *next_tail) +{ + struct io_event *ev; + unsigned tail; + + smp_rmb(); + tail = READ_ONCE(ring->tail); + *next_tail = tail + 1; + if (*next_tail == ring->nr_events) + *next_tail = 0; + if (*next_tail == READ_ONCE(ring->head)) + return NULL; + + /* io_event array starts offset one into the mapped range */ + tail++; + ev = page_address(range->pages[tail >> event_page_shift]); + tail &= ((1 << event_page_shift) - 1); + return ev + tail; +} + +static void aio_commit_cqring(struct kioctx *ctx, unsigned next_tail) +{ + struct aio_io_event_ring *ring; + + ring = page_address(ctx->cq_ring.pages[0]); + if (next_tail != ring->tail) { + ring->tail = next_tail; + smp_wmb(); + } +} + +static struct io_event *aio_peek_cqring(struct kioctx *ctx, unsigned *ntail) +{ + struct aio_io_event_ring *ring; + + ring = page_address(ctx->cq_ring.pages[0]); + return __aio_get_cqring_ev(ring, &ctx->cq_ring, ntail); +} + static void aio_ring_complete(struct kioctx *ctx, struct aio_kiocb *iocb, long res, long res2) { @@ -1279,7 +1328,17 @@ static void aio_complete(struct aio_kiocb *iocb, long res, long res2) { struct kioctx *ctx = iocb->ki_ctx; - aio_ring_complete(ctx, iocb, res, res2); + if (ctx->flags & IOCTX_FLAG_SCQRING) { + struct io_event *ev; + unsigned int tail; + + /* Can't fail, we have a ring reservation */ + ev = aio_peek_cqring(ctx, &tail); + aio_fill_event(ev, iocb, res, res2); + ai
Re: [PATCH] blk-mq: fix corruption with direct issue
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > If we attempt a direct issue to a SCSI device, and it returns BUSY, then > we queue the request up normally. However, the SCSI layer may have > already setup SG tables etc for this particular command. If we later > merge with this request, then the old tables are no longer valid. Once > we issue the IO, we only read/write the original part of the request, > not the new state of it. > > This causes data corruption, and is most often noticed with the file > system complaining about the just read data being invalid: > > [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm > dpkg-query: bad extra_isize 24937 (inode size 256) > > because most of it is garbage... > > This doesn't happen from the normal issue path, as we will simply defer > the request to the hardware queue dispatch list if we fail. Once it's on > the dispatch list, we never merge with it. > > Fix this from the direct issue path by flagging the request as > REQ_NOMERGE so we don't change the size of it before issue. > > See also: > https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of > 'none'") > Signed-off-by: Jens Axboe > > --- > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3f91c6e5b17a..d8f518c6ea38 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct > blk_mq_hw_ctx *hctx, > break; > case BLK_STS_RESOURCE: > case BLK_STS_DEV_RESOURCE: > + /* > + * If direct dispatch fails, we cannot allow any merging on > + * this IO. Drivers (like SCSI) may have set up permanent state > + * for this request, like SG tables and mappings, and if we > + * merge to it later on then we'll still only do IO to the > + * original part. > + */ > + rq->cmd_flags |= REQ_NOMERGE; > + > blk_mq_update_dispatch_busy(hctx, true); > __blk_mq_requeue_request(rq); > break; > Not sure it is enough to just mark it as NOMERGE, for example, driver may have setup the .special_vec for discard, and NOMERGE may not prevent request from entering elevator queue completely. Cause 'rq.rb_node' and 'rq.special_vec' share same space. So how about inserting this request via blk_mq_request_bypass_insert() in case that direct issue returns BUSY? Then it is invariant that any request queued via .queue_rq() won't enter scheduler queue. -- diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f91c6e5b17a..4b2db0b2909e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1764,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (bypass_insert) return BLK_STS_RESOURCE; - blk_mq_sched_insert_request(rq, false, run_queue, false); + blk_mq_request_bypass_insert(rq, run_queue); return BLK_STS_OK; } @@ -1780,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) - blk_mq_sched_insert_request(rq, false, true, false); + blk_mq_request_bypass_insert(rq, true); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); Thanks, Ming
Re: [PATCH] blk-mq: fix corruption with direct issue
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > If we attempt a direct issue to a SCSI device, and it returns BUSY, then > we queue the request up normally. However, the SCSI layer may have > already setup SG tables etc for this particular command. If we later > merge with this request, then the old tables are no longer valid. Once > we issue the IO, we only read/write the original part of the request, > not the new state of it. > > This causes data corruption, and is most often noticed with the file > system complaining about the just read data being invalid: > > [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm > dpkg-query: bad extra_isize 24937 (inode size 256) > > because most of it is garbage... > > This doesn't happen from the normal issue path, as we will simply defer > the request to the hardware queue dispatch list if we fail. Once it's on > the dispatch list, we never merge with it. > > Fix this from the direct issue path by flagging the request as > REQ_NOMERGE so we don't change the size of it before issue. > > See also: > https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of > 'none'") > Signed-off-by: Jens Axboe Tested-by: Guenter Roeck ... on two systems affected by the problem. > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3f91c6e5b17a..d8f518c6ea38 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct > blk_mq_hw_ctx *hctx, > break; > case BLK_STS_RESOURCE: > case BLK_STS_DEV_RESOURCE: > + /* > + * If direct dispatch fails, we cannot allow any merging on > + * this IO. Drivers (like SCSI) may have set up permanent state > + * for this request, like SG tables and mappings, and if we > + * merge to it later on then we'll still only do IO to the > + * original part. > + */ > + rq->cmd_flags |= REQ_NOMERGE; > + > blk_mq_update_dispatch_busy(hctx, true); > __blk_mq_requeue_request(rq); > break;
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/4/18 6:37 PM, Ming Lei wrote: > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >> we queue the request up normally. However, the SCSI layer may have >> already setup SG tables etc for this particular command. If we later >> merge with this request, then the old tables are no longer valid. Once >> we issue the IO, we only read/write the original part of the request, >> not the new state of it. >> >> This causes data corruption, and is most often noticed with the file >> system complaining about the just read data being invalid: >> >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: >> comm dpkg-query: bad extra_isize 24937 (inode size 256) >> >> because most of it is garbage... >> >> This doesn't happen from the normal issue path, as we will simply defer >> the request to the hardware queue dispatch list if we fail. Once it's on >> the dispatch list, we never merge with it. >> >> Fix this from the direct issue path by flagging the request as >> REQ_NOMERGE so we don't change the size of it before issue. >> >> See also: >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case >> of 'none'") >> Signed-off-by: Jens Axboe >> >> --- >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 3f91c6e5b17a..d8f518c6ea38 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct >> blk_mq_hw_ctx *hctx, >> break; >> case BLK_STS_RESOURCE: >> case BLK_STS_DEV_RESOURCE: >> +/* >> + * If direct dispatch fails, we cannot allow any merging on >> + * this IO. Drivers (like SCSI) may have set up permanent state >> + * for this request, like SG tables and mappings, and if we >> + * merge to it later on then we'll still only do IO to the >> + * original part. >> + */ >> +rq->cmd_flags |= REQ_NOMERGE; >> + >> blk_mq_update_dispatch_busy(hctx, true); >> __blk_mq_requeue_request(rq); >> break; >> > > Not sure it is enough to just mark it as NOMERGE, for example, driver > may have setup the .special_vec for discard, and NOMERGE may not prevent > request from entering elevator queue completely. Cause 'rq.rb_node' and > 'rq.special_vec' share same space. We should rather limit the scope of the direct dispatch instead. It doesn't make sense to do for anything but read/write anyway. > So how about inserting this request via blk_mq_request_bypass_insert() > in case that direct issue returns BUSY? Then it is invariant that > any request queued via .queue_rq() won't enter scheduler queue. I did consider this, but I didn't want to experiment with exercising a new path for an important bug fix. You do realize that your original patch has been corrupting data for months? I think a little caution is in order here. -- Jens Axboe
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/4/18 7:16 PM, Jens Axboe wrote: > On 12/4/18 6:37 PM, Ming Lei wrote: >> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >>> we queue the request up normally. However, the SCSI layer may have >>> already setup SG tables etc for this particular command. If we later >>> merge with this request, then the old tables are no longer valid. Once >>> we issue the IO, we only read/write the original part of the request, >>> not the new state of it. >>> >>> This causes data corruption, and is most often noticed with the file >>> system complaining about the just read data being invalid: >>> >>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: >>> comm dpkg-query: bad extra_isize 24937 (inode size 256) >>> >>> because most of it is garbage... >>> >>> This doesn't happen from the normal issue path, as we will simply defer >>> the request to the hardware queue dispatch list if we fail. Once it's on >>> the dispatch list, we never merge with it. >>> >>> Fix this from the direct issue path by flagging the request as >>> REQ_NOMERGE so we don't change the size of it before issue. >>> >>> See also: >>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >>> >>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case >>> of 'none'") >>> Signed-off-by: Jens Axboe >>> >>> --- >>> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 3f91c6e5b17a..d8f518c6ea38 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct >>> blk_mq_hw_ctx *hctx, >>> break; >>> case BLK_STS_RESOURCE: >>> case BLK_STS_DEV_RESOURCE: >>> + /* >>> +* If direct dispatch fails, we cannot allow any merging on >>> +* this IO. Drivers (like SCSI) may have set up permanent state >>> +* for this request, like SG tables and mappings, and if we >>> +* merge to it later on then we'll still only do IO to the >>> +* original part. >>> +*/ >>> + rq->cmd_flags |= REQ_NOMERGE; >>> + >>> blk_mq_update_dispatch_busy(hctx, true); >>> __blk_mq_requeue_request(rq); >>> break; >>> >> >> Not sure it is enough to just mark it as NOMERGE, for example, driver >> may have setup the .special_vec for discard, and NOMERGE may not prevent >> request from entering elevator queue completely. Cause 'rq.rb_node' and >> 'rq.special_vec' share same space. > > We should rather limit the scope of the direct dispatch instead. It > doesn't make sense to do for anything but read/write anyway. > >> So how about inserting this request via blk_mq_request_bypass_insert() >> in case that direct issue returns BUSY? Then it is invariant that >> any request queued via .queue_rq() won't enter scheduler queue. > > I did consider this, but I didn't want to experiment with exercising > a new path for an important bug fix. You do realize that your original > patch has been corrupting data for months? I think a little caution > is in order here. Here's a further limiting version. And we seriously need to clean up the direct issue paths, it's ridiculous. diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f91c6e5b17a..3262d83b9e07 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, break; case BLK_STS_RESOURCE: case BLK_STS_DEV_RESOURCE: + /* +* If direct dispatch fails, we cannot allow any merging on +* this IO. Drivers (like SCSI) may have set up permanent state +* for this request, like SG tables and mappings, and if we +* merge to it later on then we'll still only do IO to the +* original part. +*/ + rq->cmd_flags |= REQ_NOMERGE; + blk_mq_update_dispatch_busy(hctx, true); __blk_mq_requeue_request(rq); break; @@ -1727,6 +1736,18 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, return ret; } +/* + * Don't allow direct dispatch of anything but regular reads/writes, + * as some of the other commands can potentially share request space + * with data we need for the IO scheduler. If we attempt a direct dispatch + * on those and fail, we can't safely add it to the scheduler afterwards + * without potentially overwriting data that the driver has already written. + */ +static bool blk_rq_can_direct_dispatch(struct request *rq) +{ + return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE; +} + static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_qc_t *cookie,
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/4/18 6:38 PM, Guenter Roeck wrote: > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >> we queue the request up normally. However, the SCSI layer may have >> already setup SG tables etc for this particular command. If we later >> merge with this request, then the old tables are no longer valid. Once >> we issue the IO, we only read/write the original part of the request, >> not the new state of it. >> >> This causes data corruption, and is most often noticed with the file >> system complaining about the just read data being invalid: >> >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: >> comm dpkg-query: bad extra_isize 24937 (inode size 256) >> >> because most of it is garbage... >> >> This doesn't happen from the normal issue path, as we will simply defer >> the request to the hardware queue dispatch list if we fail. Once it's on >> the dispatch list, we never merge with it. >> >> Fix this from the direct issue path by flagging the request as >> REQ_NOMERGE so we don't change the size of it before issue. >> >> See also: >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case >> of 'none'") >> Signed-off-by: Jens Axboe > > Tested-by: Guenter Roeck > > ... on two systems affected by the problem. Thanks for testing! And for being persistent in reproducing and providing clues for getting this nailed. -- Jens Axboe
Re: [PATCH] blk-mq: fix corruption with direct issue
On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > On 12/4/18 6:37 PM, Ming Lei wrote: > > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then > >> we queue the request up normally. However, the SCSI layer may have > >> already setup SG tables etc for this particular command. If we later > >> merge with this request, then the old tables are no longer valid. Once > >> we issue the IO, we only read/write the original part of the request, > >> not the new state of it. > >> > >> This causes data corruption, and is most often noticed with the file > >> system complaining about the just read data being invalid: > >> > >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: > >> comm dpkg-query: bad extra_isize 24937 (inode size 256) > >> > >> because most of it is garbage... > >> > >> This doesn't happen from the normal issue path, as we will simply defer > >> the request to the hardware queue dispatch list if we fail. Once it's on > >> the dispatch list, we never merge with it. > >> > >> Fix this from the direct issue path by flagging the request as > >> REQ_NOMERGE so we don't change the size of it before issue. > >> > >> See also: > >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > >> > >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case > >> of 'none'") > >> Signed-off-by: Jens Axboe > >> > >> --- > >> > >> diff --git a/block/blk-mq.c b/block/blk-mq.c > >> index 3f91c6e5b17a..d8f518c6ea38 100644 > >> --- a/block/blk-mq.c > >> +++ b/block/blk-mq.c > >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct > >> blk_mq_hw_ctx *hctx, > >>break; > >>case BLK_STS_RESOURCE: > >>case BLK_STS_DEV_RESOURCE: > >> + /* > >> + * If direct dispatch fails, we cannot allow any merging on > >> + * this IO. Drivers (like SCSI) may have set up permanent state > >> + * for this request, like SG tables and mappings, and if we > >> + * merge to it later on then we'll still only do IO to the > >> + * original part. > >> + */ > >> + rq->cmd_flags |= REQ_NOMERGE; > >> + > >>blk_mq_update_dispatch_busy(hctx, true); > >>__blk_mq_requeue_request(rq); > >>break; > >> > > > > Not sure it is enough to just mark it as NOMERGE, for example, driver > > may have setup the .special_vec for discard, and NOMERGE may not prevent > > request from entering elevator queue completely. Cause 'rq.rb_node' and > > 'rq.special_vec' share same space. > > We should rather limit the scope of the direct dispatch instead. It > doesn't make sense to do for anything but read/write anyway. discard is kind of write, and it isn't treated very specially in make request path, except for multi-range discard. > > > So how about inserting this request via blk_mq_request_bypass_insert() > > in case that direct issue returns BUSY? Then it is invariant that > > any request queued via .queue_rq() won't enter scheduler queue. > > I did consider this, but I didn't want to experiment with exercising > a new path for an important bug fix. You do realize that your original > patch has been corrupting data for months? I think a little caution > is in order here. But marking NOMERGE still may have a hole on re-insert discard request as mentioned above. Given we never allow to re-insert queued request to scheduler queue except for 6ce3dd6eec1, I think it is the correct thing to do, and the fix is simple too. Thanks, Ming
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/4/18 7:27 PM, Ming Lei wrote: > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: >> On 12/4/18 6:37 PM, Ming Lei wrote: >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: If we attempt a direct issue to a SCSI device, and it returns BUSY, then we queue the request up normally. However, the SCSI layer may have already setup SG tables etc for this particular command. If we later merge with this request, then the old tables are no longer valid. Once we issue the IO, we only read/write the original part of the request, not the new state of it. This causes data corruption, and is most often noticed with the file system complaining about the just read data being invalid: [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) because most of it is garbage... This doesn't happen from the normal issue path, as we will simply defer the request to the hardware queue dispatch list if we fail. Once it's on the dispatch list, we never merge with it. Fix this from the direct issue path by flagging the request as REQ_NOMERGE so we don't change the size of it before issue. See also: https://bugzilla.kernel.org/show_bug.cgi?id=201685 Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Signed-off-by: Jens Axboe --- diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f91c6e5b17a..d8f518c6ea38 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, break; case BLK_STS_RESOURCE: case BLK_STS_DEV_RESOURCE: + /* + * If direct dispatch fails, we cannot allow any merging on + * this IO. Drivers (like SCSI) may have set up permanent state + * for this request, like SG tables and mappings, and if we + * merge to it later on then we'll still only do IO to the + * original part. + */ + rq->cmd_flags |= REQ_NOMERGE; + blk_mq_update_dispatch_busy(hctx, true); __blk_mq_requeue_request(rq); break; >>> >>> Not sure it is enough to just mark it as NOMERGE, for example, driver >>> may have setup the .special_vec for discard, and NOMERGE may not prevent >>> request from entering elevator queue completely. Cause 'rq.rb_node' and >>> 'rq.special_vec' share same space. >> >> We should rather limit the scope of the direct dispatch instead. It >> doesn't make sense to do for anything but read/write anyway. > > discard is kind of write, and it isn't treated very specially in make > request path, except for multi-range discard. The point of direct dispatch is to reduce latencies for requests, discards are so damn slow on ALL devices anyway that it doesn't make any sense to try direct dispatch to begin with, regardless of whether it possible or not. >>> So how about inserting this request via blk_mq_request_bypass_insert() >>> in case that direct issue returns BUSY? Then it is invariant that >>> any request queued via .queue_rq() won't enter scheduler queue. >> >> I did consider this, but I didn't want to experiment with exercising >> a new path for an important bug fix. You do realize that your original >> patch has been corrupting data for months? I think a little caution >> is in order here. > > But marking NOMERGE still may have a hole on re-insert discard request as > mentioned above. What I said was further limit the scope of direct dispatch, which means not allowing anything that isn't a read/write. > Given we never allow to re-insert queued request to scheduler queue > except for 6ce3dd6eec1, I think it is the correct thing to do, and the > fix is simple too. As I said, it's not the time to experiment. This issue has been there since 4.19-rc1. The alternative is yanking both those patches, and then looking at it later when the direct issue path has been cleaned up first. -- Jens Axboe
Re: [PATCH] blk-mq: fix corruption with direct issue
On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: > On 12/4/18 7:27 PM, Ming Lei wrote: > > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > >> On 12/4/18 6:37 PM, Ming Lei wrote: > >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > If we attempt a direct issue to a SCSI device, and it returns BUSY, then > we queue the request up normally. However, the SCSI layer may have > already setup SG tables etc for this particular command. If we later > merge with this request, then the old tables are no longer valid. Once > we issue the IO, we only read/write the original part of the request, > not the new state of it. > > This causes data corruption, and is most often noticed with the file > system complaining about the just read data being invalid: > > [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: > comm dpkg-query: bad extra_isize 24937 (inode size 256) > > because most of it is garbage... > > This doesn't happen from the normal issue path, as we will simply defer > the request to the hardware queue dispatch list if we fail. Once it's on > the dispatch list, we never merge with it. > > Fix this from the direct issue path by flagging the request as > REQ_NOMERGE so we don't change the size of it before issue. > > See also: > https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in > case of 'none'") > Signed-off-by: Jens Axboe > > --- > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3f91c6e5b17a..d8f518c6ea38 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1715,6 +1715,15 @@ static blk_status_t > __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > break; > case BLK_STS_RESOURCE: > case BLK_STS_DEV_RESOURCE: > +/* > + * If direct dispatch fails, we cannot allow any > merging on > + * this IO. Drivers (like SCSI) may have set up > permanent state > + * for this request, like SG tables and mappings, and > if we > + * merge to it later on then we'll still only do IO to > the > + * original part. > + */ > +rq->cmd_flags |= REQ_NOMERGE; > + > blk_mq_update_dispatch_busy(hctx, true); > __blk_mq_requeue_request(rq); > break; > > >>> > >>> Not sure it is enough to just mark it as NOMERGE, for example, driver > >>> may have setup the .special_vec for discard, and NOMERGE may not prevent > >>> request from entering elevator queue completely. Cause 'rq.rb_node' and > >>> 'rq.special_vec' share same space. > >> > >> We should rather limit the scope of the direct dispatch instead. It > >> doesn't make sense to do for anything but read/write anyway. > > > > discard is kind of write, and it isn't treated very specially in make > > request path, except for multi-range discard. > > The point of direct dispatch is to reduce latencies for requests, > discards are so damn slow on ALL devices anyway that it doesn't make any > sense to try direct dispatch to begin with, regardless of whether it > possible or not. SCSI MQ device may benefit from direct dispatch from reduced lock contention. > > >>> So how about inserting this request via blk_mq_request_bypass_insert() > >>> in case that direct issue returns BUSY? Then it is invariant that > >>> any request queued via .queue_rq() won't enter scheduler queue. > >> > >> I did consider this, but I didn't want to experiment with exercising > >> a new path for an important bug fix. You do realize that your original > >> patch has been corrupting data for months? I think a little caution > >> is in order here. > > > > But marking NOMERGE still may have a hole on re-insert discard request as > > mentioned above. > > What I said was further limit the scope of direct dispatch, which means > not allowing anything that isn't a read/write. IMO, the conservative approach is to take the one used in legacy io path, in which it is never allowed to re-insert queued request to scheduler queue except for requeue, however RQF_DONTPREP is cleared before requeuing request to scheduler. > > > Given we never allow to re-insert queued request to scheduler queue > > except for 6ce3dd6eec1, I think it is the correct thing to do, and the > > fix is simple too. > > As I said, it's not the time to experiment. This issue has been there > since 4.19-rc1. The alternative is yanking both those patches, and then > looking at it later when the direct issue path has been cleaned up > first. The issue should have been there from v4.1, especially after commit f984
Re: [PATCH] blk-mq: fix corruption with direct issue
On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote: > On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: > > On 12/4/18 7:27 PM, Ming Lei wrote: > > > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > > >> On 12/4/18 6:37 PM, Ming Lei wrote: > > >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > > If we attempt a direct issue to a SCSI device, and it returns BUSY, > > then > > we queue the request up normally. However, the SCSI layer may have > > already setup SG tables etc for this particular command. If we later > > merge with this request, then the old tables are no longer valid. Once > > we issue the IO, we only read/write the original part of the request, > > not the new state of it. > > > > This causes data corruption, and is most often noticed with the file > > system complaining about the just read data being invalid: > > > > [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode > > #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) > > > > because most of it is garbage... > > > > This doesn't happen from the normal issue path, as we will simply defer > > the request to the hardware queue dispatch list if we fail. Once it's > > on > > the dispatch list, we never merge with it. > > > > Fix this from the direct issue path by flagging the request as > > REQ_NOMERGE so we don't change the size of it before issue. > > > > See also: > > https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > > > Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in > > case of 'none'") > > Signed-off-by: Jens Axboe > > > > --- > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 3f91c6e5b17a..d8f518c6ea38 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1715,6 +1715,15 @@ static blk_status_t > > __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > > break; > > case BLK_STS_RESOURCE: > > case BLK_STS_DEV_RESOURCE: > > + /* > > + * If direct dispatch fails, we cannot allow any > > merging on > > + * this IO. Drivers (like SCSI) may have set up > > permanent state > > + * for this request, like SG tables and mappings, and > > if we > > + * merge to it later on then we'll still only do IO to > > the > > + * original part. > > + */ > > + rq->cmd_flags |= REQ_NOMERGE; > > + > > blk_mq_update_dispatch_busy(hctx, true); > > __blk_mq_requeue_request(rq); > > break; > > > > >>> > > >>> Not sure it is enough to just mark it as NOMERGE, for example, driver > > >>> may have setup the .special_vec for discard, and NOMERGE may not prevent > > >>> request from entering elevator queue completely. Cause 'rq.rb_node' and > > >>> 'rq.special_vec' share same space. > > >> > > >> We should rather limit the scope of the direct dispatch instead. It > > >> doesn't make sense to do for anything but read/write anyway. > > > > > > discard is kind of write, and it isn't treated very specially in make > > > request path, except for multi-range discard. > > > > The point of direct dispatch is to reduce latencies for requests, > > discards are so damn slow on ALL devices anyway that it doesn't make any > > sense to try direct dispatch to begin with, regardless of whether it > > possible or not. > > SCSI MQ device may benefit from direct dispatch from reduced lock contention. > > > > > >>> So how about inserting this request via blk_mq_request_bypass_insert() > > >>> in case that direct issue returns BUSY? Then it is invariant that > > >>> any request queued via .queue_rq() won't enter scheduler queue. > > >> > > >> I did consider this, but I didn't want to experiment with exercising > > >> a new path for an important bug fix. You do realize that your original > > >> patch has been corrupting data for months? I think a little caution > > >> is in order here. > > > > > > But marking NOMERGE still may have a hole on re-insert discard request as > > > mentioned above. > > > > What I said was further limit the scope of direct dispatch, which means > > not allowing anything that isn't a read/write. > > IMO, the conservative approach is to take the one used in legacy io > path, in which it is never allowed to re-insert queued request to > scheduler queue except for requeue, however RQF_DONTPREP is cleared > before requeuing request to scheduler. > > > > > > Given we never allow to re-insert queued request to scheduler queue > > > except for 6ce3dd6eec1, I think it is the correct thing to do, and the > > > fix is simple too. > > > > As I said, it's not the time
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/4/18 7:58 PM, Ming Lei wrote: > On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: >> On 12/4/18 7:27 PM, Ming Lei wrote: >>> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: On 12/4/18 6:37 PM, Ming Lei wrote: > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >> we queue the request up normally. However, the SCSI layer may have >> already setup SG tables etc for this particular command. If we later >> merge with this request, then the old tables are no longer valid. Once >> we issue the IO, we only read/write the original part of the request, >> not the new state of it. >> >> This causes data corruption, and is most often noticed with the file >> system complaining about the just read data being invalid: >> >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: >> comm dpkg-query: bad extra_isize 24937 (inode size 256) >> >> because most of it is garbage... >> >> This doesn't happen from the normal issue path, as we will simply defer >> the request to the hardware queue dispatch list if we fail. Once it's on >> the dispatch list, we never merge with it. >> >> Fix this from the direct issue path by flagging the request as >> REQ_NOMERGE so we don't change the size of it before issue. >> >> See also: >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in >> case of 'none'") >> Signed-off-by: Jens Axboe >> >> --- >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 3f91c6e5b17a..d8f518c6ea38 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -1715,6 +1715,15 @@ static blk_status_t >> __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, >> break; >> case BLK_STS_RESOURCE: >> case BLK_STS_DEV_RESOURCE: >> +/* >> + * If direct dispatch fails, we cannot allow any >> merging on >> + * this IO. Drivers (like SCSI) may have set up >> permanent state >> + * for this request, like SG tables and mappings, and >> if we >> + * merge to it later on then we'll still only do IO to >> the >> + * original part. >> + */ >> +rq->cmd_flags |= REQ_NOMERGE; >> + >> blk_mq_update_dispatch_busy(hctx, true); >> __blk_mq_requeue_request(rq); >> break; >> > > Not sure it is enough to just mark it as NOMERGE, for example, driver > may have setup the .special_vec for discard, and NOMERGE may not prevent > request from entering elevator queue completely. Cause 'rq.rb_node' and > 'rq.special_vec' share same space. We should rather limit the scope of the direct dispatch instead. It doesn't make sense to do for anything but read/write anyway. >>> >>> discard is kind of write, and it isn't treated very specially in make >>> request path, except for multi-range discard. >> >> The point of direct dispatch is to reduce latencies for requests, >> discards are so damn slow on ALL devices anyway that it doesn't make any >> sense to try direct dispatch to begin with, regardless of whether it >> possible or not. > > SCSI MQ device may benefit from direct dispatch from reduced lock contention. > >> > So how about inserting this request via blk_mq_request_bypass_insert() > in case that direct issue returns BUSY? Then it is invariant that > any request queued via .queue_rq() won't enter scheduler queue. I did consider this, but I didn't want to experiment with exercising a new path for an important bug fix. You do realize that your original patch has been corrupting data for months? I think a little caution is in order here. >>> >>> But marking NOMERGE still may have a hole on re-insert discard request as >>> mentioned above. >> >> What I said was further limit the scope of direct dispatch, which means >> not allowing anything that isn't a read/write. > > IMO, the conservative approach is to take the one used in legacy io > path, in which it is never allowed to re-insert queued request to > scheduler queue except for requeue, however RQF_DONTPREP is cleared > before requeuing request to scheduler. I don't agree. Whether you agree or not, I'm doing the simple fix for 4.20 and queueing it up for 4.21 as well. Then I hope Jianchao can rework his direct dispatch cleanup patches for 4.21 on top of that, and then we can experiment on top of that with other solutions. >>> Given we never allow to re-insert queued request to scheduler queue >>> except for 6ce3dd6eec1, I think it is the correct thing to do, and t
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/4/18 8:03 PM, Ming Lei wrote: > On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote: >> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: >>> On 12/4/18 7:27 PM, Ming Lei wrote: On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > On 12/4/18 6:37 PM, Ming Lei wrote: >> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >>> we queue the request up normally. However, the SCSI layer may have >>> already setup SG tables etc for this particular command. If we later >>> merge with this request, then the old tables are no longer valid. Once >>> we issue the IO, we only read/write the original part of the request, >>> not the new state of it. >>> >>> This causes data corruption, and is most often noticed with the file >>> system complaining about the just read data being invalid: >>> >>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode >>> #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) >>> >>> because most of it is garbage... >>> >>> This doesn't happen from the normal issue path, as we will simply defer >>> the request to the hardware queue dispatch list if we fail. Once it's on >>> the dispatch list, we never merge with it. >>> >>> Fix this from the direct issue path by flagging the request as >>> REQ_NOMERGE so we don't change the size of it before issue. >>> >>> See also: >>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >>> >>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in >>> case of 'none'") >>> Signed-off-by: Jens Axboe >>> >>> --- >>> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 3f91c6e5b17a..d8f518c6ea38 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -1715,6 +1715,15 @@ static blk_status_t >>> __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, >>> break; >>> case BLK_STS_RESOURCE: >>> case BLK_STS_DEV_RESOURCE: >>> + /* >>> +* If direct dispatch fails, we cannot allow any >>> merging on >>> +* this IO. Drivers (like SCSI) may have set up >>> permanent state >>> +* for this request, like SG tables and mappings, and >>> if we >>> +* merge to it later on then we'll still only do IO to >>> the >>> +* original part. >>> +*/ >>> + rq->cmd_flags |= REQ_NOMERGE; >>> + >>> blk_mq_update_dispatch_busy(hctx, true); >>> __blk_mq_requeue_request(rq); >>> break; >>> >> >> Not sure it is enough to just mark it as NOMERGE, for example, driver >> may have setup the .special_vec for discard, and NOMERGE may not prevent >> request from entering elevator queue completely. Cause 'rq.rb_node' and >> 'rq.special_vec' share same space. > > We should rather limit the scope of the direct dispatch instead. It > doesn't make sense to do for anything but read/write anyway. discard is kind of write, and it isn't treated very specially in make request path, except for multi-range discard. >>> >>> The point of direct dispatch is to reduce latencies for requests, >>> discards are so damn slow on ALL devices anyway that it doesn't make any >>> sense to try direct dispatch to begin with, regardless of whether it >>> possible or not. >> >> SCSI MQ device may benefit from direct dispatch from reduced lock contention. >> >>> >> So how about inserting this request via blk_mq_request_bypass_insert() >> in case that direct issue returns BUSY? Then it is invariant that >> any request queued via .queue_rq() won't enter scheduler queue. > > I did consider this, but I didn't want to experiment with exercising > a new path for an important bug fix. You do realize that your original > patch has been corrupting data for months? I think a little caution > is in order here. But marking NOMERGE still may have a hole on re-insert discard request as mentioned above. >>> >>> What I said was further limit the scope of direct dispatch, which means >>> not allowing anything that isn't a read/write. >> >> IMO, the conservative approach is to take the one used in legacy io >> path, in which it is never allowed to re-insert queued request to >> scheduler queue except for requeue, however RQF_DONTPREP is cleared >> before requeuing request to scheduler. >> >>> Given we never allow to re-insert queued request to scheduler queue except for 6ce3dd6eec1, I think it is the correct thing to do, and the fix is simple too. >>> >>> As I said, it's not the time to experiment. This issue has been there >>> since 4.1
RE: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
+ Linux-scsi > > diff --git a/block/blk-mq.h b/block/blk-mq.h > > index 9497b47..57432be 100644 > > --- a/block/blk-mq.h > > +++ b/block/blk-mq.h > > @@ -175,6 +175,7 @@ static inline bool > > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) > > static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, > > struct request *rq) > > { > > +hctx->tags->rqs[rq->tag] = NULL; > > blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); > > rq->tag = -1; > > No SCSI driver should call scsi_host_find_tag() after a request has > finished. The above patch introduces yet another race and hence can't be > a proper fix. Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id to find out pending IO in firmware. One of the use case is - HBA firmware recovery. In case of firmware recovery, driver may require to traverse the list and return back pending scsi command to SML for retry. I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic, mpt3sas are using API scsi_host_find_tag for the same purpose. Without this patch, we hit very basic kernel panic due to page fault. This is not an issue in non-mq code path. Non-mq path use blk_map_queue_find_tag() and that particular API does not provide stale requests. Kashyap > > Bart.
Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
On Tue, 2018-12-04 at 22:17 +0530, Kashyap Desai wrote: > + Linux-scsi > > > > diff --git a/block/blk-mq.h b/block/blk-mq.h > > > index 9497b47..57432be 100644 > > > --- a/block/blk-mq.h > > > +++ b/block/blk-mq.h > > > @@ -175,6 +175,7 @@ static inline bool > > > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) > > > static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, > > > struct request *rq) > > > { > > > +hctx->tags->rqs[rq->tag] = NULL; > > > blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); > > > rq->tag = -1; > > > > No SCSI driver should call scsi_host_find_tag() after a request has > > finished. The above patch introduces yet another race and hence can't be > > a proper fix. > > Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id to > find out pending IO in firmware. > One of the use case is - HBA firmware recovery. In case of firmware > recovery, driver may require to traverse the list and return back pending > scsi command to SML for retry. > I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic, > mpt3sas are using API scsi_host_find_tag for the same purpose. > > Without this patch, we hit very basic kernel panic due to page fault. This > is not an issue in non-mq code path. Non-mq path use > blk_map_queue_find_tag() and that particular API does not provide stale > requests. As I wrote before, your patch doesn't fix the race you described but only makes the race window smaller. If you want an example of how to use scsi_host_find_tag() properly, have a look at the SRP initiator driver (drivers/infiniband/ulp/srp). That driver uses scsi_host_find_tag() without triggering any NULL pointer dereferences. The approach used in that driver also works when having to support HBA firmware recovery. Bart.
RE: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in blk+AF8-mq+AF8-put+AF8-driver+AF8-tag
> -Original Message- > From: Bart Van Assche [mailto:bvanass...@acm.org] > Sent: Tuesday, December 4, 2018 10:45 PM > To: Kashyap Desai; linux-block; Jens Axboe; Ming Lei; linux-scsi > Cc: Suganath Prabu Subramani; Sreekanth Reddy; Sathya Prakash Veerichetty > Subject: Re: [PATCH] blk-mq: Set request mapping to NULL in > blk_mq_put_driver_tag > > On Tue, 2018-12-04 at 22:17 +0530, Kashyap Desai wrote: > > + Linux-scsi > > > > > > diff --git a/block/blk-mq.h b/block/blk-mq.h > > > > index 9497b47..57432be 100644 > > > > --- a/block/blk-mq.h > > > > +++ b/block/blk-mq.h > > > > @@ -175,6 +175,7 @@ static inline bool > > > > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) > > > > static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx > *hctx, > > > > struct request *rq) > > > > { > > > > +hctx->tags->rqs[rq->tag] = NULL; > > > > blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); > > > > rq->tag = -1; > > > > > > No SCSI driver should call scsi_host_find_tag() after a request has > > > finished. The above patch introduces yet another race and hence can't > > > be > > > a proper fix. > > > > Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id > > to > > find out pending IO in firmware. > > One of the use case is - HBA firmware recovery. In case of firmware > > recovery, driver may require to traverse the list and return back > > pending > > scsi command to SML for retry. > > I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic, > > mpt3sas are using API scsi_host_find_tag for the same purpose. > > > > Without this patch, we hit very basic kernel panic due to page fault. > > This > > is not an issue in non-mq code path. Non-mq path use > > blk_map_queue_find_tag() and that particular API does not provide stale > > requests. > > As I wrote before, your patch doesn't fix the race you described but only > makes the race window smaller. Hi Bart, Let me explain the issue. It is not a race, but very straight issue. Let's say we have one scsi_device /dev/sda and total IO submitted + completed are some number 100. All the 100 IO is *completed*. Now, As part of Firmware recovery, driver tries to find our outstanding IOs using scsi_host_find_tag(). Block layer will return all the 100 commands to the driver but really those 100 commands are not outstanding. This patch will return *actual* outstanding commands. If scsi_device /dev/sda is not removed in OS, driver accessing scmd of those 100 commands are safe memory access. Now consider a case where scsi_device /dev/sda is removed and driver performs firmware recovery. This time driver will crash while accessing scmd (randomly based on memory reused.). Along with this patch, low level driver should make sure that all request queue at block layer is quiesce. If you want an example of how to use > scsi_host_find_tag() properly, have a look at the SRP initiator driver > (drivers/infiniband/ulp/srp). That driver uses scsi_host_find_tag() > without > triggering any NULL pointer dereferences. I am not able to find right context from srp, but I check the srp code and looks like that driver is getting scmd using scsi_host_find_tag() for live command. > The approach used in that driver > also works when having to support HBA firmware recovery. > > Bart.
[PATCH v5 00/14] block: always associate blkg and refcount cleanup
Hi everyone, A special case with dm is the flush bio which is statically initialized before the block device is opened and associated with a disk. This caused blkg association to throw a NPE. 0005 addresses this case by moving association to be on the flush path. With v4 moving association to piggyback off of bio_set_dev(), this caused a NPE to be thrown by the special case above. I was overly cautious with v4 and added the bio_has_queue() check which is now removed in v5. Also, the addition of bio_set_dev_only() wasn't quite right due to writeback and swap sharing the same bio init paths in many places. The safer thing to do is double association for those paths and in a follow up series split out the bio init paths. Changes in v5: All: Fixed minor grammar and syntactic issues. 0004: Removed bio_has_queue() for being overly cautious. 0005: New, properly addressed the static flush_bio in md. 0006: Removed the rcu lock in blkcg_bio_issue_check() as the bio will own a ref on the blkg so it is unnecessary. 0011: Consolidated bio_associate_blkg_from_css() (removed __ version). >From v4: This is respin of v3 [1] with fixes for the errors reported in [2] and [3]. v3 was reverted in [4]. The issue in [3] was that bio->bi_disk->queue and blkg->q were out of sync. So when I changed blk_get_rl() to use blkg->q, the wrong queue was returned and elevator from q->elevator->type threw a NPE. Note, with v4.21, the old block stack was removed and so this patch was dropped. I did backport this to v4.20 and verified this series does not encounter the error. The biggest changes in v4 are when association occurs and clearly defining the cases where association should happen. 1. Association is now done when the device is set to keep blkg->q and bio->bi_disk->queue in sync. 2. When a bio is submitted directly to the device, it will not be associated with a blkg. This is because a blkg represents the relationship between a blkcg and a request_queue. Going directly to the device means the request_queue may not exist meaning no blkg will exist. The patch updating blk_get_rl() was dropped (v3 10/12). The patch to always associate a blkg from v3 (v3 04/12) was fixed and split into patches 0004 and 0005. 0011 is new removing bio_disassociate_task(). Summarizing the ideas of this series: 1. Gracefully handle blkg failure to create by walking up the blkg tree rather than fall through to root. 2. Associate a bio with a blkg in core logic rather than per controller logic. 3. Rather than have a css and blkg reference, hold just a blkg ref as it also holds a css ref. 4. Switch to percpu ref counting for blkg. [1] https://lore.kernel.org/lkml/20180911184137.35897-1-dennissz...@gmail.com/ [2] https://lore.kernel.org/lkml/13987.1539646...@turing-police.cc.vt.edu/ [3] https://marc.info/?l=linux-cgroups&m=154110436103723 [4] https://lore.kernel.org/lkml/20181101212410.47569-1-den...@kernel.org/ This patchset contains the following 14 patches: 0001-blkcg-fix-ref-count-issue-with-bio_blkcg-using-task_.patch 0002-blkcg-update-blkg_lookup_create-to-do-locking.patch 0003-blkcg-convert-blkg_lookup_create-to-find-closest-blk.patch 0004-blkcg-introduce-common-blkg-association-logic.patch 0005-dm-set-flush-bio-device-on-demand.patch 0006-blkcg-associate-blkg-when-associating-a-device.patch 0007-blkcg-consolidate-bio_issue_init-to-be-a-part-of-cor.patch 0008-blkcg-associate-a-blkg-for-pages-being-evicted-by-sw.patch 0009-blkcg-associate-writeback-bios-with-a-blkg.patch 0010-blkcg-remove-bio-bi_css-and-instead-use-bio-bi_blkg.patch 0011-blkcg-remove-additional-reference-to-the-css.patch 0012-blkcg-remove-bio_disassociate_task.patch 0013-blkcg-change-blkg-reference-counting-to-use-percpu_r.patch 0014-blkcg-rename-blkg_try_get-to-blkg_tryget.patch This patchset is on top of linu-block#for-4.21/block 154989e45fd8. diffstats below: Dennis Zhou (14): blkcg: fix ref count issue with bio_blkcg() using task_css blkcg: update blkg_lookup_create() to do locking blkcg: convert blkg_lookup_create() to find closest blkg blkcg: introduce common blkg association logic dm: set the static flush bio device on demand blkcg: associate blkg when associating a device blkcg: consolidate bio_issue_init() to be a part of core blkcg: associate a blkg for pages being evicted by swap blkcg: associate writeback bios with a blkg blkcg: remove bio->bi_css and instead use bio->bi_blkg blkcg: remove additional reference to the css blkcg: remove bio_disassociate_task() blkcg: change blkg reference counting to use percpu_ref blkcg: rename blkg_try_get() to blkg_tryget() Documentation/admin-guide/cgroup-v2.rst | 8 +- block/bfq-cgroup.c | 4 +- block/bfq-iosched.c | 2 +- block/bio.c | 156 +++- block/blk-cgroup.c | 95 --- block/bl
[PATCH 08/14] blkcg: associate a blkg for pages being evicted by swap
A prior patch in this series added blkg association to bios issued by cgroups. There are two other paths that we want to attribute work back to the appropriate cgroup: swap and writeback. Here we modify the way swap tags bios to include the blkg. Writeback will be tackle in the next patch. Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Acked-by: Tejun Heo --- block/bio.c | 62 +++-- include/linux/bio.h | 6 ++--- mm/page_io.c| 2 +- 3 files changed, 42 insertions(+), 28 deletions(-) diff --git a/block/bio.c b/block/bio.c index 90089124b512..f0f069c1823c 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1957,30 +1957,6 @@ EXPORT_SYMBOL(bioset_init_from_src); #ifdef CONFIG_BLK_CGROUP -#ifdef CONFIG_MEMCG -/** - * bio_associate_blkcg_from_page - associate a bio with the page's blkcg - * @bio: target bio - * @page: the page to lookup the blkcg from - * - * Associate @bio with the blkcg from @page's owning memcg. This works like - * every other associate function wrt references. - */ -int bio_associate_blkcg_from_page(struct bio *bio, struct page *page) -{ - struct cgroup_subsys_state *blkcg_css; - - if (unlikely(bio->bi_css)) - return -EBUSY; - if (!page->mem_cgroup) - return 0; - blkcg_css = cgroup_get_e_css(page->mem_cgroup->css.cgroup, -&io_cgrp_subsys); - bio->bi_css = blkcg_css; - return 0; -} -#endif /* CONFIG_MEMCG */ - /** * bio_associate_blkcg - associate a bio with the specified blkcg * @bio: target bio @@ -2045,6 +2021,44 @@ static void __bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg) bio->bi_blkg = blkg_try_get_closest(blkg); } +static void __bio_associate_blkg_from_css(struct bio *bio, + struct cgroup_subsys_state *css) +{ + struct blkcg_gq *blkg; + + rcu_read_lock(); + + blkg = blkg_lookup_create(css_to_blkcg(css), bio->bi_disk->queue); + __bio_associate_blkg(bio, blkg); + + rcu_read_unlock(); +} + +#ifdef CONFIG_MEMCG +/** + * bio_associate_blkg_from_page - associate a bio with the page's blkg + * @bio: target bio + * @page: the page to lookup the blkcg from + * + * Associate @bio with the blkg from @page's owning memcg and the respective + * request_queue. This works like every other associate function wrt + * references. + */ +void bio_associate_blkg_from_page(struct bio *bio, struct page *page) +{ + struct cgroup_subsys_state *css; + + if (unlikely(bio->bi_css)) + return; + if (!page->mem_cgroup) + return; + + css = cgroup_get_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys); + bio->bi_css = css; + __bio_associate_blkg_from_css(bio, css); +} +#endif /* CONFIG_MEMCG */ + /** * bio_associate_blkg - associate a bio with a blkg * @bio: target bio diff --git a/include/linux/bio.h b/include/linux/bio.h index 6ee2ea8b378a..f13572c254a7 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -505,10 +505,10 @@ do { \ disk_devt((bio)->bi_disk) #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP) -int bio_associate_blkcg_from_page(struct bio *bio, struct page *page); +void bio_associate_blkg_from_page(struct bio *bio, struct page *page); #else -static inline int bio_associate_blkcg_from_page(struct bio *bio, - struct page *page) { return 0; } +static inline void bio_associate_blkg_from_page(struct bio *bio, + struct page *page) { } #endif #ifdef CONFIG_BLK_CGROUP diff --git a/mm/page_io.c b/mm/page_io.c index 5bdfd21c1bd9..3475733b1926 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -339,7 +339,7 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, goto out; } bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc); - bio_associate_blkcg_from_page(bio, page); + bio_associate_blkg_from_page(bio, page); count_swpout_vm_event(page); set_page_writeback(page); unlock_page(page); -- 2.17.1
[PATCH 14/14] blkcg: rename blkg_try_get() to blkg_tryget()
blkg reference counting now uses percpu_ref rather than atomic_t. Let's make this consistent with css_tryget. This renames blkg_try_get to blkg_tryget and now returns a bool rather than the blkg or %NULL. Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Acked-by: Tejun Heo --- block/bio.c| 2 +- block/blk-cgroup.c | 3 +-- block/blk-iolatency.c | 2 +- include/linux/blk-cgroup.h | 12 +--- 4 files changed, 8 insertions(+), 11 deletions(-) diff --git a/block/bio.c b/block/bio.c index 7ec5316e6ecc..06760543ec81 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1990,7 +1990,7 @@ static void __bio_associate_blkg(struct bio *bio, struct blkcg_gq *blkg) { bio_disassociate_blkg(bio); - bio->bi_blkg = blkg_try_get_closest(blkg); + bio->bi_blkg = blkg_tryget_closest(blkg); } /** diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 2ca7611fe274..6bd0619a7d6e 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1736,8 +1736,7 @@ void blkcg_maybe_throttle_current(void) blkg = blkg_lookup(blkcg, q); if (!blkg) goto out; - blkg = blkg_try_get(blkg); - if (!blkg) + if (!blkg_tryget(blkg)) goto out; rcu_read_unlock(); diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c index 5a79f06a730d..0b14c3d57769 100644 --- a/block/blk-iolatency.c +++ b/block/blk-iolatency.c @@ -698,7 +698,7 @@ static void blkiolatency_timer_fn(struct timer_list *t) * We could be exiting, don't access the pd unless we have a * ref on the blkg. */ - if (!blkg_try_get(blkg)) + if (!blkg_tryget(blkg)) continue; iolat = blkg_to_lat(blkg); diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 05909cf31ed1..64ac89130ae8 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -490,27 +490,25 @@ static inline void blkg_get(struct blkcg_gq *blkg) } /** - * blkg_try_get - try and get a blkg reference + * blkg_tryget - try and get a blkg reference * @blkg: blkg to get * * This is for use when doing an RCU lookup of the blkg. We may be in the midst * of freeing this blkg, so we can only use it if the refcnt is not zero. */ -static inline struct blkcg_gq *blkg_try_get(struct blkcg_gq *blkg) +static inline bool blkg_tryget(struct blkcg_gq *blkg) { - if (percpu_ref_tryget(&blkg->refcnt)) - return blkg; - return NULL; + return percpu_ref_tryget(&blkg->refcnt); } /** - * blkg_try_get_closest - try and get a blkg ref on the closet blkg + * blkg_tryget_closest - try and get a blkg ref on the closet blkg * @blkg: blkg to get * * This walks up the blkg tree to find the closest non-dying blkg and returns * the blkg that it did association with as it may not be the passed in blkg. */ -static inline struct blkcg_gq *blkg_try_get_closest(struct blkcg_gq *blkg) +static inline struct blkcg_gq *blkg_tryget_closest(struct blkcg_gq *blkg) { while (!percpu_ref_tryget(&blkg->refcnt)) blkg = blkg->parent; -- 2.17.1
[PATCH 13/14] blkcg: change blkg reference counting to use percpu_ref
Every bio is now associated with a blkg putting blkg_get, blkg_try_get, and blkg_put on the hot path. Switch over the refcnt in blkg to use percpu_ref. Signed-off-by: Dennis Zhou Acked-by: Tejun Heo --- block/blk-cgroup.c | 41 -- include/linux/blk-cgroup.h | 15 +- 2 files changed, 44 insertions(+), 12 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 120f2e2835fb..2ca7611fe274 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -81,6 +81,37 @@ static void blkg_free(struct blkcg_gq *blkg) kfree(blkg); } +static void __blkg_release(struct rcu_head *rcu) +{ + struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head); + + percpu_ref_exit(&blkg->refcnt); + + /* release the blkcg and parent blkg refs this blkg has been holding */ + css_put(&blkg->blkcg->css); + if (blkg->parent) + blkg_put(blkg->parent); + + wb_congested_put(blkg->wb_congested); + + blkg_free(blkg); +} + +/* + * A group is RCU protected, but having an rcu lock does not mean that one + * can access all the fields of blkg and assume these are valid. For + * example, don't try to follow throtl_data and request queue links. + * + * Having a reference to blkg under an rcu allows accesses to only values + * local to groups like group stats and group rate limits. + */ +static void blkg_release(struct percpu_ref *ref) +{ + struct blkcg_gq *blkg = container_of(ref, struct blkcg_gq, refcnt); + + call_rcu(&blkg->rcu_head, __blkg_release); +} + /** * blkg_alloc - allocate a blkg * @blkcg: block cgroup the new blkg is associated with @@ -107,7 +138,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q, blkg->q = q; INIT_LIST_HEAD(&blkg->q_node); blkg->blkcg = blkcg; - atomic_set(&blkg->refcnt, 1); for (i = 0; i < BLKCG_MAX_POLS; i++) { struct blkcg_policy *pol = blkcg_policy[i]; @@ -207,6 +237,11 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, blkg_get(blkg->parent); } + ret = percpu_ref_init(&blkg->refcnt, blkg_release, 0, + GFP_NOWAIT | __GFP_NOWARN); + if (ret) + goto err_cancel_ref; + /* invoke per-policy init */ for (i = 0; i < BLKCG_MAX_POLS; i++) { struct blkcg_policy *pol = blkcg_policy[i]; @@ -239,6 +274,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, blkg_put(blkg); return ERR_PTR(ret); +err_cancel_ref: + percpu_ref_exit(&blkg->refcnt); err_put_congested: wb_congested_put(wb_congested); err_put_css: @@ -367,7 +404,7 @@ static void blkg_destroy(struct blkcg_gq *blkg) * Put the reference taken at the time of creation so that when all * queues are gone, group can be destroyed. */ - blkg_put(blkg); + percpu_ref_kill(&blkg->refcnt); } /** diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 68cae7961060..05909cf31ed1 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -123,7 +123,7 @@ struct blkcg_gq { struct blkcg_gq *parent; /* reference count */ - atomic_trefcnt; + struct percpu_ref refcnt; /* is this blkg online? protected by both blkcg and q locks */ boolonline; @@ -486,8 +486,7 @@ static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen) */ static inline void blkg_get(struct blkcg_gq *blkg) { - WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0); - atomic_inc(&blkg->refcnt); + percpu_ref_get(&blkg->refcnt); } /** @@ -499,7 +498,7 @@ static inline void blkg_get(struct blkcg_gq *blkg) */ static inline struct blkcg_gq *blkg_try_get(struct blkcg_gq *blkg) { - if (atomic_inc_not_zero(&blkg->refcnt)) + if (percpu_ref_tryget(&blkg->refcnt)) return blkg; return NULL; } @@ -513,23 +512,19 @@ static inline struct blkcg_gq *blkg_try_get(struct blkcg_gq *blkg) */ static inline struct blkcg_gq *blkg_try_get_closest(struct blkcg_gq *blkg) { - while (!atomic_inc_not_zero(&blkg->refcnt)) + while (!percpu_ref_tryget(&blkg->refcnt)) blkg = blkg->parent; return blkg; } -void __blkg_release_rcu(struct rcu_head *rcu); - /** * blkg_put - put a blkg reference * @blkg: blkg to put */ static inline void blkg_put(struct blkcg_gq *blkg) { - WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0); - if (atomic_dec_and_test(&blkg->refcnt)) - call_rcu(&blkg->rcu_head, __blkg_release_rcu); + percpu_ref_put(&blkg->refcnt); } /** -- 2.17.1
[PATCH 12/14] blkcg: remove bio_disassociate_task()
Now that a bio only holds a blkg reference, so clean up is simply putting back that reference. Remove bio_disassociate_task() as it just calls bio_disassociate_blkg() and call the latter directly. Signed-off-by: Dennis Zhou Acked-by: Tejun Heo --- block/bio.c | 11 +-- include/linux/bio.h | 2 -- 2 files changed, 1 insertion(+), 12 deletions(-) diff --git a/block/bio.c b/block/bio.c index ce1e512dca5a..7ec5316e6ecc 100644 --- a/block/bio.c +++ b/block/bio.c @@ -244,7 +244,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx, void bio_uninit(struct bio *bio) { - bio_disassociate_task(bio); + bio_disassociate_blkg(bio); } EXPORT_SYMBOL(bio_uninit); @@ -2073,15 +2073,6 @@ void bio_associate_blkg(struct bio *bio) } EXPORT_SYMBOL_GPL(bio_associate_blkg); -/** - * bio_disassociate_task - undo bio_associate_current() - * @bio: target bio - */ -void bio_disassociate_task(struct bio *bio) -{ - bio_disassociate_blkg(bio); -} - /** * bio_clone_blkg_association - clone blkg association from src to dst bio * @dst: destination bio diff --git a/include/linux/bio.h b/include/linux/bio.h index 84e1c4dc703a..7380b094dcca 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -516,7 +516,6 @@ void bio_disassociate_blkg(struct bio *bio); void bio_associate_blkg(struct bio *bio); void bio_associate_blkg_from_css(struct bio *bio, struct cgroup_subsys_state *css); -void bio_disassociate_task(struct bio *bio); void bio_clone_blkg_association(struct bio *dst, struct bio *src); #else /* CONFIG_BLK_CGROUP */ static inline void bio_disassociate_blkg(struct bio *bio) { } @@ -524,7 +523,6 @@ static inline void bio_associate_blkg(struct bio *bio) { } static inline void bio_associate_blkg_from_css(struct bio *bio, struct cgroup_subsys_state *css) { } -static inline void bio_disassociate_task(struct bio *bio) { } static inline void bio_clone_blkg_association(struct bio *dst, struct bio *src) { } #endif /* CONFIG_BLK_CGROUP */ -- 2.17.1
[PATCH 10/14] blkcg: remove bio->bi_css and instead use bio->bi_blkg
Prior patches ensured that any bio that interacts with a request_queue is properly associated with a blkg. This makes bio->bi_css unnecessary as blkg maintains a reference to blkcg already. This removes the bio field bi_css and transfers corresponding uses to access via bi_blkg. Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Acked-by: Tejun Heo --- block/bio.c| 59 +- block/bounce.c | 2 +- drivers/block/loop.c | 5 ++-- drivers/md/raid0.c | 2 +- include/linux/bio.h| 11 +++ include/linux/blk-cgroup.h | 8 +++--- include/linux/blk_types.h | 7 +++-- kernel/trace/blktrace.c| 4 +-- 8 files changed, 32 insertions(+), 66 deletions(-) diff --git a/block/bio.c b/block/bio.c index b42477b6a225..2b6bc7b805ec 100644 --- a/block/bio.c +++ b/block/bio.c @@ -610,7 +610,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src) bio->bi_iter = bio_src->bi_iter; bio->bi_io_vec = bio_src->bi_io_vec; - bio_clone_blkcg_association(bio, bio_src); + bio_clone_blkg_association(bio, bio_src); blkcg_bio_issue_init(bio); } EXPORT_SYMBOL(__bio_clone_fast); @@ -1957,34 +1957,6 @@ EXPORT_SYMBOL(bioset_init_from_src); #ifdef CONFIG_BLK_CGROUP -/** - * bio_associate_blkcg - associate a bio with the specified blkcg - * @bio: target bio - * @blkcg_css: css of the blkcg to associate - * - * Associate @bio with the blkcg specified by @blkcg_css. Block layer will - * treat @bio as if it were issued by a task which belongs to the blkcg. - * - * This function takes an extra reference of @blkcg_css which will be put - * when @bio is released. The caller must own @bio and is responsible for - * synchronizing calls to this function. If @blkcg_css is %NULL, a call to - * blkcg_get_css() finds the current css from the kthread or task. - */ -int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css) -{ - if (unlikely(bio->bi_css)) - return -EBUSY; - - if (blkcg_css) - css_get(blkcg_css); - else - blkcg_css = blkcg_get_css(); - - bio->bi_css = blkcg_css; - return 0; -} -EXPORT_SYMBOL_GPL(bio_associate_blkcg); - /** * bio_disassociate_blkg - puts back the blkg reference if associated * @bio: target bio @@ -1994,6 +1966,8 @@ EXPORT_SYMBOL_GPL(bio_associate_blkcg); void bio_disassociate_blkg(struct bio *bio) { if (bio->bi_blkg) { + /* a ref is always taken on css */ + css_put(&bio_blkcg(bio)->css); blkg_put(bio->bi_blkg); bio->bi_blkg = NULL; } @@ -2047,7 +2021,6 @@ void bio_associate_blkg_from_css(struct bio *bio, struct cgroup_subsys_state *css) { css_get(css); - bio->bi_css = css; __bio_associate_blkg_from_css(bio, css); } EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css); @@ -2066,13 +2039,10 @@ void bio_associate_blkg_from_page(struct bio *bio, struct page *page) { struct cgroup_subsys_state *css; - if (unlikely(bio->bi_css)) - return; if (!page->mem_cgroup) return; css = cgroup_get_e_css(page->mem_cgroup->css.cgroup, &io_cgrp_subsys); - bio->bi_css = css; __bio_associate_blkg_from_css(bio, css); } #endif /* CONFIG_MEMCG */ @@ -2094,8 +2064,10 @@ void bio_associate_blkg(struct bio *bio) rcu_read_lock(); - bio_associate_blkcg(bio, NULL); - blkcg = bio_blkcg(bio); + if (bio->bi_blkg) + blkcg = bio->bi_blkg->blkcg; + else + blkcg = css_to_blkcg(blkcg_get_css()); if (!blkcg->css.parent) { __bio_associate_blkg(bio, q->root_blkg); @@ -2115,27 +2087,22 @@ EXPORT_SYMBOL_GPL(bio_associate_blkg); */ void bio_disassociate_task(struct bio *bio) { - if (bio->bi_css) { - css_put(bio->bi_css); - bio->bi_css = NULL; - } bio_disassociate_blkg(bio); } /** - * bio_clone_blkcg_association - clone blkcg association from src to dst bio + * bio_clone_blkg_association - clone blkg association from src to dst bio * @dst: destination bio * @src: source bio */ -void bio_clone_blkcg_association(struct bio *dst, struct bio *src) +void bio_clone_blkg_association(struct bio *dst, struct bio *src) { - if (src->bi_css) - WARN_ON(bio_associate_blkcg(dst, src->bi_css)); - - if (src->bi_blkg) + if (src->bi_blkg) { + css_get(&bio_blkcg(src)->css); __bio_associate_blkg(dst, src->bi_blkg); + } } -EXPORT_SYMBOL_GPL(bio_clone_blkcg_association); +EXPORT_SYMBOL_GPL(bio_clone_blkg_association); #endif /* CONFIG_BLK_CGROUP */ static void __init biovec_init_slabs(void) diff --git a/block/bounce.c b/block/bounce.c index cfb96d5170d0..ffb9e9ecfa7e 100644 --- a/block/bounce.c +++ b/block/bounce.c @@
[PATCH 09/14] blkcg: associate writeback bios with a blkg
One of the goals of this series is to remove a separate reference to the css of the bio. This can and should be accessed via bio_blkcg(). In this patch, wbc_init_bio() now requires a bio to have a device associated with it. Signed-off-by: Dennis Zhou Reviewed-by: Josef Bacik Acked-by: Tejun Heo --- Documentation/admin-guide/cgroup-v2.rst | 8 +--- block/bio.c | 18 ++ fs/buffer.c | 10 +- fs/ext4/page-io.c | 2 +- include/linux/bio.h | 5 + include/linux/writeback.h | 5 +++-- 6 files changed, 37 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 476722b7b636..baf19bf28385 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1879,8 +1879,10 @@ following two functions. wbc_init_bio(@wbc, @bio) Should be called for each bio carrying writeback data and - associates the bio with the inode's owner cgroup. Can be - called anytime between bio allocation and submission. + associates the bio with the inode's owner cgroup and the + corresponding request queue. This must be called after + a queue (device) has been associated with the bio and + before submission. wbc_account_io(@wbc, @page, @bytes) Should be called for each data segment being written out. @@ -1899,7 +1901,7 @@ the configuration, the bio may be executed at a lower priority and if the writeback session is holding shared resources, e.g. a journal entry, may lead to priority inversion. There is no one easy solution for the problem. Filesystems can try to work around specific problem -cases by skipping wbc_init_bio() or using bio_associate_blkcg() +cases by skipping wbc_init_bio() and using bio_associate_blkg() directly. diff --git a/block/bio.c b/block/bio.c index f0f069c1823c..b42477b6a225 100644 --- a/block/bio.c +++ b/block/bio.c @@ -2034,6 +2034,24 @@ static void __bio_associate_blkg_from_css(struct bio *bio, rcu_read_unlock(); } +/** + * bio_associate_blkg_from_css - associate a bio with a specified css + * @bio: target bio + * @css: target css + * + * Associate @bio with the blkg found by combining the css's blkg and the + * request_queue of the @bio. This takes a reference on the css that will + * be put upon freeing of @bio. + */ +void bio_associate_blkg_from_css(struct bio *bio, +struct cgroup_subsys_state *css) +{ + css_get(css); + bio->bi_css = css; + __bio_associate_blkg_from_css(bio, css); +} +EXPORT_SYMBOL_GPL(bio_associate_blkg_from_css); + #ifdef CONFIG_MEMCG /** * bio_associate_blkg_from_page - associate a bio with the page's blkg diff --git a/fs/buffer.c b/fs/buffer.c index 1286c2b95498..d60d61e8ed7d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3060,11 +3060,6 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, */ bio = bio_alloc(GFP_NOIO, 1); - if (wbc) { - wbc_init_bio(wbc, bio); - wbc_account_io(wbc, bh->b_page, bh->b_size); - } - bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio_set_dev(bio, bh->b_bdev); bio->bi_write_hint = write_hint; @@ -3084,6 +3079,11 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, op_flags |= REQ_PRIO; bio_set_op_attrs(bio, op, op_flags); + if (wbc) { + wbc_init_bio(wbc, bio); + wbc_account_io(wbc, bh->b_page, bh->b_size); + } + submit_bio(bio); return 0; } diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index db7590178dfc..2aa62d58d8dd 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -374,13 +374,13 @@ static int io_submit_init_bio(struct ext4_io_submit *io, bio = bio_alloc(GFP_NOIO, BIO_MAX_PAGES); if (!bio) return -ENOMEM; - wbc_init_bio(io->io_wbc, bio); bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio_set_dev(bio, bh->b_bdev); bio->bi_end_io = ext4_end_bio; bio->bi_private = ext4_get_io_end(io->io_end); io->io_bio = bio; io->io_next_block = bh->b_blocknr; + wbc_init_bio(io->io_wbc, bio); return 0; } diff --git a/include/linux/bio.h b/include/linux/bio.h index f13572c254a7..f0438061a5a3 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -515,6 +515,8 @@ static inline void bio_associate_blkg_from_page(struct bio *bio, int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css); void bio_disassociate_blkg(struct bio *bio); void bio_associate_blkg(struct bio *bio); +void bio_associate_blkg_from_css(struct bio *bio, +struct cgroup_subsys_state *css); void bio_disassoci