Re: [PATCH rfc 01/30] nvme: Add admin connect request queue
On Sun, Jun 18, 2017 at 06:21:35PM +0300, Sagi Grimberg wrote: > In case we reconnect with inflight admin IO we > need to make sure that the connect comes before > the admin command. This can be only achieved by > using a seperate request queue for admin connects. Use up a few more lines of the available space for your lines? :) Wouldn't a head insertation also solve the problem?
Re: [PATCH rfc 02/30] nvme-rdma: Don't alloc/free the tagset on reset
> +static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl, bool > remove) > { > + nvme_rdma_stop_queue(&ctrl->queues[0]); > + if (remove) { > + blk_cleanup_queue(ctrl->ctrl.admin_connect_q); > + blk_cleanup_queue(ctrl->ctrl.admin_q); > + blk_mq_free_tag_set(&ctrl->admin_tag_set); > + nvme_rdma_dev_put(ctrl->device); > + } > + > nvme_rdma_free_qe(ctrl->queues[0].device->dev, &ctrl->async_event_sqe, > sizeof(struct nvme_command), DMA_TO_DEVICE); > + nvme_rdma_free_queue(&ctrl->queues[0]); I don't like the calling convention. We only have have two callers anyway. So I'd much rather only keep the code inside the if above in the new nvme_rdma_destroy_admin_queue that is only called at shutdown time, and opencode the calls to nvme_rdma_stop_queue, nvme_rdma_free_qe and nvme_rdma_free_queue in the callers. > -static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl) > +static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, bool > new) PCIe is just checking for a non-null admin_q. But I think we should jsut split this into two functions, one for the shared code at the end and one just for the first-time setup, with the nvme_rdma_init_queue call open coded. > error = nvmf_connect_admin_queue(&ctrl->ctrl); > @@ -1596,6 +1601,8 @@ static int nvme_rdma_configure_admin_queue(struct > nvme_rdma_ctrl *ctrl) > > set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags); > > + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true); > + Where does this come from?
Re: [PATCH rfc 03/30] nvme-rdma: reuse configure/destroy admin queue
On Sun, Jun 18, 2017 at 06:21:37PM +0300, Sagi Grimberg wrote: > We have all we need in these functions now that these > are aware if we are doing a full instantiation/removal. > > For that we move nvme_rdma_configure_admin_queue to avoid > a forward declaration, and we add blk_mq_ops forward declaration. > > Signed-off-by: Sagi Grimberg > --- > drivers/nvme/host/rdma.c | 253 > ++- > 1 file changed, 119 insertions(+), 134 deletions(-) > > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c > index 3e4c6aa119ee..5fef5545e365 100644 > --- a/drivers/nvme/host/rdma.c > +++ b/drivers/nvme/host/rdma.c > @@ -140,6 +140,9 @@ static DEFINE_MUTEX(device_list_mutex); > static LIST_HEAD(nvme_rdma_ctrl_list); > static DEFINE_MUTEX(nvme_rdma_ctrl_mutex); > > +static const struct blk_mq_ops nvme_rdma_mq_ops; > +static const struct blk_mq_ops nvme_rdma_admin_mq_ops; > + > /* > * Disabling this option makes small I/O goes faster, but is fundamentally > * unsafe. With it turned off we will have to register a global rkey that > @@ -562,20 +565,22 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl > *ctrl, > > static void nvme_rdma_stop_queue(struct nvme_rdma_queue *queue) > { > + if (test_bit(NVME_RDMA_Q_DELETING, &queue->flags)) > + return; > rdma_disconnect(queue->cm_id); > ib_drain_qp(queue->qp); > } > > static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue) > { > + if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags)) > + return; > nvme_rdma_destroy_queue_ib(queue); > rdma_destroy_id(queue->cm_id); > } > > static void nvme_rdma_stop_and_free_queue(struct nvme_rdma_queue *queue) > { > - if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags)) > - return; > nvme_rdma_stop_queue(queue); > nvme_rdma_free_queue(queue); > } > @@ -671,6 +676,116 @@ static void nvme_rdma_destroy_admin_queue(struct > nvme_rdma_ctrl *ctrl, bool remo > nvme_rdma_free_queue(&ctrl->queues[0]); > } > static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, bool new) Too long line.. Also what about moving the helpers into the right place in the previous patch that already re-indented most of this?
Re: [PATCH rfc 24/30] nvme-pci: rename to nvme_pci_configure_admin_queue
On Sun, Jun 18, 2017 at 06:21:58PM +0300, Sagi Grimberg wrote: > we are going to need the name for the core routine... I think we should just pick this up ASAP as a prep patch.. Reviewed-by: Christoph Hellwig
Re: [PATCH rfc 28/30] nvme: update tagset nr_hw_queues when reallocating io queues
On Sun, Jun 18, 2017 at 06:22:02PM +0300, Sagi Grimberg wrote: > Signed-off-by: Sagi Grimberg Could use a changelog. Ming: does this solve your problem of not seeing the new queues after a qemu CPU hotplug + reset? > --- > drivers/nvme/host/core.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 6937ba26ff2c..476c49c0601f 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -2776,6 +2776,9 @@ int nvme_configure_io_queues(struct nvme_ctrl *ctrl, > bool new) > ret = blk_mq_reinit_tagset(ctrl->tagset); > if (ret) > goto out_free_io_queues; > + > + blk_mq_update_nr_hw_queues(ctrl->tagset, > + ctrl->queue_count - 1); > } > > ret = nvme_start_io_queues(ctrl); > -- > 2.7.4 ---end quoted text---
Re: [PATCH rfc 29/30] nvme: add sed-opal ctrl manipulation in admin configuration
On Sun, Jun 18, 2017 at 06:22:03PM +0300, Sagi Grimberg wrote: > Signed-off-by: Sagi Grimberg The subject sounds odd and it could use a changelog. But I'd love to pick this change up ASAP as it's the right thing to do..
[PATCH 1/2] block: remove the unused bio_to_phys macro
Signed-off-by: Christoph Hellwig --- include/linux/bio.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/bio.h b/include/linux/bio.h index 40d054185277..83f9edb39d3f 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -118,7 +118,6 @@ static inline void *bio_data(struct bio *bio) /* * will die */ -#define bio_to_phys(bio) (page_to_phys(bio_page((bio))) + (unsigned long) bio_offset((bio))) #define bvec_to_phys(bv) (page_to_phys((bv)->bv_page) + (unsigned long) (bv)->bv_offset) /* -- 2.11.0
[PATCH 2/2] block: stop using bio_data() in blk_write_same_mergeable
While the Write Same page currently always is in low-level it is just as easy and safer to just compare the page and offset directly. Signed-off-by: Christoph Hellwig --- include/linux/blkdev.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index db2b7a7450bd..a61907438139 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -815,7 +815,8 @@ static inline bool rq_mergeable(struct request *rq) static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b) { - if (bio_data(a) == bio_data(b)) + if (bio_page(a) == bio_page(b) && + bio_offset(a) == bio_offset(b)) return true; return false; -- 2.11.0
two small cleanups
Stop relying on bio_page and a page_address in the Write Same code and remove an unused macro.
move bounce limits settings into the drivers
Currently we still default to a bounce all highmem setting for block drivers. This series defaults to no bouncing and instead adds call to blk_queue_bounce_limit to those drivers that need it. It also has a few cleanups in that area.
[PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce
pktcdvd is a make_request based stacking driver and thus doesn't have any addressing limits on it's own. It also doesn't use bio_data() or page_address(), so it doesn't need a lowmem bounce either. Signed-off-by: Christoph Hellwig --- drivers/block/pktcdvd.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index 26c04baae967..7f2eca31eb6a 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -2413,8 +2413,6 @@ static blk_qc_t pkt_make_request(struct request_queue *q, struct bio *bio) char b[BDEVNAME_SIZE]; struct bio *split; - blk_queue_bounce(q, &bio); - blk_queue_split(q, &bio); pd = q->queuedata; -- 2.11.0
[PATCH 02/10] blk-map: call blk_queue_bounce from blk_rq_append_bio
This makes moves the knowledge about bouncing out of the callers into the block core (just like we do for the normal I/O path), and allows to unexport blk_queue_bounce. Signed-off-by: Christoph Hellwig --- block/blk-map.c | 7 +++ block/bounce.c | 2 -- drivers/scsi/osd/osd_initiator.c | 5 + 3 files changed, 4 insertions(+), 10 deletions(-) diff --git a/block/blk-map.c b/block/blk-map.c index 3b5cb863318f..2547016aa7aa 100644 --- a/block/blk-map.c +++ b/block/blk-map.c @@ -16,6 +16,8 @@ */ int blk_rq_append_bio(struct request *rq, struct bio *bio) { + blk_queue_bounce(rq->q, &bio); + if (!rq->bio) { blk_rq_bio_prep(rq->q, rq, bio); } else { @@ -72,15 +74,13 @@ static int __blk_rq_map_user_iov(struct request *rq, map_data->offset += bio->bi_iter.bi_size; orig_bio = bio; - blk_queue_bounce(q, &bio); /* * We link the bounce buffer in and could have to traverse it * later so we have to get a ref to prevent it from being freed */ - bio_get(bio); - ret = blk_rq_append_bio(rq, bio); + bio_get(bio); if (ret) { bio_endio(bio); __blk_rq_unmap_user(orig_bio); @@ -249,7 +249,6 @@ int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf, return ret; } - blk_queue_bounce(q, &rq->bio); return 0; } EXPORT_SYMBOL(blk_rq_map_kern); diff --git a/block/bounce.c b/block/bounce.c index 17d77613c471..b7e44bd59d7c 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -284,5 +284,3 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig) */ __blk_queue_bounce(q, bio_orig, pool); } - -EXPORT_SYMBOL(blk_queue_bounce); diff --git a/drivers/scsi/osd/osd_initiator.c b/drivers/scsi/osd/osd_initiator.c index 1e69a43b279d..3754888dc7dd 100644 --- a/drivers/scsi/osd/osd_initiator.c +++ b/drivers/scsi/osd/osd_initiator.c @@ -1577,10 +1577,7 @@ static struct request *_make_request(struct request_queue *q, bool has_write, scsi_req_init(req); for_each_bio(bio) { - struct bio *bounce_bio = bio; - - blk_queue_bounce(req->q, &bounce_bio); - ret = blk_rq_append_bio(req, bounce_bio); + ret = blk_rq_append_bio(req, bio); if (ret) return ERR_PTR(ret); } -- 2.11.0
[PATCH 06/10] blk-mq: don't bounce by default
For historical reasons we default to bouncing highmem pages for all block queues. But the blk-mq drivers are easy to audit to ensure that we don't need this - scsi and mtip32xx set explicit limits and everyone else doesn't have any particular ones. Signed-off-by: Christoph Hellwig --- block/blk-mq.c | 5 - drivers/block/virtio_blk.c | 3 --- drivers/block/xen-blkfront.c | 3 --- 3 files changed, 11 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index e26b7435e245..dd276a9e138e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2300,11 +2300,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, blk_queue_make_request(q, blk_mq_make_request); /* -* by default assume old behaviour and bounce for any highmem page -*/ - blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); - - /* * Do this after blk_queue_make_request() overrides it... */ q->nr_requests = set->queue_depth; diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index e59bd4549a8a..0297ad7c1452 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -720,9 +720,6 @@ static int virtblk_probe(struct virtio_device *vdev) /* We can handle whatever the host told us to handle. */ blk_queue_max_segments(q, vblk->sg_elems-2); - /* No need to bounce any requests */ - blk_queue_bounce_limit(q, BLK_BOUNCE_ANY); - /* No real sector limit. */ blk_queue_max_hw_sectors(q, -1U); diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index ac90093fcb25..c852ed3c01d5 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -953,9 +953,6 @@ static void blkif_set_queue_limits(struct blkfront_info *info) /* Make sure buffer addresses are sector-aligned. */ blk_queue_dma_alignment(rq, 511); - - /* Make sure we don't use bounce buffers. */ - blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY); } static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size, -- 2.11.0
[PATCH 04/10] block: remove the queue_bounce_pfn helper
Only used inside the bounce code, and opencoding it makes it more obvious what is going on. Signed-off-by: Christoph Hellwig --- block/bounce.c | 6 +++--- include/linux/blkdev.h | 5 - 2 files changed, 3 insertions(+), 8 deletions(-) diff --git a/block/bounce.c b/block/bounce.c index fc79dd3e0f20..f2b4361fff81 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -203,7 +203,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig, bio_for_each_segment(from, *bio_orig, iter) { if (i++ < BIO_MAX_PAGES) sectors += from.bv_len >> 9; - if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q)) + if (page_to_pfn(from.bv_page) > q->limits.bounce_pfn) bounce = true; } if (!bounce) @@ -220,7 +220,7 @@ static void __blk_queue_bounce(struct request_queue *q, struct bio **bio_orig, bio_for_each_segment_all(to, bio, i) { struct page *page = to->bv_page; - if (page_to_pfn(page) <= queue_bounce_pfn(q)) + if (page_to_pfn(page) <= q->limits.bounce_pfn) continue; to->bv_page = mempool_alloc(pool, q->bounce_gfp); @@ -272,7 +272,7 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig) * don't waste time iterating over bio segments */ if (!(q->bounce_gfp & GFP_DMA)) { - if (queue_bounce_pfn(q) >= blk_max_pfn) + if (q->limits.bounce_pfn >= blk_max_pfn) return; pool = page_pool; } else { diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f1a5f82a4d97..db2b7a7450bd 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -1363,11 +1363,6 @@ enum blk_default_limits { #define blkdev_entry_to_request(entry) list_entry((entry), struct request, queuelist) -static inline unsigned long queue_bounce_pfn(struct request_queue *q) -{ - return q->limits.bounce_pfn; -} - static inline unsigned long queue_segment_boundary(struct request_queue *q) { return q->limits.seg_boundary_mask; -- 2.11.0
[PATCH 05/10] block: don't bother with bounce limits for make_request drivers
We only call blk_queue_bounce for request-based drivers, so stop messing with it for make_request based drivers. Signed-off-by: Christoph Hellwig --- block/blk-core.c | 5 + block/blk-mq.c | 5 + block/blk-settings.c | 5 - drivers/block/brd.c| 1 - drivers/block/drbd/drbd_main.c | 1 - drivers/block/rsxx/dev.c | 1 - drivers/nvdimm/blk.c | 1 - drivers/nvdimm/btt.c | 1 - drivers/nvdimm/pmem.c | 1 - 9 files changed, 10 insertions(+), 11 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 62cf92550512..07b26853ce1e 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -954,6 +954,11 @@ int blk_init_allocated_queue(struct request_queue *q) */ blk_queue_make_request(q, blk_queue_bio); + /* +* by default assume old behaviour and bounce for any highmem page +*/ + blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); + q->sg_reserved_size = INT_MAX; /* Protect q->elevator from elevator_change */ diff --git a/block/blk-mq.c b/block/blk-mq.c index dd276a9e138e..e26b7435e245 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2300,6 +2300,11 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, blk_queue_make_request(q, blk_mq_make_request); /* +* by default assume old behaviour and bounce for any highmem page +*/ + blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); + + /* * Do this after blk_queue_make_request() overrides it... */ q->nr_requests = set->queue_depth; diff --git a/block/blk-settings.c b/block/blk-settings.c index 4fa81ed383ca..be1f115b538b 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -172,11 +172,6 @@ void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn) q->nr_batching = BLK_BATCH_REQ; blk_set_default_limits(&q->limits); - - /* -* by default assume old behaviour and bounce for any highmem page -*/ - blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); } EXPORT_SYMBOL(blk_queue_make_request); diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 57b574f2f66a..6112e99bedf7 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -418,7 +418,6 @@ static struct brd_device *brd_alloc(int i) blk_queue_make_request(brd->brd_queue, brd_make_request); blk_queue_max_hw_sectors(brd->brd_queue, 1024); - blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); /* This is so fdisk will align partitions on 4k, because of * direct_access API needing 4k alignment, returning a PFN diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 90680034ef57..5fb99e06ebe4 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -2850,7 +2850,6 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig /* Setting the max_hw_sectors to an odd value of 8kibyte here This triggers a max_bio_size message upon first attach or connect */ blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8); - blk_queue_bounce_limit(q, BLK_BOUNCE_ANY); q->queue_lock = &resource->req_lock; device->md_io.page = alloc_page(GFP_KERNEL); diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c index 4e8bdfa0aa31..7f4acebf4657 100644 --- a/drivers/block/rsxx/dev.c +++ b/drivers/block/rsxx/dev.c @@ -284,7 +284,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card) } blk_queue_make_request(card->queue, rsxx_make_request); - blk_queue_bounce_limit(card->queue, BLK_BOUNCE_ANY); blk_queue_max_hw_sectors(card->queue, blkdev_max_hw_sectors); blk_queue_physical_block_size(card->queue, RSXX_HW_BLK_SIZE); diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c index 79eb9fb358d5..f12d23c49771 100644 --- a/drivers/nvdimm/blk.c +++ b/drivers/nvdimm/blk.c @@ -273,7 +273,6 @@ static int nsblk_attach_disk(struct nd_namespace_blk *nsblk) blk_queue_make_request(q, nd_blk_make_request); blk_queue_max_hw_sectors(q, UINT_MAX); - blk_queue_bounce_limit(q, BLK_BOUNCE_ANY); blk_queue_logical_block_size(q, nsblk_sector_size(nsblk)); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q); q->queuedata = nsblk; diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c index 31b2d14e210d..b6ba0618ea46 100644 --- a/drivers/nvdimm/btt.c +++ b/drivers/nvdimm/btt.c @@ -1297,7 +1297,6 @@ static int btt_blk_init(struct btt *btt) blk_queue_make_request(btt->btt_queue, btt_make_request); blk_queue_logical_block_size(btt->btt_queue, btt->sector_size); blk_queue_max_hw_sectors(btt->btt_queue, UINT_MAX); - blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY); queue_flag_set_unlocked(QUEUE_FLAG_NONROT, btt->btt_queue); btt->btt_queue
[PATCH 03/10] block: move bounce declarations to block/blk.h
Signed-off-by: Christoph Hellwig --- block/blk.h| 13 + block/bounce.c | 1 + include/linux/blkdev.h | 13 - 3 files changed, 14 insertions(+), 13 deletions(-) diff --git a/block/blk.h b/block/blk.h index 83c8e1100525..4576fb5a8006 100644 --- a/block/blk.h +++ b/block/blk.h @@ -334,4 +334,17 @@ static inline void blk_throtl_bio_endio(struct bio *bio) { } static inline void blk_throtl_stat_add(struct request *rq, u64 time) { } #endif +#ifdef CONFIG_BOUNCE +extern int init_emergency_isa_pool(void); +extern void blk_queue_bounce(struct request_queue *q, struct bio **bio); +#else +static inline int init_emergency_isa_pool(void) +{ + return 0; +} +static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio) +{ +} +#endif /* CONFIG_BOUNCE */ + #endif /* BLK_INTERNAL_H */ diff --git a/block/bounce.c b/block/bounce.c index b7e44bd59d7c..fc79dd3e0f20 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -22,6 +22,7 @@ #include #include +#include "blk.h" #define POOL_SIZE 64 #define ISA_POOL_SIZE 16 diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 22cfba64ce81..f1a5f82a4d97 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -863,19 +863,6 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn; #define BLK_DEFAULT_SG_TIMEOUT (60 * HZ) #define BLK_MIN_SG_TIMEOUT (7 * HZ) -#ifdef CONFIG_BOUNCE -extern int init_emergency_isa_pool(void); -extern void blk_queue_bounce(struct request_queue *q, struct bio **bio); -#else -static inline int init_emergency_isa_pool(void) -{ - return 0; -} -static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio) -{ -} -#endif /* CONFIG_MMU */ - struct rq_map_data { struct page **pages; int page_order; -- 2.11.0
[PATCH 09/10] dm: don't set bounce limit
Now all queues allocators come without abounce limit by default, dm doesn't have to override this anymore. Signed-off-by: Christoph Hellwig --- drivers/md/dm.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index fbd06b9f9467..402946035308 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1534,7 +1534,6 @@ void dm_init_normal_md_queue(struct mapped_device *md) * Initialize aspects of queue that aren't relevant for blk-mq */ md->queue->backing_dev_info->congested_fn = dm_any_congested; - blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY); } static void cleanup_mapped_device(struct mapped_device *md) -- 2.11.0
[PATCH 08/10] block: don't set bounce limit in blk_init_queue
Instead move it to the callers. Those that either don't use bio_data() or page_address() or are specific to architectures that do not support highmem are skipped. Signed-off-by: Christoph Hellwig --- block/blk-core.c| 5 - drivers/block/aoe/aoeblk.c | 1 + drivers/block/floppy.c | 1 + drivers/block/paride/pcd.c | 1 + drivers/block/paride/pd.c | 1 + drivers/block/paride/pf.c | 1 + drivers/block/skd_main.c| 1 + drivers/block/swim.c| 2 ++ drivers/block/swim3.c | 1 + drivers/block/xsysace.c | 1 + drivers/cdrom/gdrom.c | 1 + drivers/mtd/mtd_blkdevs.c | 1 + drivers/sbus/char/jsflash.c | 1 + 13 files changed, 13 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index ef82164c6d82..62cf92550512 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -927,11 +927,6 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id) return NULL; } - /* -* by default assume old behaviour and bounce for any highmem page -*/ - blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); - return q; } EXPORT_SYMBOL(blk_init_queue_node); diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c index 027b876370bc..6797e6c23c8a 100644 --- a/drivers/block/aoe/aoeblk.c +++ b/drivers/block/aoe/aoeblk.c @@ -388,6 +388,7 @@ aoeblk_gdalloc(void *vp) d->aoemajor, d->aoeminor); goto err_mempool; } + blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); spin_lock_irqsave(&d->lock, flags); WARN_ON(!(d->flags & DEVFL_GD_NOW)); diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index 9e3cb32e365d..ce823647a9c4 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -4203,6 +4203,7 @@ static int __init do_floppy_init(void) goto out_put_disk; } + blk_queue_bounce_limit(disks[drive]->queue, BLK_BOUNCE_HIGH); blk_queue_max_hw_sectors(disks[drive]->queue, 64); disks[drive]->major = FLOPPY_MAJOR; disks[drive]->first_minor = TOMINOR(drive); diff --git a/drivers/block/paride/pcd.c b/drivers/block/paride/pcd.c index cffe42d80ce9..7b8c6368beb7 100644 --- a/drivers/block/paride/pcd.c +++ b/drivers/block/paride/pcd.c @@ -305,6 +305,7 @@ static void pcd_init_units(void) put_disk(disk); continue; } + blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH); cd->disk = disk; cd->pi = &cd->pia; cd->present = 0; diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c index c98983be4f9c..27a44b97393a 100644 --- a/drivers/block/paride/pd.c +++ b/drivers/block/paride/pd.c @@ -863,6 +863,7 @@ static void pd_probe_drive(struct pd_unit *disk) return; } blk_queue_max_hw_sectors(p->queue, cluster); + blk_queue_bounce_limit(p->queue, BLK_BOUNCE_HIGH); if (disk->drive == -1) { for (disk->drive = 0; disk->drive <= 1; disk->drive++) diff --git a/drivers/block/paride/pf.c b/drivers/block/paride/pf.c index 5f46da8d05cd..eef7a91f667d 100644 --- a/drivers/block/paride/pf.c +++ b/drivers/block/paride/pf.c @@ -293,6 +293,7 @@ static void __init pf_init_units(void) return; } blk_queue_max_segments(disk->queue, cluster); + blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH); pf->disk = disk; pf->pi = &pf->pia; pf->media_status = PF_NM; diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c index e6c526861703..d0368682bd43 100644 --- a/drivers/block/skd_main.c +++ b/drivers/block/skd_main.c @@ -4273,6 +4273,7 @@ static int skd_cons_disk(struct skd_device *skdev) rc = -ENOMEM; goto err_out; } + blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); skdev->queue = q; disk->queue = q; diff --git a/drivers/block/swim.c b/drivers/block/swim.c index 1633aaf24060..84434d3ea19b 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -864,6 +864,8 @@ static int swim_floppy_init(struct swim_priv *swd) put_disk(swd->unit[drive].disk); goto exit_put_disks; } + blk_queue_bounce_limit(swd->unit[drive].disk->queue, + BLK_BOUNCE_HIGH); swd->unit[drive].disk->queue->queuedata = swd; swd->unit[drive].swd = swd; } diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c index e3399a138335..9f931f8f6b4c 100644 --- a/drivers/block/swim3.c +++ b/drivers/block/swim3.c @@ -1223,6 +1223,7 @@ static int swim3_attach(struct macio_dev *mdev, put_disk(disk); return -ENOMEM; } +
[PATCH 07/10] block: don't set bounce limit in blk_init_allocated_queue
And just move it into scsi_transport_sas which needs it due to low-level drivers directly derferencing bio_data, and into blk_init_queue_node, which will need a further push into the callers. Signed-off-by: Christoph Hellwig --- block/blk-core.c | 10 +- drivers/scsi/scsi_transport_sas.c | 5 + 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 07b26853ce1e..ef82164c6d82 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -927,6 +927,11 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id) return NULL; } + /* +* by default assume old behaviour and bounce for any highmem page +*/ + blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); + return q; } EXPORT_SYMBOL(blk_init_queue_node); @@ -954,11 +959,6 @@ int blk_init_allocated_queue(struct request_queue *q) */ blk_queue_make_request(q, blk_queue_bio); - /* -* by default assume old behaviour and bounce for any highmem page -*/ - blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); - q->sg_reserved_size = INT_MAX; /* Protect q->elevator from elevator_change */ diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c index cc970c811bcb..87ea904f0700 100644 --- a/drivers/scsi/scsi_transport_sas.c +++ b/drivers/scsi/scsi_transport_sas.c @@ -249,6 +249,11 @@ static int sas_bsg_initialize(struct Scsi_Host *shost, struct sas_rphy *rphy) if (error) goto out_cleanup_queue; + /* +* by default assume old behaviour and bounce for any highmem page +*/ + blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH); + error = bsg_register_queue(q, dev, name, release); if (error) goto out_cleanup_queue; -- 2.11.0
[PATCH 10/10] mmc/block: remove a call to blk_queue_bounce_limit
BLK_BOUNCE_ANY is the defauly now, so the call is superflous. Signed-off-by: Christoph Hellwig --- drivers/mmc/core/queue.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c index 7f20298d892b..b659a28c8018 100644 --- a/drivers/mmc/core/queue.c +++ b/drivers/mmc/core/queue.c @@ -388,7 +388,6 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card *card, mmc_queue_setup_discard(mq->queue, card); if (card->bouncesz) { - blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_ANY); blk_queue_max_hw_sectors(mq->queue, card->bouncesz / 512); blk_queue_max_segments(mq->queue, card->bouncesz / 512); blk_queue_max_segment_size(mq->queue, card->bouncesz); -- 2.11.0
Re: [PATCH BUGFIX] block, bfq: update wr_busy_queues if needed on a queue split
Hi Paolo, [auto build test WARNING on v4.12-rc5] [also build test WARNING on next-20170616] [cannot apply to block/for-next] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-update-wr_busy_queues-if-needed-on-a-queue-split/20170619-145003 config: i386-randconfig-x000-201725 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=i386 Note: it may well be a FALSE warning. FWIW you are at least aware of it now. http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings All warnings (new ones prefixed by >>): block/bfq-iosched.c: In function 'bfq_get_rq_private': >> block/bfq-iosched.c:770:10: warning: 'old_wr_coeff' may be used >> uninitialized in this function [-Wmaybe-uninitialized] else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1) ^ block/bfq-iosched.c:731:15: note: 'old_wr_coeff' was declared here unsigned int old_wr_coeff; ^~~~ vim +/old_wr_coeff +770 block/bfq-iosched.c 754 time_is_before_jiffies(bfqq->last_wr_start_finish + 755 bfqq->wr_cur_max_time))) { 756 bfq_log_bfqq(bfqq->bfqd, bfqq, 757 "resume state: switching off wr"); 758 759 bfqq->wr_coeff = 1; 760 } 761 762 /* make sure weight will be updated, however we got here */ 763 bfqq->entity.prio_changed = 1; 764 765 if (likely(!busy)) 766 return; 767 768 if (old_wr_coeff == 1 && bfqq->wr_coeff > 1) 769 bfqd->wr_busy_queues++; > 770 else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1) 771 bfqd->wr_busy_queues--; 772 } 773 774 static int bfqq_process_refs(struct bfq_queue *bfqq) 775 { 776 return bfqq->ref - bfqq->allocated - bfqq->entity.on_st; 777 } 778 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH rfc 01/30] nvme: Add admin connect request queue
In case we reconnect with inflight admin IO we need to make sure that the connect comes before the admin command. This can be only achieved by using a seperate request queue for admin connects. Use up a few more lines of the available space for your lines? :) I warned in the cover-letter that the change logs are pure negligence at the moment :) Wouldn't a head insertation also solve the problem? the head insertion will not protect against it because we must invoke blk_mq_start_stopped_hw_queues on the admin_q request queue so the admin connect can make progress, at this point (and before we actually queue up connect) pending admin commands can sneak in... However, you raise a valid point, I think I added this before we had the queue_is_ready protection, which will reject the command if the queue is not LIVE (unless its a connect). I think the reason its still in is that I tested this with loop which doesn't have a per-queue state machine. I'm still wandering if its a good idea to rely on the transport queue state to reject non-connect requests on non-LIVE queues. if/when we introduce a queue representation to the core and we drive the state machine there, then we could actually rely on it (I do have some code for it, but its a pretty massive change which cannot be added in an incremental fashion). Thoughts?
Re: [PATCH rfc 02/30] nvme-rdma: Don't alloc/free the tagset on reset
On 19/06/17 10:18, Christoph Hellwig wrote: +static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl, bool remove) { + nvme_rdma_stop_queue(&ctrl->queues[0]); + if (remove) { + blk_cleanup_queue(ctrl->ctrl.admin_connect_q); + blk_cleanup_queue(ctrl->ctrl.admin_q); + blk_mq_free_tag_set(&ctrl->admin_tag_set); + nvme_rdma_dev_put(ctrl->device); + } + nvme_rdma_free_qe(ctrl->queues[0].device->dev, &ctrl->async_event_sqe, sizeof(struct nvme_command), DMA_TO_DEVICE); + nvme_rdma_free_queue(&ctrl->queues[0]); I don't like the calling convention. We only have have two callers anyway. So I'd much rather only keep the code inside the if above in the new nvme_rdma_destroy_admin_queue that is only called at shutdown time, and opencode the calls to nvme_rdma_stop_queue, nvme_rdma_free_qe and nvme_rdma_free_queue in the callers. We can do that, but this tries to eliminate duplicate code as much as possible. It's not like the convention is unprecedented... -static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl) +static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, bool new) PCIe is just checking for a non-null admin_q. Which I don't like very much :) But I think we should jsut split this into two functions, one for the shared code at the end and one just for the first-time setup, with the nvme_rdma_init_queue call open coded. We can split, but I less like the idea of open-coding nvme_rdma_init_queue at the call-sites. error = nvmf_connect_admin_queue(&ctrl->ctrl); @@ -1596,6 +1601,8 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl) set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags); + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true); + Where does this come from? Spilled in I guess...
Re: [PATCH rfc 03/30] nvme-rdma: reuse configure/destroy admin queue
static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, bool new) Too long line.. Also what about moving the helpers into the right place in the previous patch that already re-indented most of this? Sure.
Re: [PATCH rfc 29/30] nvme: add sed-opal ctrl manipulation in admin configuration
The subject sounds odd and it could use a changelog. But I'd love to pick this change up ASAP as it's the right thing to do.. How? where would you place it? there is no nvme_configure_admin_queue in nvme-core. I suppose we could get it in nvme_init_ctrl?
Re: [PATCH rfc 28/30] nvme: update tagset nr_hw_queues when reallocating io queues
On Mon, Jun 19, 2017 at 09:21:48AM +0200, Christoph Hellwig wrote: > On Sun, Jun 18, 2017 at 06:22:02PM +0300, Sagi Grimberg wrote: > > Signed-off-by: Sagi Grimberg > > Could use a changelog. > > Ming: does this solve your problem of not seeing the new queues > after a qemu CPU hotplug + reset? The issue I observed is that there isn't NVMe reset triggered after CPU becomes online. It may take a while since the test need to apply the whole patchset. Or is it fine to figure out one fix on this issue? Sagi? Once the test is done, I will update with you. Thanks, Ming
Re: [lkp-robot] [scsi] ebc76736f2: fio.write_bw_MBps -4% regression
On Mon, Jun 19, 2017 at 03:49:43PM +0800, Ye Xiaolong wrote: > On 06/19, Christoph Hellwig wrote: > >On Mon, Jun 19, 2017 at 02:03:18PM +0800, kernel test robot wrote: > >> > >> Greeting, > >> > >> FYI, we noticed a -4% regression of fio.write_bw_MBps due to commit: > > > >What does cat /sys/block//queue/scheduler say for your test? > >Can you try with mq-deadline or bfq and see what the results are? > > Could you tell me how to do it? via changing kernel config? BFQ doesn't seem to be enabled in the config, so it would have to be enabled. I've not seen a compile-time option for the MQ I/O scheduler (unlike the legacy one), so the way to change it would be to echo the name to /sys/block//queue/scheduler
Re: [lkp-robot] [scsi] ebc76736f2: fio.write_bw_MBps -4% regression
On 06/19, Christoph Hellwig wrote: >On Mon, Jun 19, 2017 at 03:49:43PM +0800, Ye Xiaolong wrote: >> On 06/19, Christoph Hellwig wrote: >> >On Mon, Jun 19, 2017 at 02:03:18PM +0800, kernel test robot wrote: >> >> >> >> Greeting, >> >> >> >> FYI, we noticed a -4% regression of fio.write_bw_MBps due to commit: >> > >> >What does cat /sys/block//queue/scheduler say for your test? >> >Can you try with mq-deadline or bfq and see what the results are? >> >> Could you tell me how to do it? via changing kernel config? > >BFQ doesn't seem to be enabled in the config, so it would have to be >enabled. > >I've not seen a compile-time option for the MQ I/O scheduler (unlike >the legacy one), so the way to change it would be to echo the name to >/sys/block//queue/scheduler echo mq-deadline > /sys/block//queue/scheduler ? I'll try and post the results once I get it. Thanks, Xiaolong
Re: [lkp-robot] [scsi] ebc76736f2: fio.write_bw_MBps -4% regression
On Mon, Jun 19, 2017 at 04:52:36PM +0800, Ye Xiaolong wrote: > >I've not seen a compile-time option for the MQ I/O scheduler (unlike > >the legacy one), so the way to change it would be to echo the name to > >/sys/block//queue/scheduler > > echo mq-deadline > /sys/block//queue/scheduler ? > > I'll try and post the results once I get it. Thanks!
[PATCH] btrfs: use new block error code
This function is supposed to return blk_status_t error codes now but there was a stray -ENOMEM left behind. Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t") Signed-off-by: Dan Carpenter diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index e536e98fe351..2c0b7b57fcd5 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -584,7 +584,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, __GFP_HIGHMEM); if (!cb->compressed_pages[pg_index]) { faili = pg_index - 1; - ret = -ENOMEM; + ret = BLK_STS_RESOURCE; goto fail2; } }
Re: [PATCH BUGFIX] block, bfq: update wr_busy_queues if needed on a queue split
> Il giorno 19 giu 2017, alle ore 09:38, kbuild test robot ha > scritto: > > Hi Paolo, > > [auto build test WARNING on v4.12-rc5] > [also build test WARNING on next-20170616] > [cannot apply to block/for-next] > [if your patch is applied to the wrong git tree, please drop us a note to > help improve the system] > > url: > https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-update-wr_busy_queues-if-needed-on-a-queue-split/20170619-145003 > config: i386-randconfig-x000-201725 (attached as .config) > compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 > reproduce: ># save the attached .config to linux build tree >make ARCH=i386 > > Note: it may well be a FALSE warning. FWIW you are at least aware of it now. > http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings > > All warnings (new ones prefixed by >>): > > block/bfq-iosched.c: In function 'bfq_get_rq_private': >>> block/bfq-iosched.c:770:10: warning: 'old_wr_coeff' may be used >>> uninitialized in this function [-Wmaybe-uninitialized] > else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1) > ^ > block/bfq-iosched.c:731:15: note: 'old_wr_coeff' was declared here > unsigned int old_wr_coeff; > ^~~~ > I'm sending a V2, (probably imperceptibly) slower on average, but not confusing the compiler. Thanks, Paolo > vim +/old_wr_coeff +770 block/bfq-iosched.c > > 754 time_is_before_jiffies(bfqq->last_wr_start_finish + > 755bfqq->wr_cur_max_time))) { > 756 bfq_log_bfqq(bfqq->bfqd, bfqq, > 757 "resume state: switching off wr"); > 758 > 759 bfqq->wr_coeff = 1; > 760 } > 761 > 762 /* make sure weight will be updated, however we got here */ > 763 bfqq->entity.prio_changed = 1; > 764 > 765 if (likely(!busy)) > 766 return; > 767 > 768 if (old_wr_coeff == 1 && bfqq->wr_coeff > 1) > 769 bfqd->wr_busy_queues++; >> 770 else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1) > 771 bfqd->wr_busy_queues--; > 772 } > 773 > 774 static int bfqq_process_refs(struct bfq_queue *bfqq) > 775 { > 776 return bfqq->ref - bfqq->allocated - bfqq->entity.on_st; > 777 } > 778 > > --- > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation > <.config.gz>
[PATCH BUGFIX V2] block, bfq: update wr_busy_queues if needed on a queue split
This commit fixes a bug triggered by a non-trivial sequence of events. These events are briefly described in the next two paragraphs. The impatiens, or those who are familiar with queue merging and splitting, can jump directly to the last paragraph. On each I/O-request arrival for a shared bfq_queue, i.e., for a bfq_queue that is the result of the merge of two or more bfq_queues, BFQ checks whether the shared bfq_queue has become seeky (i.e., if too many random I/O requests have arrived for the bfq_queue; if the device is non rotational, then random requests must be also small for the bfq_queue to be tagged as seeky). If the shared bfq_queue is actually detected as seeky, then a split occurs: the bfq I/O context of the process that has issued the request is redirected from the shared bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the shared bfq_queue actually happens to be shared only by one process (because of previous splits), then no new bfq_queue is created: the state of the shared bfq_queue is just changed from shared to non shared. Regardless of whether a brand new non-shared bfq_queue is created, or the pre-existing shared bfq_queue is just turned into a non-shared bfq_queue, several parameters of the non-shared bfq_queue are set (restored) to the original values they had when the bfq_queue associated with the bfq I/O context of the process (that has just issued an I/O request) was merged with the shared bfq_queue. One of these parameters is the weight-raising state. If, on the split of a shared bfq_queue, 1) a pre-existing shared bfq_queue is turned into a non-shared bfq_queue; 2) the previously shared bfq_queue happens to be busy; 3) the weight-raising state of the previously shared bfq_queue happens to change; the number of weight-raised busy queues changes. The field wr_busy_queues must then be updated accordingly, but such an update was missing. This commit adds the missing update. Reported-by: Luca Miccio Signed-off-by: Paolo Valente --- block/bfq-iosched.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index ed93da2..bbeaf52 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -725,8 +725,12 @@ static void bfq_updated_next_req(struct bfq_data *bfqd, } static void -bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic) +bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd, + struct bfq_io_cq *bic, bool bfq_already_existing) { + unsigned int old_wr_coeff = bfqq->wr_coeff; + bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq); + if (bic->saved_idle_window) bfq_mark_bfqq_idle_window(bfqq); else @@ -754,6 +758,14 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic) /* make sure weight will be updated, however we got here */ bfqq->entity.prio_changed = 1; + + if (likely(!busy)) + return; + + if (old_wr_coeff == 1 && bfqq->wr_coeff > 1) + bfqd->wr_busy_queues++; + else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1) + bfqd->wr_busy_queues--; } static int bfqq_process_refs(struct bfq_queue *bfqq) @@ -4402,7 +4414,7 @@ static int bfq_get_rq_private(struct request_queue *q, struct request *rq, const int is_sync = rq_is_sync(rq); struct bfq_queue *bfqq; bool new_queue = false; - bool split = false; + bool bfqq_already_existing = false, split = false; spin_lock_irq(&bfqd->lock); @@ -4432,6 +,8 @@ static int bfq_get_rq_private(struct request_queue *q, struct request *rq, bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio, true, is_sync, NULL); + else + bfqq_already_existing = true; } } @@ -4457,7 +4471,8 @@ static int bfq_get_rq_private(struct request_queue *q, struct request *rq, * queue: restore the idle window and the * possible weight raising period. */ - bfq_bfqq_resume_state(bfqq, bic); + bfq_bfqq_resume_state(bfqq, bfqd, bic, + bfqq_already_existing); } } -- 2.10.0
dm integrity tests crash kernel (4.12-rc5)
Hi, cryptsetup testsuite easily triggers following crash. I can provide more info on demand, but currently most straightforward way to trigger it is: 1) checkout cryptsetup master branch (https://gitlab.com/cryptsetup/cryptsetup.git) 2)./autogen.sh --disable-python --enable-integritysetup 3) compile 4) run tests/integrity-compat-test in loop for a while (it's not 100% reproducible) [ 330.980914] DEBUG_LOCKS_WARN_ON(current->hardirq_context) [ 330.980923] [ cut here ] [ 330.982627] WARNING: CPU: 1 PID: 0 at kernel/locking/lockdep.c:2748 trace_hardirqs_on_caller+0x107/0x180 [ 330.984340] Modules linked in: dm_integrity async_xor xor async_tx dm_bufio dm_mod dax auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd [ 330.989205] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.12.0-rc5 #1 [ 330.990645] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014 [ 330.992616] task: 88007c838000 task.stack: c9368000 [ 330.994084] RIP: 0010:trace_hardirqs_on_caller+0x107/0x180 [ 330.995361] RSP: 0018:88007fd03e38 EFLAGS: 00010092 [ 330.996594] RAX: 002d RBX: 8800322404e0 RCX: [ 330.998194] RDX: 810badd8 RSI: 0001 RDI: 810badf2 [ 330.999801] RBP: 88007fd03e48 R08: 0001 R09: [ 331.001427] R10: R11: 810badbe R12: 81809ff7 [ 331.003045] R13: 880032c4e700 R14: 880077cd9bc0 R15: [ 331.005616] FS: () GS:88007fd0() knlGS: [ 331.008308] CS: 0010 DS: ES: CR0: 80050033 [ 331.010107] CR2: 025e7f50 CR3: 39c7c000 CR4: 000406e0 [ 331.011968] Call Trace: [ 331.012806] [ 331.013684] trace_hardirqs_on+0xd/0x10 [ 331.014732] _raw_spin_unlock_irq+0x27/0x30 [ 331.015840] submit_flush_bio+0x4e/0x80 [dm_integrity] [ 331.017157] do_endio_flush+0x41/0x70 [dm_integrity] [ 331.018476] dec_in_flight+0x59/0x110 [dm_integrity] [ 331.019767] integrity_end_io+0x5e/0x70 [dm_integrity] [ 331.020965] bio_endio+0x7c/0x1a0 [ 331.021917] blk_update_request+0x9f/0x3d0 [ 331.023050] blk_mq_end_request+0x15/0x60 [ 331.024224] lo_complete_rq+0x2b/0x80 [ 331.025406] __blk_mq_complete_request_remote+0xe/0x10 [ 331.026813] flush_smp_call_function_queue+0x4f/0x110 [ 331.028173] generic_smp_call_function_single_interrupt+0xe/0x20 [ 331.029616] smp_call_function_single_interrupt+0x22/0x30 [ 331.031125] call_function_single_interrupt+0x90/0xa0 [ 331.032819] RIP: 0010:default_idle+0x1b/0x180 [ 331.035740] RSP: 0018:c936beb0 EFLAGS: 0206 ORIG_RAX: ff04 [ 331.039007] RAX: 88007c838000 RBX: 88007c838000 RCX: 0001 [ 331.041332] RDX: RSI: 0001 RDI: 88007c838000 [ 331.043227] RBP: c936bec0 R08: R09: 0001 [ 331.045066] R10: R11: R12: 0001 [ 331.047367] R13: 88007c838000 R14: R15: [ 331.049232] [ 331.050297] arch_cpu_idle+0xa/0x10 [ 331.051505] default_idle_call+0x1e/0x30 [ 331.052874] do_idle+0x15a/0x1c0 [ 331.053836] cpu_startup_entry+0x18/0x20 [ 331.054997] start_secondary+0xed/0xf0 [ 331.056183] secondary_startup_64+0x9f/0x9f [ 331.057503] Code: 41 5c 5d f3 c3 e8 ea 75 29 00 85 c0 74 f1 8b 35 b0 92 ca 01 85 f6 75 e7 48 c7 c6 67 8e c1 81 48 c7 c7 43 6d c1 81 e8 dc b4 0b 00 <0f> ff eb d0 be 01 00 00 00 48 89 df e8 48 fe ff ff 85 c0 75 90 [ 331.062261] ---[ end trace fd21f79668c6a046 ]--- [ 331.063745] [ cut here ] [ 331.066503] kernel BUG at kernel/irq_work.c:135! [ 331.068374] invalid opcode: [#1] SMP [ 331.070144] Modules linked in: dm_integrity async_xor xor async_tx dm_bufio dm_mod dax auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd [ 331.076302] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW 4.12.0-rc5 #1 [ 331.078639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014 [ 331.081124] task: 88007c838000 task.stack: c9368000 [ 331.082646] RIP: 0010:irq_work_run_list+0x62/0x70 [ 331.083892] RSP: 0018:88007fd03f70 EFLAGS: 00010206 [ 331.085469] RAX: 88007c838000 RBX: 0200 RCX: 0002 [ 331.087449] RDX: 813097bc RSI: 0001 RDI: 88007fd14cf8 [ 331.089271] RBP: 88007fd03f90 R08: R09: [ 331.091093] R10: R11: 81309760 R12: [ 3
Re: [dm-devel] dm integrity tests crash kernel (4.12-rc5)
same log again, hopefully prettier format. Sorry: [ 330.980914] DEBUG_LOCKS_WARN_ON(current->hardirq_context) [ 330.980923] [ cut here ] [ 330.982627] WARNING: CPU: 1 PID: 0 at kernel/locking/lockdep.c:2748 trace_hardirqs_on_caller+0x107/0x180 [ 330.984340] Modules linked in: dm_integrity async_xor xor async_tx dm_bufio dm_mod dax auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd [ 330.989205] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.12.0-rc5 #1 [ 330.990645] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014 [ 330.992616] task: 88007c838000 task.stack: c9368000 [ 330.994084] RIP: 0010:trace_hardirqs_on_caller+0x107/0x180 [ 330.995361] RSP: 0018:88007fd03e38 EFLAGS: 00010092 [ 330.996594] RAX: 002d RBX: 8800322404e0 RCX: [ 330.998194] RDX: 810badd8 RSI: 0001 RDI: 810badf2 [ 330.999801] RBP: 88007fd03e48 R08: 0001 R09: [ 331.001427] R10: R11: 810badbe R12: 81809ff7 [ 331.003045] R13: 880032c4e700 R14: 880077cd9bc0 R15: [ 331.005616] FS: () GS:88007fd0() knlGS: [ 331.008308] CS: 0010 DS: ES: CR0: 80050033 [ 331.010107] CR2: 025e7f50 CR3: 39c7c000 CR4: 000406e0 [ 331.011968] Call Trace: [ 331.012806] [ 331.013684] trace_hardirqs_on+0xd/0x10 [ 331.014732] _raw_spin_unlock_irq+0x27/0x30 [ 331.015840] submit_flush_bio+0x4e/0x80 [dm_integrity] [ 331.017157] do_endio_flush+0x41/0x70 [dm_integrity] [ 331.018476] dec_in_flight+0x59/0x110 [dm_integrity] [ 331.019767] integrity_end_io+0x5e/0x70 [dm_integrity] [ 331.020965] bio_endio+0x7c/0x1a0 [ 331.021917] blk_update_request+0x9f/0x3d0 [ 331.023050] blk_mq_end_request+0x15/0x60 [ 331.024224] lo_complete_rq+0x2b/0x80 [ 331.025406] __blk_mq_complete_request_remote+0xe/0x10 [ 331.026813] flush_smp_call_function_queue+0x4f/0x110 [ 331.028173] generic_smp_call_function_single_interrupt+0xe/0x20 [ 331.029616] smp_call_function_single_interrupt+0x22/0x30 [ 331.031125] call_function_single_interrupt+0x90/0xa0 [ 331.032819] RIP: 0010:default_idle+0x1b/0x180 [ 331.035740] RSP: 0018:c936beb0 EFLAGS: 0206 ORIG_RAX: ff04 [ 331.039007] RAX: 88007c838000 RBX: 88007c838000 RCX: 0001 [ 331.041332] RDX: RSI: 0001 RDI: 88007c838000 [ 331.043227] RBP: c936bec0 R08: R09: 0001 [ 331.045066] R10: R11: R12: 0001 [ 331.047367] R13: 88007c838000 R14: R15: [ 331.049232] [ 331.050297] arch_cpu_idle+0xa/0x10 [ 331.051505] default_idle_call+0x1e/0x30 [ 331.052874] do_idle+0x15a/0x1c0 [ 331.053836] cpu_startup_entry+0x18/0x20 [ 331.054997] start_secondary+0xed/0xf0 [ 331.056183] secondary_startup_64+0x9f/0x9f [ 331.057503] Code: 41 5c 5d f3 c3 e8 ea 75 29 00 85 c0 74 f1 8b 35 b0 92 ca 01 85 f6 75 e7 48 c7 c6 67 8e c1 81 48 c7 c7 43 6d c1 81 e8 dc b4 0b 00 <0f> ff eb d0 be 01 00 00 00 48 89 df e8 48 fe ff ff 85 c0 75 90 [ 331.062261] ---[ end trace fd21f79668c6a046 ]--- [ 331.063745] [ cut here ] [ 331.066503] kernel BUG at kernel/irq_work.c:135! [ 331.068374] invalid opcode: [#1] SMP [ 331.070144] Modules linked in: dm_integrity async_xor xor async_tx dm_bufio dm_mod dax auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd [ 331.076302] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW 4.12.0-rc5 #1 [ 331.078639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014 [ 331.081124] task: 88007c838000 task.stack: c9368000 [ 331.082646] RIP: 0010:irq_work_run_list+0x62/0x70 [ 331.083892] RSP: 0018:88007fd03f70 EFLAGS: 00010206 [ 331.085469] RAX: 88007c838000 RBX: 0200 RCX: 0002 [ 331.087449] RDX: 813097bc RSI: 0001 RDI: 88007fd14cf8 [ 331.089271] RBP: 88007fd03f90 R08: R09: [ 331.091093] R10: R11: 81309760 R12: [ 331.093018] R13: 0001 R14: R15: [ 331.094834] FS: () GS:88007fd0() knlGS: [ 331.099328] CS: 0010 DS: ES: CR0: 80050033 [ 331.101766] CR2: 025e7f50 CR3: 39c7c000 CR4: 000406e0 [ 331.103660] Call Trace: [ 331.104600] [ 331.105484] irq_work_run+0x18/0x40 [ 331.106885] flus
[PATCH] sd: add support for TCG OPAL self encrypting disks
Just wire up the generic TCG OPAL infrastructure to the SCSI disk driver and the Security In/Out commands. Note that I don't know of any actual SCSI disks that do support TCG OPAL, but this is required to support ATA disks through libata. Signed-off-by: Christoph Hellwig --- drivers/ata/libata-scsi.c | 3 +++ drivers/scsi/sd.c | 53 +- drivers/scsi/sd.h | 2 ++ include/scsi/scsi_device.h | 1 + 4 files changed, 58 insertions(+), 1 deletion(-) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 0f788ad6f2f6..3e5ca2e894a4 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1321,6 +1321,9 @@ static int ata_scsi_dev_config(struct scsi_device *sdev, blk_queue_flush_queueable(q, false); + if (dev->flags & ATA_DFLAG_TRUSTED) + sdev->security_supported = 1; + dev->sdev = sdev; return 0; } diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index f9d1432d7cc5..5d32fd7d3a3e 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -50,6 +50,7 @@ #include #include #include +#include #include #include #include @@ -643,6 +644,26 @@ static void scsi_disk_put(struct scsi_disk *sdkp) mutex_unlock(&sd_ref_mutex); } +#ifdef CONFIG_BLK_SED_OPAL +static int sd_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, + size_t len, bool send) +{ + struct scsi_device *sdev = data; + u8 cdb[12] = { 0, }; + int ret; + + cdb[0] = send ? SECURITY_PROTOCOL_OUT : SECURITY_PROTOCOL_IN; + cdb[1] = secp; + put_unaligned_be16(spsp, &cdb[2]); + put_unaligned_be32(len, &cdb[6]); + + ret = scsi_execute_req(sdev, cdb, + send ? DMA_TO_DEVICE : DMA_FROM_DEVICE, + buffer, len, NULL, SD_TIMEOUT, SD_MAX_RETRIES, NULL); + return ret <= 0 ? ret : -EIO; +} +#endif /* CONFIG_BLK_SED_OPAL */ + static unsigned char sd_setup_protect_cmnd(struct scsi_cmnd *scmd, unsigned int dix, unsigned int dif) { @@ -1439,6 +1460,9 @@ static int sd_ioctl(struct block_device *bdev, fmode_t mode, if (error) goto out; + if (is_sed_ioctl(cmd)) + return sed_ioctl(sdkp->opal_dev, cmd, p); + /* * Send SCSI addressing ioctls directly to mid level, send other * ioctls to block level and then onto mid level if they can't be @@ -2994,6 +3018,20 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer) sdkp->ws10 = 1; } +static void sd_read_security(struct scsi_disk *sdkp, unsigned char *buffer) +{ + struct scsi_device *sdev = sdkp->device; + + if (!sdev->security_supported) + return; + + if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, + SECURITY_PROTOCOL_IN) == 1 && + scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, + SECURITY_PROTOCOL_OUT) == 1) + sdkp->security = 1; +} + /** * sd_revalidate_disk - called the first time a new disk is seen, * performs disk spin up, read_capacity, etc. @@ -3047,6 +3085,7 @@ static int sd_revalidate_disk(struct gendisk *disk) sd_read_cache_type(sdkp, buffer); sd_read_app_tag_own(sdkp, buffer); sd_read_write_same(sdkp, buffer); + sd_read_security(sdkp, buffer); } sdkp->first_scan = 0; @@ -3207,6 +3246,12 @@ static void sd_probe_async(void *data, async_cookie_t cookie) sd_revalidate_disk(gd); + if (sdkp->security) { + sdkp->opal_dev = init_opal_dev(sdp, &sd_sec_submit); + if (sdkp->opal_dev) + sd_printk(KERN_NOTICE, sdkp, "supports TCG Opal\n"); + } + sd_printk(KERN_NOTICE, sdkp, "Attached SCSI %sdisk\n", sdp->removable ? "removable " : ""); scsi_autopm_put_device(sdp); @@ -3356,6 +3401,8 @@ static int sd_remove(struct device *dev) sd_zbc_remove(sdkp); + free_opal_dev(sdkp->opal_dev); + blk_register_region(devt, SD_MINORS, NULL, sd_default_probe, NULL, NULL); @@ -3497,6 +3544,7 @@ static int sd_suspend_runtime(struct device *dev) static int sd_resume(struct device *dev) { struct scsi_disk *sdkp = dev_get_drvdata(dev); + int ret; if (!sdkp) /* E.g.: runtime resume at the start of sd_probe() */ return 0; @@ -3505,7 +3553,10 @@ static int sd_resume(struct device *dev) return 0; sd_printk(KERN_NOTICE, sdkp, "Starting disk\n"); - return sd_start_stop_device(sdkp, 1); + ret = sd_start_stop_device(sdkp, 1); + if (!ret) + opal_unlock_from_suspend(sdkp->opal_dev); + return ret; } /** diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h index 61d02e
TCG Opal support for sd.c
Hi all, this patch adds TCG Opal support to the scsi disk driver. As far as I know only SATA disks actually support OPAL, and as Martin fears RSOC-related regressions the support is conditional in a flag in struct scsi_device, which so far only libata sets. Because of that we should merge the patch through the libata tree with ACKs from the SCSI side. Changes from the previous version: - add the new security_supported flag - add a call to opal_unlock_from_suspend
Re: [PATCH rfc 01/30] nvme: Add admin connect request queue
On Mon, Jun 19, 2017 at 10:49:15AM +0300, Sagi Grimberg wrote: > However, you raise a valid point, I think I added this before we > had the queue_is_ready protection, which will reject the command > if the queue is not LIVE (unless its a connect). I think the reason > its still in is that I tested this with loop which doesn't have > a per-queue state machine. Yeah. > I'm still wandering if its a good idea to rely on the transport > queue state to reject non-connect requests on non-LIVE queues. > if/when we introduce a queue representation to the core and we > drive the state machine there, then we could actually rely on it > (I do have some code for it, but its a pretty massive change which > cannot be added in an incremental fashion). I suspect moving the state machine to the core is a good idea. Note that the current nvme_rdma_queue_is_ready hack actually seems a bit to simple - even after the connect we should only allow get/set Property. Nevermind the additional complications if/when authentication is implemented.
Re: [PATCH rfc 02/30] nvme-rdma: Don't alloc/free the tagset on reset
> We can do that, but this tries to eliminate duplicate code as > much as possible. It's not like the convention is unprecedented... It's fairly nasty to follow. OTOH I like your overall cleanup, so I guess I shouldn't complain about the initial patches to much but just possibly do another pass after you are done..
Re: [PATCH rfc 04/30] nvme-rdma: introduce configure/destroy io queues
Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH rfc 05/30] nvme-rdma: introduce nvme_rdma_start_queue
On Sun, Jun 18, 2017 at 06:21:39PM +0300, Sagi Grimberg wrote: > This should pair with nvme_rdma_stop_queue. While this > is not a complete 1x1 reverse, it still pairs up pretty > well because in fabrics we don't have a disconnect capsule > but we simply teardown the transport association. Looks like we are going to get one sooner or later. Not that I'd like to send them if we can avoid it. But the patch itself looks fine to me.
Re: [PATCH rfc 06/30] nvme-rdma: rename nvme_rdma_init_queue to nvme_rdma_alloc_queue
Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH rfc 07/30] nvme-rdma: make stop/free queue receive a ctrl and qid struct
Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH rfc 08/30] nvme-rdma: cleanup error path in controller reset
On Sun, Jun 18, 2017 at 06:21:42PM +0300, Sagi Grimberg wrote: > No need to queue an extra work to indirect controller > uninit and put the final reference. Maybe my memory is a little vague, but didn't we need the work_struct for something? At least it would serialize all the removals for example.
Re: [PATCH rfc 09/30] nvme: Move queue_count to the nvme_ctrl
Looks fine. I'd be happy to take this as an early cleanup. Reviewed-by: Christoph Hellwig
Re: [PATCH rfc 10/30] nvme: Add admin_tagset pointer to nvme_ctrl
On Sun, Jun 18, 2017 at 06:21:44PM +0300, Sagi Grimberg wrote: > Will be used when we centralize control flows. only > rdma for now. Should we at some point move the tag_sets themselves to the generic ctrl instead of just pointers?
Re: [PATCH rfc 11/30] nvme: move controller cap to struct nvme_ctrl
On Sun, Jun 18, 2017 at 06:21:45PM +0300, Sagi Grimberg wrote: > Will be used in centralized code later. only rdma > for now. It would be great to initialize it early on for all transports, and then just use the stored field instead of re-reading CAP in various places.
Re: [PATCH rfc 12/30] nvme-rdma: disable controller in reset instead of shutdown
Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH rfc 13/30] nvme-rdma: move queue LIVE/DELETING flags settings to queue routines
Looks good, Reviewed-by: Christoph Hellwig
Re: [PATCH rfc 14/30] nvme-rdma: stop queues instead of simply flipping their state
Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH rfc 15/30] nvme-rdma: don't check queue state for shutdown/disable
Looks fine, Reviewed-by: Christoph Hellwig
Re: [PATCH rfc 16/30] nvme-rdma: move tagset allocation to a dedicated routine
Looks fine, but how about doing this early in the series? There's quite a bit of churn around this code.
Re: [PATCH rfc 17/30] nvme-rdma: move admin specific resources to alloc_queue
On Sun, Jun 18, 2017 at 06:21:51PM +0300, Sagi Grimberg wrote: > We're trying to make admin queue configuration generic, so > move the rdma specifics to the queue allocation (based on > the queue index passed). Needs at least a comment, and probably factoring into a little helper. And once we have that helper it sounds like this might be a good callout from the core?
Re: [PATCH rfc 20/30] nvme: add err, reconnect and delete work items to nvme core
On Sun, Jun 18, 2017 at 06:21:54PM +0300, Sagi Grimberg wrote: > We intent for these handlers to become generic, thus, add them to > the nvme core controller struct. Do you remember why we actually need all the different work items? We need err_work to recover from RDMA QP-level errors. But how is it so different from a reset in that respect? Similarly why do we need reset to be different from reconnect? Especially as reconnect sort of is the reset of fabrics.
Re: [PATCH rfc 29/30] nvme: add sed-opal ctrl manipulation in admin configuration
On Mon, Jun 19, 2017 at 11:03:36AM +0300, Sagi Grimberg wrote: > >> The subject sounds odd and it could use a changelog. But I'd love to >> pick this change up ASAP as it's the right thing to do.. > > How? where would you place it? there is no nvme_configure_admin_queue in > nvme-core. Doh. Yeah, let's keep it where it is.
Re: [PATCH rfc 25/30] nvme: move control plane handling to nvme core
> +static void nvme_free_io_queues(struct nvme_ctrl *ctrl) > +{ > + int i; > + > + for (i = 1; i < ctrl->queue_count; i++) > + ctrl->ops->free_hw_queue(ctrl, i); > +} > + > +void nvme_stop_io_queues(struct nvme_ctrl *ctrl) > +{ > + int i; > + > + for (i = 1; i < ctrl->queue_count; i++) > + ctrl->ops->stop_hw_queue(ctrl, i); > +} > +EXPORT_SYMBOL_GPL(nvme_stop_io_queues); At leasr for PCIe this is going to work very differently, so I'm not sure this part make so much sense in the core. Maybe in Fabrics? Or at least make the callouts operate on all I/O queues, which would suite PCIe a lot more. > + error = ctrl->ops->start_hw_queue(ctrl, 0); > + if (error) > + goto out_cleanup_connect_queue; > + > + error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap); > + if (error) { > + dev_err(ctrl->device, > + "prop_get NVME_REG_CAP failed\n"); > + goto out_cleanup_connect_queue; > + } > + > + ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize); > + > + error = nvme_enable_ctrl(ctrl, ctrl->cap); > + if (error) > + goto out_cleanup_connect_queue; I'm not sure this ordering is going to work for PCIe..
Re: [PATCH 04/10] blk-mq-sched: unify request finished methods
> Il giorno 16 giu 2017, alle ore 18:15, Christoph Hellwig ha > scritto: > > No need to have two different callouts of bfq vs kyber. > > Signed-off-by: Christoph Hellwig > --- > block/bfq-iosched.c | 6 +++--- > block/blk-mq.c | 11 --- > block/kyber-iosched.c| 8 +++- > include/linux/elevator.h | 3 +-- > 4 files changed, 11 insertions(+), 17 deletions(-) > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index ed93da2462ab..4f69e39c2f89 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -4290,7 +4290,7 @@ static void bfq_put_rq_priv_body(struct bfq_queue *bfqq) > bfq_put_queue(bfqq); > } > > -static void bfq_put_rq_private(struct request_queue *q, struct request *rq) > +static void bfq_finish_request(struct request *rq) > { > struct bfq_queue *bfqq = RQ_BFQQ(rq); > struct bfq_data *bfqd = bfqq->bfqd; > @@ -4324,7 +4324,7 @@ static void bfq_put_rq_private(struct request_queue *q, > struct request *rq) >*/ > > if (!RB_EMPTY_NODE(&rq->rb_node)) > - bfq_remove_request(q, rq); > + bfq_remove_request(rq->q, rq); > bfq_put_rq_priv_body(bfqq); > } > > @@ -4951,7 +4951,7 @@ static struct elv_fs_entry bfq_attrs[] = { > static struct elevator_type iosched_bfq_mq = { > .ops.mq = { > .get_rq_priv= bfq_get_rq_private, > - .put_rq_priv= bfq_put_rq_private, > + .finish_request = bfq_finish_request, > .exit_icq = bfq_exit_icq, > .insert_requests= bfq_insert_requests, > .dispatch_request = bfq_dispatch_request, > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 1a45c287db64..9df7e0394a48 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -437,19 +437,16 @@ void blk_mq_free_request(struct request *rq) > struct request_queue *q = rq->q; > struct elevator_queue *e = q->elevator; > > - if (rq->rq_flags & RQF_ELVPRIV) { > - if (e && e->type->ops.mq.put_rq_priv) > - e->type->ops.mq.put_rq_priv(q, rq); > + if (rq->rq_flags & (RQF_ELVPRIV | RQF_QUEUED)) { > + if (e && e->type->ops.mq.finish_request) > + e->type->ops.mq.finish_request(rq); > if (rq->elv.icq) { > put_io_context(rq->elv.icq->ioc); > rq->elv.icq = NULL; > } > } > > - if ((rq->rq_flags & RQF_QUEUED) && e && e->type->ops.mq.put_request) > - e->type->ops.mq.put_request(rq); > - else > - blk_mq_finish_request(rq); > + blk_mq_finish_request(rq); > } My only concern is that this last change is, and will remain, functionally equivalent to the previous version of the code (choice between put_request and finish_request). In fact, bfq doesn't use put_request at all, and, in former bfq_put_rq_private assumes that rq->elv.priv is still properly set. If this change enjoys, as it seems to me, this invariance, then, for the part related to bfq, Acked-by: Paolo Valente > EXPORT_SYMBOL_GPL(blk_mq_free_request); > > diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c > index b9faabc75fdb..2557b399f0a8 100644 > --- a/block/kyber-iosched.c > +++ b/block/kyber-iosched.c > @@ -446,13 +446,11 @@ static struct request *kyber_get_request(struct > request_queue *q, > return rq; > } > > -static void kyber_put_request(struct request *rq) > +static void kyber_finish_request(struct request *rq) > { > - struct request_queue *q = rq->q; > - struct kyber_queue_data *kqd = q->elevator->elevator_data; > + struct kyber_queue_data *kqd = rq->q->elevator->elevator_data; > > rq_clear_domain_token(kqd, rq); > - blk_mq_finish_request(rq); > } > > static void kyber_completed_request(struct request *rq) > @@ -816,7 +814,7 @@ static struct elevator_type kyber_sched = { > .init_hctx = kyber_init_hctx, > .exit_hctx = kyber_exit_hctx, > .get_request = kyber_get_request, > - .put_request = kyber_put_request, > + .finish_request = kyber_finish_request, > .completed_request = kyber_completed_request, > .dispatch_request = kyber_dispatch_request, > .has_work = kyber_has_work, > diff --git a/include/linux/elevator.h b/include/linux/elevator.h > index 0e306c5a86d6..4acea351d43f 100644 > --- a/include/linux/elevator.h > +++ b/include/linux/elevator.h > @@ -105,7 +105,7 @@ struct elevator_mq_ops { > void (*request_merged)(struct request_queue *, struct request *, enum > elv_merge); > void (*requests_merged)(struct request_queue *, struct request *, > struct request *); > struct request *(*get_request)(struct request_queue *, unsigned int, > struct blk_mq_alloc_data *); > - void (*put_request)(struct request *); > + vo
Re: [PATCH 07/10] bfq-iosched: fix NULL ioc check in bfq_get_rq_private
> Il giorno 16 giu 2017, alle ore 18:15, Christoph Hellwig ha > scritto: > > icq_to_bic is a container_of operation, so we need to check for NULL > before it. Also move the check outside the spinlock while we're at > it. > > Signed-off-by: Christoph Hellwig > --- > block/bfq-iosched.c | 15 +-- > 1 file changed, 5 insertions(+), 10 deletions(-) > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index 4f69e39c2f89..f037b005faa1 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -4398,16 +4398,17 @@ static int bfq_get_rq_private(struct request_queue > *q, struct request *rq, > struct bio *bio) > { > struct bfq_data *bfqd = q->elevator->elevator_data; > - struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq); > + struct bfq_io_cq *bic; > const int is_sync = rq_is_sync(rq); > struct bfq_queue *bfqq; > bool new_queue = false; > bool split = false; > > - spin_lock_irq(&bfqd->lock); Thanks for taking a step I feared here (basically for cowardice) ... > + if (!rq->elv.icq) > + return 1; > + bic = icq_to_bic(rq->elv.icq); > > - if (!bic) > - goto queue_fail; > + spin_lock_irq(&bfqd->lock); > > bfq_check_ioprio_change(bic, bio); > > @@ -4465,13 +4466,7 @@ static int bfq_get_rq_private(struct request_queue *q, > struct request *rq, > bfq_handle_burst(bfqd, bfqq); > > spin_unlock_irq(&bfqd->lock); > - > return 0; > - > -queue_fail: > - spin_unlock_irq(&bfqd->lock); > - > - return 1; > } > > static void bfq_idle_slice_timer_body(struct bfq_queue *bfqq) > -- > 2.11.0 > Acked-by: Paolo Valente
Re: [PATCH 09/10] blk-mq-sched: unify request prepare methods
> Il giorno 16 giu 2017, alle ore 18:15, Christoph Hellwig ha > scritto: > > This patch makes sure we always allocate requests in the core blk-mq > code and use a common prepare_request method to initialize them for > both mq I/O schedulers. For Kyber and additional limit_depth method > is added that is called before allocating the request. > > Also because none of the intializations can really fail the new method > does not return an error - instead the bfq finish method is hardened > to deal with the no-IOC case. > > Last but not least this removes the abuse of RQF_QUEUE by the blk-mq > scheduling code as RQF_ELFPRIV is all that is needed now. > > Signed-off-by: Christoph Hellwig > --- > block/bfq-iosched.c | 19 --- > block/blk-mq.c | 22 ++ > block/kyber-iosched.c| 23 +++ > include/linux/elevator.h | 4 ++-- > 4 files changed, 31 insertions(+), 37 deletions(-) > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index f037b005faa1..60d32700f104 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -4292,8 +4292,14 @@ static void bfq_put_rq_priv_body(struct bfq_queue > *bfqq) > > static void bfq_finish_request(struct request *rq) > { > - struct bfq_queue *bfqq = RQ_BFQQ(rq); > - struct bfq_data *bfqd = bfqq->bfqd; > + struct bfq_queue *bfqq; > + struct bfq_data *bfqd; > + > + if (!rq->elv.icq) > + return; > + If this is a rq dispatched from a bfqq (or even a request still in the scheduler), then just exiting here will break bfq state seriously. However, I guess that this case can never occur. If this is a rq dispatched from the bfqd->dispatch list, then exiting here should only unbalance the bfqd->rq_in_driver counter. I guess that in both cases there wouldn't be problems for the missing reset of the content of rq->elv.priv (if the function terminates here). If I can be of further help in some way, I'm willing to help. Thanks, Paolo > + bfqq = RQ_BFQQ(rq); > + bfqd = bfqq->bfqd; > > if (rq->rq_flags & RQF_STARTED) > bfqg_stats_update_completion(bfqq_group(bfqq), > @@ -4394,9 +4400,9 @@ static struct bfq_queue > *bfq_get_bfqq_handle_split(struct bfq_data *bfqd, > /* > * Allocate bfq data structures associated with this request. > */ > -static int bfq_get_rq_private(struct request_queue *q, struct request *rq, > - struct bio *bio) > +static void bfq_prepare_request(struct request *rq, struct bio *bio) > { > + struct request_queue *q = rq->q; > struct bfq_data *bfqd = q->elevator->elevator_data; > struct bfq_io_cq *bic; > const int is_sync = rq_is_sync(rq); > @@ -4405,7 +4411,7 @@ static int bfq_get_rq_private(struct request_queue *q, > struct request *rq, > bool split = false; > > if (!rq->elv.icq) > - return 1; > + return; > bic = icq_to_bic(rq->elv.icq); > > spin_lock_irq(&bfqd->lock); > @@ -4466,7 +4472,6 @@ static int bfq_get_rq_private(struct request_queue *q, > struct request *rq, > bfq_handle_burst(bfqd, bfqq); > > spin_unlock_irq(&bfqd->lock); > - return 0; > } > > static void bfq_idle_slice_timer_body(struct bfq_queue *bfqq) > @@ -4945,7 +4950,7 @@ static struct elv_fs_entry bfq_attrs[] = { > > static struct elevator_type iosched_bfq_mq = { > .ops.mq = { > - .get_rq_priv= bfq_get_rq_private, > + .prepare_request= bfq_prepare_request, > .finish_request = bfq_finish_request, > .exit_icq = bfq_exit_icq, > .insert_requests= bfq_insert_requests, > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 2f380ab7a603..81d05c19d4b3 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -298,16 +298,11 @@ static struct request *blk_mq_get_request(struct > request_queue *q, >* Flush requests are special and go directly to the >* dispatch list. >*/ > - if (!op_is_flush(op) && e->type->ops.mq.get_request) { > - rq = e->type->ops.mq.get_request(q, op, data); > - if (rq) > - rq->rq_flags |= RQF_QUEUED; > - goto allocated; > - } > + if (!op_is_flush(op) && e->type->ops.mq.limit_depth) > + e->type->ops.mq.limit_depth(op, data); > } > > rq = __blk_mq_alloc_request(data, op); > -allocated: > if (!rq) { > blk_queue_exit(q); > return NULL; > @@ -315,17 +310,12 @@ static struct request *blk_mq_get_request(struct > request_queue *q, > > if (!op_is_flush(op)) { > rq->elv.icq = NULL; > - if (e && e->type->ops.mq.get_rq_priv) { > + if (e && e->type->ops.mq.prepare_request) { > if (e->type->icq_cache &
Re: [PATCH rfc 10/30] nvme: Add admin_tagset pointer to nvme_ctrl
Will be used when we centralize control flows. only rdma for now. Should we at some point move the tag_sets themselves to the generic ctrl instead of just pointers? We can easily do that, but the tagsets are heavily read in the hot path so I was careful not to completely move them to nvme_ctrl which is not arranged for it at all (and transports through it far back in their struct). Once we actually get some of this merged we should look into arranging the transport controllers to be: struct transport_ctrl { /* transport specific accessed in the hot path */ ... struct nvme_ctrl ctrl; /* hot members first */ /* transport specific bookkeeping */ ... };
Re: [PATCH rfc 20/30] nvme: add err, reconnect and delete work items to nvme core
We intent for these handlers to become generic, thus, add them to the nvme core controller struct. Do you remember why we actually need all the different work items? I remember documenting it at some point, but either it got lost somewhere or I don't remember... We need err_work to recover from RDMA QP-level errors. transport errors usually detected in soft-irq, so we queue up err_work to: 1. stop + drain queues 2. fail inflight I/O. 3. queue delayed reconnect (reconnect_work) But how is it so different from a reset in that respect? Similarly why do we need reset to be different from reconnect? Especially as reconnect sort of is the reset of fabrics. Hmm, resets and reconnects are indeed similar, but one difference is that in resets we do not fail fast inflight I/O as the expectation is that the recovery should be immediate (it also matches pci with that respect) while we consider reconnect something that can last for a while so we fail fast to allow failover and continue reconnect attempts quietly. Another difference is that failed reset results in a controller removal while in reconnects we have to exhaust ctrl_loss_tmo. We could change things, like merging reconnect and reset and introduce a concept of "on-host reset". Not sure it will be any less confusing though...
Re: dm integrity tests crash kernel (4.12-rc5)
On Mon, Jun 19 2017 at 8:11am -0400, Ondrej Kozina wrote: > same log again, hopefully prettier format. Sorry: > > [ 330.980914] DEBUG_LOCKS_WARN_ON(current->hardirq_context) > [ 330.980923] [ cut here ] > [ 330.982627] WARNING: CPU: 1 PID: 0 at kernel/locking/lockdep.c:2748 > trace_hardirqs_on_caller+0x107/0x180 ... > [ 331.011968] Call Trace: > [ 331.012806] > [ 331.013684] trace_hardirqs_on+0xd/0x10 > [ 331.014732] _raw_spin_unlock_irq+0x27/0x30 > [ 331.015840] submit_flush_bio+0x4e/0x80 [dm_integrity] > [ 331.017157] do_endio_flush+0x41/0x70 [dm_integrity] > [ 331.018476] dec_in_flight+0x59/0x110 [dm_integrity] > [ 331.019767] integrity_end_io+0x5e/0x70 [dm_integrity] > [ 331.020965] bio_endio+0x7c/0x1a0 > [ 331.021917] blk_update_request+0x9f/0x3d0 > [ 331.023050] blk_mq_end_request+0x15/0x60 > [ 331.024224] lo_complete_rq+0x2b/0x80 > [ 331.025406] __blk_mq_complete_request_remote+0xe/0x10 > [ 331.026813] flush_smp_call_function_queue+0x4f/0x110 > [ 331.028173] generic_smp_call_function_single_interrupt+0xe/0x20 > [ 331.029616] smp_call_function_single_interrupt+0x22/0x30 > [ 331.031125] call_function_single_interrupt+0x90/0xa0 > [ 331.032819] RIP: 0010:default_idle+0x1b/0x180 Looks like submit_flush_bio() is disabling/enabling interrupts from interrupt context. Ondrej, does this patch fix the issue? diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index 4ab10cf..93b1810 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -1105,10 +1105,13 @@ static void schedule_autocommit(struct dm_integrity_c *ic) static void submit_flush_bio(struct dm_integrity_c *ic, struct dm_integrity_io *dio) { struct bio *bio; - spin_lock_irq(&ic->endio_wait.lock); + unsigned long flags; + + spin_lock_irqsave(&ic->endio_wait.lock, flags); bio = dm_bio_from_per_bio_data(dio, sizeof(struct dm_integrity_io)); bio_list_add(&ic->flush_bio_list, bio); - spin_unlock_irq(&ic->endio_wait.lock); + spin_unlock_irqrestore(&ic->endio_wait.lock, flags); + queue_work(ic->commit_wq, &ic->commit_work); }
Re: [PATCH 11/11] nvme: add support for streams and directives
On 06/19/2017 12:25 AM, Christoph Hellwig wrote: > On Sat, Jun 17, 2017 at 09:11:30AM -0600, Jens Axboe wrote: >> I have two samples here, and I just tested, and both of them want it >> assigned with nsid=0x or they will fail the writes... So I'd say >> we're better off ensuring we do allocate them globally. > > That's clearly against the spec. I'd say go to your vendor and get > a refund, as we Linux folks (Martin and I) fought for this so that > we would not have to do the explicit allocations. > > Another quote for from the spec: > > "Streams are opened by the controller when the host issues a write > command that specifies a stream identifier that is not currently open. > While a stream is open the controller maintains context for that stream > (e.g., buffers for associated data). The host may determine the streams > that are open using the Get Status operation." > > And I think this is very important - otherwise you need to either > allocate the stremas beforehand as your earlier patches (and we take > away the resources from the 99% of the users not using write life > hints), or we need the lazy allocation scheme. And for that to be > efficient it probably needs to be lazy per-stream allocation. That's > why we got the implicit open in after all. These are just samples, so no refund possible! As you might remember, I was pretty adamant on not wanting explicit open/close as well, back in those days. I'll check with the vendor and see what's going on. -- Jens Axboe
Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce
On Mon, Jun 19, 2017 at 3:26 PM, Christoph Hellwig wrote: > pktcdvd is a make_request based stacking driver and thus doesn't have any > addressing limits on it's own. It also doesn't use bio_data() or > page_address(), so it doesn't need a lowmem bounce either. > > Signed-off-by: Christoph Hellwig > --- > drivers/block/pktcdvd.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c > index 26c04baae967..7f2eca31eb6a 100644 > --- a/drivers/block/pktcdvd.c > +++ b/drivers/block/pktcdvd.c > @@ -2413,8 +2413,6 @@ static blk_qc_t pkt_make_request(struct request_queue > *q, struct bio *bio) > char b[BDEVNAME_SIZE]; > struct bio *split; > > - blk_queue_bounce(q, &bio); > - > blk_queue_split(q, &bio); > > pd = q->queuedata; > -- > 2.11.0 > blk_queue_make_request() sets bounce for any highmem page for long time, and in theory this patch might cause regression on 32bit arch, when the controller can't handle higmem page really(especially in case of PAE), so we may need to be careful about this change, especially on some old hardware. thanks, Ming Lei
Re: [PATCH 11/11] nvme: add support for streams and directives
On 06/19/2017 08:31 AM, Jens Axboe wrote: > On 06/19/2017 12:25 AM, Christoph Hellwig wrote: >> On Sat, Jun 17, 2017 at 09:11:30AM -0600, Jens Axboe wrote: >>> I have two samples here, and I just tested, and both of them want it >>> assigned with nsid=0x or they will fail the writes... So I'd say >>> we're better off ensuring we do allocate them globally. >> >> That's clearly against the spec. I'd say go to your vendor and get >> a refund, as we Linux folks (Martin and I) fought for this so that >> we would not have to do the explicit allocations. >> >> Another quote for from the spec: >> >> "Streams are opened by the controller when the host issues a write >> command that specifies a stream identifier that is not currently open. >> While a stream is open the controller maintains context for that stream >> (e.g., buffers for associated data). The host may determine the streams >> that are open using the Get Status operation." >> >> And I think this is very important - otherwise you need to either >> allocate the stremas beforehand as your earlier patches (and we take >> away the resources from the 99% of the users not using write life >> hints), or we need the lazy allocation scheme. And for that to be >> efficient it probably needs to be lazy per-stream allocation. That's >> why we got the implicit open in after all. > > These are just samples, so no refund possible! As you might remember, > I was pretty adamant on not wanting explicit open/close as well, back > in those days. I'll check with the vendor and see what's going on. Looking at it a bit more closely - there's a difference between assigning X number of streams (allocating) for use by the subsystem or per-ns, and having to manually open them. So I don't necessarily think there's a problem here, neither for us or on the device. If I load the driver as it currently stands, and get streams params after load: Number of Open Streams in NVM subsystem (NOSNS): 0 Allocated Stream Resources (ASR): 4 Number of Open Streams in Namespace (NOSN): 0 4 streams allocated, 0 currently open. Then I generate a single write to some stream ID: Number of Open Streams in NVM subsystem (NOSNS): 1 Allocated Stream Resources (ASR): 4 Number of Open Streams in Namespace (NOSN): 1 and we know see a single stream actually open, implicitly, by the device. Fire off another single stream write, this time to another stream, and we get: Number of Open Streams in NVM subsystem (NOSNS): 2 Allocated Stream Resources (ASR): 4 Number of Open Streams in Namespace (NOSN): 2 and stream status correctly reflects that I did a WRITE_HINT_SHORT and WRITE_HINT_LONG write: Open Stream Count : 2 Stream Identifier 01 : 1 Stream Identifier 02 : 3 Awaiting clarification from the vendor what their position/view of this is. Reading the spec, I do agree that it leans towards only needing allocation for a specific name space, but it doesn't explicitly say that you can use any of the available streams, without allocation, if they haven't been assigned to a specific name space. I would interpret it that way, though. -- Jens Axboe
Re: [PATCH 01/11] fs: add support for an inode to carry write hint related data
On 06/19/2017 12:26 AM, Christoph Hellwig wrote: >> +/* >> + * Write life time hint values. >> + */ >> +enum rw_hint { >> +WRITE_LIFE_NONE = 0, >> +WRITE_LIFE_SHORT, >> +WRITE_LIFE_MEDIUM, >> +WRITE_LIFE_LONG, >> +WRITE_LIFE_EXTREME, >> +}; >> + >> +#define RW_HINT_MASK0x7 /* 3 bits */ > > FYI, exposing enums in a uapi is always a bit problematic, due to > different ABI rules. It might be better to make these explicit defines > at least for the uapi. > > Btw, I think it might make sense to merge this with patch 5. OK, I'll change the uapi to be defines. -- Jens Axboe
Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
On 06/19/2017 12:27 AM, Christoph Hellwig wrote: > On Sat, Jun 17, 2017 at 01:59:47PM -0600, Jens Axboe wrote: >> Add four flags for the pwritev2(2) system call, allowing an application >> to give the kernel a hint about what on-media life times can be >> expected from a given write. >> >> The intent is for these values to be relative to each other, no >> absolute meaning should be attached to these flag names. >> >> Set aside 3 bits in the iocb flags structure to carry this information >> over from the pwritev2 RWF_WRITE_LIFE_* flags. > > What is the strong use case for the per-I/O flags? I'd much rather > stick to fcntl only for now if we can. Fine, I guess I should just have dusted off the 2 year old patchset, as that was _exactly_ what that did. -- Jens Axboe
Re: [PATCH 05/11] fs: add fcntl() interface for setting/getting write life time hints
On 06/19/2017 12:28 AM, Christoph Hellwig wrote: > On Sat, Jun 17, 2017 at 01:59:48PM -0600, Jens Axboe wrote: >> We have a pwritev2(2) interface based on passing in flags. Add an >> fcntl interface for querying these flags, and also for setting them >> as well: >> >> F_GET_RW_HINTReturns the read/write hint set. Right now it >> will be one of the WRITE_LIFE_* values. >> >> F_SET_RW_HINTPass in rw_hint type to set the read/write hint. >> Only WRITE_LIFE_* hints are currently supported. >> Returns 0 on succes, -1 otherwise. >> >> Sample program testing/implementing basic setting/getting of write >> hints is below. >> >> /* >> * writehint.c: check or set a file/inode write hint >> */ >> >> static char *str[] = { "WRITE_LIFE_NONE", "WRITE_LIFE_SHORT", >> "WRITE_LIFE_MEDIUM", "WRITE_LIFE_LONG", >> "WRITE_LIFE_EXTREME" }; >> >> int main(int argc, char *argv[]) >> { >> int hint = -1, fd, ret; >> >> if (argc < 2) { >> fprintf(stderr, "%s: dev \n", argv[0]); >> return 1; >> } >> >> fd = open(argv[1], O_RDONLY); >> if (fd < 0) { >> perror("open"); >> return 2; >> } >> >> if (argc > 2) >> hint = atoi(argv[2]); >> >> if (hint == -1) { >> ret = fcntl(fd, F_GET_RW_HINT); >> if (ret < 0) { >> perror("fcntl: F_GET_RW_HINT"); >> return 3; >> } >> hint = ret; >> } else { >> ret = fcntl(fd, F_SET_RW_HINT, hint); >> if (ret < 0) { >> perror("fcntl: F_SET_RW_HINT"); >> return 4; >> } >> } >> >> printf("%s: %shint %s\n", argv[1], hint != -1 ? "set " : "", str[hint]); >> close(fd); >> return 0; >> } >> >> Signed-off-by: Jens Axboe >> --- >> fs/fcntl.c | 38 ++ >> include/uapi/linux/fcntl.h | 6 ++ >> 2 files changed, 44 insertions(+) >> >> diff --git a/fs/fcntl.c b/fs/fcntl.c >> index f4e7267d117f..417ce336c875 100644 >> --- a/fs/fcntl.c >> +++ b/fs/fcntl.c >> @@ -243,6 +243,40 @@ static int f_getowner_uids(struct file *filp, unsigned >> long arg) >> } >> #endif >> >> +long fcntl_rw_hint(struct file *file, unsigned int cmd, unsigned long arg) > > The unsigned long arg is a little annoying because it will de different > size for 32-bit vs 64-vit architectures. > > How about just doing a get_user and use a fixed-size 64-bit value? It's just passing it in from fcnt(), my intent was for it to be a u32 on the uapi side. But we can make it a 64-bit type, since we're going away from the enum and making it a proper type. -- Jens Axboe
Re: [PATCH 06/11] fs: add O_DIRECT support for sending down write life time hints
On 06/19/2017 12:28 AM, Christoph Hellwig wrote: > On Sat, Jun 17, 2017 at 01:59:49PM -0600, Jens Axboe wrote: >> Reviewed-by: Andreas Dilger >> Signed-off-by: Jens Axboe >> --- >> fs/block_dev.c | 2 ++ >> fs/direct-io.c | 2 ++ >> fs/iomap.c | 1 + >> 3 files changed, 5 insertions(+) >> >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 51959936..b9222a9d285e 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct >> iov_iter *iter, >> should_dirty = true; >> } else { >> bio.bi_opf = dio_bio_write_op(iocb); >> +bio.bi_opf |= write_hint_to_opf(iocb_write_hint(iocb)); > > Move this into dio_bio_write_op instead of duplicating it? Good cleanup, will do. -- Jens Axboe
Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce
On Mon, Jun 19, 2017 at 10:34:36PM +0800, Ming Lei wrote: > blk_queue_make_request() sets bounce for any highmem page for long time, > and in theory this patch might cause regression on 32bit arch, when > the controller can't handle higmem page really(especially in case of PAE), so > we may need to be careful about this change, especially on some old hardware. Which controller?
Re: [PATCH 11/11] nvme: add support for streams and directives
On 06/19/2017 12:35 AM, Christoph Hellwig wrote: > Can you add linux-nvme for the next repost? > > As said before I think we should rely on implicit streams allocation, > as that will make the whole patch a lot simpler, and it solves the issue > that your current patch will take away your 4 streams from the general > pool on every controller that supports streams, which isn't optimal. The only thing it'll change in the patch is the removal of nvme_streams_allocate(), which is 20 lines of code. So I don't think it'll simplify things a lot. The patch is already very simple. But if we don't need to allocate the streams, then of course it should just go. >> +streamid = (req->cmd_flags & REQ_WRITE_LIFE_MASK) >> +>> __REQ_WRITE_HINT_SHIFT; > > Can we add a helper to convert the REQ_* flags back to the hint next > to where we have the helper to do the reverse one? OK, will od. >> +if (streamid == WRITE_LIFE_NONE) >> +return; >> + >> +/* for now just round-robin, do something more clever later */ >> +if (streamid > ctrl->nr_streams) >> +streamid = (streamid % ctrl->nr_streams) + 1; > > I thought you were going to fix that to do more intelligent collapsing? Sure, if you want me to do that now, I can. I propose: - With 4 streams, direct mapping. - With 3 streams, collapse two longest life times. - With 2 streams, collapse short+medium, and long+extreme. - With 1 stream, don't use streams. We could potentially still use streams with just 1 stream, since we have the default of no stream as well. But I don't think it makes sense at that point. > >> -ns->queue->limits.discard_granularity = logical_block_size; >> +if (ctrl->nr_streams && ns->sws && ns->sgs) { >> +unsigned int sz = logical_block_size * ns->sws * ns->sgs; >> + >> +ns->queue->limits.discard_alignment = sz; > > I don't think this is correct: > > "Data that is aligned to and in multiples of the Stream Write Size (SWS) > provides optimal performance of the write commands to the controller. > The Stream Granularity Size indicates the size of the media that is prepared > as a unit for future allocation for write commands and is a multiple of the > Stream Write Size. The controller may allocate and group together a stream > in Stream Granularity Size (SGS) units. Refer to Figure 293." > > So I think the ns->sws should go away. The SGS value is in units of SWS, however. >> +ns->queue->limits.discard_granularity = sz; >> +} else { >> +u32 logical_block_size = queue_logical_block_size(ns->queue); > > I think we already have a logical_block_size with the same value in > scope here. Oops yes. >> +ns->sws = le32_to_cpu(s.sws); >> +ns->sgs = le16_to_cpu(s.sgs); >> + >> +if (ns->sws) { >> +unsigned int bs = 1 << ns->lba_shift; >> + >> +blk_queue_io_min(ns->queue, bs * ns->sws); >> +if (ns->sgs) >> +blk_queue_io_opt(ns->queue, bs * ns->sws * ns->sgs); > > drop the multiplication with ns->sws here as well. See above. -- Jens Axboe
Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce
On Mon, Jun 19, 2017 at 11:00 PM, Christoph Hellwig wrote: > On Mon, Jun 19, 2017 at 10:34:36PM +0800, Ming Lei wrote: >> blk_queue_make_request() sets bounce for any highmem page for long time, >> and in theory this patch might cause regression on 32bit arch, when >> the controller can't handle higmem page really(especially in case of PAE), so >> we may need to be careful about this change, especially on some old hardware. > > Which controller? I don't know. IMO, we can't just remove it because it is a bio based driver, the bounce is not related with a bio or req based driver, and it depends if the controller can handle highmem page. Actually the bounce for highmem is here from v2.6.12, and looks you might need to investigate why it is introduced from start before the removing, :-) Thanks, Ming Lei
Re: [dm-devel] dm integrity tests crash kernel (4.12-rc5)
On 06/19/2017 04:20 PM, Mike Snitzer wrote: Looks like submit_flush_bio() is disabling/enabling interrupts from interrupt context. Ondrej, does this patch fix the issue? I let it spin for 30 minutes on patched dm-integrity and everything seems ok now. The moment I loaded back the old one it fell on face almost immediately again. Thanks! diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c index 4ab10cf..93b1810 100644 --- a/drivers/md/dm-integrity.c +++ b/drivers/md/dm-integrity.c @@ -1105,10 +1105,13 @@ static void schedule_autocommit(struct dm_integrity_c *ic) static void submit_flush_bio(struct dm_integrity_c *ic, struct dm_integrity_io *dio) { struct bio *bio; - spin_lock_irq(&ic->endio_wait.lock); + unsigned long flags; + + spin_lock_irqsave(&ic->endio_wait.lock, flags); bio = dm_bio_from_per_bio_data(dio, sizeof(struct dm_integrity_io)); bio_list_add(&ic->flush_bio_list, bio); - spin_unlock_irq(&ic->endio_wait.lock); + spin_unlock_irqrestore(&ic->endio_wait.lock, flags); + queue_work(ic->commit_wq, &ic->commit_work); } -- dm-devel mailing list dm-de...@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce
On Mon, Jun 19, 2017 at 11:13:46PM +0800, Ming Lei wrote: > On Mon, Jun 19, 2017 at 11:00 PM, Christoph Hellwig wrote: > > On Mon, Jun 19, 2017 at 10:34:36PM +0800, Ming Lei wrote: > >> blk_queue_make_request() sets bounce for any highmem page for long time, > >> and in theory this patch might cause regression on 32bit arch, when > >> the controller can't handle higmem page really(especially in case of PAE), > >> so > >> we may need to be careful about this change, especially on some old > >> hardware. > > > > Which controller? > > I don't know. > > IMO, we can't just remove it because it is a bio based driver, the > bounce is not related with > a bio or req based driver, and it depends if the controller can handle > highmem page. pktcdvd is a stacking driver and forwards all data through either generic_make_request or blk_rq_map_kern + blk_execute_rq, both of which call blk_queue_bounce internally.
Re: dm integrity tests crash kernel (4.12-rc5)
On Mon, Jun 19 2017 at 11:16am -0400, Ondrej Kozina wrote: > On 06/19/2017 04:20 PM, Mike Snitzer wrote: > > > >Looks like submit_flush_bio() is disabling/enabling interrupts from > >interrupt context. Ondrej, does this patch fix the issue? > > I let it spin for 30 minutes on patched dm-integrity and everything > seems ok now. The moment I loaded back the old one it fell on face > almost immediately again. OK, I'll get it staged as a fix for 4.12. Thanks, Mike
Re: [PATCH rfc 01/30] nvme: Add admin connect request queue
On 06/19/2017 09:49 AM, Sagi Grimberg wrote: > >>> In case we reconnect with inflight admin IO we >>> need to make sure that the connect comes before >>> the admin command. This can be only achieved by >>> using a seperate request queue for admin connects. >> >> Use up a few more lines of the available space for your lines? :) > > I warned in the cover-letter that the change logs are pure > negligence at the moment :) > >> Wouldn't a head insertation also solve the problem? > > the head insertion will not protect against it because > we must invoke blk_mq_start_stopped_hw_queues on the admin_q > request queue so the admin connect can make progress, at this > point (and before we actually queue up connect) pending admin > commands can sneak in... > > However, you raise a valid point, I think I added this before we > had the queue_is_ready protection, which will reject the command > if the queue is not LIVE (unless its a connect). I think the reason > its still in is that I tested this with loop which doesn't have > a per-queue state machine. > > I'm still wandering if its a good idea to rely on the transport > queue state to reject non-connect requests on non-LIVE queues. > if/when we introduce a queue representation to the core and we > drive the state machine there, then we could actually rely on it > (I do have some code for it, but its a pretty massive change which > cannot be added in an incremental fashion). > > Thoughts? I very much prefer this solution, ie rejecting all commands if the queue state is not 'LIVE', and make the 'connect' command the only one being accepted in that state. (Actually, we would need a state 'READY_TO_CONNECT' or somesuch to make this work properly.) Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce
On Mon, Jun 19, 2017 at 11:18 PM, Christoph Hellwig wrote: > On Mon, Jun 19, 2017 at 11:13:46PM +0800, Ming Lei wrote: >> On Mon, Jun 19, 2017 at 11:00 PM, Christoph Hellwig wrote: >> > On Mon, Jun 19, 2017 at 10:34:36PM +0800, Ming Lei wrote: >> >> blk_queue_make_request() sets bounce for any highmem page for long time, >> >> and in theory this patch might cause regression on 32bit arch, when >> >> the controller can't handle higmem page really(especially in case of >> >> PAE), so >> >> we may need to be careful about this change, especially on some old >> >> hardware. >> > >> > Which controller? >> >> I don't know. >> >> IMO, we can't just remove it because it is a bio based driver, the >> bounce is not related with >> a bio or req based driver, and it depends if the controller can handle >> highmem page. > > pktcdvd is a stacking driver and forwards all data through either > generic_make_request or blk_rq_map_kern + blk_execute_rq, > both of which call blk_queue_bounce internally. OK, thanks for explaining, then looks it isn't needed from the start. Thanks, Ming Lei
Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints
On 06/19/2017 08:56 AM, Jens Axboe wrote: > On 06/19/2017 12:27 AM, Christoph Hellwig wrote: >> On Sat, Jun 17, 2017 at 01:59:47PM -0600, Jens Axboe wrote: >>> Add four flags for the pwritev2(2) system call, allowing an application >>> to give the kernel a hint about what on-media life times can be >>> expected from a given write. >>> >>> The intent is for these values to be relative to each other, no >>> absolute meaning should be attached to these flag names. >>> >>> Set aside 3 bits in the iocb flags structure to carry this information >>> over from the pwritev2 RWF_WRITE_LIFE_* flags. >> >> What is the strong use case for the per-I/O flags? I'd much rather >> stick to fcntl only for now if we can. > > Fine, I guess I should just have dusted off the 2 year old patchset, > as that was _exactly_ what that did. Actually, one good use case is O_DIRECT on a block device. Since I'm not a huge fan of having per-call hints that is only useful for a single case, how about we add the hints to the struct file as well? For buffered IO, just grab it from the inode. If we have a file available, then that overrides the per-inode setting. -- Jens Axboe
Re: [PATCH rfc 28/30] nvme: update tagset nr_hw_queues when reallocating io queues
Signed-off-by: Sagi Grimberg Could use a changelog. Ming: does this solve your problem of not seeing the new queues after a qemu CPU hotplug + reset? The issue I observed is that there isn't NVMe reset triggered after CPU becomes online. This won't help with that. It may take a while since the test need to apply the whole patchset. Or is it fine to figure out one fix on this issue? Sagi? No need, all this doesn't even touch pci for now...
Re: [PATCH v7 00/22] fs: enhanced writeback error reporting with errseq_t (pile #1)
On Fri, 2017-06-16 at 15:34 -0400, Jeff Layton wrote: > v7: > === > This is the seventh posting of the patchset to revamp the way writeback > errors are tracked and reported. > > The main difference from the v6 posting is the removal of the > FS_WB_ERRSEQ flag. That requires a few other incremental patches in the > writeback code to ensure that both error tracking models are handled > in a suitable way. > > Also, a bit more cleanup of the metadata writeback codepaths, and some > documentation updates. > > Some of these patches have been posted separately, but I'm re-posting > them here to make it clear that they're prerequisites of the later > patches in the series. > > This series is based on top of linux-next from a day or so ago. I'd like > to have this picked up by linux-next in the near future so we can get > some more extensive testing with it. Should I just plan to maintain a > topic branch on top of -next and ask Stephen to pick it up? > > Background: > === > The basic problem is that we have (for a very long time) tracked and > reported writeback errors based on two flags in the address_space: > AS_EIO and AS_ENOSPC. Those flags are cleared when they are checked, > so only the first caller to check them is able to consume them. > > That model is quite unreliable, for several related reasons: > > * only the first fsync caller on the inode will see the error. In a > world of containerized setups, that's no longer viable. Applications > need to know that their writes are safely stored, and they can > currently miss seeing errors that they should be aware of when > they're not. > > * there are a lot of internal callers to filemap_fdatawait* and > filemap_write_and_wait* that clear these errors but then never report > them to userland in any fashion. > > * Some internal callers report writeback errors, but can do so at > non-sensical times. For instance, we might want to truncate a file, > which triggers a pagecache flush. If that writeback fails, we might > report that error to the truncate caller, but a subsequent fsync > will likely not see it. > > * Some internal callers try to reset the error flags after clearing > them, but that's racy. Another task could check the flags between > those two events. > > Solution: > = > This patchset adds a new datatype called an errseq_t that represents a > sequence of errors. It's a u32, with a field for a POSIX-flavor error > and a counter, managed with atomics. We can sample that value at a > particular point in time, and can later tell whether there have been any > errors since that point. > > That allows us to provide traditional check-and-clear fsync semantics > on every open file description in a lightweight fashion. fsync callers > no longer need to coordinate between one another in order to ensure > that errors at fsync time are handled correctly. > > Strategy: > = > The aim with this pile is to do the minimum possible to support for > reliable reporting of errors on fsync, without substantially changing > the internals of the filesystems themselves. > > Most of the internal calls to filemap_fdatawait are left alone, so all > of the internal error checkers are using the same error handling they > always have. The only real difference here is that we're better > reporting errors at fsync. > > I think that we probably will want to eventually convert all of those > internal callers to use errseq_t based reporting, but that can be done > in an incremental fashion in follow-on patchsets. > > Testing: > > I've primarily been testing this with some new xfstests that I will post > in a separate series. These tests use dm-error fault injection to make > the underlying block device start throwing I/O errors, and then test the > how the filesystem layer reports errors after that. > > Jeff Layton (22): > fs: remove call_fsync helper function > buffer: use mapping_set_error instead of setting the flag > fs: check for writeback errors after syncing out buffers in > generic_file_fsync > buffer: set errors in mapping at the time that the error occurs > jbd2: don't clear and reset errors after waiting on writeback > mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails > mm: don't TestClearPageError in __filemap_fdatawait_range > mm: clean up error handling in write_one_page > fs: always sync metadata in __generic_file_fsync > lib: add errseq_t type and infrastructure for handling it > fs: new infrastructure for writeback error handling and reporting > mm: tracepoints for writeback error events > mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error > Documentation: flesh out the section in vfs.txt on storing and > reporting writeback errors > dax: set errors in mapping when writeback fails > block: convert to errseq_t based writeback error tracking > ext4: use errseq_t based error handling for reporting data writeback > errors > fs: add
Re: [PATCH rfc 25/30] nvme: move control plane handling to nvme core
+static void nvme_free_io_queues(struct nvme_ctrl *ctrl) +{ + int i; + + for (i = 1; i < ctrl->queue_count; i++) + ctrl->ops->free_hw_queue(ctrl, i); +} + +void nvme_stop_io_queues(struct nvme_ctrl *ctrl) +{ + int i; + + for (i = 1; i < ctrl->queue_count; i++) + ctrl->ops->stop_hw_queue(ctrl, i); +} +EXPORT_SYMBOL_GPL(nvme_stop_io_queues); At leasr for PCIe this is going to work very differently, so I'm not sure this part make so much sense in the core. Maybe in Fabrics? Or at least make the callouts operate on all I/O queues, which would suite PCIe a lot more. Yea, I spent some time thinking on the async nature of queue removal for pci... I started from ->stop/free_io_queues callouts but hated the fact that we need to iterate exactly the same way in every driver... We could have an optional stop/free_io_queues that the core will call instead if implemented? + error = ctrl->ops->start_hw_queue(ctrl, 0); + if (error) + goto out_cleanup_connect_queue; + + error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap); + if (error) { + dev_err(ctrl->device, + "prop_get NVME_REG_CAP failed\n"); + goto out_cleanup_connect_queue; + } + + ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize); + + error = nvme_enable_ctrl(ctrl, ctrl->cap); + if (error) + goto out_cleanup_connect_queue; I'm not sure this ordering is going to work for PCIe.. This one is easy to reverse...
[PATCH 0/10 v13] merge request: No wait AIO
Jens, As Christoph suggested, I am sending the patches against the block tree for merge since the block layer changes had the most conflicts. My tree is at https://github.com/goldwynr/linux/tree/nowait-block This series adds nonblocking feature to asynchronous I/O writes. io_submit() can be delayed because of a number of reason: - Block allocation for files - Data writebacks for direct I/O - Sleeping because of waiting to acquire i_rwsem - Congested block device The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if any of these conditions are met. This way userspace can push most of the write()s to the kernel to the best of its ability to complete and if it returns -EAGAIN, can defer it to another thread. In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to IOCB_NOWAIT for struct iocb, REQ_NOWAIT for bio.bi_opf and IOMAP_NOWAIT for iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could not use aio_flags because it is not currently checked for invalidity in the kernel. This feature is provided for direct I/O of asynchronous I/O only. I have tested it against xfs, ext4, and btrfs while I intend to add more filesystems. The nowait feature is for request based devices. In the future, I intend to add support to stacked devices such as md. Applications will have to check supportability by sending a async direct write and any other error besides -EAGAIN would mean it is not supported. First two patches are prep patches into nowait I/O. Changes since v1: + changed name from _NONBLOCKING to *_NOWAIT + filemap_range_has_page call moved to closer to (just before) calling filemap_write_and_wait_range(). + BIO_NOWAIT limited to get_request() + XFS fixes - included reflink - use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag - Translate the flag through IOMAP_NOWAIT (iomap) to check for block allocation for the file. + ext4 coding style Changes since v2: + Using aio_reserved1 as aio_rw_flags instead of aio_flags + blk-mq support + xfs uptodate with kernel and reflink changes Changes since v3: + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature. + Checks in generic_make_request() to make sure BIO_NOWAIT comes in for async direct writes only. + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT. This is added (rather not set) to block devices such as dm/md currently. Changes since v4: + Ported AIO code to use RWF_* flags. Check for RWF_* flags in generic_file_write_iter(). + Changed IOCB_RW_FLAGS_NOWAIT to RWF_NOWAIT. Changes since v5: + BIO_NOWAIT to REQ_NOWAIT + Common helper for RWF flags. Changes since v6: + REQ_NOWAIT will be ignored for request based devices since they cannot block. So, removed QUEUE_FLAG_NOWAIT since it is not required in the current implementation. It will be resurrected when we program for stacked devices. + changed kiocb_rw_flags() to kiocb_set_rw_flags() in order to accomodate for errors. Moved checks in the function. Changes since v7: + split patches into prep so the main patches are smaller and easier to understand + All patches are reviewed or acked! Changes since v8: + Err out AIO reads with -EINVAL flagged as RWF_NOWAIT Changes since v9: + Retract - Err out AIO reads with -EINVAL flagged as RWF_NOWAIT + XFS returns EAGAIN if extent list is not in memory + Man page updates to io_submit with iocb description and nowait features. Changes since v10: + Corrected comment and subject in "return on congested block device" Changes since v11: + FMODE_AIO_NOWAIT to show AIO NOWAIT support. This is to block non-supported filesystems instead of returning ENOTSUPP in each individual filesystem Changes since v12: + Updated with respect to linux-next. Primarily changes the patch in block patch, which changes bio.bi_error to bio.bi_status. + Fix FMODE_AIO_NOWAIT spellcheck + Added RWF_SUPPORTED to encompass all flags + Added checks on the read side, namely generic_file_read_iter() and iomap_dio_rw() -- Goldwyn
[PATCH 01/10] fs: Separate out kiocb flags setup based on RWF_* flags
From: Goldwyn Rodrigues Also added RWF_SUPPORTED to encompass all flags. Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Signed-off-by: Goldwyn Rodrigues --- fs/read_write.c | 12 +++- include/linux/fs.h | 14 ++ include/uapi/linux/fs.h | 2 ++ 3 files changed, 19 insertions(+), 9 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 47c1d4484df9..53c816c61122 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, struct iov_iter *iter, struct kiocb kiocb; ssize_t ret; - if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC)) - return -EOPNOTSUPP; - init_sync_kiocb(&kiocb, filp); - if (flags & RWF_HIPRI) - kiocb.ki_flags |= IOCB_HIPRI; - if (flags & RWF_DSYNC) - kiocb.ki_flags |= IOCB_DSYNC; - if (flags & RWF_SYNC) - kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC); + ret = kiocb_set_rw_flags(&kiocb, flags); + if (ret) + return ret; kiocb.ki_pos = *ppos; if (type == READ) diff --git a/include/linux/fs.h b/include/linux/fs.h index 023f0324762b..96a1a1fa54a9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3057,6 +3057,20 @@ static inline int iocb_flags(struct file *file) return res; } +static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags) +{ + if (unlikely(flags & ~RWF_SUPPORTED)) + return -EOPNOTSUPP; + + if (flags & RWF_HIPRI) + ki->ki_flags |= IOCB_HIPRI; + if (flags & RWF_DSYNC) + ki->ki_flags |= IOCB_DSYNC; + if (flags & RWF_SYNC) + ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC); + return 0; +} + static inline ino_t parent_ino(struct dentry *dentry) { ino_t res; diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 24e61a54feaa..937c3e39650a 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -361,4 +361,6 @@ struct fscrypt_key { #define RWF_DSYNC 0x0002 /* per-IO O_DSYNC */ #define RWF_SYNC 0x0004 /* per-IO O_SYNC */ +#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC) + #endif /* _UAPI_LINUX_FS_H */ -- 2.12.0
[PATCH 07/10] block: return on congested block device
From: Goldwyn Rodrigues A new bio operation flag REQ_NOWAIT is introduced to identify bio's orignating from iocb with IOCB_NOWAIT. This flag indicates to return immediately if a request cannot be made instead of retrying. Stacked devices such as md (the ones with make_request_fn hooks) currently are not supported because it may block for housekeeping. For example, an md can have a part of the device suspended. For this reason, only request based devices are supported. In the future, this feature will be expanded to stacked devices by teaching them how to handle the REQ_NOWAIT flags. Reviewed-by: Christoph Hellwig Reviewed-by: Jens Axboe Signed-off-by: Goldwyn Rodrigues --- block/blk-core.c | 22 -- block/blk-mq-sched.c | 3 +++ block/blk-mq.c| 2 ++ fs/direct-io.c| 10 -- include/linux/bio.h | 6 ++ include/linux/blk_types.h | 4 6 files changed, 43 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 8592409db272..3999e6ac288d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -143,6 +143,7 @@ static const struct { [BLK_STS_MEDIUM]= { -ENODATA, "critical medium" }, [BLK_STS_PROTECTION]= { -EILSEQ,"protection" }, [BLK_STS_RESOURCE] = { -ENOMEM,"kernel resource" }, + [BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" }, /* device mapper special case, should not leak out: */ [BLK_STS_DM_REQUEUE]= { -EREMCHG, "dm internal retry" }, @@ -1314,6 +1315,11 @@ static struct request *get_request(struct request_queue *q, unsigned int op, if (!IS_ERR(rq)) return rq; + if (op & REQ_NOWAIT) { + blk_put_rl(rl); + return ERR_PTR(-EAGAIN); + } + if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) { blk_put_rl(rl); return rq; @@ -1961,6 +1967,14 @@ generic_make_request_checks(struct bio *bio) goto end_io; } + /* +* For a REQ_NOWAIT based request, return -EOPNOTSUPP +* if queue is not a request based queue. +*/ + + if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q)) + goto not_supported; + part = bio->bi_bdev->bd_part; if (should_fail_request(part, bio->bi_iter.bi_size) || should_fail_request(&part_to_disk(part)->part0, @@ -2118,7 +2132,7 @@ blk_qc_t generic_make_request(struct bio *bio) do { struct request_queue *q = bdev_get_queue(bio->bi_bdev); - if (likely(blk_queue_enter(q, false) == 0)) { + if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) { struct bio_list lower, same; /* Create a fresh bio_list for all subordinate requests */ @@ -2143,7 +2157,11 @@ blk_qc_t generic_make_request(struct bio *bio) bio_list_merge(&bio_list_on_stack[0], &same); bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); } else { - bio_io_error(bio); + if (unlikely(!blk_queue_dying(q) && + (bio->bi_opf & REQ_NOWAIT))) + bio_wouldblock_error(bio); + else + bio_io_error(bio); } bio = bio_list_pop(&bio_list_on_stack[0]); } while (bio); diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index c4e2afb9d12d..ab14295d8468 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -83,6 +83,9 @@ struct request *blk_mq_sched_get_request(struct request_queue *q, if (likely(!data->hctx)) data->hctx = blk_mq_map_queue(q, data->ctx->cpu); + if (op & REQ_NOWAIT) + data->flags |= BLK_MQ_REQ_NOWAIT; + if (e) { data->flags |= BLK_MQ_REQ_INTERNAL; diff --git a/block/blk-mq.c b/block/blk-mq.c index 359d2dc0d414..387fc8e8a2ba 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1498,6 +1498,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data); if (unlikely(!rq)) { __wbt_done(q->rq_wb, wb_acct); + if (bio->bi_opf & REQ_NOWAIT) + bio_wouldblock_error(bio); return BLK_QC_T_NONE; } diff --git a/fs/direct-io.c b/fs/direct-io.c index e8baaabebf13..c87077d1dc33 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -479,8 +479,12 @@ static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio) unsigned i; blk_status_t err = bio->bi_status; - if (err) - dio->io_error = -EIO; + if (err) { + if (err == BLK_STS_AGAI
[PATCH 03/10] fs: Use RWF_* flags for AIO operations
From: Goldwyn Rodrigues aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will carry the RWF_* flags. We cannot use aio_flags because they are not checked for validity which may break existing applications. Note, the only place RWF_HIPRI comes in effect is dio_await_one(). All the rest of the locations, aio code return -EIOCBQUEUED before the checks for RWF_HIPRI. Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Signed-off-by: Goldwyn Rodrigues --- fs/aio.c | 8 +++- include/uapi/linux/aio_abi.h | 2 +- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index f52d925ee259..020fa0045e3c 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, ssize_t ret; /* enforce forwards compatibility on users */ - if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) { + if (unlikely(iocb->aio_reserved2)) { pr_debug("EINVAL: reserve field set\n"); return -EINVAL; } @@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, req->common.ki_flags |= IOCB_EVENTFD; } + ret = kiocb_set_rw_flags(&req->common, iocb->aio_rw_flags); + if (unlikely(ret)) { + pr_debug("EINVAL: aio_rw_flags\n"); + goto out_put_req; + } + ret = put_user(KIOCB_KEY, &user_iocb->aio_key); if (unlikely(ret)) { pr_debug("EFAULT: aio_key\n"); diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h index bb2554f7fbd1..a2d4a8ac94ca 100644 --- a/include/uapi/linux/aio_abi.h +++ b/include/uapi/linux/aio_abi.h @@ -79,7 +79,7 @@ struct io_event { struct iocb { /* these are internal to the kernel/libc. */ __u64 aio_data; /* data to be returned in event's data */ - __u32 PADDED(aio_key, aio_reserved1); + __u32 PADDED(aio_key, aio_rw_flags); /* the kernel sets aio_key to the req # */ /* common fields */ -- 2.12.0
[PATCH 05/10] fs: return if direct I/O will trigger writeback
From: Goldwyn Rodrigues Find out if the I/O will trigger a wait due to writeback. If yes, return -EAGAIN. Return -EINVAL for buffered AIO: there are multiple causes of delay such as page locks, dirty throttling logic, page loading from disk etc. which cannot be taken care of. Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Signed-off-by: Goldwyn Rodrigues --- mm/filemap.c | 32 +--- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 9b39a2390b9e..742034e56100 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2070,10 +2070,17 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter) loff_t size; size = i_size_read(inode); - retval = filemap_write_and_wait_range(mapping, iocb->ki_pos, - iocb->ki_pos + count - 1); - if (retval < 0) - goto out; + if (iocb->ki_flags & IOCB_NOWAIT) { + if (filemap_range_has_page(mapping, iocb->ki_pos, + iocb->ki_pos + count - 1)) + return -EAGAIN; + } else { + retval = filemap_write_and_wait_range(mapping, + iocb->ki_pos, + iocb->ki_pos + count - 1); + if (retval < 0) + goto out; + } file_accessed(file); @@ -2674,6 +2681,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from) pos = iocb->ki_pos; + if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT)) + return -EINVAL; + if (limit != RLIM_INFINITY) { if (iocb->ki_pos >= limit) { send_sig(SIGXFSZ, current, 0); @@ -2742,9 +2752,17 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from) write_len = iov_iter_count(from); end = (pos + write_len - 1) >> PAGE_SHIFT; - written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1); - if (written) - goto out; + if (iocb->ki_flags & IOCB_NOWAIT) { + /* If there are pages to writeback, return */ + if (filemap_range_has_page(inode->i_mapping, pos, + pos + iov_iter_count(from))) + return -EAGAIN; + } else { + written = filemap_write_and_wait_range(mapping, pos, + pos + write_len - 1); + if (written) + goto out; + } /* * After a write we want buffered reads to be sure to go to disk to get -- 2.12.0
[PATCH 02/10] fs: Introduce filemap_range_has_page()
From: Goldwyn Rodrigues filemap_range_has_page() return true if the file's mapping has a page within the range mentioned. This function will be used to check if a write() call will cause a writeback of previous writes. Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Signed-off-by: Goldwyn Rodrigues --- include/linux/fs.h | 2 ++ mm/filemap.c | 32 2 files changed, 34 insertions(+) diff --git a/include/linux/fs.h b/include/linux/fs.h index 96a1a1fa54a9..0d34f5b5a6b0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2518,6 +2518,8 @@ extern int filemap_fdatawait(struct address_space *); extern void filemap_fdatawait_keep_errors(struct address_space *); extern int filemap_fdatawait_range(struct address_space *, loff_t lstart, loff_t lend); +extern bool filemap_range_has_page(struct address_space *, loff_t lstart, + loff_t lend); extern int filemap_write_and_wait(struct address_space *mapping); extern int filemap_write_and_wait_range(struct address_space *mapping, loff_t lstart, loff_t lend); diff --git a/mm/filemap.c b/mm/filemap.c index 6f1be573a5e6..9b39a2390b9e 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -376,6 +376,38 @@ int filemap_flush(struct address_space *mapping) } EXPORT_SYMBOL(filemap_flush); +/** + * filemap_range_has_page - check if a page exists in range. + * @mapping: address space within which to check + * @start_byte:offset in bytes where the range starts + * @end_byte: offset in bytes where the range ends (inclusive) + * + * Find at least one page in the range supplied, usually used to check if + * direct writing in this range will trigger a writeback. + */ +bool filemap_range_has_page(struct address_space *mapping, + loff_t start_byte, loff_t end_byte) +{ + pgoff_t index = start_byte >> PAGE_SHIFT; + pgoff_t end = end_byte >> PAGE_SHIFT; + struct pagevec pvec; + bool ret; + + if (end_byte < start_byte) + return false; + + if (mapping->nrpages == 0) + return false; + + pagevec_init(&pvec, 0); + if (!pagevec_lookup(&pvec, mapping, index, 1)) + return false; + ret = (pvec.pages[0]->index <= end); + pagevec_release(&pvec); + return ret; +} +EXPORT_SYMBOL(filemap_range_has_page); + static int __filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte, loff_t end_byte) { -- 2.12.0
[PATCH 04/10] fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT
From: Goldwyn Rodrigues RWF_NOWAIT informs kernel to bail out if an AIO request will block for reasons such as file allocations, or a writeback triggered, or would block while allocating requests while performing direct I/O. RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags. FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable of returning -EAGAIN if the AIO call will block. This must be set by supporting filesystems in the ->open() call. Filesystems xfs, btrfs and ext4 would be supported in the following patches. Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Signed-off-by: Goldwyn Rodrigues --- fs/aio.c| 6 ++ include/linux/fs.h | 9 + include/uapi/linux/fs.h | 4 +++- 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/fs/aio.c b/fs/aio.c index 020fa0045e3c..34027b67e2f4 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb, goto out_put_req; } + if ((req->common.ki_flags & IOCB_NOWAIT) && + !(req->common.ki_flags & IOCB_DIRECT)) { + ret = -EOPNOTSUPP; + goto out_put_req; + } + ret = put_user(KIOCB_KEY, &user_iocb->aio_key); if (unlikely(ret)) { pr_debug("EFAULT: aio_key\n"); diff --git a/include/linux/fs.h b/include/linux/fs.h index 0d34f5b5a6b0..4574121f4746 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -143,6 +143,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File was opened by fanotify and shouldn't generate fanotify events */ #define FMODE_NONOTIFY ((__force fmode_t)0x400) +/* File is capable of returning -EAGAIN if AIO will block */ +#define FMODE_AIO_NOWAIT ((__force fmode_t)0x800) + /* * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector * that indicates that they should check the contents of the iovec are @@ -269,6 +272,7 @@ struct writeback_control; #define IOCB_DSYNC (1 << 4) #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) +#define IOCB_NOWAIT(1 << 7) struct kiocb { struct file *ki_filp; @@ -3064,6 +3068,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags) if (unlikely(flags & ~RWF_SUPPORTED)) return -EOPNOTSUPP; + if (flags & RWF_NOWAIT) { + if (!(ki->ki_filp->f_mode & FMODE_AIO_NOWAIT)) + return -EOPNOTSUPP; + ki->ki_flags |= IOCB_NOWAIT; + } if (flags & RWF_HIPRI) ki->ki_flags |= IOCB_HIPRI; if (flags & RWF_DSYNC) diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 937c3e39650a..27d8c36c04af 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -360,7 +360,9 @@ struct fscrypt_key { #define RWF_HIPRI 0x0001 /* high priority request, poll if possible */ #define RWF_DSYNC 0x0002 /* per-IO O_DSYNC */ #define RWF_SYNC 0x0004 /* per-IO O_SYNC */ +#define RWF_NOWAIT 0x0008 /* per-IO, return -EAGAIN if operation would block */ -#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC) +#define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC |\ +RWF_NOWAIT) #endif /* _UAPI_LINUX_FS_H */ -- 2.12.0
[PATCH 10/10] btrfs: nowait aio support
From: Goldwyn Rodrigues Return EAGAIN if any of the following checks fail + i_rwsem is not lockable + NODATACOW or PREALLOC is not set + Cannot nocow at the desired location + Writing beyond end of file which is not allocated Acked-by: David Sterba Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/file.c | 33 +++-- fs/btrfs/inode.c | 3 +++ 2 files changed, 30 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index da1096eb1a40..59e2dccdf75b 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1875,12 +1875,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, ssize_t num_written = 0; bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host); ssize_t err; - loff_t pos; - size_t count; + loff_t pos = iocb->ki_pos; + size_t count = iov_iter_count(from); loff_t oldsize; int clean_page = 0; - inode_lock(inode); + if ((iocb->ki_flags & IOCB_NOWAIT) && + (iocb->ki_flags & IOCB_DIRECT)) { + /* Don't sleep on inode rwsem */ + if (!inode_trylock(inode)) + return -EAGAIN; + /* +* We will allocate space in case nodatacow is not set, +* so bail +*/ + if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | + BTRFS_INODE_PREALLOC)) || + check_can_nocow(BTRFS_I(inode), pos, &count) <= 0) { + inode_unlock(inode); + return -EAGAIN; + } + } else + inode_lock(inode); + err = generic_write_checks(iocb, from); if (err <= 0) { inode_unlock(inode); @@ -1914,8 +1931,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb, */ update_time_for_write(inode); - pos = iocb->ki_pos; - count = iov_iter_count(from); start_pos = round_down(pos, fs_info->sectorsize); oldsize = i_size_read(inode); if (start_pos > oldsize) { @@ -3071,13 +3086,19 @@ static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int whence) return offset; } +static int btrfs_file_open(struct inode *inode, struct file *filp) +{ + filp->f_mode |= FMODE_AIO_NOWAIT; + return generic_file_open(inode, filp); +} + const struct file_operations btrfs_file_operations = { .llseek = btrfs_file_llseek, .read_iter = generic_file_read_iter, .splice_read= generic_file_splice_read, .write_iter = btrfs_file_write_iter, .mmap = btrfs_file_mmap, - .open = generic_file_open, + .open = btrfs_file_open, .release= btrfs_release_file, .fsync = btrfs_sync_file, .fallocate = btrfs_fallocate, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f942293dd7e7..556c93060606 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8754,6 +8754,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) dio_data.overwrite = 1; inode_unlock(inode); relock = true; + } else if (iocb->ki_flags & IOCB_NOWAIT) { + ret = -EAGAIN; + goto out; } ret = btrfs_delalloc_reserve_space(inode, offset, count); if (ret) -- 2.12.0
[PATCH 06/10] fs: Introduce IOMAP_NOWAIT
From: Goldwyn Rodrigues IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps. This is used by XFS in the XFS patch. Reviewed-by: Christoph Hellwig Reviewed-by: Jan Kara Signed-off-by: Goldwyn Rodrigues --- fs/iomap.c| 8 include/linux/iomap.h | 1 + 2 files changed, 9 insertions(+) diff --git a/fs/iomap.c b/fs/iomap.c index 18f2f2b8ba2c..c71a64b97fba 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -881,6 +881,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, flags |= IOMAP_WRITE; } + if (iocb->ki_flags & IOCB_NOWAIT) { + if (filemap_range_has_page(mapping, start, end)) { + ret = -EAGAIN; + goto out_free_dio; + } + flags |= IOMAP_NOWAIT; + } + ret = filemap_write_and_wait_range(mapping, start, end); if (ret) goto out_free_dio; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index f753e788da31..69f4e9470084 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -52,6 +52,7 @@ struct iomap { #define IOMAP_REPORT (1 << 2) /* report extent status, e.g. FIEMAP */ #define IOMAP_FAULT(1 << 3) /* mapping for page fault */ #define IOMAP_DIRECT (1 << 4) /* direct I/O */ +#define IOMAP_NOWAIT (1 << 5) /* Don't wait for writeback */ struct iomap_ops { /* -- 2.12.0
[PATCH 09/10] xfs: nowait aio support
From: Goldwyn Rodrigues If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable immediately. IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin if it needs allocation either due to file extension, writing to a hole, or COW or waiting for other DIOs to finish. Return -EAGAIN if we don't have extent list in memory. Signed-off-by: Goldwyn Rodrigues Reviewed-by: Christoph Hellwig Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_file.c | 20 +++- fs/xfs/xfs_iomap.c | 22 ++ 2 files changed, 37 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5fb5a0958a14..e159eb381d9f 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -541,8 +541,11 @@ xfs_file_dio_aio_write( iolock = XFS_IOLOCK_SHARED; } - xfs_ilock(ip, iolock); - + if (!xfs_ilock_nowait(ip, iolock)) { + if (iocb->ki_flags & IOCB_NOWAIT) + return -EAGAIN; + xfs_ilock(ip, iolock); + } ret = xfs_file_aio_write_checks(iocb, from, &iolock); if (ret) goto out; @@ -553,9 +556,15 @@ xfs_file_dio_aio_write( * otherwise demote the lock if we had to take the exclusive lock * for other reasons in xfs_file_aio_write_checks. */ - if (unaligned_io) - inode_dio_wait(inode); - else if (iolock == XFS_IOLOCK_EXCL) { + if (unaligned_io) { + /* If we are going to wait for other DIO to finish, bail */ + if (iocb->ki_flags & IOCB_NOWAIT) { + if (atomic_read(&inode->i_dio_count)) + return -EAGAIN; + } else { + inode_dio_wait(inode); + } + } else if (iolock == XFS_IOLOCK_EXCL) { xfs_ilock_demote(ip, XFS_IOLOCK_EXCL); iolock = XFS_IOLOCK_SHARED; } @@ -892,6 +901,7 @@ xfs_file_open( return -EFBIG; if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb))) return -EIO; + file->f_mode |= FMODE_AIO_NOWAIT; return 0; } diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 94e5bdf7304c..05dc87e8c1f5 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -995,6 +995,11 @@ xfs_file_iomap_begin( lockmode = xfs_ilock_data_map_shared(ip); } + if ((flags & IOMAP_NOWAIT) && !(ip->i_df.if_flags & XFS_IFEXTENTS)) { + error = -EAGAIN; + goto out_unlock; + } + ASSERT(offset <= mp->m_super->s_maxbytes); if ((xfs_fsize_t)offset + length > mp->m_super->s_maxbytes) length = mp->m_super->s_maxbytes - offset; @@ -1016,6 +1021,15 @@ xfs_file_iomap_begin( if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) { if (flags & IOMAP_DIRECT) { + /* +* A reflinked inode will result in CoW alloc. +* FIXME: It could still overwrite on unshared extents +* and not need allocation. +*/ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } /* may drop and re-acquire the ilock */ error = xfs_reflink_allocate_cow(ip, &imap, &shared, &lockmode); @@ -1033,6 +1047,14 @@ xfs_file_iomap_begin( if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) { /* +* If nowait is set bail since we are going to make +* allocations. +*/ + if (flags & IOMAP_NOWAIT) { + error = -EAGAIN; + goto out_unlock; + } + /* * We cap the maximum length we map here to MAX_WRITEBACK_PAGES * pages to keep the chunks of work done where somewhat symmetric * with the work writeback does. This is a completely arbitrary -- 2.12.0
[PATCH 08/10] ext4: nowait aio support
From: Goldwyn Rodrigues Return EAGAIN if any of the following checks fail for direct I/O: + i_rwsem is lockable + Writing beyond end of file (will trigger allocation) + Blocks are not allocated at the write location Signed-off-by: Goldwyn Rodrigues Reviewed-by: Jan Kara --- fs/ext4/file.c | 24 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 02ce7e7bbdf5..25cd42baadea 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) return ext4_dax_write_iter(iocb, from); #endif - inode_lock(inode); + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!inode_trylock(inode)) + return -EAGAIN; + } else { + inode_lock(inode); + } + ret = ext4_write_checks(iocb, from); if (ret <= 0) goto out; @@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) iocb->private = &overwrite; /* Check whether we do a DIO overwrite or not */ - if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio && - ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) - overwrite = 1; + if (o_direct && !unaligned_aio) { + if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) { + if (ext4_should_dioread_nolock(inode)) + overwrite = 1; + } else if (iocb->ki_flags & IOCB_NOWAIT) { + ret = -EAGAIN; + goto out; + } + } ret = __generic_file_write_iter(iocb, from); inode_unlock(inode); @@ -435,6 +447,10 @@ static int ext4_file_open(struct inode * inode, struct file * filp) if (ret < 0) return ret; } + + /* Set the flags to support nowait AIO */ + filp->f_mode |= FMODE_AIO_NOWAIT; + return dquot_file_open(inode, filp); } -- 2.12.0
Re: [PATCH 05/10] fs: return if direct I/O will trigger writeback
On Mon, Jun 19, 2017 at 11:33:43AM -0500, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues > > Find out if the I/O will trigger a wait due to writeback. If yes, > return -EAGAIN. > > Return -EINVAL for buffered AIO: there are multiple causes of > delay such as page locks, dirty throttling logic, page loading > from disk etc. which cannot be taken care of. > I don't believe that generic_file_read_iter() is sufficient. Consider ext4_file_read_iter() -> ext4_dax_read_iter(): static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to) { struct inode *inode = file_inode(iocb->ki_filp); ssize_t ret; inode_lock_shared(inode); IOW, your ext4 patch is missing a chunk. I hadn't checked xfs one...
[PATCH 8/9] btrfs: add support for passing in write hints for buffered writes
Reviewed-by: Andreas Dilger Signed-off-by: Chris Mason Signed-off-by: Jens Axboe --- fs/btrfs/extent_io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 19eedf2e630b..3e57cfaa6dd6 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2830,6 +2830,7 @@ static int submit_extent_page(int op, int op_flags, struct extent_io_tree *tree, bio_add_page(bio, page, page_size, offset); bio->bi_end_io = end_io_func; bio->bi_private = tree; + op_flags |= write_hint_to_opf(inode_write_hint(page->mapping->host)); bio_set_op_attrs(bio, op, op_flags); if (wbc) { wbc_init_bio(wbc, bio); -- 2.7.4
[PATCHSET v8] Add support for write life time hints
A new iteration of this patchset, previously known as write streams. As before, this patchset aims at enabling applications split up writes into separate streams, based on the perceived life time of the data written. This is useful for a variety of reasons: - For NVMe, this feature is ratified and released with the NVMe 1.3 spec. Devices implementing Directives can expose multiple streams. Separating data written into streams based on life time can drastically reduce the write amplification. This helps device endurance, and increases performance. Testing just performed internally at Facebook with these patches showed up to a 25% reduction in NAND writes in a RocksDB setup. - Software caching solutions can make more intelligent decisions on how and where to place data. Contrary to previous patches, we're not exposing numeric stream values anymore. I've previously advocated for just doing a set of hints that makes sense instead. See the coverage from the LSFMM summit this year: https://lwn.net/Articles/717755/ This patchset attempts to do that. We add an fcntl(2) interface to get/set these types of hints. We define 4 hints that pertain to data write life times: RWH_WRITE_LIFE_SHORTData written with this flag is expected to have a high overwrite rate, or life time. RWH_WRITE_LIFE_MEDIUM Longer life time than SHORT RWH_WRITE_LIFE_LONG Longer life time than MEDIUM RWH_WRITE_LIFE_EXTREME Longer life time than LONG The idea is that these are relative values, so an application can use them as they see fit. The underlying device can then place data appropriately, or be free to ignore the hint. It's just a hint. A branch based on current master can be pulled from here: git://git.kernel.dk/linux-block write-stream.8 Changes since v7: - NVMe: change 'streams' parameter to be a bool enable/disable. We hardwire the number of streams anyway and use the appropriate amount, so no point in exposing this value. - NVMe: collapse stream values appropriately, instead of just doing a basic MOD. - Get rid of pwritev2(2) flags. Just use the fcntl(2) interface. - Collapse some patches - Change fcntl(2) interface to get/set values from a user supplied 64-bit pointer. - Move inode-to-iocb mask setting to iocb_flags(). Changes since v6: - Rewrite NVMe write stream assignment - Change NVMe stream assignment to be per-controller, not per-ns. Then we can use the same IDs across name spaces, and we don't have to do lazy setup of streams. - If streams are enabled on nvme, set io min/opt and discard granularity based on the stream params reported. - Fixup F_SET_RW_HINT definition, it was 20, should have been 12. Changes since v5: - Change enum write_hint to enum rw_hint. - Change fcntl() interface to be read/write generic - Bring enum rw_hint all the way to bio/request - Change references to streams in changelogs and debugfs interface - Rebase to master to resolve blkdev.h conflict - Reshuffle patches so the WRITE_LIFE_* hints and type come first. Allowed me to merge two block patches as well. Changes since v4: - Add enum write_hint and the WRITE_HINT_* values. This is what we use internally (until transformed to req/bio flags), and what is exposed to user space with the fcntl() interface. Maps directly to the RWF_WRITE_LIFE_* values. - Add fcntl() interface for getting/setting hint values. - Get rid of inode ->i_write_hint, encode the 3 bits of hint info in the inode flags intead. - Allow a write with no hint to clear the old hint. Previously we only changed the hint if a new valid hint was given, not if no hint was passed in. - Shrink flag space grabbed from 4 to 3 bits for RWF_* and the inode flags. Changes since v3: - Change any naming of stream ID to write hint. - Various little API changes, suggested by Christoph - Cleanup the NVMe bits, dump the debug info. - Change NVMe to lazily allocate the streams. - Various NVMe error handling improvements and command checking. Changes since v2: - Get rid of bio->bi_stream and replace with four request/bio flags. These map directly to the RWF_WRITE_* flags that the user passes in. - Cleanup the NVMe stream setting. - Drivers now responsible for updating the queue stream write counter, as they determine what stream to map a given flag to. Changes since v1: - Guard queue stream stats to ensure we don't mess up memory, if bio_stream() ever were to return a larger value than we support. - NVMe: ensure we set the stream modulo the name space defined count. - Cleanup the RWF_ and IOCB_ flags. Set aside 4 bits, and just store the stream value in there. This makes the passing of stream ID from RWF_ space to IOCB_ (and IOCB_ to bio) more efficient, and cleans it up in general. - Kill the block internal definitions of the stream type, we don't need them anymore. See above. -- Jens Axboe
[PATCH 4/9] fs: add O_DIRECT support for sending down write life time hints
Reviewed-by: Andreas Dilger Signed-off-by: Jens Axboe --- fs/block_dev.c | 2 ++ fs/direct-io.c | 2 ++ fs/iomap.c | 1 + 3 files changed, 5 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index dd91c99e9ba0..30e1fb65c2fa 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -183,6 +183,8 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb) /* avoid the need for a I/O completion work item */ if (iocb->ki_flags & IOCB_DSYNC) op |= REQ_FUA; + + op |= write_hint_to_opf(iocb_write_hint(iocb)); return op; } diff --git a/fs/direct-io.c b/fs/direct-io.c index e8baaabebf13..9e9adca0c592 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -385,6 +385,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio, else bio->bi_end_io = dio_bio_end_io; + bio->bi_opf |= write_hint_to_opf(iocb_write_hint(dio->iocb)); + sdio->bio = bio; sdio->logical_offset_in_bio = sdio->cur_page_fs_offset; } diff --git a/fs/iomap.c b/fs/iomap.c index 18f2f2b8ba2c..63b9f87d9461 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length, if (dio->flags & IOMAP_DIO_WRITE) { bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | REQ_IDLE); + bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode)); task_io_account_write(bio->bi_iter.bi_size); } else { bio_set_op_attrs(bio, REQ_OP_READ, 0); -- 2.7.4
[PATCH 9/9] nvme: add support for streams and directives
This adds support for Directives in NVMe, particular for the Streams directive. Support for Directives is a new feature in NVMe 1.3. It allows a user to pass in information about where to store the data, so that it the device can do so most effiently. If an application is managing and writing data with different life times, mixing differently retentioned data onto the same locations on flash can cause write amplification to grow. This, in turn, will reduce performance and life time of the device. We default to allocating 4 streams, controller wide, so we can use them on all name spaces. This is configurable with the 'streams' module parameter. If a write stream is set in a write, flag is as such before sending it to the device. Signed-off-by: Jens Axboe --- drivers/nvme/host/core.c | 189 +-- drivers/nvme/host/nvme.h | 4 + include/linux/nvme.h | 48 3 files changed, 236 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index aee37b73231d..fa1acfa27fa4 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -68,6 +68,10 @@ MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if qu struct workqueue_struct *nvme_wq; EXPORT_SYMBOL_GPL(nvme_wq); +static bool streams = true; +module_param(streams, bool, 0644); +MODULE_PARM_DESC(streams, "use streams, if available"); + static LIST_HEAD(nvme_ctrl_list); static DEFINE_SPINLOCK(dev_list_lock); @@ -297,6 +301,141 @@ struct request *nvme_alloc_request(struct request_queue *q, } EXPORT_SYMBOL_GPL(nvme_alloc_request); +/* + + * Returns number of streams allocated for use by, or -1 on error. + */ +static int nvme_streams_allocate(struct nvme_ctrl *ctrl, unsigned int nstreams) +{ + struct nvme_command c; + union nvme_result res; + int ret; + + memset(&c, 0, sizeof(c)); + + c.directive.opcode = nvme_admin_directive_recv; + c.directive.nsid = cpu_to_le32(0x); + c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE; + c.directive.dtype = NVME_DIR_STREAMS; + c.directive.endir = nstreams; + + ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res, NULL, 0, 0, + NVME_QID_ANY, 0, 0); + if (ret) + return -1; + + return le32_to_cpu(res.u32) & 0x; +} + +static int nvme_enable_streams(struct nvme_ctrl *ctrl) +{ + struct nvme_command c; + + memset(&c, 0, sizeof(c)); + + c.directive.opcode = nvme_admin_directive_send; + c.directive.nsid = cpu_to_le32(0x); + c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE; + c.directive.dtype = NVME_DIR_IDENTIFY; + c.directive.tdtype = NVME_DIR_STREAMS; + c.directive.endir = NVME_DIR_ENDIR; + + return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0); +} + +static int nvme_get_stream_params(struct nvme_ctrl *ctrl, + struct streams_directive_params *s, u32 nsid) +{ + struct nvme_command c; + + memset(&c, 0, sizeof(c)); + memset(s, 0, sizeof(*s)); + + c.directive.opcode = nvme_admin_directive_recv; + c.directive.nsid = cpu_to_le32(nsid); + c.directive.numd = sizeof(*s); + c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM; + c.directive.dtype = NVME_DIR_STREAMS; + + return nvme_submit_sync_cmd(ctrl->admin_q, &c, s, sizeof(*s)); +} + +static int nvme_setup_directives(struct nvme_ctrl *ctrl) +{ + struct streams_directive_params s; + unsigned int nstreams; + int ret; + + if (!(ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES)) + return 0; + if (!streams) + return 0; + + ret = nvme_enable_streams(ctrl); + if (ret) + return ret; + + ret = nvme_get_stream_params(ctrl, &s, 0x); + if (ret) + return ret; + + ctrl->nssa = le16_to_cpu(s.nssa); + + nstreams = min_t(unsigned int, ctrl->nssa, BLK_MAX_WRITE_HINTS - 1); + ret = nvme_streams_allocate(ctrl, nstreams); + if (ret < 0) + return ret; + + /* require at least 2 streams to use them effectively */ + if (ret > 1) { + ret = min(ret, BLK_MAX_WRITE_HINTS - 1); + ctrl->nr_streams = ret; + dev_info(ctrl->device, "successfully enabled %d streams\n", ret); + } + + return 0; +} + +/* + * Write hint number to stream mappings + */ +static const unsigned int stream_mappings[BLK_MAX_WRITE_HINTS][BLK_MAX_WRITE_HINTS] = { + /* 0 or 1 stream, we don't use streams */ + { 0, }, + { 0, }, + /* collapse short+medium to short, and long+extreme to medium */ + { WRITE_LIFE_NONE, WRITE_LIFE_SHORT, WRITE_LIFE_SHORT, + WRITE_LIFE_MEDIUM, WRITE_LIFE_MEDIUM }, + /* collapse long+extreme to long */ + { WRITE_LIFE_NONE, WRITE_LIFE_SHORT, WRITE_LIFE_MED
[PATCH 1/9] fs: add fcntl() interface for setting/getting write life time hints
Define a set of write life time hints: and add an fcntl interface for querying these flags, and also for setting them as well: F_GET_RW_HINT Returns the read/write hint set. F_SET_RW_HINT Pass one of the above write hints. The user passes in a 64-bit pointer to get/set these values, and the interface returns 0/-1 on success/error. Sample program testing/implementing basic setting/getting of write hints is below. Add support for storing the write life time hint in the inode flags, and pass them to the kiocb flags as well. This is in preparation for utilizing these hints in the block layer, to guide on-media data placement. /* * writehint.c: check or set a file/inode write hint */ static char *str[] = { "WRITE_LIFE_NONE", "WRITE_LIFE_SHORT", "WRITE_LIFE_MEDIUM", "WRITE_LIFE_LONG", "WRITE_LIFE_EXTREME" }; int main(int argc, char *argv[]) { uint64_t hint = -1ULL; int fd, ret; if (argc < 2) { fprintf(stderr, "%s: dev \n", argv[0]); return 1; } fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); return 2; } if (argc > 2) hint = atoi(argv[2]); if (hint == -1ULL) { ret = fcntl(fd, F_RW_GET_HINT, &hint); if (ret < 0) { perror("fcntl: F_RW_GET_HINT"); return 3; } } else { ret = fcntl(fd, F_RW_SET_HINT, &hint); if (ret < 0) { perror("fcntl: F_RW_SET_HINT"); return 4; } } printf("%s: %shint %s\n", argv[1], hint != -1ULL ? "set " : "", str[hint]); close(fd); return 0; } Signed-off-by: Jens Axboe --- fs/fcntl.c | 43 fs/inode.c | 11 + include/linux/fs.h | 61 ++ include/uapi/linux/fcntl.h | 15 4 files changed, 130 insertions(+) diff --git a/fs/fcntl.c b/fs/fcntl.c index f4e7267d117f..113b78c11631 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -243,6 +243,45 @@ static int f_getowner_uids(struct file *filp, unsigned long arg) } #endif +static long fcntl_rw_hint(struct file *file, unsigned int cmd, + u64 __user *ptr) +{ + struct inode *inode = file_inode(file); + long ret = 0; + u64 hint; + + switch (cmd) { + case F_GET_RW_HINT: + hint = mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT); + if (put_user(hint, ptr)) + ret = -EFAULT; + break; + case F_SET_RW_HINT: + if (get_user(hint, ptr)) { + ret = -EFAULT; + break; + } + switch (hint) { + case WRITE_LIFE_NONE: + case WRITE_LIFE_SHORT: + case WRITE_LIFE_MEDIUM: + case WRITE_LIFE_LONG: + case WRITE_LIFE_EXTREME: + inode_set_write_hint(inode, hint); + ret = 0; + break; + default: + ret = -EINVAL; + } + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, struct file *filp) { @@ -337,6 +376,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg, case F_GET_SEALS: err = shmem_fcntl(filp, cmd, arg); break; + case F_GET_RW_HINT: + case F_SET_RW_HINT: + err = fcntl_rw_hint(filp, cmd, (u64 __user *) arg); + break; default: break; } diff --git a/fs/inode.c b/fs/inode.c index db5914783a71..defb015a2c6d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -2120,3 +2120,14 @@ struct timespec current_time(struct inode *inode) return timespec_trunc(now, inode->i_sb->s_time_gran); } EXPORT_SYMBOL(current_time); + +void inode_set_write_hint(struct inode *inode, enum rw_hint hint) +{ + unsigned int flags = write_hint_to_mask(hint, S_WRITE_LIFE_SHIFT); + + if (flags != mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT)) { + inode_lock(inode); + inode_set_flags(inode, flags, S_WRITE_LIFE_MASK); + inode_unlock(inode); + } +} diff --git a/include/linux/fs.h b/include/linux/fs.h index 023f0324762b..8720251cc153 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -270,6 +270,12 @@ struct writeback_control; #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) +/* + * Steal 3 bits for stream information, this
[PATCH 5/9] fs: add support for buffered writeback to pass down write hints
Reviewed-by: Andreas Dilger Signed-off-by: Jens Axboe --- fs/buffer.c | 14 +- fs/mpage.c | 1 + 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 306b720f7383..1259524715c8 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -49,7 +49,7 @@ static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, -struct writeback_control *wbc); +unsigned int stream, struct writeback_control *wbc); #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers) @@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct page *page, do { struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { - submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc); + submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, + inode_write_hint(inode), wbc); nr_underway++; } bh = next; @@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct page *page, struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { clear_buffer_dirty(bh); - submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc); + submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, + inode_write_hint(inode), wbc); nr_underway++; } bh = next; @@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio) } static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, -struct writeback_control *wbc) +unsigned int write_hint, struct writeback_control *wbc) { struct bio *bio; @@ -3134,6 +3136,8 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, op_flags |= REQ_META; if (buffer_prio(bh)) op_flags |= REQ_PRIO; + + op_flags |= write_hint_to_opf(write_hint); bio_set_op_attrs(bio, op, op_flags); submit_bio(bio); @@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh, int submit_bh(int op, int op_flags, struct buffer_head *bh) { - return submit_bh_wbc(op, op_flags, bh, NULL); + return submit_bh_wbc(op, op_flags, bh, 0, NULL); } EXPORT_SYMBOL(submit_bh); diff --git a/fs/mpage.c b/fs/mpage.c index 9524fdde00c2..d8a750873bf4 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -615,6 +615,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, goto confused; wbc_init_bio(wbc, bio); + bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode)); } /* -- 2.7.4
[PATCH 6/9] ext4: add support for passing in write hints for buffered writes
Reviewed-by: Andreas Dilger Signed-off-by: Jens Axboe --- fs/ext4/page-io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 930ca0fc9a0f..92834b702728 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -350,6 +350,7 @@ void ext4_io_submit(struct ext4_io_submit *io) if (bio) { int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0; + io_op_flags |= write_hint_to_opf(inode_write_hint(io->io_end->inode)); bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags); submit_bio(io->io_bio); } @@ -397,6 +398,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io, ret = io_submit_init_bio(io, bh); if (ret) return ret; + io->io_bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode)); } ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh)); if (ret != bh->b_size) -- 2.7.4
[PATCH 3/9] blk-mq: expose stream write hints through debugfs
Useful to verify that things are working the way they should. Reading the file will return number of kb written with each write hint. Writing the file will reset the statistics. No care is taken to ensure that we don't race on updates. Drivers will write to q->write_hints[] if they handle a given write hint. Reviewed-by: Andreas Dilger Signed-off-by: Jens Axboe --- block/blk-mq-debugfs.c | 24 include/linux/blkdev.h | 3 +++ 2 files changed, 27 insertions(+) diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 9edebbdce0bd..9ebc2945f991 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -135,6 +135,29 @@ static void print_stat(struct seq_file *m, struct blk_rq_stat *stat) } } +static int queue_write_hint_show(void *data, struct seq_file *m) +{ + struct request_queue *q = data; + int i; + + for (i = 0; i < BLK_MAX_WRITE_HINTS; i++) + seq_printf(m, "hint%d: %llu\n", i, q->write_hints[i]); + + return 0; +} + +static ssize_t queue_write_hint_store(void *data, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct request_queue *q = data; + int i; + + for (i = 0; i < BLK_MAX_WRITE_HINTS; i++) + q->write_hints[i] = 0; + + return count; +} + static int queue_poll_stat_show(void *data, struct seq_file *m) { struct request_queue *q = data; @@ -730,6 +753,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = { {"poll_stat", 0400, queue_poll_stat_show}, {"requeue_list", 0400, .seq_ops = &queue_requeue_list_seq_ops}, {"state", 0600, queue_state_show, queue_state_write}, + {"write_hints", 0600, queue_write_hint_show, queue_write_hint_store}, {}, }; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 22cfba64ce81..687394b70924 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -586,6 +586,9 @@ struct request_queue { size_t cmd_size; void*rq_alloc_data; + +#define BLK_MAX_WRITE_HINTS5 + u64 write_hints[BLK_MAX_WRITE_HINTS]; }; #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */ -- 2.7.4