Re: [PATCH rfc 01/30] nvme: Add admin connect request queue

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:21:35PM +0300, Sagi Grimberg wrote:
> In case we reconnect with inflight admin IO we
> need to make sure that the connect comes before
> the admin command. This can be only achieved by
> using a seperate request queue for admin connects.

Use up a few more lines of the available space for your lines? :)

Wouldn't a head insertation also solve the problem?


Re: [PATCH rfc 02/30] nvme-rdma: Don't alloc/free the tagset on reset

2017-06-19 Thread Christoph Hellwig
> +static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl, bool 
> remove)
>  {
> + nvme_rdma_stop_queue(&ctrl->queues[0]);
> + if (remove) {
> + blk_cleanup_queue(ctrl->ctrl.admin_connect_q);
> + blk_cleanup_queue(ctrl->ctrl.admin_q);
> + blk_mq_free_tag_set(&ctrl->admin_tag_set);
> + nvme_rdma_dev_put(ctrl->device);
> + }
> +
>   nvme_rdma_free_qe(ctrl->queues[0].device->dev, &ctrl->async_event_sqe,
>   sizeof(struct nvme_command), DMA_TO_DEVICE);
> + nvme_rdma_free_queue(&ctrl->queues[0]);

I don't like the calling convention.  We only have have two callers
anyway.  So I'd much rather only keep the code inside the if above
in the new nvme_rdma_destroy_admin_queue that is only called at shutdown
time, and opencode the calls to nvme_rdma_stop_queue, nvme_rdma_free_qe
and nvme_rdma_free_queue in the callers.

> -static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl)
> +static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, bool 
> new)

PCIe is just checking for a non-null admin_q.  But I think we should
jsut split this into two functions, one for the shared code at the end
and one just for the first-time setup, with the nvme_rdma_init_queue
call open coded.

>   error = nvmf_connect_admin_queue(&ctrl->ctrl);
> @@ -1596,6 +1601,8 @@ static int nvme_rdma_configure_admin_queue(struct 
> nvme_rdma_ctrl *ctrl)
>  
>   set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
>  
> + blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);
> +

Where does this come from?


Re: [PATCH rfc 03/30] nvme-rdma: reuse configure/destroy admin queue

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:21:37PM +0300, Sagi Grimberg wrote:
> We have all we need in these functions now that these
> are aware if we are doing a full instantiation/removal.
> 
> For that we move nvme_rdma_configure_admin_queue to avoid
> a forward declaration, and we add blk_mq_ops forward declaration.
> 
> Signed-off-by: Sagi Grimberg 
> ---
>  drivers/nvme/host/rdma.c | 253 
> ++-
>  1 file changed, 119 insertions(+), 134 deletions(-)
> 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 3e4c6aa119ee..5fef5545e365 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -140,6 +140,9 @@ static DEFINE_MUTEX(device_list_mutex);
>  static LIST_HEAD(nvme_rdma_ctrl_list);
>  static DEFINE_MUTEX(nvme_rdma_ctrl_mutex);
>  
> +static const struct blk_mq_ops nvme_rdma_mq_ops;
> +static const struct blk_mq_ops nvme_rdma_admin_mq_ops;
> +
>  /*
>   * Disabling this option makes small I/O goes faster, but is fundamentally
>   * unsafe.  With it turned off we will have to register a global rkey that
> @@ -562,20 +565,22 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl 
> *ctrl,
>  
>  static void nvme_rdma_stop_queue(struct nvme_rdma_queue *queue)
>  {
> + if (test_bit(NVME_RDMA_Q_DELETING, &queue->flags))
> + return;
>   rdma_disconnect(queue->cm_id);
>   ib_drain_qp(queue->qp);
>  }
>  
>  static void nvme_rdma_free_queue(struct nvme_rdma_queue *queue)
>  {
> + if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags))
> + return;
>   nvme_rdma_destroy_queue_ib(queue);
>   rdma_destroy_id(queue->cm_id);
>  }
>  
>  static void nvme_rdma_stop_and_free_queue(struct nvme_rdma_queue *queue)
>  {
> - if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags))
> - return;
>   nvme_rdma_stop_queue(queue);
>   nvme_rdma_free_queue(queue);
>  }
> @@ -671,6 +676,116 @@ static void nvme_rdma_destroy_admin_queue(struct 
> nvme_rdma_ctrl *ctrl, bool remo
>   nvme_rdma_free_queue(&ctrl->queues[0]);
>  }
>  
static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, bool 
new)

Too long line..

Also what about moving the helpers into the right place in the
previous patch that already re-indented most of this?


Re: [PATCH rfc 24/30] nvme-pci: rename to nvme_pci_configure_admin_queue

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:21:58PM +0300, Sagi Grimberg wrote:
> we are going to need the name for the core routine...

I think we should just pick this up ASAP as a prep patch..

Reviewed-by: Christoph Hellwig 


Re: [PATCH rfc 28/30] nvme: update tagset nr_hw_queues when reallocating io queues

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:22:02PM +0300, Sagi Grimberg wrote:
> Signed-off-by: Sagi Grimberg 

Could use a changelog.

Ming: does this solve your problem of not seeing the new queues
after a qemu CPU hotplug + reset?

> ---
>  drivers/nvme/host/core.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6937ba26ff2c..476c49c0601f 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2776,6 +2776,9 @@ int nvme_configure_io_queues(struct nvme_ctrl *ctrl, 
> bool new)
>   ret = blk_mq_reinit_tagset(ctrl->tagset);
>   if (ret)
>   goto out_free_io_queues;
> +
> + blk_mq_update_nr_hw_queues(ctrl->tagset,
> + ctrl->queue_count - 1);
> }
>  
>   ret = nvme_start_io_queues(ctrl);
> -- 
> 2.7.4
---end quoted text---


Re: [PATCH rfc 29/30] nvme: add sed-opal ctrl manipulation in admin configuration

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:22:03PM +0300, Sagi Grimberg wrote:
> Signed-off-by: Sagi Grimberg 

The subject sounds odd and it could use a changelog.  But I'd love to
pick this change up ASAP as it's the right thing to do..


[PATCH 1/2] block: remove the unused bio_to_phys macro

2017-06-19 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 include/linux/bio.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 40d054185277..83f9edb39d3f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -118,7 +118,6 @@ static inline void *bio_data(struct bio *bio)
 /*
  * will die
  */
-#define bio_to_phys(bio)   (page_to_phys(bio_page((bio))) + (unsigned 
long) bio_offset((bio)))
 #define bvec_to_phys(bv)   (page_to_phys((bv)->bv_page) + (unsigned long) 
(bv)->bv_offset)
 
 /*
-- 
2.11.0



[PATCH 2/2] block: stop using bio_data() in blk_write_same_mergeable

2017-06-19 Thread Christoph Hellwig
While the Write Same page currently always is in low-level it is just
as easy and safer to just compare the page and offset directly.

Signed-off-by: Christoph Hellwig 
---
 include/linux/blkdev.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index db2b7a7450bd..a61907438139 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -815,7 +815,8 @@ static inline bool rq_mergeable(struct request *rq)
 
 static inline bool blk_write_same_mergeable(struct bio *a, struct bio *b)
 {
-   if (bio_data(a) == bio_data(b))
+   if (bio_page(a) == bio_page(b) &&
+   bio_offset(a) == bio_offset(b))
return true;
 
return false;
-- 
2.11.0



two small cleanups

2017-06-19 Thread Christoph Hellwig
Stop relying on bio_page and a page_address in the Write Same code
and remove an unused macro.



move bounce limits settings into the drivers

2017-06-19 Thread Christoph Hellwig
Currently we still default to a bounce all highmem setting for block
drivers.  This series defaults to no bouncing and instead adds call
to blk_queue_bounce_limit to those drivers that need it.   It also
has a few cleanups in that area.



[PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce

2017-06-19 Thread Christoph Hellwig
pktcdvd is a make_request based stacking driver and thus doesn't have any
addressing limits on it's own.  It also doesn't use bio_data() or
page_address(), so it doesn't need a lowmem bounce either.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/pktcdvd.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 26c04baae967..7f2eca31eb6a 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2413,8 +2413,6 @@ static blk_qc_t pkt_make_request(struct request_queue *q, 
struct bio *bio)
char b[BDEVNAME_SIZE];
struct bio *split;
 
-   blk_queue_bounce(q, &bio);
-
blk_queue_split(q, &bio);
 
pd = q->queuedata;
-- 
2.11.0



[PATCH 02/10] blk-map: call blk_queue_bounce from blk_rq_append_bio

2017-06-19 Thread Christoph Hellwig
This makes moves the knowledge about bouncing out of the callers into the
block core (just like we do for the normal I/O path), and allows to unexport
blk_queue_bounce.

Signed-off-by: Christoph Hellwig 
---
 block/blk-map.c  | 7 +++
 block/bounce.c   | 2 --
 drivers/scsi/osd/osd_initiator.c | 5 +
 3 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 3b5cb863318f..2547016aa7aa 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -16,6 +16,8 @@
  */
 int blk_rq_append_bio(struct request *rq, struct bio *bio)
 {
+   blk_queue_bounce(rq->q, &bio);
+
if (!rq->bio) {
blk_rq_bio_prep(rq->q, rq, bio);
} else {
@@ -72,15 +74,13 @@ static int __blk_rq_map_user_iov(struct request *rq,
map_data->offset += bio->bi_iter.bi_size;
 
orig_bio = bio;
-   blk_queue_bounce(q, &bio);
 
/*
 * We link the bounce buffer in and could have to traverse it
 * later so we have to get a ref to prevent it from being freed
 */
-   bio_get(bio);
-
ret = blk_rq_append_bio(rq, bio);
+   bio_get(bio);
if (ret) {
bio_endio(bio);
__blk_rq_unmap_user(orig_bio);
@@ -249,7 +249,6 @@ int blk_rq_map_kern(struct request_queue *q, struct request 
*rq, void *kbuf,
return ret;
}
 
-   blk_queue_bounce(q, &rq->bio);
return 0;
 }
 EXPORT_SYMBOL(blk_rq_map_kern);
diff --git a/block/bounce.c b/block/bounce.c
index 17d77613c471..b7e44bd59d7c 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -284,5 +284,3 @@ void blk_queue_bounce(struct request_queue *q, struct bio 
**bio_orig)
 */
__blk_queue_bounce(q, bio_orig, pool);
 }
-
-EXPORT_SYMBOL(blk_queue_bounce);
diff --git a/drivers/scsi/osd/osd_initiator.c b/drivers/scsi/osd/osd_initiator.c
index 1e69a43b279d..3754888dc7dd 100644
--- a/drivers/scsi/osd/osd_initiator.c
+++ b/drivers/scsi/osd/osd_initiator.c
@@ -1577,10 +1577,7 @@ static struct request *_make_request(struct 
request_queue *q, bool has_write,
scsi_req_init(req);
 
for_each_bio(bio) {
-   struct bio *bounce_bio = bio;
-
-   blk_queue_bounce(req->q, &bounce_bio);
-   ret = blk_rq_append_bio(req, bounce_bio);
+   ret = blk_rq_append_bio(req, bio);
if (ret)
return ERR_PTR(ret);
}
-- 
2.11.0



[PATCH 06/10] blk-mq: don't bounce by default

2017-06-19 Thread Christoph Hellwig
For historical reasons we default to bouncing highmem pages for all block
queues.  But the blk-mq drivers are easy to audit to ensure that we don't
need this - scsi and mtip32xx set explicit limits and everyone else doesn't
have any particular ones.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq.c   | 5 -
 drivers/block/virtio_blk.c   | 3 ---
 drivers/block/xen-blkfront.c | 3 ---
 3 files changed, 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e26b7435e245..dd276a9e138e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2300,11 +2300,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
blk_queue_make_request(q, blk_mq_make_request);
 
/*
-* by default assume old behaviour and bounce for any highmem page
-*/
-   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
-
-   /*
 * Do this after blk_queue_make_request() overrides it...
 */
q->nr_requests = set->queue_depth;
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index e59bd4549a8a..0297ad7c1452 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -720,9 +720,6 @@ static int virtblk_probe(struct virtio_device *vdev)
/* We can handle whatever the host told us to handle. */
blk_queue_max_segments(q, vblk->sg_elems-2);
 
-   /* No need to bounce any requests */
-   blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
-
/* No real sector limit. */
blk_queue_max_hw_sectors(q, -1U);
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index ac90093fcb25..c852ed3c01d5 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -953,9 +953,6 @@ static void blkif_set_queue_limits(struct blkfront_info 
*info)
 
/* Make sure buffer addresses are sector-aligned. */
blk_queue_dma_alignment(rq, 511);
-
-   /* Make sure we don't use bounce buffers. */
-   blk_queue_bounce_limit(rq, BLK_BOUNCE_ANY);
 }
 
 static int xlvbd_init_blk_queue(struct gendisk *gd, u16 sector_size,
-- 
2.11.0



[PATCH 04/10] block: remove the queue_bounce_pfn helper

2017-06-19 Thread Christoph Hellwig
Only used inside the bounce code, and opencoding it makes it more obvious
what is going on.

Signed-off-by: Christoph Hellwig 
---
 block/bounce.c | 6 +++---
 include/linux/blkdev.h | 5 -
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/block/bounce.c b/block/bounce.c
index fc79dd3e0f20..f2b4361fff81 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -203,7 +203,7 @@ static void __blk_queue_bounce(struct request_queue *q, 
struct bio **bio_orig,
bio_for_each_segment(from, *bio_orig, iter) {
if (i++ < BIO_MAX_PAGES)
sectors += from.bv_len >> 9;
-   if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q))
+   if (page_to_pfn(from.bv_page) > q->limits.bounce_pfn)
bounce = true;
}
if (!bounce)
@@ -220,7 +220,7 @@ static void __blk_queue_bounce(struct request_queue *q, 
struct bio **bio_orig,
bio_for_each_segment_all(to, bio, i) {
struct page *page = to->bv_page;
 
-   if (page_to_pfn(page) <= queue_bounce_pfn(q))
+   if (page_to_pfn(page) <= q->limits.bounce_pfn)
continue;
 
to->bv_page = mempool_alloc(pool, q->bounce_gfp);
@@ -272,7 +272,7 @@ void blk_queue_bounce(struct request_queue *q, struct bio 
**bio_orig)
 * don't waste time iterating over bio segments
 */
if (!(q->bounce_gfp & GFP_DMA)) {
-   if (queue_bounce_pfn(q) >= blk_max_pfn)
+   if (q->limits.bounce_pfn >= blk_max_pfn)
return;
pool = page_pool;
} else {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f1a5f82a4d97..db2b7a7450bd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1363,11 +1363,6 @@ enum blk_default_limits {
 
 #define blkdev_entry_to_request(entry) list_entry((entry), struct request, 
queuelist)
 
-static inline unsigned long queue_bounce_pfn(struct request_queue *q)
-{
-   return q->limits.bounce_pfn;
-}
-
 static inline unsigned long queue_segment_boundary(struct request_queue *q)
 {
return q->limits.seg_boundary_mask;
-- 
2.11.0



[PATCH 05/10] block: don't bother with bounce limits for make_request drivers

2017-06-19 Thread Christoph Hellwig
We only call blk_queue_bounce for request-based drivers, so stop messing
with it for make_request based drivers.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c   | 5 +
 block/blk-mq.c | 5 +
 block/blk-settings.c   | 5 -
 drivers/block/brd.c| 1 -
 drivers/block/drbd/drbd_main.c | 1 -
 drivers/block/rsxx/dev.c   | 1 -
 drivers/nvdimm/blk.c   | 1 -
 drivers/nvdimm/btt.c   | 1 -
 drivers/nvdimm/pmem.c  | 1 -
 9 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 62cf92550512..07b26853ce1e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -954,6 +954,11 @@ int blk_init_allocated_queue(struct request_queue *q)
 */
blk_queue_make_request(q, blk_queue_bio);
 
+   /*
+* by default assume old behaviour and bounce for any highmem page
+*/
+   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
+
q->sg_reserved_size = INT_MAX;
 
/* Protect q->elevator from elevator_change */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index dd276a9e138e..e26b7435e245 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2300,6 +2300,11 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
blk_queue_make_request(q, blk_mq_make_request);
 
/*
+* by default assume old behaviour and bounce for any highmem page
+*/
+   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
+
+   /*
 * Do this after blk_queue_make_request() overrides it...
 */
q->nr_requests = set->queue_depth;
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4fa81ed383ca..be1f115b538b 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -172,11 +172,6 @@ void blk_queue_make_request(struct request_queue *q, 
make_request_fn *mfn)
q->nr_batching = BLK_BATCH_REQ;
 
blk_set_default_limits(&q->limits);
-
-   /*
-* by default assume old behaviour and bounce for any highmem page
-*/
-   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
 }
 EXPORT_SYMBOL(blk_queue_make_request);
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 57b574f2f66a..6112e99bedf7 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -418,7 +418,6 @@ static struct brd_device *brd_alloc(int i)
 
blk_queue_make_request(brd->brd_queue, brd_make_request);
blk_queue_max_hw_sectors(brd->brd_queue, 1024);
-   blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);
 
/* This is so fdisk will align partitions on 4k, because of
 * direct_access API needing 4k alignment, returning a PFN
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 90680034ef57..5fb99e06ebe4 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2850,7 +2850,6 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
/* Setting the max_hw_sectors to an odd value of 8kibyte here
   This triggers a max_bio_size message upon first attach or connect */
blk_queue_max_hw_sectors(q, DRBD_MAX_BIO_SIZE_SAFE >> 8);
-   blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
q->queue_lock = &resource->req_lock;
 
device->md_io.page = alloc_page(GFP_KERNEL);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index 4e8bdfa0aa31..7f4acebf4657 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -284,7 +284,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
}
 
blk_queue_make_request(card->queue, rsxx_make_request);
-   blk_queue_bounce_limit(card->queue, BLK_BOUNCE_ANY);
blk_queue_max_hw_sectors(card->queue, blkdev_max_hw_sectors);
blk_queue_physical_block_size(card->queue, RSXX_HW_BLK_SIZE);
 
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 79eb9fb358d5..f12d23c49771 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -273,7 +273,6 @@ static int nsblk_attach_disk(struct nd_namespace_blk *nsblk)
 
blk_queue_make_request(q, nd_blk_make_request);
blk_queue_max_hw_sectors(q, UINT_MAX);
-   blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);
blk_queue_logical_block_size(q, nsblk_sector_size(nsblk));
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
q->queuedata = nsblk;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 31b2d14e210d..b6ba0618ea46 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1297,7 +1297,6 @@ static int btt_blk_init(struct btt *btt)
blk_queue_make_request(btt->btt_queue, btt_make_request);
blk_queue_logical_block_size(btt->btt_queue, btt->sector_size);
blk_queue_max_hw_sectors(btt->btt_queue, UINT_MAX);
-   blk_queue_bounce_limit(btt->btt_queue, BLK_BOUNCE_ANY);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, btt->btt_queue);
btt->btt_queue

[PATCH 03/10] block: move bounce declarations to block/blk.h

2017-06-19 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
---
 block/blk.h| 13 +
 block/bounce.c |  1 +
 include/linux/blkdev.h | 13 -
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 83c8e1100525..4576fb5a8006 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -334,4 +334,17 @@ static inline void blk_throtl_bio_endio(struct bio *bio) { 
}
 static inline void blk_throtl_stat_add(struct request *rq, u64 time) { }
 #endif
 
+#ifdef CONFIG_BOUNCE
+extern int init_emergency_isa_pool(void);
+extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
+#else
+static inline int init_emergency_isa_pool(void)
+{
+   return 0;
+}
+static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
+{
+}
+#endif /* CONFIG_BOUNCE */
+
 #endif /* BLK_INTERNAL_H */
diff --git a/block/bounce.c b/block/bounce.c
index b7e44bd59d7c..fc79dd3e0f20 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -22,6 +22,7 @@
 #include 
 
 #include 
+#include "blk.h"
 
 #define POOL_SIZE  64
 #define ISA_POOL_SIZE  16
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 22cfba64ce81..f1a5f82a4d97 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -863,19 +863,6 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn;
 #define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
 #define BLK_MIN_SG_TIMEOUT (7 * HZ)
 
-#ifdef CONFIG_BOUNCE
-extern int init_emergency_isa_pool(void);
-extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
-#else
-static inline int init_emergency_isa_pool(void)
-{
-   return 0;
-}
-static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
-{
-}
-#endif /* CONFIG_MMU */
-
 struct rq_map_data {
struct page **pages;
int page_order;
-- 
2.11.0



[PATCH 09/10] dm: don't set bounce limit

2017-06-19 Thread Christoph Hellwig
Now all queues allocators come without abounce limit by default,
dm doesn't have to override this anymore.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index fbd06b9f9467..402946035308 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1534,7 +1534,6 @@ void dm_init_normal_md_queue(struct mapped_device *md)
 * Initialize aspects of queue that aren't relevant for blk-mq
 */
md->queue->backing_dev_info->congested_fn = dm_any_congested;
-   blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 }
 
 static void cleanup_mapped_device(struct mapped_device *md)
-- 
2.11.0



[PATCH 08/10] block: don't set bounce limit in blk_init_queue

2017-06-19 Thread Christoph Hellwig
Instead move it to the callers.  Those that either don't use bio_data() or
page_address() or are specific to architectures that do not support highmem
are skipped.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c| 5 -
 drivers/block/aoe/aoeblk.c  | 1 +
 drivers/block/floppy.c  | 1 +
 drivers/block/paride/pcd.c  | 1 +
 drivers/block/paride/pd.c   | 1 +
 drivers/block/paride/pf.c   | 1 +
 drivers/block/skd_main.c| 1 +
 drivers/block/swim.c| 2 ++
 drivers/block/swim3.c   | 1 +
 drivers/block/xsysace.c | 1 +
 drivers/cdrom/gdrom.c   | 1 +
 drivers/mtd/mtd_blkdevs.c   | 1 +
 drivers/sbus/char/jsflash.c | 1 +
 13 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ef82164c6d82..62cf92550512 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -927,11 +927,6 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
*lock, int node_id)
return NULL;
}
 
-   /*
-* by default assume old behaviour and bounce for any highmem page
-*/
-   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
-
return q;
 }
 EXPORT_SYMBOL(blk_init_queue_node);
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 027b876370bc..6797e6c23c8a 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -388,6 +388,7 @@ aoeblk_gdalloc(void *vp)
d->aoemajor, d->aoeminor);
goto err_mempool;
}
+   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
 
spin_lock_irqsave(&d->lock, flags);
WARN_ON(!(d->flags & DEVFL_GD_NOW));
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 9e3cb32e365d..ce823647a9c4 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4203,6 +4203,7 @@ static int __init do_floppy_init(void)
goto out_put_disk;
}
 
+   blk_queue_bounce_limit(disks[drive]->queue, BLK_BOUNCE_HIGH);
blk_queue_max_hw_sectors(disks[drive]->queue, 64);
disks[drive]->major = FLOPPY_MAJOR;
disks[drive]->first_minor = TOMINOR(drive);
diff --git a/drivers/block/paride/pcd.c b/drivers/block/paride/pcd.c
index cffe42d80ce9..7b8c6368beb7 100644
--- a/drivers/block/paride/pcd.c
+++ b/drivers/block/paride/pcd.c
@@ -305,6 +305,7 @@ static void pcd_init_units(void)
put_disk(disk);
continue;
}
+   blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH);
cd->disk = disk;
cd->pi = &cd->pia;
cd->present = 0;
diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
index c98983be4f9c..27a44b97393a 100644
--- a/drivers/block/paride/pd.c
+++ b/drivers/block/paride/pd.c
@@ -863,6 +863,7 @@ static void pd_probe_drive(struct pd_unit *disk)
return;
}
blk_queue_max_hw_sectors(p->queue, cluster);
+   blk_queue_bounce_limit(p->queue, BLK_BOUNCE_HIGH);
 
if (disk->drive == -1) {
for (disk->drive = 0; disk->drive <= 1; disk->drive++)
diff --git a/drivers/block/paride/pf.c b/drivers/block/paride/pf.c
index 5f46da8d05cd..eef7a91f667d 100644
--- a/drivers/block/paride/pf.c
+++ b/drivers/block/paride/pf.c
@@ -293,6 +293,7 @@ static void __init pf_init_units(void)
return;
}
blk_queue_max_segments(disk->queue, cluster);
+   blk_queue_bounce_limit(disk->queue, BLK_BOUNCE_HIGH);
pf->disk = disk;
pf->pi = &pf->pia;
pf->media_status = PF_NM;
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index e6c526861703..d0368682bd43 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -4273,6 +4273,7 @@ static int skd_cons_disk(struct skd_device *skdev)
rc = -ENOMEM;
goto err_out;
}
+   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
 
skdev->queue = q;
disk->queue = q;
diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 1633aaf24060..84434d3ea19b 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -864,6 +864,8 @@ static int swim_floppy_init(struct swim_priv *swd)
put_disk(swd->unit[drive].disk);
goto exit_put_disks;
}
+   blk_queue_bounce_limit(swd->unit[drive].disk->queue,
+   BLK_BOUNCE_HIGH);
swd->unit[drive].disk->queue->queuedata = swd;
swd->unit[drive].swd = swd;
}
diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index e3399a138335..9f931f8f6b4c 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -1223,6 +1223,7 @@ static int swim3_attach(struct macio_dev *mdev,
put_disk(disk);
return -ENOMEM;
}
+  

[PATCH 07/10] block: don't set bounce limit in blk_init_allocated_queue

2017-06-19 Thread Christoph Hellwig
And just move it into scsi_transport_sas which needs it due to low-level
drivers directly derferencing bio_data, and into blk_init_queue_node,
which will need a further push into the callers.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c  | 10 +-
 drivers/scsi/scsi_transport_sas.c |  5 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 07b26853ce1e..ef82164c6d82 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -927,6 +927,11 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t 
*lock, int node_id)
return NULL;
}
 
+   /*
+* by default assume old behaviour and bounce for any highmem page
+*/
+   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
+
return q;
 }
 EXPORT_SYMBOL(blk_init_queue_node);
@@ -954,11 +959,6 @@ int blk_init_allocated_queue(struct request_queue *q)
 */
blk_queue_make_request(q, blk_queue_bio);
 
-   /*
-* by default assume old behaviour and bounce for any highmem page
-*/
-   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
-
q->sg_reserved_size = INT_MAX;
 
/* Protect q->elevator from elevator_change */
diff --git a/drivers/scsi/scsi_transport_sas.c 
b/drivers/scsi/scsi_transport_sas.c
index cc970c811bcb..87ea904f0700 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -249,6 +249,11 @@ static int sas_bsg_initialize(struct Scsi_Host *shost, 
struct sas_rphy *rphy)
if (error)
goto out_cleanup_queue;
 
+   /*
+* by default assume old behaviour and bounce for any highmem page
+*/
+   blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
+
error = bsg_register_queue(q, dev, name, release);
if (error)
goto out_cleanup_queue;
-- 
2.11.0



[PATCH 10/10] mmc/block: remove a call to blk_queue_bounce_limit

2017-06-19 Thread Christoph Hellwig
BLK_BOUNCE_ANY is the defauly now, so the call is superflous.

Signed-off-by: Christoph Hellwig 
---
 drivers/mmc/core/queue.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 7f20298d892b..b659a28c8018 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -388,7 +388,6 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card 
*card,
mmc_queue_setup_discard(mq->queue, card);
 
if (card->bouncesz) {
-   blk_queue_bounce_limit(mq->queue, BLK_BOUNCE_ANY);
blk_queue_max_hw_sectors(mq->queue, card->bouncesz / 512);
blk_queue_max_segments(mq->queue, card->bouncesz / 512);
blk_queue_max_segment_size(mq->queue, card->bouncesz);
-- 
2.11.0



Re: [PATCH BUGFIX] block, bfq: update wr_busy_queues if needed on a queue split

2017-06-19 Thread kbuild test robot
Hi Paolo,

[auto build test WARNING on v4.12-rc5]
[also build test WARNING on next-20170616]
[cannot apply to block/for-next]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-update-wr_busy_queues-if-needed-on-a-queue-split/20170619-145003
config: i386-randconfig-x000-201725 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   block/bfq-iosched.c: In function 'bfq_get_rq_private':
>> block/bfq-iosched.c:770:10: warning: 'old_wr_coeff' may be used 
>> uninitialized in this function [-Wmaybe-uninitialized]
 else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
 ^
   block/bfq-iosched.c:731:15: note: 'old_wr_coeff' was declared here
 unsigned int old_wr_coeff;
  ^~~~

vim +/old_wr_coeff +770 block/bfq-iosched.c

   754  time_is_before_jiffies(bfqq->last_wr_start_finish +
   755 bfqq->wr_cur_max_time))) {
   756  bfq_log_bfqq(bfqq->bfqd, bfqq,
   757  "resume state: switching off wr");
   758  
   759  bfqq->wr_coeff = 1;
   760  }
   761  
   762  /* make sure weight will be updated, however we got here */
   763  bfqq->entity.prio_changed = 1;
   764  
   765  if (likely(!busy))
   766  return;
   767  
   768  if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)
   769  bfqd->wr_busy_queues++;
 > 770  else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
   771  bfqd->wr_busy_queues--;
   772  }
   773  
   774  static int bfqq_process_refs(struct bfq_queue *bfqq)
   775  {
   776  return bfqq->ref - bfqq->allocated - bfqq->entity.on_st;
   777  }
   778  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH rfc 01/30] nvme: Add admin connect request queue

2017-06-19 Thread Sagi Grimberg



In case we reconnect with inflight admin IO we
need to make sure that the connect comes before
the admin command. This can be only achieved by
using a seperate request queue for admin connects.


Use up a few more lines of the available space for your lines? :)


I warned in the cover-letter that the change logs are pure
negligence at the moment :)


Wouldn't a head insertation also solve the problem?


the head insertion will not protect against it because
we must invoke blk_mq_start_stopped_hw_queues on the admin_q
request queue so the admin connect can make progress, at this
point (and before we actually queue up connect) pending admin
commands can sneak in...

However, you raise a valid point, I think I added this before we
had the queue_is_ready protection, which will reject the command
if the queue is not LIVE (unless its a connect). I think the reason
its still in is that I tested this with loop which doesn't have
a per-queue state machine.

I'm still wandering if its a good idea to rely on the transport
queue state to reject non-connect requests on non-LIVE queues.
if/when we introduce a queue representation to the core and we
drive the state machine there, then we could actually rely on it
(I do have some code for it, but its a pretty massive change which
cannot be added in an incremental fashion).

Thoughts?


Re: [PATCH rfc 02/30] nvme-rdma: Don't alloc/free the tagset on reset

2017-06-19 Thread Sagi Grimberg



On 19/06/17 10:18, Christoph Hellwig wrote:

+static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl, bool 
remove)
  {
+   nvme_rdma_stop_queue(&ctrl->queues[0]);
+   if (remove) {
+   blk_cleanup_queue(ctrl->ctrl.admin_connect_q);
+   blk_cleanup_queue(ctrl->ctrl.admin_q);
+   blk_mq_free_tag_set(&ctrl->admin_tag_set);
+   nvme_rdma_dev_put(ctrl->device);
+   }
+
nvme_rdma_free_qe(ctrl->queues[0].device->dev, &ctrl->async_event_sqe,
sizeof(struct nvme_command), DMA_TO_DEVICE);
+   nvme_rdma_free_queue(&ctrl->queues[0]);


I don't like the calling convention.  We only have have two callers
anyway.  So I'd much rather only keep the code inside the if above
in the new nvme_rdma_destroy_admin_queue that is only called at shutdown
time, and opencode the calls to nvme_rdma_stop_queue, nvme_rdma_free_qe
and nvme_rdma_free_queue in the callers.


We can do that, but this tries to eliminate duplicate code as
much as possible. It's not like the convention is unprecedented...


-static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl)
+static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, bool 
new)


PCIe is just checking for a non-null admin_q.


Which I don't like very much :)


But I think we should
jsut split this into two functions, one for the shared code at the end
and one just for the first-time setup, with the nvme_rdma_init_queue
call open coded.


We can split, but I less like the idea of open-coding
nvme_rdma_init_queue at the call-sites.


error = nvmf_connect_admin_queue(&ctrl->ctrl);
@@ -1596,6 +1601,8 @@ static int nvme_rdma_configure_admin_queue(struct 
nvme_rdma_ctrl *ctrl)
  
  	set_bit(NVME_RDMA_Q_LIVE, &ctrl->queues[0].flags);
  
+	blk_mq_start_stopped_hw_queues(ctrl->ctrl.admin_q, true);

+


Where does this come from?


Spilled in I guess...


Re: [PATCH rfc 03/30] nvme-rdma: reuse configure/destroy admin queue

2017-06-19 Thread Sagi Grimberg



static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, bool 
new)

Too long line..

Also what about moving the helpers into the right place in the
previous patch that already re-indented most of this?


Sure.


Re: [PATCH rfc 29/30] nvme: add sed-opal ctrl manipulation in admin configuration

2017-06-19 Thread Sagi Grimberg



The subject sounds odd and it could use a changelog.  But I'd love to
pick this change up ASAP as it's the right thing to do..


How? where would you place it? there is no nvme_configure_admin_queue in
nvme-core.

I suppose we could get it in nvme_init_ctrl?


Re: [PATCH rfc 28/30] nvme: update tagset nr_hw_queues when reallocating io queues

2017-06-19 Thread Ming Lei
On Mon, Jun 19, 2017 at 09:21:48AM +0200, Christoph Hellwig wrote:
> On Sun, Jun 18, 2017 at 06:22:02PM +0300, Sagi Grimberg wrote:
> > Signed-off-by: Sagi Grimberg 
> 
> Could use a changelog.
> 
> Ming: does this solve your problem of not seeing the new queues
> after a qemu CPU hotplug + reset?

The issue I observed is that there isn't NVMe reset triggered after
CPU becomes online.

It may take a while since the test need to apply the whole patchset.
Or is it fine to figure out one fix on this issue? Sagi?

Once the test is done, I will update with you.

Thanks,
Ming


Re: [lkp-robot] [scsi] ebc76736f2: fio.write_bw_MBps -4% regression

2017-06-19 Thread Christoph Hellwig
On Mon, Jun 19, 2017 at 03:49:43PM +0800, Ye Xiaolong wrote:
> On 06/19, Christoph Hellwig wrote:
> >On Mon, Jun 19, 2017 at 02:03:18PM +0800, kernel test robot wrote:
> >> 
> >> Greeting,
> >> 
> >> FYI, we noticed a -4% regression of fio.write_bw_MBps due to commit:
> >
> >What does cat /sys/block//queue/scheduler say for your test?
> >Can you try with mq-deadline or bfq and see what the results are?
> 
> Could you tell me how to do it? via changing kernel config?

BFQ doesn't seem to be enabled in the config, so it would have to be
enabled.

I've not seen a compile-time option for the MQ I/O scheduler (unlike
the legacy one), so the way to change it would be to echo the name to
/sys/block//queue/scheduler 


Re: [lkp-robot] [scsi] ebc76736f2: fio.write_bw_MBps -4% regression

2017-06-19 Thread Ye Xiaolong
On 06/19, Christoph Hellwig wrote:
>On Mon, Jun 19, 2017 at 03:49:43PM +0800, Ye Xiaolong wrote:
>> On 06/19, Christoph Hellwig wrote:
>> >On Mon, Jun 19, 2017 at 02:03:18PM +0800, kernel test robot wrote:
>> >> 
>> >> Greeting,
>> >> 
>> >> FYI, we noticed a -4% regression of fio.write_bw_MBps due to commit:
>> >
>> >What does cat /sys/block//queue/scheduler say for your test?
>> >Can you try with mq-deadline or bfq and see what the results are?
>> 
>> Could you tell me how to do it? via changing kernel config?
>
>BFQ doesn't seem to be enabled in the config, so it would have to be
>enabled.
>
>I've not seen a compile-time option for the MQ I/O scheduler (unlike
>the legacy one), so the way to change it would be to echo the name to
>/sys/block//queue/scheduler 

echo mq-deadline > /sys/block//queue/scheduler ?

I'll try and post the results once I get it.

Thanks,
Xiaolong



Re: [lkp-robot] [scsi] ebc76736f2: fio.write_bw_MBps -4% regression

2017-06-19 Thread Christoph Hellwig
On Mon, Jun 19, 2017 at 04:52:36PM +0800, Ye Xiaolong wrote:
> >I've not seen a compile-time option for the MQ I/O scheduler (unlike
> >the legacy one), so the way to change it would be to echo the name to
> >/sys/block//queue/scheduler 
> 
> echo mq-deadline > /sys/block//queue/scheduler ?
> 
> I'll try and post the results once I get it.

Thanks!


[PATCH] btrfs: use new block error code

2017-06-19 Thread Dan Carpenter
This function is supposed to return blk_status_t error codes now but
there was a stray -ENOMEM left behind.

Fixes: 4e4cbee93d56 ("block: switch bios to blk_status_t")
Signed-off-by: Dan Carpenter 

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index e536e98fe351..2c0b7b57fcd5 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -584,7 +584,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode 
*inode, struct bio *bio,
  __GFP_HIGHMEM);
if (!cb->compressed_pages[pg_index]) {
faili = pg_index - 1;
-   ret = -ENOMEM;
+   ret = BLK_STS_RESOURCE;
goto fail2;
}
}


Re: [PATCH BUGFIX] block, bfq: update wr_busy_queues if needed on a queue split

2017-06-19 Thread Paolo Valente

> Il giorno 19 giu 2017, alle ore 09:38, kbuild test robot  ha 
> scritto:
> 
> Hi Paolo,
> 
> [auto build test WARNING on v4.12-rc5]
> [also build test WARNING on next-20170616]
> [cannot apply to block/for-next]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Paolo-Valente/block-bfq-update-wr_busy_queues-if-needed-on-a-queue-split/20170619-145003
> config: i386-randconfig-x000-201725 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
># save the attached .config to linux build tree
>make ARCH=i386 
> 
> Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
> http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings
> 
> All warnings (new ones prefixed by >>):
> 
>   block/bfq-iosched.c: In function 'bfq_get_rq_private':
>>> block/bfq-iosched.c:770:10: warning: 'old_wr_coeff' may be used 
>>> uninitialized in this function [-Wmaybe-uninitialized]
> else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
> ^
>   block/bfq-iosched.c:731:15: note: 'old_wr_coeff' was declared here
> unsigned int old_wr_coeff;
>  ^~~~
> 

I'm sending a V2, (probably imperceptibly) slower on average, but not
confusing the compiler.

Thanks,
Paolo

> vim +/old_wr_coeff +770 block/bfq-iosched.c
> 
>   754 time_is_before_jiffies(bfqq->last_wr_start_finish +
>   755bfqq->wr_cur_max_time))) {
>   756 bfq_log_bfqq(bfqq->bfqd, bfqq,
>   757 "resume state: switching off wr");
>   758 
>   759 bfqq->wr_coeff = 1;
>   760 }
>   761 
>   762 /* make sure weight will be updated, however we got here */
>   763 bfqq->entity.prio_changed = 1;
>   764 
>   765 if (likely(!busy))
>   766 return;
>   767 
>   768 if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)
>   769 bfqd->wr_busy_queues++;
>> 770  else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
>   771 bfqd->wr_busy_queues--;
>   772 }
>   773 
>   774 static int bfqq_process_refs(struct bfq_queue *bfqq)
>   775 {
>   776 return bfqq->ref - bfqq->allocated - bfqq->entity.on_st;
>   777 }
>   778 
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> <.config.gz>



[PATCH BUGFIX V2] block, bfq: update wr_busy_queues if needed on a queue split

2017-06-19 Thread Paolo Valente
This commit fixes a bug triggered by a non-trivial sequence of
events. These events are briefly described in the next two
paragraphs. The impatiens, or those who are familiar with queue
merging and splitting, can jump directly to the last paragraph.

On each I/O-request arrival for a shared bfq_queue, i.e., for a
bfq_queue that is the result of the merge of two or more bfq_queues,
BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
many random I/O requests have arrived for the bfq_queue; if the device
is non rotational, then random requests must be also small for the
bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
detected as seeky, then a split occurs: the bfq I/O context of the
process that has issued the request is redirected from the shared
bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
shared bfq_queue actually happens to be shared only by one process
(because of previous splits), then no new bfq_queue is created: the
state of the shared bfq_queue is just changed from shared to non
shared.

Regardless of whether a brand new non-shared bfq_queue is created, or
the pre-existing shared bfq_queue is just turned into a non-shared
bfq_queue, several parameters of the non-shared bfq_queue are set
(restored) to the original values they had when the bfq_queue
associated with the bfq I/O context of the process (that has just
issued an I/O request) was merged with the shared bfq_queue. One of
these parameters is the weight-raising state.

If, on the split of a shared bfq_queue,
1) a pre-existing shared bfq_queue is turned into a non-shared
bfq_queue;
2) the previously shared bfq_queue happens to be busy;
3) the weight-raising state of the previously shared bfq_queue happens
to change;
the number of weight-raised busy queues changes. The field
wr_busy_queues must then be updated accordingly, but such an update
was missing. This commit adds the missing update.

Reported-by: Luca Miccio 
Signed-off-by: Paolo Valente 
---
 block/bfq-iosched.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ed93da2..bbeaf52 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -725,8 +725,12 @@ static void bfq_updated_next_req(struct bfq_data *bfqd,
 }
 
 static void
-bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
+bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct bfq_data *bfqd,
+ struct bfq_io_cq *bic, bool bfq_already_existing)
 {
+   unsigned int old_wr_coeff = bfqq->wr_coeff;
+   bool busy = bfq_already_existing && bfq_bfqq_busy(bfqq);
+
if (bic->saved_idle_window)
bfq_mark_bfqq_idle_window(bfqq);
else
@@ -754,6 +758,14 @@ bfq_bfqq_resume_state(struct bfq_queue *bfqq, struct 
bfq_io_cq *bic)
 
/* make sure weight will be updated, however we got here */
bfqq->entity.prio_changed = 1;
+
+   if (likely(!busy))
+   return;
+
+   if (old_wr_coeff == 1 && bfqq->wr_coeff > 1)
+   bfqd->wr_busy_queues++;
+   else if (old_wr_coeff > 1 && bfqq->wr_coeff == 1)
+   bfqd->wr_busy_queues--;
 }
 
 static int bfqq_process_refs(struct bfq_queue *bfqq)
@@ -4402,7 +4414,7 @@ static int bfq_get_rq_private(struct request_queue *q, 
struct request *rq,
const int is_sync = rq_is_sync(rq);
struct bfq_queue *bfqq;
bool new_queue = false;
-   bool split = false;
+   bool bfqq_already_existing = false, split = false;
 
spin_lock_irq(&bfqd->lock);
 
@@ -4432,6 +,8 @@ static int bfq_get_rq_private(struct request_queue *q, 
struct request *rq,
bfqq = bfq_get_bfqq_handle_split(bfqd, bic, bio,
 true, is_sync,
 NULL);
+   else
+   bfqq_already_existing = true;
}
}
 
@@ -4457,7 +4471,8 @@ static int bfq_get_rq_private(struct request_queue *q, 
struct request *rq,
 * queue: restore the idle window and the
 * possible weight raising period.
 */
-   bfq_bfqq_resume_state(bfqq, bic);
+   bfq_bfqq_resume_state(bfqq, bfqd, bic,
+ bfqq_already_existing);
}
}
 
-- 
2.10.0



dm integrity tests crash kernel (4.12-rc5)

2017-06-19 Thread Ondrej Kozina

Hi,

cryptsetup testsuite easily triggers following crash. I can provide more 
info on demand, but currently most straightforward way to trigger it is:


1) checkout cryptsetup master branch 
(https://gitlab.com/cryptsetup/cryptsetup.git)

2)./autogen.sh --disable-python --enable-integritysetup
3) compile
4) run tests/integrity-compat-test in loop for a while (it's not 100% 
reproducible)


[  330.980914] DEBUG_LOCKS_WARN_ON(current->hardirq_context)

[  330.980923] [ cut here ]

[  330.982627] WARNING: CPU: 1 PID: 0 at kernel/locking/lockdep.c:2748 
trace_hardirqs_on_caller+0x107/0x180


[  330.984340] Modules linked in: dm_integrity async_xor xor async_tx 
dm_bufio dm_mod dax auth_rpcgss oid_registry nfsv4 dns_resolver nfs 
lockd grace sunrpc crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc 
aesni_intel aes_x86_64 crypto_simd glue_helper cryptd


[  330.989205] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.12.0-rc5 #1

[  330.990645] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.9.1-1.fc24 04/01/2014


[  330.992616] task: 88007c838000 task.stack: c9368000

[  330.994084] RIP: 0010:trace_hardirqs_on_caller+0x107/0x180

[  330.995361] RSP: 0018:88007fd03e38 EFLAGS: 00010092

[  330.996594] RAX: 002d RBX: 8800322404e0 RCX: 



[  330.998194] RDX: 810badd8 RSI: 0001 RDI: 
810badf2


[  330.999801] RBP: 88007fd03e48 R08: 0001 R09: 



[  331.001427] R10:  R11: 810badbe R12: 
81809ff7


[  331.003045] R13: 880032c4e700 R14: 880077cd9bc0 R15: 



[  331.005616] FS:  () GS:88007fd0() 
knlGS:


[  331.008308] CS:  0010 DS:  ES:  CR0: 80050033

[  331.010107] CR2: 025e7f50 CR3: 39c7c000 CR4: 
000406e0


[  331.011968] Call Trace:

[  331.012806]  

[  331.013684]  trace_hardirqs_on+0xd/0x10

[  331.014732]  _raw_spin_unlock_irq+0x27/0x30

[  331.015840]  submit_flush_bio+0x4e/0x80 [dm_integrity]

[  331.017157]  do_endio_flush+0x41/0x70 [dm_integrity]

[  331.018476]  dec_in_flight+0x59/0x110 [dm_integrity]

[  331.019767]  integrity_end_io+0x5e/0x70 [dm_integrity]

[  331.020965]  bio_endio+0x7c/0x1a0

[  331.021917]  blk_update_request+0x9f/0x3d0

[  331.023050]  blk_mq_end_request+0x15/0x60

[  331.024224]  lo_complete_rq+0x2b/0x80

[  331.025406]  __blk_mq_complete_request_remote+0xe/0x10

[  331.026813]  flush_smp_call_function_queue+0x4f/0x110

[  331.028173]  generic_smp_call_function_single_interrupt+0xe/0x20

[  331.029616]  smp_call_function_single_interrupt+0x22/0x30

[  331.031125]  call_function_single_interrupt+0x90/0xa0

[  331.032819] RIP: 0010:default_idle+0x1b/0x180

[  331.035740] RSP: 0018:c936beb0 EFLAGS: 0206 ORIG_RAX: 
ff04


[  331.039007] RAX: 88007c838000 RBX: 88007c838000 RCX: 
0001


[  331.041332] RDX:  RSI: 0001 RDI: 
88007c838000


[  331.043227] RBP: c936bec0 R08:  R09: 
0001


[  331.045066] R10:  R11:  R12: 
0001


[  331.047367] R13: 88007c838000 R14:  R15: 



[  331.049232]  

[  331.050297]  arch_cpu_idle+0xa/0x10

[  331.051505]  default_idle_call+0x1e/0x30

[  331.052874]  do_idle+0x15a/0x1c0

[  331.053836]  cpu_startup_entry+0x18/0x20

[  331.054997]  start_secondary+0xed/0xf0

[  331.056183]  secondary_startup_64+0x9f/0x9f

[  331.057503] Code: 41 5c 5d f3 c3 e8 ea 75 29 00 85 c0 74 f1 8b 35 b0 
92 ca 01 85 f6 75 e7 48 c7 c6 67 8e c1 81 48 c7 c7 43 6d c1 81 e8 dc b4 
0b 00 <0f> ff eb d0 be 01 00 00 00 48 89 df e8 48 fe ff ff 85 c0 75 90


[  331.062261] ---[ end trace fd21f79668c6a046 ]---

[  331.063745] [ cut here ]

[  331.066503] kernel BUG at kernel/irq_work.c:135!

[  331.068374] invalid opcode:  [#1] SMP

[  331.070144] Modules linked in: dm_integrity async_xor xor async_tx 
dm_bufio dm_mod dax auth_rpcgss oid_registry nfsv4 dns_resolver nfs 
lockd grace sunrpc crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc 
aesni_intel aes_x86_64 crypto_simd glue_helper cryptd


[  331.076302] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW 
4.12.0-rc5 #1


[  331.078639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.9.1-1.fc24 04/01/2014


[  331.081124] task: 88007c838000 task.stack: c9368000

[  331.082646] RIP: 0010:irq_work_run_list+0x62/0x70

[  331.083892] RSP: 0018:88007fd03f70 EFLAGS: 00010206

[  331.085469] RAX: 88007c838000 RBX: 0200 RCX: 
0002


[  331.087449] RDX: 813097bc RSI: 0001 RDI: 
88007fd14cf8


[  331.089271] RBP: 88007fd03f90 R08:  R09: 



[  331.091093] R10:  R11: 81309760 R12: 



[  3

Re: [dm-devel] dm integrity tests crash kernel (4.12-rc5)

2017-06-19 Thread Ondrej Kozina

same log again, hopefully prettier format. Sorry:

[  330.980914] DEBUG_LOCKS_WARN_ON(current->hardirq_context)
[  330.980923] [ cut here ]
[  330.982627] WARNING: CPU: 1 PID: 0 at kernel/locking/lockdep.c:2748 
trace_hardirqs_on_caller+0x107/0x180
[  330.984340] Modules linked in: dm_integrity async_xor xor async_tx 
dm_bufio dm_mod dax auth_rpcgss oid_registry nfsv4 dns_resolver nfs 
lockd grace sunrpc crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc 
aesni_intel aes_x86_64 crypto_simd glue_helper cryptd

[  330.989205] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.12.0-rc5 #1
[  330.990645] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.9.1-1.fc24 04/01/2014

[  330.992616] task: 88007c838000 task.stack: c9368000
[  330.994084] RIP: 0010:trace_hardirqs_on_caller+0x107/0x180
[  330.995361] RSP: 0018:88007fd03e38 EFLAGS: 00010092
[  330.996594] RAX: 002d RBX: 8800322404e0 RCX: 

[  330.998194] RDX: 810badd8 RSI: 0001 RDI: 
810badf2
[  330.999801] RBP: 88007fd03e48 R08: 0001 R09: 

[  331.001427] R10:  R11: 810badbe R12: 
81809ff7
[  331.003045] R13: 880032c4e700 R14: 880077cd9bc0 R15: 

[  331.005616] FS:  () GS:88007fd0() 
knlGS:

[  331.008308] CS:  0010 DS:  ES:  CR0: 80050033
[  331.010107] CR2: 025e7f50 CR3: 39c7c000 CR4: 
000406e0

[  331.011968] Call Trace:
[  331.012806]  
[  331.013684]  trace_hardirqs_on+0xd/0x10
[  331.014732]  _raw_spin_unlock_irq+0x27/0x30
[  331.015840]  submit_flush_bio+0x4e/0x80 [dm_integrity]
[  331.017157]  do_endio_flush+0x41/0x70 [dm_integrity]
[  331.018476]  dec_in_flight+0x59/0x110 [dm_integrity]
[  331.019767]  integrity_end_io+0x5e/0x70 [dm_integrity]
[  331.020965]  bio_endio+0x7c/0x1a0
[  331.021917]  blk_update_request+0x9f/0x3d0
[  331.023050]  blk_mq_end_request+0x15/0x60
[  331.024224]  lo_complete_rq+0x2b/0x80
[  331.025406]  __blk_mq_complete_request_remote+0xe/0x10
[  331.026813]  flush_smp_call_function_queue+0x4f/0x110
[  331.028173]  generic_smp_call_function_single_interrupt+0xe/0x20
[  331.029616]  smp_call_function_single_interrupt+0x22/0x30
[  331.031125]  call_function_single_interrupt+0x90/0xa0
[  331.032819] RIP: 0010:default_idle+0x1b/0x180
[  331.035740] RSP: 0018:c936beb0 EFLAGS: 0206 ORIG_RAX: 
ff04
[  331.039007] RAX: 88007c838000 RBX: 88007c838000 RCX: 
0001
[  331.041332] RDX:  RSI: 0001 RDI: 
88007c838000
[  331.043227] RBP: c936bec0 R08:  R09: 
0001
[  331.045066] R10:  R11:  R12: 
0001
[  331.047367] R13: 88007c838000 R14:  R15: 


[  331.049232]  
[  331.050297]  arch_cpu_idle+0xa/0x10
[  331.051505]  default_idle_call+0x1e/0x30
[  331.052874]  do_idle+0x15a/0x1c0
[  331.053836]  cpu_startup_entry+0x18/0x20
[  331.054997]  start_secondary+0xed/0xf0
[  331.056183]  secondary_startup_64+0x9f/0x9f
[  331.057503] Code: 41 5c 5d f3 c3 e8 ea 75 29 00 85 c0 74 f1 8b 35 b0 
92 ca 01 85 f6 75 e7 48 c7 c6 67 8e c1 81 48 c7 c7 43 6d c1 81 e8 dc b4 
0b 00 <0f> ff eb d0 be 01 00 00 00 48 89 df e8 48 fe ff ff 85 c0 75 90

[  331.062261] ---[ end trace fd21f79668c6a046 ]---
[  331.063745] [ cut here ]
[  331.066503] kernel BUG at kernel/irq_work.c:135!
[  331.068374] invalid opcode:  [#1] SMP
[  331.070144] Modules linked in: dm_integrity async_xor xor async_tx 
dm_bufio dm_mod dax auth_rpcgss oid_registry nfsv4 dns_resolver nfs 
lockd grace sunrpc crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc 
aesni_intel aes_x86_64 crypto_simd glue_helper cryptd
[  331.076302] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW 
4.12.0-rc5 #1
[  331.078639] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
BIOS 1.9.1-1.fc24 04/01/2014

[  331.081124] task: 88007c838000 task.stack: c9368000
[  331.082646] RIP: 0010:irq_work_run_list+0x62/0x70
[  331.083892] RSP: 0018:88007fd03f70 EFLAGS: 00010206
[  331.085469] RAX: 88007c838000 RBX: 0200 RCX: 
0002
[  331.087449] RDX: 813097bc RSI: 0001 RDI: 
88007fd14cf8
[  331.089271] RBP: 88007fd03f90 R08:  R09: 

[  331.091093] R10:  R11: 81309760 R12: 

[  331.093018] R13: 0001 R14:  R15: 

[  331.094834] FS:  () GS:88007fd0() 
knlGS:

[  331.099328] CS:  0010 DS:  ES:  CR0: 80050033
[  331.101766] CR2: 025e7f50 CR3: 39c7c000 CR4: 
000406e0

[  331.103660] Call Trace:
[  331.104600]  
[  331.105484]  irq_work_run+0x18/0x40
[  331.106885]  flus

[PATCH] sd: add support for TCG OPAL self encrypting disks

2017-06-19 Thread Christoph Hellwig
Just wire up the generic TCG OPAL infrastructure to the SCSI disk driver
and the Security In/Out commands.

Note that I don't know of any actual SCSI disks that do support TCG OPAL,
but this is required to support ATA disks through libata.

Signed-off-by: Christoph Hellwig 
---
 drivers/ata/libata-scsi.c  |  3 +++
 drivers/scsi/sd.c  | 53 +-
 drivers/scsi/sd.h  |  2 ++
 include/scsi/scsi_device.h |  1 +
 4 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 0f788ad6f2f6..3e5ca2e894a4 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1321,6 +1321,9 @@ static int ata_scsi_dev_config(struct scsi_device *sdev,
 
blk_queue_flush_queueable(q, false);
 
+   if (dev->flags & ATA_DFLAG_TRUSTED)
+   sdev->security_supported = 1;
+
dev->sdev = sdev;
return 0;
 }
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index f9d1432d7cc5..5d32fd7d3a3e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -643,6 +644,26 @@ static void scsi_disk_put(struct scsi_disk *sdkp)
mutex_unlock(&sd_ref_mutex);
 }
 
+#ifdef CONFIG_BLK_SED_OPAL
+static int sd_sec_submit(void *data, u16 spsp, u8 secp, void *buffer,
+   size_t len, bool send)
+{
+   struct scsi_device *sdev = data;
+   u8 cdb[12] = { 0, };
+   int ret;
+
+   cdb[0] = send ? SECURITY_PROTOCOL_OUT : SECURITY_PROTOCOL_IN;
+   cdb[1] = secp;
+   put_unaligned_be16(spsp, &cdb[2]);
+   put_unaligned_be32(len, &cdb[6]);
+
+   ret = scsi_execute_req(sdev, cdb,
+   send ? DMA_TO_DEVICE : DMA_FROM_DEVICE,
+   buffer, len, NULL, SD_TIMEOUT, SD_MAX_RETRIES, NULL);
+   return ret <= 0 ? ret : -EIO;
+}
+#endif /* CONFIG_BLK_SED_OPAL */
+
 static unsigned char sd_setup_protect_cmnd(struct scsi_cmnd *scmd,
   unsigned int dix, unsigned int dif)
 {
@@ -1439,6 +1460,9 @@ static int sd_ioctl(struct block_device *bdev, fmode_t 
mode,
if (error)
goto out;
 
+   if (is_sed_ioctl(cmd))
+   return sed_ioctl(sdkp->opal_dev, cmd, p);
+
/*
 * Send SCSI addressing ioctls directly to mid level, send other
 * ioctls to block level and then onto mid level if they can't be
@@ -2994,6 +3018,20 @@ static void sd_read_write_same(struct scsi_disk *sdkp, 
unsigned char *buffer)
sdkp->ws10 = 1;
 }
 
+static void sd_read_security(struct scsi_disk *sdkp, unsigned char *buffer)
+{
+   struct scsi_device *sdev = sdkp->device;
+
+   if (!sdev->security_supported)
+   return;
+
+   if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE,
+   SECURITY_PROTOCOL_IN) == 1 &&
+   scsi_report_opcode(sdev, buffer, SD_BUF_SIZE,
+   SECURITY_PROTOCOL_OUT) == 1)
+   sdkp->security = 1;
+}
+
 /**
  * sd_revalidate_disk - called the first time a new disk is seen,
  * performs disk spin up, read_capacity, etc.
@@ -3047,6 +3085,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
sd_read_cache_type(sdkp, buffer);
sd_read_app_tag_own(sdkp, buffer);
sd_read_write_same(sdkp, buffer);
+   sd_read_security(sdkp, buffer);
}
 
sdkp->first_scan = 0;
@@ -3207,6 +3246,12 @@ static void sd_probe_async(void *data, async_cookie_t 
cookie)
 
sd_revalidate_disk(gd);
 
+   if (sdkp->security) {
+   sdkp->opal_dev = init_opal_dev(sdp, &sd_sec_submit);
+   if (sdkp->opal_dev)
+   sd_printk(KERN_NOTICE, sdkp, "supports TCG Opal\n");
+   }
+
sd_printk(KERN_NOTICE, sdkp, "Attached SCSI %sdisk\n",
  sdp->removable ? "removable " : "");
scsi_autopm_put_device(sdp);
@@ -3356,6 +3401,8 @@ static int sd_remove(struct device *dev)
 
sd_zbc_remove(sdkp);
 
+   free_opal_dev(sdkp->opal_dev);
+
blk_register_region(devt, SD_MINORS, NULL,
sd_default_probe, NULL, NULL);
 
@@ -3497,6 +3544,7 @@ static int sd_suspend_runtime(struct device *dev)
 static int sd_resume(struct device *dev)
 {
struct scsi_disk *sdkp = dev_get_drvdata(dev);
+   int ret;
 
if (!sdkp)  /* E.g.: runtime resume at the start of sd_probe() */
return 0;
@@ -3505,7 +3553,10 @@ static int sd_resume(struct device *dev)
return 0;
 
sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
-   return sd_start_stop_device(sdkp, 1);
+   ret = sd_start_stop_device(sdkp, 1);
+   if (!ret)
+   opal_unlock_from_suspend(sdkp->opal_dev);
+   return ret;
 }
 
 /**
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 61d02e

TCG Opal support for sd.c

2017-06-19 Thread Christoph Hellwig
Hi all,

this patch adds TCG Opal support to the scsi disk driver.  As far as I know
only SATA disks actually support OPAL, and as Martin fears RSOC-related
regressions the support is conditional in a flag in struct scsi_device,
which so far only libata sets.

Because of that we should merge the patch through the libata tree with ACKs
from the SCSI side.

Changes from the previous version:
  - add the new security_supported flag
  - add a call to opal_unlock_from_suspend


Re: [PATCH rfc 01/30] nvme: Add admin connect request queue

2017-06-19 Thread Christoph Hellwig
On Mon, Jun 19, 2017 at 10:49:15AM +0300, Sagi Grimberg wrote:
> However, you raise a valid point, I think I added this before we
> had the queue_is_ready protection, which will reject the command
> if the queue is not LIVE (unless its a connect). I think the reason
> its still in is that I tested this with loop which doesn't have
> a per-queue state machine.

Yeah.

> I'm still wandering if its a good idea to rely on the transport
> queue state to reject non-connect requests on non-LIVE queues.
> if/when we introduce a queue representation to the core and we
> drive the state machine there, then we could actually rely on it
> (I do have some code for it, but its a pretty massive change which
> cannot be added in an incremental fashion).

I suspect moving the state machine to the core is a good idea.  Note that
the current nvme_rdma_queue_is_ready hack actually seems a bit to simple -
even after the connect we should only allow get/set Property.  Nevermind
the additional complications if/when authentication is implemented.


Re: [PATCH rfc 02/30] nvme-rdma: Don't alloc/free the tagset on reset

2017-06-19 Thread Christoph Hellwig
> We can do that, but this tries to eliminate duplicate code as
> much as possible. It's not like the convention is unprecedented...

It's fairly nasty to follow.  OTOH I like your overall cleanup,
so I guess I shouldn't complain about the initial patches to much
but just possibly do another pass after you are done..


Re: [PATCH rfc 04/30] nvme-rdma: introduce configure/destroy io queues

2017-06-19 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH rfc 05/30] nvme-rdma: introduce nvme_rdma_start_queue

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:21:39PM +0300, Sagi Grimberg wrote:
> This should pair with nvme_rdma_stop_queue. While this
> is not a complete 1x1 reverse, it still pairs up pretty
> well because in fabrics we don't have a disconnect capsule
> but we simply teardown the transport association.

Looks like we are going to get one sooner or later.  Not that I'd
like to send them if we can avoid it.

But the patch itself looks fine to me.


Re: [PATCH rfc 06/30] nvme-rdma: rename nvme_rdma_init_queue to nvme_rdma_alloc_queue

2017-06-19 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH rfc 07/30] nvme-rdma: make stop/free queue receive a ctrl and qid struct

2017-06-19 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH rfc 08/30] nvme-rdma: cleanup error path in controller reset

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:21:42PM +0300, Sagi Grimberg wrote:
> No need to queue an extra work to indirect controller
> uninit and put the final reference.

Maybe my memory is a little vague, but didn't we need the work_struct
for something?  At least it would serialize all the removals for example.


Re: [PATCH rfc 09/30] nvme: Move queue_count to the nvme_ctrl

2017-06-19 Thread Christoph Hellwig
Looks fine.  I'd be happy to take this as an early cleanup.

Reviewed-by: Christoph Hellwig 


Re: [PATCH rfc 10/30] nvme: Add admin_tagset pointer to nvme_ctrl

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:21:44PM +0300, Sagi Grimberg wrote:
> Will be used when we centralize control flows. only
> rdma for now.

Should we at some point move the tag_sets themselves to the generic
ctrl instead of just pointers?


Re: [PATCH rfc 11/30] nvme: move controller cap to struct nvme_ctrl

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:21:45PM +0300, Sagi Grimberg wrote:
> Will be used in centralized code later. only rdma
> for now.

It would be great to initialize it early on for all transports, and
then just use the stored field instead of re-reading CAP in various
places.


Re: [PATCH rfc 12/30] nvme-rdma: disable controller in reset instead of shutdown

2017-06-19 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH rfc 13/30] nvme-rdma: move queue LIVE/DELETING flags settings to queue routines

2017-06-19 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH rfc 14/30] nvme-rdma: stop queues instead of simply flipping their state

2017-06-19 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH rfc 15/30] nvme-rdma: don't check queue state for shutdown/disable

2017-06-19 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH rfc 16/30] nvme-rdma: move tagset allocation to a dedicated routine

2017-06-19 Thread Christoph Hellwig
Looks fine, but how about doing this early in the series?  There's
quite a bit of churn around this code.


Re: [PATCH rfc 17/30] nvme-rdma: move admin specific resources to alloc_queue

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:21:51PM +0300, Sagi Grimberg wrote:
> We're trying to make admin queue configuration generic, so
> move the rdma specifics to the queue allocation (based on
> the queue index passed).

Needs at least a comment, and probably factoring into a little
helper.  And once we have that helper it sounds like this might
be a good callout from the core?


Re: [PATCH rfc 20/30] nvme: add err, reconnect and delete work items to nvme core

2017-06-19 Thread Christoph Hellwig
On Sun, Jun 18, 2017 at 06:21:54PM +0300, Sagi Grimberg wrote:
> We intent for these handlers to become generic, thus, add them to
> the nvme core controller struct.

Do you remember why we actually need all the different work items?

We need err_work to recover from RDMA QP-level errors.  But how
is it so different from a reset in that respect?  Similarly why
do we need reset to be different from reconnect?  Especially as
reconnect sort of is the reset of fabrics.


Re: [PATCH rfc 29/30] nvme: add sed-opal ctrl manipulation in admin configuration

2017-06-19 Thread Christoph Hellwig
On Mon, Jun 19, 2017 at 11:03:36AM +0300, Sagi Grimberg wrote:
>
>> The subject sounds odd and it could use a changelog.  But I'd love to
>> pick this change up ASAP as it's the right thing to do..
>
> How? where would you place it? there is no nvme_configure_admin_queue in
> nvme-core.

Doh.  Yeah, let's keep it where it is.


Re: [PATCH rfc 25/30] nvme: move control plane handling to nvme core

2017-06-19 Thread Christoph Hellwig
> +static void nvme_free_io_queues(struct nvme_ctrl *ctrl)
> +{
> + int i;
> +
> + for (i = 1; i < ctrl->queue_count; i++)
> + ctrl->ops->free_hw_queue(ctrl, i);
> +}
> +
> +void nvme_stop_io_queues(struct nvme_ctrl *ctrl)
> +{
> + int i;
> +
> + for (i = 1; i < ctrl->queue_count; i++)
> + ctrl->ops->stop_hw_queue(ctrl, i);
> +}
> +EXPORT_SYMBOL_GPL(nvme_stop_io_queues);

At leasr for PCIe this is going to work very differently, so I'm not
sure this part make so much sense in the core.  Maybe in Fabrics?
Or at least make the callouts operate on all I/O queues, which would
suite PCIe a lot more.

> + error = ctrl->ops->start_hw_queue(ctrl, 0);
> + if (error)
> + goto out_cleanup_connect_queue;
> +
> + error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap);
> + if (error) {
> + dev_err(ctrl->device,
> + "prop_get NVME_REG_CAP failed\n");
> + goto out_cleanup_connect_queue;
> + }
> +
> + ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize);
> +
> + error = nvme_enable_ctrl(ctrl, ctrl->cap);
> + if (error)
> + goto out_cleanup_connect_queue;

I'm not sure this ordering is going to work for PCIe..


Re: [PATCH 04/10] blk-mq-sched: unify request finished methods

2017-06-19 Thread Paolo Valente

> Il giorno 16 giu 2017, alle ore 18:15, Christoph Hellwig  ha 
> scritto:
> 
> No need to have two different callouts of bfq vs kyber.
> 
> Signed-off-by: Christoph Hellwig 
> ---
> block/bfq-iosched.c  |  6 +++---
> block/blk-mq.c   | 11 ---
> block/kyber-iosched.c|  8 +++-
> include/linux/elevator.h |  3 +--
> 4 files changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index ed93da2462ab..4f69e39c2f89 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4290,7 +4290,7 @@ static void bfq_put_rq_priv_body(struct bfq_queue *bfqq)
>   bfq_put_queue(bfqq);
> }
> 
> -static void bfq_put_rq_private(struct request_queue *q, struct request *rq)
> +static void bfq_finish_request(struct request *rq)
> {
>   struct bfq_queue *bfqq = RQ_BFQQ(rq);
>   struct bfq_data *bfqd = bfqq->bfqd;
> @@ -4324,7 +4324,7 @@ static void bfq_put_rq_private(struct request_queue *q, 
> struct request *rq)
>*/
> 
>   if (!RB_EMPTY_NODE(&rq->rb_node))
> - bfq_remove_request(q, rq);
> + bfq_remove_request(rq->q, rq);
>   bfq_put_rq_priv_body(bfqq);
>   }
> 
> @@ -4951,7 +4951,7 @@ static struct elv_fs_entry bfq_attrs[] = {
> static struct elevator_type iosched_bfq_mq = {
>   .ops.mq = {
>   .get_rq_priv= bfq_get_rq_private,
> - .put_rq_priv= bfq_put_rq_private,
> + .finish_request = bfq_finish_request,
>   .exit_icq   = bfq_exit_icq,
>   .insert_requests= bfq_insert_requests,
>   .dispatch_request   = bfq_dispatch_request,
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1a45c287db64..9df7e0394a48 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -437,19 +437,16 @@ void blk_mq_free_request(struct request *rq)
>   struct request_queue *q = rq->q;
>   struct elevator_queue *e = q->elevator;
> 
> - if (rq->rq_flags & RQF_ELVPRIV) {
> - if (e && e->type->ops.mq.put_rq_priv)
> - e->type->ops.mq.put_rq_priv(q, rq);
> + if (rq->rq_flags & (RQF_ELVPRIV | RQF_QUEUED)) {
> + if (e && e->type->ops.mq.finish_request)
> + e->type->ops.mq.finish_request(rq);
>   if (rq->elv.icq) {
>   put_io_context(rq->elv.icq->ioc);
>   rq->elv.icq = NULL;
>   }
>   }
> 
> - if ((rq->rq_flags & RQF_QUEUED) && e && e->type->ops.mq.put_request)
> - e->type->ops.mq.put_request(rq);
> - else
> - blk_mq_finish_request(rq);
> + blk_mq_finish_request(rq);
> }

My only concern is that this last change is, and will remain,
functionally equivalent to the previous version of the code (choice
between put_request and finish_request).  In fact, bfq doesn't use
put_request at all, and, in former bfq_put_rq_private assumes that
rq->elv.priv is still properly set.  If this change enjoys, as it
seems to me, this invariance, then, for the part related to bfq,

Acked-by: Paolo Valente 


> EXPORT_SYMBOL_GPL(blk_mq_free_request);
> 
> diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
> index b9faabc75fdb..2557b399f0a8 100644
> --- a/block/kyber-iosched.c
> +++ b/block/kyber-iosched.c
> @@ -446,13 +446,11 @@ static struct request *kyber_get_request(struct 
> request_queue *q,
>   return rq;
> }
> 
> -static void kyber_put_request(struct request *rq)
> +static void kyber_finish_request(struct request *rq)
> {
> - struct request_queue *q = rq->q;
> - struct kyber_queue_data *kqd = q->elevator->elevator_data;
> + struct kyber_queue_data *kqd = rq->q->elevator->elevator_data;
> 
>   rq_clear_domain_token(kqd, rq);
> - blk_mq_finish_request(rq);
> }
> 
> static void kyber_completed_request(struct request *rq)
> @@ -816,7 +814,7 @@ static struct elevator_type kyber_sched = {
>   .init_hctx = kyber_init_hctx,
>   .exit_hctx = kyber_exit_hctx,
>   .get_request = kyber_get_request,
> - .put_request = kyber_put_request,
> + .finish_request = kyber_finish_request,
>   .completed_request = kyber_completed_request,
>   .dispatch_request = kyber_dispatch_request,
>   .has_work = kyber_has_work,
> diff --git a/include/linux/elevator.h b/include/linux/elevator.h
> index 0e306c5a86d6..4acea351d43f 100644
> --- a/include/linux/elevator.h
> +++ b/include/linux/elevator.h
> @@ -105,7 +105,7 @@ struct elevator_mq_ops {
>   void (*request_merged)(struct request_queue *, struct request *, enum 
> elv_merge);
>   void (*requests_merged)(struct request_queue *, struct request *, 
> struct request *);
>   struct request *(*get_request)(struct request_queue *, unsigned int, 
> struct blk_mq_alloc_data *);
> - void (*put_request)(struct request *);
> + vo

Re: [PATCH 07/10] bfq-iosched: fix NULL ioc check in bfq_get_rq_private

2017-06-19 Thread Paolo Valente

> Il giorno 16 giu 2017, alle ore 18:15, Christoph Hellwig  ha 
> scritto:
> 
> icq_to_bic is a container_of operation, so we need to check for NULL
> before it.  Also move the check outside the spinlock while we're at
> it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
> block/bfq-iosched.c | 15 +--
> 1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 4f69e39c2f89..f037b005faa1 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4398,16 +4398,17 @@ static int bfq_get_rq_private(struct request_queue 
> *q, struct request *rq,
> struct bio *bio)
> {
>   struct bfq_data *bfqd = q->elevator->elevator_data;
> - struct bfq_io_cq *bic = icq_to_bic(rq->elv.icq);
> + struct bfq_io_cq *bic;
>   const int is_sync = rq_is_sync(rq);
>   struct bfq_queue *bfqq;
>   bool new_queue = false;
>   bool split = false;
> 
> - spin_lock_irq(&bfqd->lock);

Thanks for taking a step I feared here (basically for cowardice) ...

> + if (!rq->elv.icq)
> + return 1;
> + bic = icq_to_bic(rq->elv.icq);
> 
> - if (!bic)
> - goto queue_fail;
> + spin_lock_irq(&bfqd->lock);
> 
>   bfq_check_ioprio_change(bic, bio);
> 
> @@ -4465,13 +4466,7 @@ static int bfq_get_rq_private(struct request_queue *q, 
> struct request *rq,
>   bfq_handle_burst(bfqd, bfqq);
> 
>   spin_unlock_irq(&bfqd->lock);
> -
>   return 0;
> -
> -queue_fail:
> - spin_unlock_irq(&bfqd->lock);
> -
> - return 1;
> }
> 
> static void bfq_idle_slice_timer_body(struct bfq_queue *bfqq)
> -- 
> 2.11.0
> 

Acked-by: Paolo Valente 



Re: [PATCH 09/10] blk-mq-sched: unify request prepare methods

2017-06-19 Thread Paolo Valente

> Il giorno 16 giu 2017, alle ore 18:15, Christoph Hellwig  ha 
> scritto:
> 
> This patch makes sure we always allocate requests in the core blk-mq
> code and use a common prepare_request method to initialize them for
> both mq I/O schedulers.  For Kyber and additional limit_depth method
> is added that is called before allocating the request.
> 
> Also because none of the intializations can really fail the new method
> does not return an error - instead the bfq finish method is hardened
> to deal with the no-IOC case.
> 
> Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
> scheduling code as RQF_ELFPRIV is all that is needed now.
> 
> Signed-off-by: Christoph Hellwig 
> ---
> block/bfq-iosched.c  | 19 ---
> block/blk-mq.c   | 22 ++
> block/kyber-iosched.c| 23 +++
> include/linux/elevator.h |  4 ++--
> 4 files changed, 31 insertions(+), 37 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index f037b005faa1..60d32700f104 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4292,8 +4292,14 @@ static void bfq_put_rq_priv_body(struct bfq_queue 
> *bfqq)
> 
> static void bfq_finish_request(struct request *rq)
> {
> - struct bfq_queue *bfqq = RQ_BFQQ(rq);
> - struct bfq_data *bfqd = bfqq->bfqd;
> + struct bfq_queue *bfqq;
> + struct bfq_data *bfqd;
> +
> + if (!rq->elv.icq)
> + return;
> +

If this is a rq dispatched from a bfqq (or even a request still in the
scheduler), then just exiting here will break bfq state seriously.
However, I guess that this case can never occur.

If this is a rq dispatched from the bfqd->dispatch list, then exiting
here should only unbalance the bfqd->rq_in_driver counter.

I guess that in both cases there wouldn't be problems for the missing
reset of the content of rq->elv.priv (if the function terminates
here).

If I can be of further help in some way, I'm willing to help.

Thanks,
Paolo

> + bfqq = RQ_BFQQ(rq);
> + bfqd = bfqq->bfqd;
> 
>   if (rq->rq_flags & RQF_STARTED)
>   bfqg_stats_update_completion(bfqq_group(bfqq),
> @@ -4394,9 +4400,9 @@ static struct bfq_queue 
> *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,
> /*
>  * Allocate bfq data structures associated with this request.
>  */
> -static int bfq_get_rq_private(struct request_queue *q, struct request *rq,
> -   struct bio *bio)
> +static void bfq_prepare_request(struct request *rq, struct bio *bio)
> {
> + struct request_queue *q = rq->q;
>   struct bfq_data *bfqd = q->elevator->elevator_data;
>   struct bfq_io_cq *bic;
>   const int is_sync = rq_is_sync(rq);
> @@ -4405,7 +4411,7 @@ static int bfq_get_rq_private(struct request_queue *q, 
> struct request *rq,
>   bool split = false;
> 
>   if (!rq->elv.icq)
> - return 1;
> + return;
>   bic = icq_to_bic(rq->elv.icq);
> 
>   spin_lock_irq(&bfqd->lock);
> @@ -4466,7 +4472,6 @@ static int bfq_get_rq_private(struct request_queue *q, 
> struct request *rq,
>   bfq_handle_burst(bfqd, bfqq);
> 
>   spin_unlock_irq(&bfqd->lock);
> - return 0;
> }
> 
> static void bfq_idle_slice_timer_body(struct bfq_queue *bfqq)
> @@ -4945,7 +4950,7 @@ static struct elv_fs_entry bfq_attrs[] = {
> 
> static struct elevator_type iosched_bfq_mq = {
>   .ops.mq = {
> - .get_rq_priv= bfq_get_rq_private,
> + .prepare_request= bfq_prepare_request,
>   .finish_request = bfq_finish_request,
>   .exit_icq   = bfq_exit_icq,
>   .insert_requests= bfq_insert_requests,
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 2f380ab7a603..81d05c19d4b3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -298,16 +298,11 @@ static struct request *blk_mq_get_request(struct 
> request_queue *q,
>* Flush requests are special and go directly to the
>* dispatch list.
>*/
> - if (!op_is_flush(op) && e->type->ops.mq.get_request) {
> - rq = e->type->ops.mq.get_request(q, op, data);
> - if (rq)
> - rq->rq_flags |= RQF_QUEUED;
> - goto allocated;
> - }
> + if (!op_is_flush(op) && e->type->ops.mq.limit_depth)
> + e->type->ops.mq.limit_depth(op, data);
>   }
> 
>   rq = __blk_mq_alloc_request(data, op);
> -allocated:
>   if (!rq) {
>   blk_queue_exit(q);
>   return NULL;
> @@ -315,17 +310,12 @@ static struct request *blk_mq_get_request(struct 
> request_queue *q,
> 
>   if (!op_is_flush(op)) {
>   rq->elv.icq = NULL;
> - if (e && e->type->ops.mq.get_rq_priv) {
> + if (e && e->type->ops.mq.prepare_request) {
>   if (e->type->icq_cache &

Re: [PATCH rfc 10/30] nvme: Add admin_tagset pointer to nvme_ctrl

2017-06-19 Thread Sagi Grimberg



Will be used when we centralize control flows. only
rdma for now.


Should we at some point move the tag_sets themselves to the generic
ctrl instead of just pointers?


We can easily do that, but the tagsets are heavily read in the hot path
so I was careful not to completely move them to nvme_ctrl which is not
arranged for it at all (and transports through it far back in their
struct).

Once we actually get some of this merged we should look into arranging
the transport controllers to be:

struct transport_ctrl {
/* transport specific accessed in the hot path */
...

struct nvme_ctrl ctrl; /* hot members first */

/* transport specific bookkeeping */
...
};


Re: [PATCH rfc 20/30] nvme: add err, reconnect and delete work items to nvme core

2017-06-19 Thread Sagi Grimberg



We intent for these handlers to become generic, thus, add them to
the nvme core controller struct.


Do you remember why we actually need all the different work items?


I remember documenting it at some point, but either it got lost
somewhere or I don't remember...


We need err_work to recover from RDMA QP-level errors.


transport errors usually detected in soft-irq, so we queue up
err_work to:
1. stop + drain queues
2. fail inflight I/O.
3. queue delayed reconnect (reconnect_work)


But how is it so different from a reset in that respect?  Similarly why
do we need reset to be different from reconnect?  Especially as
reconnect sort of is the reset of fabrics.


Hmm, resets and reconnects are indeed similar, but one difference
is that in resets we do not fail fast inflight I/O as the expectation
is that the recovery should be immediate (it also matches pci with
that respect) while we consider reconnect something that can last
for a while so we fail fast to allow failover and continue reconnect
attempts quietly. Another difference is that failed reset results in
a controller removal while in reconnects we have to exhaust
ctrl_loss_tmo.

We could change things, like merging reconnect and reset and introduce
a concept of "on-host reset". Not sure it will be any less confusing
though...


Re: dm integrity tests crash kernel (4.12-rc5)

2017-06-19 Thread Mike Snitzer
On Mon, Jun 19 2017 at  8:11am -0400,
Ondrej Kozina  wrote:

> same log again, hopefully prettier format. Sorry:
> 
> [  330.980914] DEBUG_LOCKS_WARN_ON(current->hardirq_context)
> [  330.980923] [ cut here ]
> [  330.982627] WARNING: CPU: 1 PID: 0 at kernel/locking/lockdep.c:2748 
> trace_hardirqs_on_caller+0x107/0x180
...
> [  331.011968] Call Trace:
> [  331.012806]  
> [  331.013684]  trace_hardirqs_on+0xd/0x10
> [  331.014732]  _raw_spin_unlock_irq+0x27/0x30
> [  331.015840]  submit_flush_bio+0x4e/0x80 [dm_integrity]
> [  331.017157]  do_endio_flush+0x41/0x70 [dm_integrity]
> [  331.018476]  dec_in_flight+0x59/0x110 [dm_integrity]
> [  331.019767]  integrity_end_io+0x5e/0x70 [dm_integrity]
> [  331.020965]  bio_endio+0x7c/0x1a0
> [  331.021917]  blk_update_request+0x9f/0x3d0
> [  331.023050]  blk_mq_end_request+0x15/0x60
> [  331.024224]  lo_complete_rq+0x2b/0x80
> [  331.025406]  __blk_mq_complete_request_remote+0xe/0x10
> [  331.026813]  flush_smp_call_function_queue+0x4f/0x110
> [  331.028173]  generic_smp_call_function_single_interrupt+0xe/0x20
> [  331.029616]  smp_call_function_single_interrupt+0x22/0x30
> [  331.031125]  call_function_single_interrupt+0x90/0xa0
> [  331.032819] RIP: 0010:default_idle+0x1b/0x180

Looks like submit_flush_bio() is disabling/enabling interrupts from
interrupt context.  Ondrej, does this patch fix the issue?

diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index 4ab10cf..93b1810 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -1105,10 +1105,13 @@ static void schedule_autocommit(struct dm_integrity_c 
*ic)
 static void submit_flush_bio(struct dm_integrity_c *ic, struct dm_integrity_io 
*dio)
 {
struct bio *bio;
-   spin_lock_irq(&ic->endio_wait.lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(&ic->endio_wait.lock, flags);
bio = dm_bio_from_per_bio_data(dio, sizeof(struct dm_integrity_io));
bio_list_add(&ic->flush_bio_list, bio);
-   spin_unlock_irq(&ic->endio_wait.lock);
+   spin_unlock_irqrestore(&ic->endio_wait.lock, flags);
+
queue_work(ic->commit_wq, &ic->commit_work);
 }
 


Re: [PATCH 11/11] nvme: add support for streams and directives

2017-06-19 Thread Jens Axboe
On 06/19/2017 12:25 AM, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 09:11:30AM -0600, Jens Axboe wrote:
>> I have two samples here, and I just tested, and both of them want it
>> assigned with nsid=0x or they will fail the writes... So I'd say
>> we're better off ensuring we do allocate them globally.
> 
> That's clearly against the spec.  I'd say go to your vendor and get
> a refund, as we Linux folks (Martin and I) fought for this so that
> we would not have to do the explicit allocations.
> 
> Another quote for from the spec:
> 
> "Streams are opened by the controller when the host issues a write
>  command that specifies a stream identifier that is not currently open.
>  While a stream is open the controller maintains context for that stream
>  (e.g., buffers for associated data). The host may determine the streams 
>  that are open using the Get Status operation."
> 
> And I think this is very important - otherwise you need to either
> allocate the stremas beforehand as your earlier patches (and we take
> away the resources from the 99% of the users not using write life
> hints), or we need the lazy allocation scheme.  And for that to be
> efficient it probably needs to be lazy per-stream allocation.  That's
> why we got the implicit open in after all.

These are just samples, so no refund possible! As you might remember,
I was pretty adamant on not wanting explicit open/close as well, back
in those days. I'll check with the vendor and see what's going on.

-- 
Jens Axboe



Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce

2017-06-19 Thread Ming Lei
On Mon, Jun 19, 2017 at 3:26 PM, Christoph Hellwig  wrote:
> pktcdvd is a make_request based stacking driver and thus doesn't have any
> addressing limits on it's own.  It also doesn't use bio_data() or
> page_address(), so it doesn't need a lowmem bounce either.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/pktcdvd.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> index 26c04baae967..7f2eca31eb6a 100644
> --- a/drivers/block/pktcdvd.c
> +++ b/drivers/block/pktcdvd.c
> @@ -2413,8 +2413,6 @@ static blk_qc_t pkt_make_request(struct request_queue 
> *q, struct bio *bio)
> char b[BDEVNAME_SIZE];
> struct bio *split;
>
> -   blk_queue_bounce(q, &bio);
> -
> blk_queue_split(q, &bio);
>
> pd = q->queuedata;
> --
> 2.11.0
>

blk_queue_make_request() sets bounce for any highmem page for long time,
and in theory this patch might cause regression on 32bit arch, when
the controller can't handle higmem page really(especially in case of PAE), so
we may need to be careful about this change, especially on some old hardware.

thanks,
Ming Lei


Re: [PATCH 11/11] nvme: add support for streams and directives

2017-06-19 Thread Jens Axboe
On 06/19/2017 08:31 AM, Jens Axboe wrote:
> On 06/19/2017 12:25 AM, Christoph Hellwig wrote:
>> On Sat, Jun 17, 2017 at 09:11:30AM -0600, Jens Axboe wrote:
>>> I have two samples here, and I just tested, and both of them want it
>>> assigned with nsid=0x or they will fail the writes... So I'd say
>>> we're better off ensuring we do allocate them globally.
>>
>> That's clearly against the spec.  I'd say go to your vendor and get
>> a refund, as we Linux folks (Martin and I) fought for this so that
>> we would not have to do the explicit allocations.
>>
>> Another quote for from the spec:
>>
>> "Streams are opened by the controller when the host issues a write
>>  command that specifies a stream identifier that is not currently open.
>>  While a stream is open the controller maintains context for that stream
>>  (e.g., buffers for associated data). The host may determine the streams 
>>  that are open using the Get Status operation."
>>
>> And I think this is very important - otherwise you need to either
>> allocate the stremas beforehand as your earlier patches (and we take
>> away the resources from the 99% of the users not using write life
>> hints), or we need the lazy allocation scheme.  And for that to be
>> efficient it probably needs to be lazy per-stream allocation.  That's
>> why we got the implicit open in after all.
> 
> These are just samples, so no refund possible! As you might remember,
> I was pretty adamant on not wanting explicit open/close as well, back
> in those days. I'll check with the vendor and see what's going on.

Looking at it a bit more closely - there's a difference between
assigning X number of streams (allocating) for use by the subsystem or
per-ns, and having to manually open them. So I don't necessarily think
there's a problem here, neither for us or on the device.

If I load the driver as it currently stands, and get streams params
after load:

Number of Open Streams in NVM subsystem  (NOSNS): 0
Allocated Stream Resources (ASR): 4
Number of Open Streams in Namespace   (NOSN): 0

4 streams allocated, 0 currently open. Then I generate a single write to
some stream ID:

Number of Open Streams in NVM subsystem  (NOSNS): 1
Allocated Stream Resources (ASR): 4
Number of Open Streams in Namespace   (NOSN): 1

and we know see a single stream actually open, implicitly, by the
device. Fire off another single stream write, this time to another
stream, and we get:

Number of Open Streams in NVM subsystem  (NOSNS): 2
Allocated Stream Resources (ASR): 4
Number of Open Streams in Namespace   (NOSN): 2

and stream status correctly reflects that I did a WRITE_HINT_SHORT and
WRITE_HINT_LONG write:

Open Stream Count  : 2
Stream Identifier 01 : 1
Stream Identifier 02 : 3

Awaiting clarification from the vendor what their position/view of this
is. Reading the spec, I do agree that it leans towards only needing
allocation for a specific name space, but it doesn't explicitly say that
you can use any of the available streams, without allocation, if they
haven't been assigned to a specific name space. I would interpret it
that way, though.

-- 
Jens Axboe



Re: [PATCH 01/11] fs: add support for an inode to carry write hint related data

2017-06-19 Thread Jens Axboe
On 06/19/2017 12:26 AM, Christoph Hellwig wrote:
>> +/*
>> + * Write life time hint values.
>> + */
>> +enum rw_hint {
>> +WRITE_LIFE_NONE = 0,
>> +WRITE_LIFE_SHORT,
>> +WRITE_LIFE_MEDIUM,
>> +WRITE_LIFE_LONG,
>> +WRITE_LIFE_EXTREME,
>> +};
>> +
>> +#define RW_HINT_MASK0x7 /* 3 bits */
> 
> FYI, exposing enums in a uapi is always a bit problematic, due to
> different ABI rules.  It might be better to make these explicit defines
> at least for the uapi.
> 
> Btw, I think it might make sense to merge this with patch 5.

OK, I'll change the uapi to be defines.

-- 
Jens Axboe



Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints

2017-06-19 Thread Jens Axboe
On 06/19/2017 12:27 AM, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 01:59:47PM -0600, Jens Axboe wrote:
>> Add four flags for the pwritev2(2) system call, allowing an application
>> to give the kernel a hint about what on-media life times can be
>> expected from a given write.
>>
>> The intent is for these values to be relative to each other, no
>> absolute meaning should be attached to these flag names.
>>
>> Set aside 3 bits in the iocb flags structure to carry this information
>> over from the pwritev2 RWF_WRITE_LIFE_* flags.
> 
> What is the strong use case for the per-I/O flags?  I'd much rather
> stick to fcntl only for now if we can.

Fine, I guess I should just have dusted off the 2 year old patchset,
as that was _exactly_ what that did.

-- 
Jens Axboe



Re: [PATCH 05/11] fs: add fcntl() interface for setting/getting write life time hints

2017-06-19 Thread Jens Axboe
On 06/19/2017 12:28 AM, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 01:59:48PM -0600, Jens Axboe wrote:
>> We have a pwritev2(2) interface based on passing in flags. Add an
>> fcntl interface for querying these flags, and also for setting them
>> as well:
>>
>> F_GET_RW_HINTReturns the read/write hint set. Right now it
>>  will be one of the WRITE_LIFE_* values.
>>
>> F_SET_RW_HINTPass in rw_hint type to set the read/write hint.
>>  Only WRITE_LIFE_* hints are currently supported.
>>  Returns 0 on succes, -1 otherwise.
>>
>> Sample program testing/implementing basic setting/getting of write
>> hints is below.
>>
>> /*
>>  * writehint.c: check or set a file/inode write hint
>>  */
>>
>> static char *str[] = { "WRITE_LIFE_NONE", "WRITE_LIFE_SHORT",
>>  "WRITE_LIFE_MEDIUM", "WRITE_LIFE_LONG",
>>  "WRITE_LIFE_EXTREME" };
>>
>> int main(int argc, char *argv[])
>> {
>>  int hint = -1, fd, ret;
>>
>>  if (argc < 2) {
>>  fprintf(stderr, "%s: dev \n", argv[0]);
>>  return 1;
>>  }
>>
>>  fd = open(argv[1], O_RDONLY);
>>  if (fd < 0) {
>>  perror("open");
>>  return 2;
>>  }
>>
>>  if (argc > 2)
>>  hint = atoi(argv[2]);
>>
>>  if (hint == -1) {
>>  ret = fcntl(fd, F_GET_RW_HINT);
>>  if (ret < 0) {
>>  perror("fcntl: F_GET_RW_HINT");
>>  return 3;
>>  }
>>  hint = ret;
>>  } else {
>>  ret = fcntl(fd, F_SET_RW_HINT, hint);
>>  if (ret < 0) {
>>  perror("fcntl: F_SET_RW_HINT");
>>  return 4;
>>  }
>>  }
>>
>>  printf("%s: %shint %s\n", argv[1], hint != -1 ? "set " : "", str[hint]);
>>  close(fd);
>>  return 0;
>> }
>>
>> Signed-off-by: Jens Axboe 
>> ---
>>  fs/fcntl.c | 38 ++
>>  include/uapi/linux/fcntl.h |  6 ++
>>  2 files changed, 44 insertions(+)
>>
>> diff --git a/fs/fcntl.c b/fs/fcntl.c
>> index f4e7267d117f..417ce336c875 100644
>> --- a/fs/fcntl.c
>> +++ b/fs/fcntl.c
>> @@ -243,6 +243,40 @@ static int f_getowner_uids(struct file *filp, unsigned 
>> long arg)
>>  }
>>  #endif
>>  
>> +long fcntl_rw_hint(struct file *file, unsigned int cmd, unsigned long arg)
> 
> The unsigned long arg is a little annoying because it will de different
> size for 32-bit vs 64-vit architectures.
> 
> How about just doing a get_user and use a fixed-size 64-bit value?

It's just passing it in from fcnt(), my intent was for it to be a u32 on the
uapi side. But we can make it a 64-bit type, since we're going away from
the enum and making it a proper type.

-- 
Jens Axboe



Re: [PATCH 06/11] fs: add O_DIRECT support for sending down write life time hints

2017-06-19 Thread Jens Axboe
On 06/19/2017 12:28 AM, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 01:59:49PM -0600, Jens Axboe wrote:
>> Reviewed-by: Andreas Dilger 
>> Signed-off-by: Jens Axboe 
>> ---
>>  fs/block_dev.c | 2 ++
>>  fs/direct-io.c | 2 ++
>>  fs/iomap.c | 1 +
>>  3 files changed, 5 insertions(+)
>>
>> diff --git a/fs/block_dev.c b/fs/block_dev.c
>> index 51959936..b9222a9d285e 100644
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -239,6 +239,7 @@ __blkdev_direct_IO_simple(struct kiocb *iocb, struct 
>> iov_iter *iter,
>>  should_dirty = true;
>>  } else {
>>  bio.bi_opf = dio_bio_write_op(iocb);
>> +bio.bi_opf |= write_hint_to_opf(iocb_write_hint(iocb));
> 
> Move this into dio_bio_write_op instead of duplicating it?

Good cleanup, will do.


-- 
Jens Axboe



Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce

2017-06-19 Thread Christoph Hellwig
On Mon, Jun 19, 2017 at 10:34:36PM +0800, Ming Lei wrote:
> blk_queue_make_request() sets bounce for any highmem page for long time,
> and in theory this patch might cause regression on 32bit arch, when
> the controller can't handle higmem page really(especially in case of PAE), so
> we may need to be careful about this change, especially on some old hardware.

Which controller?


Re: [PATCH 11/11] nvme: add support for streams and directives

2017-06-19 Thread Jens Axboe
On 06/19/2017 12:35 AM, Christoph Hellwig wrote:
> Can you add linux-nvme for the next repost?
> 
> As said before I think we should rely on implicit streams allocation,
> as that will make the whole patch a lot simpler, and it solves the issue
> that your current patch will take away your 4 streams from the general
> pool on every controller that supports streams, which isn't optimal.

The only thing it'll change in the patch is the removal of
nvme_streams_allocate(), which is 20 lines of code. So I don't
think it'll simplify things a lot. The patch is already very simple.
But if we don't need to allocate the streams, then of course it should
just go.

>> +streamid = (req->cmd_flags & REQ_WRITE_LIFE_MASK)
>> +>> __REQ_WRITE_HINT_SHIFT;
> 
> Can we add a helper to convert the REQ_* flags back to the hint next
> to where we have the helper to do the reverse one?

OK, will od.

>> +if (streamid == WRITE_LIFE_NONE)
>> +return;
>> +
>> +/* for now just round-robin, do something more clever later */
>> +if (streamid > ctrl->nr_streams)
>> +streamid = (streamid % ctrl->nr_streams) + 1;
> 
> I thought you were going to fix that to do more intelligent collapsing?

Sure, if you want me to do that now, I can. I propose:

- With 4 streams, direct mapping.
- With 3 streams, collapse two longest life times.
- With 2 streams, collapse short+medium, and long+extreme.
- With 1 stream, don't use streams.

We could potentially still use streams with just 1 stream, since we
have the default of no stream as well. But I don't think it makes
sense at that point.

> 
>> -ns->queue->limits.discard_granularity = logical_block_size;
>> +if (ctrl->nr_streams && ns->sws && ns->sgs) {
>> +unsigned int sz = logical_block_size * ns->sws * ns->sgs;
>> +
>> +ns->queue->limits.discard_alignment = sz;
> 
> I don't think this is correct:
> 
> "Data that is aligned to and in multiples of the Stream Write Size (SWS)
>  provides optimal performance of the write commands to the controller.
>  The Stream Granularity Size indicates the size of the media that is prepared
>  as a unit for future allocation for write commands and is a multiple of the
>  Stream Write Size. The controller may allocate and group together a stream
>  in Stream Granularity Size (SGS) units. Refer to Figure 293."
> 
> So I think the ns->sws should go away.

The SGS value is in units of SWS, however.

>> +ns->queue->limits.discard_granularity = sz;
>> +} else {
>> +u32 logical_block_size = queue_logical_block_size(ns->queue);
> 
> I think we already have a logical_block_size with the same value in
> scope here.

Oops yes.

>> +ns->sws = le32_to_cpu(s.sws);
>> +ns->sgs = le16_to_cpu(s.sgs);
>> +
>> +if (ns->sws) {
>> +unsigned int bs = 1 << ns->lba_shift;
>> +
>> +blk_queue_io_min(ns->queue, bs * ns->sws);
>> +if (ns->sgs)
>> +blk_queue_io_opt(ns->queue, bs * ns->sws * ns->sgs);
> 
> drop the multiplication with ns->sws here as well.

See above.

-- 
Jens Axboe



Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce

2017-06-19 Thread Ming Lei
On Mon, Jun 19, 2017 at 11:00 PM, Christoph Hellwig  wrote:
> On Mon, Jun 19, 2017 at 10:34:36PM +0800, Ming Lei wrote:
>> blk_queue_make_request() sets bounce for any highmem page for long time,
>> and in theory this patch might cause regression on 32bit arch, when
>> the controller can't handle higmem page really(especially in case of PAE), so
>> we may need to be careful about this change, especially on some old hardware.
>
> Which controller?

I don't know.

IMO, we can't just remove it because it is a bio based driver, the
bounce is not related with
a bio or req based driver, and it depends if the controller can handle
highmem page.

Actually the bounce for highmem is here from v2.6.12, and looks you
might need to
investigate why it is introduced from start before the removing, :-)



Thanks,
Ming Lei


Re: [dm-devel] dm integrity tests crash kernel (4.12-rc5)

2017-06-19 Thread Ondrej Kozina

On 06/19/2017 04:20 PM, Mike Snitzer wrote:


Looks like submit_flush_bio() is disabling/enabling interrupts from
interrupt context.  Ondrej, does this patch fix the issue?


I let it spin for 30 minutes on patched dm-integrity and everything 
seems ok now. The moment I loaded back the old one it fell on face 
almost immediately again.


Thanks!



diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index 4ab10cf..93b1810 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -1105,10 +1105,13 @@ static void schedule_autocommit(struct dm_integrity_c 
*ic)
 static void submit_flush_bio(struct dm_integrity_c *ic, struct dm_integrity_io 
*dio)
 {
struct bio *bio;
-   spin_lock_irq(&ic->endio_wait.lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(&ic->endio_wait.lock, flags);
bio = dm_bio_from_per_bio_data(dio, sizeof(struct dm_integrity_io));
bio_list_add(&ic->flush_bio_list, bio);
-   spin_unlock_irq(&ic->endio_wait.lock);
+   spin_unlock_irqrestore(&ic->endio_wait.lock, flags);
+
queue_work(ic->commit_wq, &ic->commit_work);
 }


--
dm-devel mailing list
dm-de...@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel





Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce

2017-06-19 Thread Christoph Hellwig
On Mon, Jun 19, 2017 at 11:13:46PM +0800, Ming Lei wrote:
> On Mon, Jun 19, 2017 at 11:00 PM, Christoph Hellwig  wrote:
> > On Mon, Jun 19, 2017 at 10:34:36PM +0800, Ming Lei wrote:
> >> blk_queue_make_request() sets bounce for any highmem page for long time,
> >> and in theory this patch might cause regression on 32bit arch, when
> >> the controller can't handle higmem page really(especially in case of PAE), 
> >> so
> >> we may need to be careful about this change, especially on some old 
> >> hardware.
> >
> > Which controller?
> 
> I don't know.
> 
> IMO, we can't just remove it because it is a bio based driver, the
> bounce is not related with
> a bio or req based driver, and it depends if the controller can handle
> highmem page.

pktcdvd is a stacking driver and forwards all data through either
generic_make_request or blk_rq_map_kern + blk_execute_rq,
both of which call blk_queue_bounce internally.


Re: dm integrity tests crash kernel (4.12-rc5)

2017-06-19 Thread Mike Snitzer
On Mon, Jun 19 2017 at 11:16am -0400,
Ondrej Kozina  wrote:

> On 06/19/2017 04:20 PM, Mike Snitzer wrote:
> >
> >Looks like submit_flush_bio() is disabling/enabling interrupts from
> >interrupt context.  Ondrej, does this patch fix the issue?
> 
> I let it spin for 30 minutes on patched dm-integrity and everything
> seems ok now. The moment I loaded back the old one it fell on face
> almost immediately again.

OK, I'll get it staged as a fix for 4.12.

Thanks,
Mike


Re: [PATCH rfc 01/30] nvme: Add admin connect request queue

2017-06-19 Thread Hannes Reinecke
On 06/19/2017 09:49 AM, Sagi Grimberg wrote:
> 
>>> In case we reconnect with inflight admin IO we
>>> need to make sure that the connect comes before
>>> the admin command. This can be only achieved by
>>> using a seperate request queue for admin connects.
>>
>> Use up a few more lines of the available space for your lines? :)
> 
> I warned in the cover-letter that the change logs are pure
> negligence at the moment :)
> 
>> Wouldn't a head insertation also solve the problem?
> 
> the head insertion will not protect against it because
> we must invoke blk_mq_start_stopped_hw_queues on the admin_q
> request queue so the admin connect can make progress, at this
> point (and before we actually queue up connect) pending admin
> commands can sneak in...
> 
> However, you raise a valid point, I think I added this before we
> had the queue_is_ready protection, which will reject the command
> if the queue is not LIVE (unless its a connect). I think the reason
> its still in is that I tested this with loop which doesn't have
> a per-queue state machine.
> 
> I'm still wandering if its a good idea to rely on the transport
> queue state to reject non-connect requests on non-LIVE queues.
> if/when we introduce a queue representation to the core and we
> drive the state machine there, then we could actually rely on it
> (I do have some code for it, but its a pretty massive change which
> cannot be added in an incremental fashion).
> 
> Thoughts?

I very much prefer this solution, ie rejecting all commands if the queue
state is not 'LIVE', and make the 'connect' command the only one being
accepted in that state.
(Actually, we would need a state 'READY_TO_CONNECT' or somesuch to make
this work properly.)

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 01/10] pktcdvd: remove the call to blk_queue_bounce

2017-06-19 Thread Ming Lei
On Mon, Jun 19, 2017 at 11:18 PM, Christoph Hellwig  wrote:
> On Mon, Jun 19, 2017 at 11:13:46PM +0800, Ming Lei wrote:
>> On Mon, Jun 19, 2017 at 11:00 PM, Christoph Hellwig  wrote:
>> > On Mon, Jun 19, 2017 at 10:34:36PM +0800, Ming Lei wrote:
>> >> blk_queue_make_request() sets bounce for any highmem page for long time,
>> >> and in theory this patch might cause regression on 32bit arch, when
>> >> the controller can't handle higmem page really(especially in case of 
>> >> PAE), so
>> >> we may need to be careful about this change, especially on some old 
>> >> hardware.
>> >
>> > Which controller?
>>
>> I don't know.
>>
>> IMO, we can't just remove it because it is a bio based driver, the
>> bounce is not related with
>> a bio or req based driver, and it depends if the controller can handle
>> highmem page.
>
> pktcdvd is a stacking driver and forwards all data through either
> generic_make_request or blk_rq_map_kern + blk_execute_rq,
> both of which call blk_queue_bounce internally.

OK, thanks for explaining, then looks it isn't needed from the start.

Thanks,
Ming Lei


Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints

2017-06-19 Thread Jens Axboe
On 06/19/2017 08:56 AM, Jens Axboe wrote:
> On 06/19/2017 12:27 AM, Christoph Hellwig wrote:
>> On Sat, Jun 17, 2017 at 01:59:47PM -0600, Jens Axboe wrote:
>>> Add four flags for the pwritev2(2) system call, allowing an application
>>> to give the kernel a hint about what on-media life times can be
>>> expected from a given write.
>>>
>>> The intent is for these values to be relative to each other, no
>>> absolute meaning should be attached to these flag names.
>>>
>>> Set aside 3 bits in the iocb flags structure to carry this information
>>> over from the pwritev2 RWF_WRITE_LIFE_* flags.
>>
>> What is the strong use case for the per-I/O flags?  I'd much rather
>> stick to fcntl only for now if we can.
> 
> Fine, I guess I should just have dusted off the 2 year old patchset,
> as that was _exactly_ what that did.

Actually, one good use case is O_DIRECT on a block device. Since I'm
not a huge fan of having per-call hints that is only useful for a
single case, how about we add the hints to the struct file as well?
For buffered IO, just grab it from the inode. If we have a file
available, then that overrides the per-inode setting.


-- 
Jens Axboe



Re: [PATCH rfc 28/30] nvme: update tagset nr_hw_queues when reallocating io queues

2017-06-19 Thread Sagi Grimberg



Signed-off-by: Sagi Grimberg 


Could use a changelog.

Ming: does this solve your problem of not seeing the new queues
after a qemu CPU hotplug + reset?


The issue I observed is that there isn't NVMe reset triggered after
CPU becomes online.


This won't help with that.


It may take a while since the test need to apply the whole patchset.
Or is it fine to figure out one fix on this issue? Sagi?


No need, all this doesn't even touch pci for now...


Re: [PATCH v7 00/22] fs: enhanced writeback error reporting with errseq_t (pile #1)

2017-06-19 Thread Jeff Layton
On Fri, 2017-06-16 at 15:34 -0400, Jeff Layton wrote:
> v7:
> ===
> This is the seventh posting of the patchset to revamp the way writeback
> errors are tracked and reported.
> 
> The main difference from the v6 posting is the removal of the
> FS_WB_ERRSEQ flag. That requires a few other incremental patches in the
> writeback code to ensure that both error tracking models are handled
> in a suitable way.
> 
> Also, a bit more cleanup of the metadata writeback codepaths, and some
> documentation updates.
> 
> Some of these patches have been posted separately, but I'm re-posting
> them here to make it clear that they're prerequisites of the later
> patches in the series.
> 
> This series is based on top of linux-next from a day or so ago. I'd like
> to have this picked up by linux-next in the near future so we can get
> some more extensive testing with it. Should I just plan to maintain a
> topic branch on top of -next and ask Stephen to pick it up?
> 
> Background:
> ===
> The basic problem is that we have (for a very long time) tracked and
> reported writeback errors based on two flags in the address_space:
> AS_EIO and AS_ENOSPC. Those flags are cleared when they are checked,
> so only the first caller to check them is able to consume them.
> 
> That model is quite unreliable, for several related reasons:
> 
> * only the first fsync caller on the inode will see the error. In a
>   world of containerized setups, that's no longer viable. Applications
>   need to know that their writes are safely stored, and they can
>   currently miss seeing errors that they should be aware of when
>   they're not.
> 
> * there are a lot of internal callers to filemap_fdatawait* and
>   filemap_write_and_wait* that clear these errors but then never report
>   them to userland in any fashion.
> 
> * Some internal callers report writeback errors, but can do so at
>   non-sensical times. For instance, we might want to truncate a file,
>   which triggers a pagecache flush. If that writeback fails, we might
>   report that error to the truncate caller, but a subsequent fsync
>   will likely not see it.
> 
> * Some internal callers try to reset the error flags after clearing
>   them, but that's racy. Another task could check the flags between
>   those two events.
> 
> Solution:
> =
> This patchset adds a new datatype called an errseq_t that represents a
> sequence of errors. It's a u32, with a field for a POSIX-flavor error
> and a counter, managed with atomics. We can sample that value at a
> particular point in time, and can later tell whether there have been any
> errors since that point.
> 
> That allows us to provide traditional check-and-clear fsync semantics
> on every open file description in a lightweight fashion. fsync callers
> no longer need to coordinate between one another in order to ensure
> that errors at fsync time are handled correctly.
> 
> Strategy:
> =
> The aim with this pile is to do the minimum possible to support for
> reliable reporting of errors on fsync, without substantially changing
> the internals of the filesystems themselves.
> 
> Most of the internal calls to filemap_fdatawait are left alone, so all
> of the internal error checkers are using the same error handling they
> always have. The only real difference here is that we're better
> reporting errors at fsync.
> 
> I think that we probably will want to eventually convert all of those
> internal callers to use errseq_t based reporting, but that can be done
> in an incremental fashion in follow-on patchsets.
> 
> Testing:
> 
> I've primarily been testing this with some new xfstests that I will post
> in a separate series. These tests use dm-error fault injection to make
> the underlying block device start throwing I/O errors, and then test the
> how the filesystem layer reports errors after that.
> 
> Jeff Layton (22):
>   fs: remove call_fsync helper function
>   buffer: use mapping_set_error instead of setting the flag
>   fs: check for writeback errors after syncing out buffers in
> generic_file_fsync
>   buffer: set errors in mapping at the time that the error occurs
>   jbd2: don't clear and reset errors after waiting on writeback
>   mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
>   mm: don't TestClearPageError in __filemap_fdatawait_range
>   mm: clean up error handling in write_one_page
>   fs: always sync metadata in __generic_file_fsync
>   lib: add errseq_t type and infrastructure for handling it
>   fs: new infrastructure for writeback error handling and reporting
>   mm: tracepoints for writeback error events
>   mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
>   Documentation: flesh out the section in vfs.txt on storing and
> reporting writeback errors
>   dax: set errors in mapping when writeback fails
>   block: convert to errseq_t based writeback error tracking
>   ext4: use errseq_t based error handling for reporting data writeback
> errors
>   fs: add

Re: [PATCH rfc 25/30] nvme: move control plane handling to nvme core

2017-06-19 Thread Sagi Grimberg

+static void nvme_free_io_queues(struct nvme_ctrl *ctrl)
+{
+   int i;
+
+   for (i = 1; i < ctrl->queue_count; i++)
+   ctrl->ops->free_hw_queue(ctrl, i);
+}
+
+void nvme_stop_io_queues(struct nvme_ctrl *ctrl)
+{
+   int i;
+
+   for (i = 1; i < ctrl->queue_count; i++)
+   ctrl->ops->stop_hw_queue(ctrl, i);
+}
+EXPORT_SYMBOL_GPL(nvme_stop_io_queues);


At leasr for PCIe this is going to work very differently, so I'm not
sure this part make so much sense in the core.  Maybe in Fabrics?
Or at least make the callouts operate on all I/O queues, which would
suite PCIe a lot more.


Yea, I spent some time thinking on the async nature of queue
removal for pci... I started from ->stop/free_io_queues callouts
but hated the fact that we need to iterate exactly the same way
in every driver...

We could have an optional stop/free_io_queues that the core
will call instead if implemented?


+   error = ctrl->ops->start_hw_queue(ctrl, 0);
+   if (error)
+   goto out_cleanup_connect_queue;
+
+   error = ctrl->ops->reg_read64(ctrl, NVME_REG_CAP, &ctrl->cap);
+   if (error) {
+   dev_err(ctrl->device,
+   "prop_get NVME_REG_CAP failed\n");
+   goto out_cleanup_connect_queue;
+   }
+
+   ctrl->sqsize = min_t(int, NVME_CAP_MQES(ctrl->cap), ctrl->sqsize);
+
+   error = nvme_enable_ctrl(ctrl, ctrl->cap);
+   if (error)
+   goto out_cleanup_connect_queue;


I'm not sure this ordering is going to work for PCIe..


This one is easy to reverse...


[PATCH 0/10 v13] merge request: No wait AIO

2017-06-19 Thread Goldwyn Rodrigues
Jens,

As Christoph suggested, I am sending the patches against the block
tree for merge since the block layer changes had the most conflicts.
My tree is at https://github.com/goldwynr/linux/tree/nowait-block

This series adds nonblocking feature to asynchronous I/O writes.
io_submit() can be delayed because of a number of reason:
 - Block allocation for files
 - Data writebacks for direct I/O
 - Sleeping because of waiting to acquire i_rwsem
 - Congested block device

The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if
any of these conditions are met. This way userspace can push most
of the write()s to the kernel to the best of its ability to complete
and if it returns -EAGAIN, can defer it to another thread.

In order to enable this, IOCB_RW_FLAG_NOWAIT is introduced in
uapi/linux/aio_abi.h. If set for aio_rw_flags, it translates to
IOCB_NOWAIT for struct iocb, REQ_NOWAIT for bio.bi_opf and IOMAP_NOWAIT for
iomap. aio_rw_flags is a new flag replacing aio_reserved1. We could
not use aio_flags because it is not currently checked for invalidity
in the kernel.

This feature is provided for direct I/O of asynchronous I/O only. I have
tested it against xfs, ext4, and btrfs while I intend to add more filesystems.
The nowait feature is for request based devices. In the future, I intend to
add support to stacked devices such as md.

Applications will have to check supportability by sending a async direct write
and any other error besides -EAGAIN would mean it is not supported.

First two patches are prep patches into nowait I/O.

Changes since v1:
 + changed name from _NONBLOCKING to *_NOWAIT
 + filemap_range_has_page call moved to closer to (just before) calling 
filemap_write_and_wait_range().
 + BIO_NOWAIT limited to get_request()
 + XFS fixes 
- included reflink 
- use of xfs_ilock_nowait() instead of a XFS_IOLOCK_NONBLOCKING flag
- Translate the flag through IOMAP_NOWAIT (iomap) to check for
  block allocation for the file.
 + ext4 coding style

Changes since v2:
 + Using aio_reserved1 as aio_rw_flags instead of aio_flags
 + blk-mq support
 + xfs uptodate with kernel and reflink changes

 Changes since v3:
  + Added FS_NOWAIT, which is set if the filesystem supports NOWAIT feature.
  + Checks in generic_make_request() to make sure BIO_NOWAIT comes in
for async direct writes only.
  + Added QUEUE_FLAG_NOWAIT, which is set if the device supports BIO_NOWAIT.
This is added (rather not set) to block devices such as dm/md currently.

 Changes since v4:
  + Ported AIO code to use RWF_* flags. Check for RWF_* flags in
generic_file_write_iter().
  + Changed IOCB_RW_FLAGS_NOWAIT to RWF_NOWAIT.

 Changes since v5:
  + BIO_NOWAIT to REQ_NOWAIT
  + Common helper for RWF flags.

 Changes since v6:
  + REQ_NOWAIT will be ignored for request based devices since they
cannot block. So, removed QUEUE_FLAG_NOWAIT since it is not
required in the current implementation. It will be resurrected
when we program for stacked devices.
  + changed kiocb_rw_flags() to kiocb_set_rw_flags() in order to accomodate
for errors. Moved checks in the function.

 Changes since v7:
  + split patches into prep so the main patches are smaller and easier
to understand
  + All patches are reviewed or acked!
 
 Changes since v8:
 + Err out AIO reads with -EINVAL flagged as RWF_NOWAIT

 Changes since v9:
 + Retract - Err out AIO reads with -EINVAL flagged as RWF_NOWAIT
 + XFS returns EAGAIN if extent list is not in memory
 + Man page updates to io_submit with iocb description and nowait features.

 Changes since v10:
 + Corrected comment and subject in "return on congested block device"

 Changes since v11:
 + FMODE_AIO_NOWAIT to show AIO NOWAIT support. This is to block
   non-supported filesystems instead of returning ENOTSUPP in each
   individual filesystem

 Changes since v12:
 + Updated with respect to linux-next. Primarily changes the patch in
   block patch, which changes bio.bi_error to bio.bi_status.
 + Fix FMODE_AIO_NOWAIT spellcheck 
 + Added RWF_SUPPORTED to encompass all flags
 + Added checks on the read side, namely generic_file_read_iter() and
   iomap_dio_rw()

-- 
Goldwyn




[PATCH 01/10] fs: Separate out kiocb flags setup based on RWF_* flags

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Also added RWF_SUPPORTED to encompass all flags.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Goldwyn Rodrigues 
---
 fs/read_write.c | 12 +++-
 include/linux/fs.h  | 14 ++
 include/uapi/linux/fs.h |  2 ++
 3 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 47c1d4484df9..53c816c61122 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -678,16 +678,10 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
struct iov_iter *iter,
struct kiocb kiocb;
ssize_t ret;
 
-   if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
-   return -EOPNOTSUPP;
-
init_sync_kiocb(&kiocb, filp);
-   if (flags & RWF_HIPRI)
-   kiocb.ki_flags |= IOCB_HIPRI;
-   if (flags & RWF_DSYNC)
-   kiocb.ki_flags |= IOCB_DSYNC;
-   if (flags & RWF_SYNC)
-   kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   ret = kiocb_set_rw_flags(&kiocb, flags);
+   if (ret)
+   return ret;
kiocb.ki_pos = *ppos;
 
if (type == READ)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 023f0324762b..96a1a1fa54a9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3057,6 +3057,20 @@ static inline int iocb_flags(struct file *file)
return res;
 }
 
+static inline int kiocb_set_rw_flags(struct kiocb *ki, int flags)
+{
+   if (unlikely(flags & ~RWF_SUPPORTED))
+   return -EOPNOTSUPP;
+
+   if (flags & RWF_HIPRI)
+   ki->ki_flags |= IOCB_HIPRI;
+   if (flags & RWF_DSYNC)
+   ki->ki_flags |= IOCB_DSYNC;
+   if (flags & RWF_SYNC)
+   ki->ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
+   return 0;
+}
+
 static inline ino_t parent_ino(struct dentry *dentry)
 {
ino_t res;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 24e61a54feaa..937c3e39650a 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -361,4 +361,6 @@ struct fscrypt_key {
 #define RWF_DSYNC  0x0002 /* per-IO O_DSYNC */
 #define RWF_SYNC   0x0004 /* per-IO O_SYNC */
 
+#define RWF_SUPPORTED  (RWF_HIPRI | RWF_DSYNC | RWF_SYNC)
+
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.12.0



[PATCH 07/10] block: return on congested block device

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

A new bio operation flag REQ_NOWAIT is introduced to identify bio's
orignating from iocb with IOCB_NOWAIT. This flag indicates
to return immediately if a request cannot be made instead
of retrying.

Stacked devices such as md (the ones with make_request_fn hooks)
currently are not supported because it may block for housekeeping.
For example, an md can have a part of the device suspended.
For this reason, only request based devices are supported.
In the future, this feature will be expanded to stacked devices
by teaching them how to handle the REQ_NOWAIT flags.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jens Axboe 
Signed-off-by: Goldwyn Rodrigues 
---
 block/blk-core.c  | 22 --
 block/blk-mq-sched.c  |  3 +++
 block/blk-mq.c|  2 ++
 fs/direct-io.c| 10 --
 include/linux/bio.h   |  6 ++
 include/linux/blk_types.h |  4 
 6 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8592409db272..3999e6ac288d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -143,6 +143,7 @@ static const struct {
[BLK_STS_MEDIUM]= { -ENODATA,   "critical medium" },
[BLK_STS_PROTECTION]= { -EILSEQ,"protection" },
[BLK_STS_RESOURCE]  = { -ENOMEM,"kernel resource" },
+   [BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" },
 
/* device mapper special case, should not leak out: */
[BLK_STS_DM_REQUEUE]= { -EREMCHG, "dm internal retry" },
@@ -1314,6 +1315,11 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
if (!IS_ERR(rq))
return rq;
 
+   if (op & REQ_NOWAIT) {
+   blk_put_rl(rl);
+   return ERR_PTR(-EAGAIN);
+   }
+
if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
blk_put_rl(rl);
return rq;
@@ -1961,6 +1967,14 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   /*
+* For a REQ_NOWAIT based request, return -EOPNOTSUPP
+* if queue is not a request based queue.
+*/
+
+   if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
+   goto not_supported;
+
part = bio->bi_bdev->bd_part;
if (should_fail_request(part, bio->bi_iter.bi_size) ||
should_fail_request(&part_to_disk(part)->part0,
@@ -2118,7 +2132,7 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-   if (likely(blk_queue_enter(q, false) == 0)) {
+   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
@@ -2143,7 +2157,11 @@ blk_qc_t generic_make_request(struct bio *bio)
bio_list_merge(&bio_list_on_stack[0], &same);
bio_list_merge(&bio_list_on_stack[0], 
&bio_list_on_stack[1]);
} else {
-   bio_io_error(bio);
+   if (unlikely(!blk_queue_dying(q) &&
+   (bio->bi_opf & REQ_NOWAIT)))
+   bio_wouldblock_error(bio);
+   else
+   bio_io_error(bio);
}
bio = bio_list_pop(&bio_list_on_stack[0]);
} while (bio);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c4e2afb9d12d..ab14295d8468 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -83,6 +83,9 @@ struct request *blk_mq_sched_get_request(struct request_queue 
*q,
if (likely(!data->hctx))
data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
 
+   if (op & REQ_NOWAIT)
+   data->flags |= BLK_MQ_REQ_NOWAIT;
+
if (e) {
data->flags |= BLK_MQ_REQ_INTERNAL;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 359d2dc0d414..387fc8e8a2ba 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1498,6 +1498,8 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
rq = blk_mq_sched_get_request(q, bio, bio->bi_opf, &data);
if (unlikely(!rq)) {
__wbt_done(q->rq_wb, wb_acct);
+   if (bio->bi_opf & REQ_NOWAIT)
+   bio_wouldblock_error(bio);
return BLK_QC_T_NONE;
}
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e8baaabebf13..c87077d1dc33 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -479,8 +479,12 @@ static blk_status_t dio_bio_complete(struct dio *dio, 
struct bio *bio)
unsigned i;
blk_status_t err = bio->bi_status;
 
-   if (err)
-   dio->io_error = -EIO;
+   if (err) {
+   if (err == BLK_STS_AGAI

[PATCH 03/10] fs: Use RWF_* flags for AIO operations

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

aio_rw_flags is introduced in struct iocb (using aio_reserved1) which will
carry the RWF_* flags. We cannot use aio_flags because they are not
checked for validity which may break existing applications.

Note, the only place RWF_HIPRI comes in effect is dio_await_one().
All the rest of the locations, aio code return -EIOCBQUEUED before the
checks for RWF_HIPRI.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Goldwyn Rodrigues 
---
 fs/aio.c | 8 +++-
 include/uapi/linux/aio_abi.h | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index f52d925ee259..020fa0045e3c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1541,7 +1541,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
ssize_t ret;
 
/* enforce forwards compatibility on users */
-   if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
+   if (unlikely(iocb->aio_reserved2)) {
pr_debug("EINVAL: reserve field set\n");
return -EINVAL;
}
@@ -1586,6 +1586,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
req->common.ki_flags |= IOCB_EVENTFD;
}
 
+   ret = kiocb_set_rw_flags(&req->common, iocb->aio_rw_flags);
+   if (unlikely(ret)) {
+   pr_debug("EINVAL: aio_rw_flags\n");
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index bb2554f7fbd1..a2d4a8ac94ca 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -79,7 +79,7 @@ struct io_event {
 struct iocb {
/* these are internal to the kernel/libc. */
__u64   aio_data;   /* data to be returned in event's data */
-   __u32   PADDED(aio_key, aio_reserved1);
+   __u32   PADDED(aio_key, aio_rw_flags);
/* the kernel sets aio_key to the req # */
 
/* common fields */
-- 
2.12.0



[PATCH 05/10] fs: return if direct I/O will trigger writeback

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Find out if the I/O will trigger a wait due to writeback. If yes,
return -EAGAIN.

Return -EINVAL for buffered AIO: there are multiple causes of
delay such as page locks, dirty throttling logic, page loading
from disk etc. which cannot be taken care of.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Goldwyn Rodrigues 
---
 mm/filemap.c | 32 +---
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9b39a2390b9e..742034e56100 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2070,10 +2070,17 @@ generic_file_read_iter(struct kiocb *iocb, struct 
iov_iter *iter)
loff_t size;
 
size = i_size_read(inode);
-   retval = filemap_write_and_wait_range(mapping, iocb->ki_pos,
-   iocb->ki_pos + count - 1);
-   if (retval < 0)
-   goto out;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (filemap_range_has_page(mapping, iocb->ki_pos,
+  iocb->ki_pos + count - 1))
+   return -EAGAIN;
+   } else {
+   retval = filemap_write_and_wait_range(mapping,
+   iocb->ki_pos,
+   iocb->ki_pos + count - 1);
+   if (retval < 0)
+   goto out;
+   }
 
file_accessed(file);
 
@@ -2674,6 +2681,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
 
pos = iocb->ki_pos;
 
+   if ((iocb->ki_flags & IOCB_NOWAIT) && !(iocb->ki_flags & IOCB_DIRECT))
+   return -EINVAL;
+
if (limit != RLIM_INFINITY) {
if (iocb->ki_pos >= limit) {
send_sig(SIGXFSZ, current, 0);
@@ -2742,9 +2752,17 @@ generic_file_direct_write(struct kiocb *iocb, struct 
iov_iter *from)
write_len = iov_iter_count(from);
end = (pos + write_len - 1) >> PAGE_SHIFT;
 
-   written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 
1);
-   if (written)
-   goto out;
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   /* If there are pages to writeback, return */
+   if (filemap_range_has_page(inode->i_mapping, pos,
+  pos + iov_iter_count(from)))
+   return -EAGAIN;
+   } else {
+   written = filemap_write_and_wait_range(mapping, pos,
+   pos + write_len - 1);
+   if (written)
+   goto out;
+   }
 
/*
 * After a write we want buffered reads to be sure to go to disk to get
-- 
2.12.0



[PATCH 02/10] fs: Introduce filemap_range_has_page()

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

filemap_range_has_page() return true if the file's mapping has
a page within the range mentioned. This function will be used
to check if a write() call will cause a writeback of previous
writes.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Goldwyn Rodrigues 
---
 include/linux/fs.h |  2 ++
 mm/filemap.c   | 32 
 2 files changed, 34 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 96a1a1fa54a9..0d34f5b5a6b0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2518,6 +2518,8 @@ extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern bool filemap_range_has_page(struct address_space *, loff_t lstart,
+ loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..9b39a2390b9e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,6 +376,38 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
+/**
+ * filemap_range_has_page - check if a page exists in range.
+ * @mapping:   address space within which to check
+ * @start_byte:offset in bytes where the range starts
+ * @end_byte:  offset in bytes where the range ends (inclusive)
+ *
+ * Find at least one page in the range supplied, usually used to check if
+ * direct writing in this range will trigger a writeback.
+ */
+bool filemap_range_has_page(struct address_space *mapping,
+  loff_t start_byte, loff_t end_byte)
+{
+   pgoff_t index = start_byte >> PAGE_SHIFT;
+   pgoff_t end = end_byte >> PAGE_SHIFT;
+   struct pagevec pvec;
+   bool ret;
+
+   if (end_byte < start_byte)
+   return false;
+
+   if (mapping->nrpages == 0)
+   return false;
+
+   pagevec_init(&pvec, 0);
+   if (!pagevec_lookup(&pvec, mapping, index, 1))
+   return false;
+   ret = (pvec.pages[0]->index <= end);
+   pagevec_release(&pvec);
+   return ret;
+}
+EXPORT_SYMBOL(filemap_range_has_page);
+
 static int __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
-- 
2.12.0



[PATCH 04/10] fs: Introduce RWF_NOWAIT and FMODE_AIO_NOWAIT

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

RWF_NOWAIT informs kernel to bail out if an AIO request will block
for reasons such as file allocations, or a writeback triggered,
or would block while allocating requests while performing
direct I/O.

RWF_NOWAIT is translated to IOCB_NOWAIT for iocb->ki_flags.

FMODE_AIO_NOWAIT is a flag which identifies the file opened is capable
of returning -EAGAIN if the AIO call will block. This must be set by
supporting filesystems in the ->open() call.

Filesystems xfs, btrfs and ext4 would be supported in the following patches.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Goldwyn Rodrigues 
---
 fs/aio.c| 6 ++
 include/linux/fs.h  | 9 +
 include/uapi/linux/fs.h | 4 +++-
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/aio.c b/fs/aio.c
index 020fa0045e3c..34027b67e2f4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1592,6 +1592,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb 
__user *user_iocb,
goto out_put_req;
}
 
+   if ((req->common.ki_flags & IOCB_NOWAIT) &&
+   !(req->common.ki_flags & IOCB_DIRECT)) {
+   ret = -EOPNOTSUPP;
+   goto out_put_req;
+   }
+
ret = put_user(KIOCB_KEY, &user_iocb->aio_key);
if (unlikely(ret)) {
pr_debug("EFAULT: aio_key\n");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0d34f5b5a6b0..4574121f4746 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -143,6 +143,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
offset,
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY ((__force fmode_t)0x400)
 
+/* File is capable of returning -EAGAIN if AIO will block */
+#define FMODE_AIO_NOWAIT   ((__force fmode_t)0x800)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
@@ -269,6 +272,7 @@ struct writeback_control;
 #define IOCB_DSYNC (1 << 4)
 #define IOCB_SYNC  (1 << 5)
 #define IOCB_WRITE (1 << 6)
+#define IOCB_NOWAIT(1 << 7)
 
 struct kiocb {
struct file *ki_filp;
@@ -3064,6 +3068,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, 
int flags)
if (unlikely(flags & ~RWF_SUPPORTED))
return -EOPNOTSUPP;
 
+   if (flags & RWF_NOWAIT) {
+   if (!(ki->ki_filp->f_mode & FMODE_AIO_NOWAIT))
+   return -EOPNOTSUPP;
+   ki->ki_flags |= IOCB_NOWAIT;
+   }
if (flags & RWF_HIPRI)
ki->ki_flags |= IOCB_HIPRI;
if (flags & RWF_DSYNC)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 937c3e39650a..27d8c36c04af 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -360,7 +360,9 @@ struct fscrypt_key {
 #define RWF_HIPRI  0x0001 /* high priority request, 
poll if possible */
 #define RWF_DSYNC  0x0002 /* per-IO O_DSYNC */
 #define RWF_SYNC   0x0004 /* per-IO O_SYNC */
+#define RWF_NOWAIT 0x0008 /* per-IO, return -EAGAIN if 
operation would block */
 
-#define RWF_SUPPORTED  (RWF_HIPRI | RWF_DSYNC | RWF_SYNC)
+#define RWF_SUPPORTED  (RWF_HIPRI | RWF_DSYNC | RWF_SYNC |\
+RWF_NOWAIT)
 
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.12.0



[PATCH 10/10] btrfs: nowait aio support

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated

Acked-by: David Sterba 
Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/file.c  | 33 +++--
 fs/btrfs/inode.c |  3 +++
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da1096eb1a40..59e2dccdf75b 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1875,12 +1875,29 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
ssize_t num_written = 0;
bool sync = (file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host);
ssize_t err;
-   loff_t pos;
-   size_t count;
+   loff_t pos = iocb->ki_pos;
+   size_t count = iov_iter_count(from);
loff_t oldsize;
int clean_page = 0;
 
-   inode_lock(inode);
+   if ((iocb->ki_flags & IOCB_NOWAIT) &&
+   (iocb->ki_flags & IOCB_DIRECT)) {
+   /* Don't sleep on inode rwsem */
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   /*
+* We will allocate space in case nodatacow is not set,
+* so bail
+*/
+   if (!(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+ BTRFS_INODE_PREALLOC)) ||
+   check_can_nocow(BTRFS_I(inode), pos, &count) <= 0) {
+   inode_unlock(inode);
+   return -EAGAIN;
+   }
+   } else
+   inode_lock(inode);
+
err = generic_write_checks(iocb, from);
if (err <= 0) {
inode_unlock(inode);
@@ -1914,8 +1931,6 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 */
update_time_for_write(inode);
 
-   pos = iocb->ki_pos;
-   count = iov_iter_count(from);
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
@@ -3071,13 +3086,19 @@ static loff_t btrfs_file_llseek(struct file *file, 
loff_t offset, int whence)
return offset;
 }
 
+static int btrfs_file_open(struct inode *inode, struct file *filp)
+{
+   filp->f_mode |= FMODE_AIO_NOWAIT;
+   return generic_file_open(inode, filp);
+}
+
 const struct file_operations btrfs_file_operations = {
.llseek = btrfs_file_llseek,
.read_iter  = generic_file_read_iter,
.splice_read= generic_file_splice_read,
.write_iter = btrfs_file_write_iter,
.mmap   = btrfs_file_mmap,
-   .open   = generic_file_open,
+   .open   = btrfs_file_open,
.release= btrfs_release_file,
.fsync  = btrfs_sync_file,
.fallocate  = btrfs_fallocate,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f942293dd7e7..556c93060606 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8754,6 +8754,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct 
iov_iter *iter)
dio_data.overwrite = 1;
inode_unlock(inode);
relock = true;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
}
ret = btrfs_delalloc_reserve_space(inode, offset, count);
if (ret)
-- 
2.12.0



[PATCH 06/10] fs: Introduce IOMAP_NOWAIT

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

IOCB_NOWAIT translates to IOMAP_NOWAIT for iomaps.
This is used by XFS in the XFS patch.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Jan Kara 
Signed-off-by: Goldwyn Rodrigues 
---
 fs/iomap.c| 8 
 include/linux/iomap.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index 18f2f2b8ba2c..c71a64b97fba 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -881,6 +881,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
flags |= IOMAP_WRITE;
}
 
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (filemap_range_has_page(mapping, start, end)) {
+   ret = -EAGAIN;
+   goto out_free_dio;
+   }
+   flags |= IOMAP_NOWAIT;
+   }
+
ret = filemap_write_and_wait_range(mapping, start, end);
if (ret)
goto out_free_dio;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f753e788da31..69f4e9470084 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -52,6 +52,7 @@ struct iomap {
 #define IOMAP_REPORT   (1 << 2) /* report extent status, e.g. FIEMAP */
 #define IOMAP_FAULT(1 << 3) /* mapping for page fault */
 #define IOMAP_DIRECT   (1 << 4) /* direct I/O */
+#define IOMAP_NOWAIT   (1 << 5) /* Don't wait for writeback */
 
 struct iomap_ops {
/*
-- 
2.12.0



[PATCH 09/10] xfs: nowait aio support

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

If IOCB_NOWAIT is set, bail if the i_rwsem is not lockable
immediately.

IF IOMAP_NOWAIT is set, return EAGAIN in xfs_file_iomap_begin
if it needs allocation either due to file extension, writing to a hole,
or COW or waiting for other DIOs to finish.

Return -EAGAIN if we don't have extent list in memory.

Signed-off-by: Goldwyn Rodrigues 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/xfs_file.c  | 20 +++-
 fs/xfs/xfs_iomap.c | 22 ++
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5fb5a0958a14..e159eb381d9f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -541,8 +541,11 @@ xfs_file_dio_aio_write(
iolock = XFS_IOLOCK_SHARED;
}
 
-   xfs_ilock(ip, iolock);
-
+   if (!xfs_ilock_nowait(ip, iolock)) {
+   if (iocb->ki_flags & IOCB_NOWAIT)
+   return -EAGAIN;
+   xfs_ilock(ip, iolock);
+   }
ret = xfs_file_aio_write_checks(iocb, from, &iolock);
if (ret)
goto out;
@@ -553,9 +556,15 @@ xfs_file_dio_aio_write(
 * otherwise demote the lock if we had to take the exclusive lock
 * for other reasons in xfs_file_aio_write_checks.
 */
-   if (unaligned_io)
-   inode_dio_wait(inode);
-   else if (iolock == XFS_IOLOCK_EXCL) {
+   if (unaligned_io) {
+   /* If we are going to wait for other DIO to finish, bail */
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (atomic_read(&inode->i_dio_count))
+   return -EAGAIN;
+   } else {
+   inode_dio_wait(inode);
+   }
+   } else if (iolock == XFS_IOLOCK_EXCL) {
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
iolock = XFS_IOLOCK_SHARED;
}
@@ -892,6 +901,7 @@ xfs_file_open(
return -EFBIG;
if (XFS_FORCED_SHUTDOWN(XFS_M(inode->i_sb)))
return -EIO;
+   file->f_mode |= FMODE_AIO_NOWAIT;
return 0;
 }
 
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 94e5bdf7304c..05dc87e8c1f5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -995,6 +995,11 @@ xfs_file_iomap_begin(
lockmode = xfs_ilock_data_map_shared(ip);
}
 
+   if ((flags & IOMAP_NOWAIT) && !(ip->i_df.if_flags & XFS_IFEXTENTS)) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+
ASSERT(offset <= mp->m_super->s_maxbytes);
if ((xfs_fsize_t)offset + length > mp->m_super->s_maxbytes)
length = mp->m_super->s_maxbytes - offset;
@@ -1016,6 +1021,15 @@ xfs_file_iomap_begin(
 
if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
if (flags & IOMAP_DIRECT) {
+   /*
+* A reflinked inode will result in CoW alloc.
+* FIXME: It could still overwrite on unshared extents
+* and not need allocation.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
/* may drop and re-acquire the ilock */
error = xfs_reflink_allocate_cow(ip, &imap, &shared,
&lockmode);
@@ -1033,6 +1047,14 @@ xfs_file_iomap_begin(
 
if ((flags & IOMAP_WRITE) && imap_needs_alloc(inode, &imap, nimaps)) {
/*
+* If nowait is set bail since we are going to make
+* allocations.
+*/
+   if (flags & IOMAP_NOWAIT) {
+   error = -EAGAIN;
+   goto out_unlock;
+   }
+   /*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
 * with the work writeback does. This is a completely arbitrary
-- 
2.12.0



[PATCH 08/10] ext4: nowait aio support

2017-06-19 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Return EAGAIN if any of the following checks fail for direct I/O:
  + i_rwsem is lockable
  + Writing beyond end of file (will trigger allocation)
  + Blocks are not allocated at the write location

Signed-off-by: Goldwyn Rodrigues 
Reviewed-by: Jan Kara 
---
 fs/ext4/file.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 02ce7e7bbdf5..25cd42baadea 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -216,7 +216,13 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
return ext4_dax_write_iter(iocb, from);
 #endif
 
-   inode_lock(inode);
+   if (iocb->ki_flags & IOCB_NOWAIT) {
+   if (!inode_trylock(inode))
+   return -EAGAIN;
+   } else {
+   inode_lock(inode);
+   }
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
@@ -235,9 +241,15 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
 
iocb->private = &overwrite;
/* Check whether we do a DIO overwrite or not */
-   if (o_direct && ext4_should_dioread_nolock(inode) && !unaligned_aio &&
-   ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from)))
-   overwrite = 1;
+   if (o_direct && !unaligned_aio) {
+   if (ext4_overwrite_io(inode, iocb->ki_pos, 
iov_iter_count(from))) {
+   if (ext4_should_dioread_nolock(inode))
+   overwrite = 1;
+   } else if (iocb->ki_flags & IOCB_NOWAIT) {
+   ret = -EAGAIN;
+   goto out;
+   }
+   }
 
ret = __generic_file_write_iter(iocb, from);
inode_unlock(inode);
@@ -435,6 +447,10 @@ static int ext4_file_open(struct inode * inode, struct 
file * filp)
if (ret < 0)
return ret;
}
+
+   /* Set the flags to support nowait AIO */
+   filp->f_mode |= FMODE_AIO_NOWAIT;
+
return dquot_file_open(inode, filp);
 }
 
-- 
2.12.0



Re: [PATCH 05/10] fs: return if direct I/O will trigger writeback

2017-06-19 Thread Al Viro
On Mon, Jun 19, 2017 at 11:33:43AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Find out if the I/O will trigger a wait due to writeback. If yes,
> return -EAGAIN.
> 
> Return -EINVAL for buffered AIO: there are multiple causes of
> delay such as page locks, dirty throttling logic, page loading
> from disk etc. which cannot be taken care of.
> 

I don't believe that generic_file_read_iter() is sufficient.  Consider
ext4_file_read_iter() -> ext4_dax_read_iter():
static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;

inode_lock_shared(inode);

IOW, your ext4 patch is missing a chunk.  I hadn't checked xfs one...


[PATCH 8/9] btrfs: add support for passing in write hints for buffered writes

2017-06-19 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Chris Mason 
Signed-off-by: Jens Axboe 
---
 fs/btrfs/extent_io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 19eedf2e630b..3e57cfaa6dd6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2830,6 +2830,7 @@ static int submit_extent_page(int op, int op_flags, 
struct extent_io_tree *tree,
bio_add_page(bio, page, page_size, offset);
bio->bi_end_io = end_io_func;
bio->bi_private = tree;
+   op_flags |= write_hint_to_opf(inode_write_hint(page->mapping->host));
bio_set_op_attrs(bio, op, op_flags);
if (wbc) {
wbc_init_bio(wbc, bio);
-- 
2.7.4



[PATCHSET v8] Add support for write life time hints

2017-06-19 Thread Jens Axboe
A new iteration of this patchset, previously known as write streams.
As before, this patchset aims at enabling applications split up
writes into separate streams, based on the perceived life time
of the data written. This is useful for a variety of reasons:

- For NVMe, this feature is ratified and released with the NVMe 1.3
  spec. Devices implementing Directives can expose multiple streams.
  Separating data written into streams based on life time can
  drastically reduce the write amplification. This helps device
  endurance, and increases performance. Testing just performed
  internally at Facebook with these patches showed up to a 25% reduction
  in NAND writes in a RocksDB setup.

- Software caching solutions can make more intelligent decisions
  on how and where to place data.

Contrary to previous patches, we're not exposing numeric stream values anymore.
I've previously advocated for just doing a set of hints that makes sense
instead. See the coverage from the LSFMM summit this year:

https://lwn.net/Articles/717755/

This patchset attempts to do that. We add an fcntl(2) interface to
get/set these types of hints. We define 4 hints that pertain to
data write life times:

RWH_WRITE_LIFE_SHORTData written with this flag is expected to have
a high overwrite rate, or life time.

RWH_WRITE_LIFE_MEDIUM   Longer life time than SHORT

RWH_WRITE_LIFE_LONG Longer life time than MEDIUM

RWH_WRITE_LIFE_EXTREME  Longer life time than LONG

The idea is that these are relative values, so an application can
use them as they see fit. The underlying device can then place
data appropriately, or be free to ignore the hint. It's just a hint.

A branch based on current master can be pulled
from here:

git://git.kernel.dk/linux-block write-stream.8

Changes since v7:

- NVMe: change 'streams' parameter to be a bool enable/disable. We
  hardwire the number of streams anyway and use the appropriate amount,
  so no point in exposing this value.
- NVMe: collapse stream values appropriately, instead of just doing
  a basic MOD.
- Get rid of pwritev2(2) flags. Just use the fcntl(2) interface.
- Collapse some patches
- Change fcntl(2) interface to get/set values from a user supplied
  64-bit pointer.
- Move inode-to-iocb mask setting to iocb_flags().

Changes since v6:

- Rewrite NVMe write stream assignment
- Change NVMe stream assignment to be per-controller, not per-ns. Then
  we can use the same IDs across name spaces, and we don't have to do
  lazy setup of streams.
- If streams are enabled on nvme, set io min/opt and discard
  granularity based on the stream params reported.
- Fixup F_SET_RW_HINT definition, it was 20, should have been 12.

Changes since v5:

- Change enum write_hint to enum rw_hint.
- Change fcntl() interface to be read/write generic
- Bring enum rw_hint all the way to bio/request
- Change references to streams in changelogs and debugfs interface
- Rebase to master to resolve blkdev.h conflict
- Reshuffle patches so the WRITE_LIFE_* hints and type come first. Allowed
  me to merge two block patches as well.

Changes since v4:

- Add enum write_hint and the WRITE_HINT_* values. This is what we
  use internally (until transformed to req/bio flags), and what is
  exposed to user space with the fcntl() interface. Maps directly
  to the RWF_WRITE_LIFE_* values.
- Add fcntl() interface for getting/setting hint values.
- Get rid of inode ->i_write_hint, encode the 3 bits of hint info
  in the inode flags intead.
- Allow a write with no hint to clear the old hint. Previously we
  only changed the hint if a new valid hint was given, not if no
  hint was passed in.
- Shrink flag space grabbed from 4 to 3 bits for RWF_* and the inode
  flags.

Changes since v3:

- Change any naming of stream ID to write hint.
- Various little API changes, suggested by Christoph
- Cleanup the NVMe bits, dump the debug info.
- Change NVMe to lazily allocate the streams.
- Various NVMe error handling improvements and command checking.

Changes since v2:

- Get rid of bio->bi_stream and replace with four request/bio flags.
  These map directly to the RWF_WRITE_* flags that the user passes in.
- Cleanup the NVMe stream setting.
- Drivers now responsible for updating the queue stream write counter,
  as they determine what stream to map a given flag to.

Changes since v1:

- Guard queue stream stats to ensure we don't mess up memory, if
  bio_stream() ever were to return a larger value than we support.
- NVMe: ensure we set the stream modulo the name space defined count.
- Cleanup the RWF_ and IOCB_ flags. Set aside 4 bits, and just store
  the stream value in there. This makes the passing of stream ID from
  RWF_ space to IOCB_ (and IOCB_ to bio) more efficient, and cleans it
  up in general.
- Kill the block internal definitions of the stream type, we don't need
  them anymore. See above.

-- 
Jens Axboe



[PATCH 4/9] fs: add O_DIRECT support for sending down write life time hints

2017-06-19 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/block_dev.c | 2 ++
 fs/direct-io.c | 2 ++
 fs/iomap.c | 1 +
 3 files changed, 5 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index dd91c99e9ba0..30e1fb65c2fa 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -183,6 +183,8 @@ static unsigned int dio_bio_write_op(struct kiocb *iocb)
/* avoid the need for a I/O completion work item */
if (iocb->ki_flags & IOCB_DSYNC)
op |= REQ_FUA;
+
+   op |= write_hint_to_opf(iocb_write_hint(iocb));
return op;
 }
 
diff --git a/fs/direct-io.c b/fs/direct-io.c
index e8baaabebf13..9e9adca0c592 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -385,6 +385,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
else
bio->bi_end_io = dio_bio_end_io;
 
+   bio->bi_opf |= write_hint_to_opf(iocb_write_hint(dio->iocb));
+
sdio->bio = bio;
sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap.c b/fs/iomap.c
index 18f2f2b8ba2c..63b9f87d9461 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -804,6 +804,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
length,
 
if (dio->flags & IOMAP_DIO_WRITE) {
bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_SYNC | 
REQ_IDLE);
+   bio->bi_opf |= 
write_hint_to_opf(inode_write_hint(inode));
task_io_account_write(bio->bi_iter.bi_size);
} else {
bio_set_op_attrs(bio, REQ_OP_READ, 0);
-- 
2.7.4



[PATCH 9/9] nvme: add support for streams and directives

2017-06-19 Thread Jens Axboe
This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data, so
that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and life
time of the device.

We default to allocating 4 streams, controller wide, so we can use them
on all name spaces. This is configurable with the 'streams' module
parameter. If a write stream is set in a write, flag is as such before
sending it to the device.

Signed-off-by: Jens Axboe 
---
 drivers/nvme/host/core.c | 189 +--
 drivers/nvme/host/nvme.h |   4 +
 include/linux/nvme.h |  48 
 3 files changed, 236 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index aee37b73231d..fa1acfa27fa4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -68,6 +68,10 @@ MODULE_PARM_DESC(force_apst, "allow APST for newly 
enumerated devices even if qu
 struct workqueue_struct *nvme_wq;
 EXPORT_SYMBOL_GPL(nvme_wq);
 
+static bool streams = true;
+module_param(streams, bool, 0644);
+MODULE_PARM_DESC(streams, "use streams, if available");
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -297,6 +301,141 @@ struct request *nvme_alloc_request(struct request_queue 
*q,
 }
 EXPORT_SYMBOL_GPL(nvme_alloc_request);
 
+/*
+
+ * Returns number of streams allocated for use by, or -1 on error.
+ */
+static int nvme_streams_allocate(struct nvme_ctrl *ctrl, unsigned int nstreams)
+{
+   struct nvme_command c;
+   union nvme_result res;
+   int ret;
+
+   memset(&c, 0, sizeof(c));
+
+   c.directive.opcode = nvme_admin_directive_recv;
+   c.directive.nsid = cpu_to_le32(0x);
+   c.directive.doper = NVME_DIR_RCV_ST_OP_RESOURCE;
+   c.directive.dtype = NVME_DIR_STREAMS;
+   c.directive.endir = nstreams;
+
+   ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res, NULL, 0, 0,
+   NVME_QID_ANY, 0, 0);
+   if (ret)
+   return -1;
+
+   return le32_to_cpu(res.u32) & 0x;
+}
+
+static int nvme_enable_streams(struct nvme_ctrl *ctrl)
+{
+   struct nvme_command c;
+
+   memset(&c, 0, sizeof(c));
+
+   c.directive.opcode = nvme_admin_directive_send;
+   c.directive.nsid = cpu_to_le32(0x);
+   c.directive.doper = NVME_DIR_SND_ID_OP_ENABLE;
+   c.directive.dtype = NVME_DIR_IDENTIFY;
+   c.directive.tdtype = NVME_DIR_STREAMS;
+   c.directive.endir = NVME_DIR_ENDIR;
+
+   return nvme_submit_sync_cmd(ctrl->admin_q, &c, NULL, 0);
+}
+
+static int nvme_get_stream_params(struct nvme_ctrl *ctrl,
+ struct streams_directive_params *s, u32 nsid)
+{
+   struct nvme_command c;
+
+   memset(&c, 0, sizeof(c));
+   memset(s, 0, sizeof(*s));
+
+   c.directive.opcode = nvme_admin_directive_recv;
+   c.directive.nsid = cpu_to_le32(nsid);
+   c.directive.numd = sizeof(*s);
+   c.directive.doper = NVME_DIR_RCV_ST_OP_PARAM;
+   c.directive.dtype = NVME_DIR_STREAMS;
+
+   return nvme_submit_sync_cmd(ctrl->admin_q, &c, s, sizeof(*s));
+}
+
+static int nvme_setup_directives(struct nvme_ctrl *ctrl)
+{
+   struct streams_directive_params s;
+   unsigned int nstreams;
+   int ret;
+
+   if (!(ctrl->oacs & NVME_CTRL_OACS_DIRECTIVES))
+   return 0;
+   if (!streams)
+   return 0;
+
+   ret = nvme_enable_streams(ctrl);
+   if (ret)
+   return ret;
+
+   ret = nvme_get_stream_params(ctrl, &s, 0x);
+   if (ret)
+   return ret;
+
+   ctrl->nssa = le16_to_cpu(s.nssa);
+
+   nstreams = min_t(unsigned int, ctrl->nssa, BLK_MAX_WRITE_HINTS - 1);
+   ret = nvme_streams_allocate(ctrl, nstreams);
+   if (ret < 0)
+   return ret;
+
+   /* require at least 2 streams to use them effectively */
+   if (ret > 1) {
+   ret = min(ret, BLK_MAX_WRITE_HINTS - 1);
+   ctrl->nr_streams = ret;
+   dev_info(ctrl->device, "successfully enabled %d streams\n", 
ret);
+   }
+
+   return 0;
+}
+
+/*
+ * Write hint number to stream mappings
+ */
+static const unsigned int 
stream_mappings[BLK_MAX_WRITE_HINTS][BLK_MAX_WRITE_HINTS] = {
+   /* 0 or 1 stream, we don't use streams */
+   { 0, },
+   { 0, },
+   /* collapse short+medium to short, and long+extreme to medium */
+   { WRITE_LIFE_NONE, WRITE_LIFE_SHORT, WRITE_LIFE_SHORT,
+   WRITE_LIFE_MEDIUM, WRITE_LIFE_MEDIUM },
+   /* collapse long+extreme to long */
+   { WRITE_LIFE_NONE, WRITE_LIFE_SHORT, WRITE_LIFE_MED

[PATCH 1/9] fs: add fcntl() interface for setting/getting write life time hints

2017-06-19 Thread Jens Axboe
Define a set of write life time hints:

and add an fcntl interface for querying these flags, and also for
setting them as well:

F_GET_RW_HINT   Returns the read/write hint set.

F_SET_RW_HINT   Pass one of the above write hints.

The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.

Sample program testing/implementing basic setting/getting of write
hints is below.

Add support for storing the write life time hint in the inode flags,
and pass them to the kiocb flags as well. This is in preparation
for utilizing these hints in the block layer, to guide on-media
data placement.

/*
 * writehint.c: check or set a file/inode write hint
 */

static char *str[] = { "WRITE_LIFE_NONE", "WRITE_LIFE_SHORT",
"WRITE_LIFE_MEDIUM", "WRITE_LIFE_LONG",
"WRITE_LIFE_EXTREME" };

int main(int argc, char *argv[])
{
uint64_t hint = -1ULL;
int fd, ret;

if (argc < 2) {
fprintf(stderr, "%s: dev \n", argv[0]);
return 1;
}

fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open");
return 2;
}

if (argc > 2)
hint = atoi(argv[2]);

if (hint == -1ULL) {
ret = fcntl(fd, F_RW_GET_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_RW_GET_HINT");
return 3;
}
} else {
ret = fcntl(fd, F_RW_SET_HINT, &hint);
if (ret < 0) {
perror("fcntl: F_RW_SET_HINT");
return 4;
}
}

printf("%s: %shint %s\n", argv[1], hint != -1ULL ? "set " : "", 
str[hint]);
close(fd);
return 0;
}

Signed-off-by: Jens Axboe 
---
 fs/fcntl.c | 43 
 fs/inode.c | 11 +
 include/linux/fs.h | 61 ++
 include/uapi/linux/fcntl.h | 15 
 4 files changed, 130 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f4e7267d117f..113b78c11631 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -243,6 +243,45 @@ static int f_getowner_uids(struct file *filp, unsigned 
long arg)
 }
 #endif
 
+static long fcntl_rw_hint(struct file *file, unsigned int cmd,
+ u64 __user *ptr)
+{
+   struct inode *inode = file_inode(file);
+   long ret = 0;
+   u64 hint;
+
+   switch (cmd) {
+   case F_GET_RW_HINT:
+   hint = mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT);
+   if (put_user(hint, ptr))
+   ret = -EFAULT;
+   break;
+   case F_SET_RW_HINT:
+   if (get_user(hint, ptr)) {
+   ret = -EFAULT;
+   break;
+   }
+   switch (hint) {
+   case WRITE_LIFE_NONE:
+   case WRITE_LIFE_SHORT:
+   case WRITE_LIFE_MEDIUM:
+   case WRITE_LIFE_LONG:
+   case WRITE_LIFE_EXTREME:
+   inode_set_write_hint(inode, hint);
+   ret = 0;
+   break;
+   default:
+   ret = -EINVAL;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   break;
+   }
+
+   return ret;
+}
+
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
struct file *filp)
 {
@@ -337,6 +376,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned 
long arg,
case F_GET_SEALS:
err = shmem_fcntl(filp, cmd, arg);
break;
+   case F_GET_RW_HINT:
+   case F_SET_RW_HINT:
+   err = fcntl_rw_hint(filp, cmd, (u64 __user *) arg);
+   break;
default:
break;
}
diff --git a/fs/inode.c b/fs/inode.c
index db5914783a71..defb015a2c6d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2120,3 +2120,14 @@ struct timespec current_time(struct inode *inode)
return timespec_trunc(now, inode->i_sb->s_time_gran);
 }
 EXPORT_SYMBOL(current_time);
+
+void inode_set_write_hint(struct inode *inode, enum rw_hint hint)
+{
+   unsigned int flags = write_hint_to_mask(hint, S_WRITE_LIFE_SHIFT);
+
+   if (flags != mask_to_write_hint(inode->i_flags, S_WRITE_LIFE_SHIFT)) {
+   inode_lock(inode);
+   inode_set_flags(inode, flags, S_WRITE_LIFE_MASK);
+   inode_unlock(inode);
+   }
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 023f0324762b..8720251cc153 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -270,6 +270,12 @@ struct writeback_control;
 #define IOCB_SYNC  (1 << 5)
 #define IOCB_WRITE (1 << 6)
 
+/*
+ * Steal 3 bits for stream information, this

[PATCH 5/9] fs: add support for buffered writeback to pass down write hints

2017-06-19 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/buffer.c | 14 +-
 fs/mpage.c  |  1 +
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 306b720f7383..1259524715c8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -49,7 +49,7 @@
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-struct writeback_control *wbc);
+unsigned int stream, struct writeback_control *wbc);
 
 #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)
 
@@ -1829,7 +1829,8 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
-   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+   inode_write_hint(inode), wbc);
nr_underway++;
}
bh = next;
@@ -1883,7 +1884,8 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
-   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh, wbc);
+   submit_bh_wbc(REQ_OP_WRITE, write_flags, bh,
+   inode_write_hint(inode), wbc);
nr_underway++;
}
bh = next;
@@ -3091,7 +3093,7 @@ void guard_bio_eod(int op, struct bio *bio)
 }
 
 static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
-struct writeback_control *wbc)
+unsigned int write_hint, struct writeback_control *wbc)
 {
struct bio *bio;
 
@@ -3134,6 +3136,8 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
op_flags |= REQ_META;
if (buffer_prio(bh))
op_flags |= REQ_PRIO;
+
+   op_flags |= write_hint_to_opf(write_hint);
bio_set_op_attrs(bio, op, op_flags);
 
submit_bio(bio);
@@ -3142,7 +3146,7 @@ static int submit_bh_wbc(int op, int op_flags, struct 
buffer_head *bh,
 
 int submit_bh(int op, int op_flags, struct buffer_head *bh)
 {
-   return submit_bh_wbc(op, op_flags, bh, NULL);
+   return submit_bh_wbc(op, op_flags, bh, 0, NULL);
 }
 EXPORT_SYMBOL(submit_bh);
 
diff --git a/fs/mpage.c b/fs/mpage.c
index 9524fdde00c2..d8a750873bf4 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -615,6 +615,7 @@ static int __mpage_writepage(struct page *page, struct 
writeback_control *wbc,
goto confused;
 
wbc_init_bio(wbc, bio);
+   bio->bi_opf |= write_hint_to_opf(inode_write_hint(inode));
}
 
/*
-- 
2.7.4



[PATCH 6/9] ext4: add support for passing in write hints for buffered writes

2017-06-19 Thread Jens Axboe
Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 fs/ext4/page-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 930ca0fc9a0f..92834b702728 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -350,6 +350,7 @@ void ext4_io_submit(struct ext4_io_submit *io)
if (bio) {
int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
  REQ_SYNC : 0;
+   io_op_flags |= 
write_hint_to_opf(inode_write_hint(io->io_end->inode));
bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
submit_bio(io->io_bio);
}
@@ -397,6 +398,7 @@ static int io_submit_add_bh(struct ext4_io_submit *io,
ret = io_submit_init_bio(io, bh);
if (ret)
return ret;
+   io->io_bio->bi_opf |= 
write_hint_to_opf(inode_write_hint(inode));
}
ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh));
if (ret != bh->b_size)
-- 
2.7.4



[PATCH 3/9] blk-mq: expose stream write hints through debugfs

2017-06-19 Thread Jens Axboe
Useful to verify that things are working the way they should.
Reading the file will return number of kb written with each
write hint. Writing the file will reset the statistics. No care
is taken to ensure that we don't race on updates.

Drivers will write to q->write_hints[] if they handle a given
write hint.

Reviewed-by: Andreas Dilger 
Signed-off-by: Jens Axboe 
---
 block/blk-mq-debugfs.c | 24 
 include/linux/blkdev.h |  3 +++
 2 files changed, 27 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 9edebbdce0bd..9ebc2945f991 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -135,6 +135,29 @@ static void print_stat(struct seq_file *m, struct 
blk_rq_stat *stat)
}
 }
 
+static int queue_write_hint_show(void *data, struct seq_file *m)
+{
+   struct request_queue *q = data;
+   int i;
+
+   for (i = 0; i < BLK_MAX_WRITE_HINTS; i++)
+   seq_printf(m, "hint%d: %llu\n", i, q->write_hints[i]);
+
+   return 0;
+}
+
+static ssize_t queue_write_hint_store(void *data, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+   struct request_queue *q = data;
+   int i;
+
+   for (i = 0; i < BLK_MAX_WRITE_HINTS; i++)
+   q->write_hints[i] = 0;
+
+   return count;
+}
+
 static int queue_poll_stat_show(void *data, struct seq_file *m)
 {
struct request_queue *q = data;
@@ -730,6 +753,7 @@ static const struct blk_mq_debugfs_attr 
blk_mq_debugfs_queue_attrs[] = {
{"poll_stat", 0400, queue_poll_stat_show},
{"requeue_list", 0400, .seq_ops = &queue_requeue_list_seq_ops},
{"state", 0600, queue_state_show, queue_state_write},
+   {"write_hints", 0600, queue_write_hint_show, queue_write_hint_store},
{},
 };
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 22cfba64ce81..687394b70924 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -586,6 +586,9 @@ struct request_queue {
 
size_t  cmd_size;
void*rq_alloc_data;
+
+#define BLK_MAX_WRITE_HINTS5
+   u64 write_hints[BLK_MAX_WRITE_HINTS];
 };
 
 #define QUEUE_FLAG_QUEUED  1   /* uses generic tag queueing */
-- 
2.7.4



  1   2   >