date:20170929

Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-29 Thread Coly Li

On 2017/9/30 上午11:17, Michael Lyle wrote:
> Coly--
> 
> What you say is correct-- it has a few changes from current behavior.
> 
> - When writeback rate is low, it is more willing to do contiguous
> I/Os.  This provides an opportunity for the IO scheduler to combine
> operations together.  The cost of doing 5 contiguous I/Os and 1 I/O is
> usually about the same on spinning disks, because most of the cost is
> seeking and rotational latency-- the actual sequential I/O bandwidth
> is very high.  This is a benefit.

Hi Mike,

Yes I can see it.

> - When writeback rate is medium, it does I/O more efficiently.  e.g.
> if the current writeback rate is 10MB/sec, and there are two
> contiguous 1MB segments, they would not presently be combined.  A 1MB
> write would occur, then we would increase the delay counter by 100ms,
> and then the next write would wait; this new code would issue 2 1MB
> writes one after the other, and then sleep 200ms.  On a disk that does
> 150MB/sec sequential, and has a 7ms seek time, this uses the disk for
> 13ms + 7ms, compared to the old code that does 13ms + 7ms * 2.  This
> is the difference between using 10% of the disk's I/O throughput and
> 13% of the disk's throughput to do the same work.

If writeback_rate is not minimum value, it means there are front end
write requests existing. In this case, backend writeback I/O should nice
I/O throughput to front end I/O. Otherwise, application will observe
increased I/O latency, especially when dirty percentage is not very
high. For enterprise workload, this change hurts performance.

An desired behavior for low latency enterprise workload is, when dirty
percentage is low, once there is front end I/O, backend writeback should
be at minimum rate. This patch will introduce unstable and unpredictable
I/O latency.

Unless there is performance bottleneck of writeback seeking, at least
enterprise users will focus more on front end I/O latency 

> - When writeback rate is very high (e.g. can't be obtained), there is
> not much difference currently, BUT:
> 
> Patch 5 is very important.  Right now, if there are many writebacks
> happening at once, the cached blocks can be read in any order.  This
> means that if we want to writeback blocks 1,2,3,4,5 we could actually
> end up issuing the write I/Os to the backing device as 3,1,4,2,5, with
> delays between them.  This is likely to make the disk seek a lot.
> Patch 5 provides an ordering property to ensure that the writes get
> issued in LBA order to the backing device.

This method is helpful only when writeback I/Os is not issued
continuously, other wise if they are issued within slice_idle,
underlying elevator will reorder or merge the I/Os in larger request.

> 
> ***The next step in this line of development (patch 6 ;) is to link
> groups of contiguous I/Os into a list in the dirty_io structure.  To
> know whether the "next I/Os" will be contiguous, we need to scan ahead
> like the new code in patch 4 does.  Then, in turn, we can plug the
> block device, and issue the contiguous writes together.  This allows
> us to guarantee that the I/Os will be properly merged and optimized by
> the underlying block IO scheduler.   Even with patch 5, currently the
> I/Os end up imperfectly combined, and the block layer ends up issuing
> writes 1, then 2,3, then 4,5.  This is great that things are combined
> some, but it could be combined into one big request.***  To get this
> benefit, it requires something like what was done in patch 4.
> 

Hmm, if you move the dirty IO from btree into dirty_io list, then
perform I/O, there is risk that once machine is power down during
writeback there might be dirty data lost. If you continuously issue
dirty I/O and remove them from btree at same time, that means you will
introduce more latency to front end I/O...

And plug list will be unplugged automatically as default, when context
switching happens. If you will performance read I/Os to the btrees, a
context switch is probably to happen, then you won't keep a large bio
lists ...

IMHO when writeback rate is low, especially when backing hard disk is
not bottleneck, group continuous I/Os in bcache code does not help too
much for writeback performance. The only benefit is less I/O issued when
front end I/O is low or idle, but not most of users care about it,
especially enterprise users.

> I believe patch 4 is useful on its own, but I have this and other
> pieces of development that depend upon it.

Current bcache code works well in most of writeback loads, I just worry
that implementing an elevator in bcache writeback logic is a big
investment with a little return.

-- 
Coly Li

[PATCH v2] blk-throttle: fix possible io stall when upgrade to max

2017-09-29 Thread Joseph Qi

From: Joseph Qi 

There is a case which will lead to io stall. The case is described as
follows. 
/test1
  |-subtest1
/test2
  |-subtest2
And subtest1 and subtest2 each has 32 queued bios already.

Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
bios as follows:
1) tg=subtest1, do nothing;
2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
left, no need to schedule next dispatch;
3) tg=subtest2, do nothing;
4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
left, no need to schedule next dispatch;
5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
to /; note that test1 and test2 each still has 16 queued bios left;
6) tg=/, try to schedule next dispatch, but since disptime is now
(update in tg_update_disptime, wait=0), pending timer is not scheduled
in fact;
7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
32 left. test1 and test2 each has 16 queued bios;
8) throtl_pending_timer_fn sees the left over bios, but could do
nothing, because throtl_select_dispatch returns 0, and test1/test2 has
no pending tg.

The blktrace shows the following:
8,32   00 2.539007641 0  m   N throtl upgrade to max
8,32   00 2.539072267 0  m   N throtl /test2 dispatch 
nr_queued=16 read=0 write=16
8,32   70 2.539077142 0  m   N throtl /test1 dispatch 
nr_queued=16 read=0 write=16

So force schedule dispatch if there are pending children.

Signed-off-by: Joseph Qi 
---
 block/blk-throttle.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0fea76a..17816a0 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1911,11 +1911,11 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
tg->disptime = jiffies - 1;
throtl_select_dispatch(sq);
-   throtl_schedule_next_dispatch(sq, false);
+   throtl_schedule_next_dispatch(sq, true);
}
rcu_read_unlock();
throtl_select_dispatch(&td->service_queue);
-   throtl_schedule_next_dispatch(&td->service_queue, false);
+   throtl_schedule_next_dispatch(&td->service_queue, true);
queue_work(kthrotld_workqueue, &td->dispatch_work);
 }
 
-- 
1.9.4

[PATCH V7 3/6] block: pass flags to blk_queue_enter()

2017-09-29 Thread Ming Lei

We need to pass PREEMPT flags to blk_queue_enter()
for allocating request with RQF_PREEMPT in the
following patch.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 10 ++
 block/blk-mq.c |  5 +++--
 block/blk-timeout.c|  2 +-
 fs/block_dev.c |  4 ++--
 include/linux/blkdev.h |  7 ++-
 5 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a5011c824ac6..7d5040a6d5a4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -766,7 +766,7 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+int blk_queue_enter(struct request_queue *q, unsigned flags)
 {
while (true) {
int ret;
@@ -774,7 +774,7 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
if (percpu_ref_tryget_live(&q->q_usage_counter))
return 0;
 
-   if (nowait)
+   if (flags & BLK_REQ_NOWAIT)
return -EBUSY;
 
/*
@@ -1408,7 +1408,8 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
+   BLK_REQ_NOWAIT : 0);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -2215,7 +2216,8 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bio->bi_disk->queue;
 
-   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
+   if (likely(blk_queue_enter(q, (bio->bi_opf & REQ_NOWAIT) ?
+   BLK_REQ_NOWAIT : 0) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 10c1f49f663d..45bff90e08f7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -384,7 +384,8 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+   ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
+   BLK_REQ_NOWAIT : 0);
if (ret)
return ERR_PTR(ret);
 
@@ -423,7 +424,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
if (hctx_idx >= q->nr_hw_queues)
return ERR_PTR(-EIO);
 
-   ret = blk_queue_enter(q, true);
+   ret = blk_queue_enter(q, BLK_REQ_NOWAIT);
if (ret)
return ERR_PTR(ret);
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 17ec83bb0900..e803106a5e5b 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -134,7 +134,7 @@ void blk_timeout_work(struct work_struct *work)
struct request *rq, *tmp;
int next_set = 0;
 
-   if (blk_queue_enter(q, true))
+   if (blk_queue_enter(q, BLK_REQ_NOWAIT))
return;
spin_lock_irqsave(q->queue_lock, flags);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088ffc05c..98cf2d7ee9d3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -674,7 +674,7 @@ int bdev_read_page(struct block_device *bdev, sector_t 
sector,
if (!ops->rw_page || bdev_get_integrity(bdev))
return result;
 
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false);
@@ -710,7 +710,7 @@ int bdev_write_page(struct block_device *bdev, sector_t 
sector,
 
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 02fa42d24b52..127f64c7012c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -858,6 +858,11 @@ enum {
BLKPREP_INVALID,/* invalid command, kill, return -EREMOTEIO */
 };
 
+/* passed to blk_queue_enter */
+enum {
+   BLK_REQ_NOWAIT = (1 << 0),
+};
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 /*
@@ -963,7 +968,7 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct 
gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 struct scsi_ioctl_command __user *);
 
-extern int blk_queue_enter(struct request_queue *q, boo

[PATCH V7 4/6] block: prepare for passing RQF_PREEMPT to request allocation

2017-09-29 Thread Ming Lei

REQF_PREEMPT is a bit special because the request is required
to be dispatched to lld even when SCSI device is quiesced.

So this patch introduces __blk_get_request() and allows users to pass
RQF_PREEMPT flag in, then we can allow to allocate request of RQF_PREEMPT
when queue is in mode of PREEMPT ONLY which will be introduced
in the following patch.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 19 +--
 block/blk-mq.c |  3 +--
 include/linux/blk-mq.h |  7 ---
 include/linux/blkdev.h | 17 ++---
 4 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7d5040a6d5a4..95b1c5e50be3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1398,7 +1398,8 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 }
 
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, gfp_t gfp_mask)
+  unsigned int op, gfp_t gfp_mask,
+  unsigned int flags)
 {
struct request *rq;
int ret = 0;
@@ -1408,8 +1409,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   BLK_REQ_NOWAIT : 0);
+   ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -1427,26 +1427,25 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-   gfp_t gfp_mask)
+struct request *__blk_get_request(struct request_queue *q, unsigned int op,
+ gfp_t gfp_mask, unsigned int flags)
 {
struct request *req;
 
+   flags |= gfp_mask & __GFP_DIRECT_RECLAIM ? 0 : BLK_REQ_NOWAIT;
if (q->mq_ops) {
-   req = blk_mq_alloc_request(q, op,
-   (gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   0 : BLK_MQ_REQ_NOWAIT);
+   req = blk_mq_alloc_request(q, op, flags);
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
-   req = blk_old_get_request(q, op, gfp_mask);
+   req = blk_old_get_request(q, op, gfp_mask, flags);
if (!IS_ERR(req) && q->initialize_rq_fn)
q->initialize_rq_fn(req);
}
 
return req;
 }
-EXPORT_SYMBOL(blk_get_request);
+EXPORT_SYMBOL(__blk_get_request);
 
 /**
  * blk_requeue_request - put a request back on queue
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45bff90e08f7..90b43f607e3c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -384,8 +384,7 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
-   BLK_REQ_NOWAIT : 0);
+   ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
if (ret)
return ERR_PTR(ret);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..066a676d7749 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -197,9 +197,10 @@ void blk_mq_free_request(struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
 
 enum {
-   BLK_MQ_REQ_NOWAIT   = (1 << 0), /* return when out of requests */
-   BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */
-   BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */
+   BLK_MQ_REQ_NOWAIT   = BLK_REQ_NOWAIT, /* return when out of 
requests */
+   BLK_MQ_REQ_PREEMPT  = BLK_REQ_PREEMPT, /* allocate for RQF_PREEMPT 
*/
+   BLK_MQ_REQ_RESERVED = (1 << BLK_REQ_MQ_START_BIT), /* allocate from 
reserved pool */
+   BLK_MQ_REQ_INTERNAL = (1 << (BLK_REQ_MQ_START_BIT + 1)), /* 
allocate internal/sched tag */
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 127f64c7012c..68445adc8765 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -860,7 +860,10 @@ enum {
 
 /* passed to blk_queue_enter */
 enum {
-   BLK_REQ_NOWAIT = (1 << 0),
+   BLK_REQ_NOWAIT  = (1 << 0),
+   BLK_REQ_PREEMPT = (1 << 1),
+   BLK_REQ_MQ_START_BIT= 2,
+   BLK_REQ_BITS_MASK   = (1U << BLK_REQ_MQ_START_BIT) - 1,
 };
 
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
@@ -945,8 +948,9 @@ extern vo

[PATCH V7 6/6] SCSI: set block queue at preempt only when SCSI device is put into quiesce

2017-09-29 Thread Ming Lei

Simply quiesing SCSI device and waiting for completeion of IO
dispatched to SCSI queue isn't safe, it is easy to use up
request pool because all allocated requests before can't
be dispatched when device is put in QIUESCE. Then no request
can be allocated for RQF_PREEMPT, and system may hang somewhere,
such as When sending commands of sync_cache or start_stop during
system suspend path.

Before quiesing SCSI, this patch sets block queue in preempt
mode first, so no new normal request can enter queue any more,
and all pending requests are drained too once blk_set_preempt_only(true)
is returned. Then RQF_PREEMPT can be allocated successfully duirng
SCSI quiescing.

This patch fixes one long term issue of IO hang, in either block legacy
and blk-mq.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: sta...@vger.kernel.org
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..82c51619f1b7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -252,9 +252,10 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
 
-   req = blk_get_request(sdev->request_queue,
+   req = __blk_get_request(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM,
+   BLK_REQ_PREEMPT);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -2928,12 +2929,28 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
int err;
 
+   /*
+* Simply quiesing SCSI device isn't safe, it is easy
+* to use up requests because all these allocated requests
+* can't be dispatched when device is put in QIUESCE.
+* Then no request can be allocated and we may hang
+* somewhere, such as system suspend/resume.
+*
+* So we set block queue in preempt only first, no new
+* normal request can enter queue any more, and all pending
+* requests are drained once blk_set_preempt_only()
+* returns. Only RQF_PREEMPT is allowed in preempt only mode.
+*/
+   blk_set_preempt_only(sdev->request_queue, true);
+
mutex_lock(&sdev->state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
mutex_unlock(&sdev->state_mutex);
 
-   if (err)
+   if (err) {
+   blk_set_preempt_only(sdev->request_queue, false);
return err;
+   }
 
scsi_run_queue(sdev->request_queue);
while (atomic_read(&sdev->device_busy)) {
@@ -2964,6 +2981,8 @@ void scsi_device_resume(struct scsi_device *sdev)
scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
scsi_run_queue(sdev->request_queue);
mutex_unlock(&sdev->state_mutex);
+
+   blk_set_preempt_only(sdev->request_queue, false);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
-- 
2.9.5

[PATCH V7 5/6] block: support PREEMPT_ONLY

2017-09-29 Thread Ming Lei

When queue is in PREEMPT_ONLY mode, only RQF_PREEMPT request
can be allocated and dispatched, other requests won't be allowed
to enter I/O path.

This is useful for supporting safe SCSI quiesce.

Part of this patch is from Bart's '[PATCH v4 4∕7] block: Add the 
QUEUE_FLAG_PREEMPT_ONLY
request queue flag'.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 26 --
 include/linux/blkdev.h |  5 +
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 95b1c5e50be3..bb683bfe37b2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -346,6 +346,17 @@ void blk_sync_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
+void blk_set_preempt_only(struct request_queue *q, bool preempt_only)
+{
+   blk_mq_freeze_queue(q);
+   if (preempt_only)
+   queue_flag_set_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q);
+   else
+   queue_flag_clear_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q);
+   blk_mq_unfreeze_queue(q);
+}
+EXPORT_SYMBOL(blk_set_preempt_only);
+
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
  * @q: The queue to run
@@ -771,9 +782,18 @@ int blk_queue_enter(struct request_queue *q, unsigned 
flags)
while (true) {
int ret;
 
+   /*
+* preempt_only flag has to be set after queue is frozen,
+* so it can be checked here lockless and safely
+*/
+   if (blk_queue_preempt_only(q)) {
+   if (!(flags & BLK_REQ_PREEMPT))
+   goto slow_path;
+   }
+
if (percpu_ref_tryget_live(&q->q_usage_counter))
return 0;
-
+ slow_path:
if (flags & BLK_REQ_NOWAIT)
return -EBUSY;
 
@@ -787,7 +807,9 @@ int blk_queue_enter(struct request_queue *q, unsigned flags)
smp_rmb();
 
ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(&q->mq_freeze_depth) ||
+   (!atomic_read(&q->mq_freeze_depth) &&
+   ((flags & BLK_REQ_PREEMPT) ||
+!blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 68445adc8765..b01a0c6bb1f0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -631,6 +631,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26  /* queue has been registered to a disk 
*/
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
+#define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_STACKABLE)|   \
@@ -735,6 +736,10 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+#define blk_queue_preempt_only(q)  \
+   test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+
+extern void blk_set_preempt_only(struct request_queue *q, bool preempt_only);
 
 static inline bool blk_account_rq(struct request *rq)
 {
-- 
2.9.5

[PATCH V7 1/6] blk-mq: only run hw queues for blk-mq

2017-09-29 Thread Ming Lei

This patch just makes it explicitely.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Reviewed-by: Johannes Thumshirn 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..6fd9f86fc86d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(&q->q_usage_counter);
-   blk_mq_run_hw_queues(q, false);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5

[PATCH V7 0/6] block/scsi: safe SCSI quiescing

2017-09-29 Thread Ming Lei

Hi Jens,

Please consider this patchset for V4.15, and it fixes one
kind of long-term I/O hang issue in either block legacy path
or blk-mq.

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for
RQF_PREEMPT can be dispatched to SCSI successfully, and
scsi_device_quiesce() just simply waits for completion of I/Os
dispatched to SCSI stack. It isn't enough at all.

Because new request still can be comming, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated and wait forever,
then system hangs forever, such as during system suspend or
sending SCSI domain alidation in case of transport_spi.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch introduces preempt only mode, and solves the issue
by allowing RQF_PREEMP only during SCSI quiesce.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all.

V7:
- add Reviewed-by & Tested-by
- one line change in patch 5 for checking preempt request

V6:
- borrow Bart's idea of preempt only, with clean
  implementation(patch 5/patch 6)
- needn't any external driver's dependency, such as MD's
change

V5:
- fix one tiny race by introducing blk_queue_enter_preempt_freeze()
given this change is small enough compared with V4, I added
tested-by directly

V4:
- reorganize patch order to make it more reasonable
- support nested preempt freeze, as required by SCSI transport spi
- check preempt freezing in slow path of of blk_queue_enter()
- add "SCSI: transport_spi: resume a quiesced device"
- wake up freeze queue in setting dying for both blk-mq and legacy
- rename blk_mq_[freeze|unfreeze]_queue() in one patch
- rename .mq_freeze_wq and .mq_freeze_depth
- improve comment

V3:
- introduce q->preempt_unfreezing to fix one bug of preempt freeze
- call blk_queue_enter_live() only when queue is preempt frozen
- cleanup a bit on the implementation of preempt freeze
- only patch 6 and 7 are changed

V2:
- drop the 1st patch in V1 because percpu_ref_is_dying() is
enough as pointed by Tejun
- introduce preempt version of blk_[freeze|unfreeze]_queue
- sync between preempt freeze and normal freeze
- fix warning from percpu-refcount as reported by Oleksandr


[1] https://marc.info/?t=150340250100013&r=3&w=2


Thanks,
Ming

Ming Lei (6):
  blk-mq: only run hw queues for blk-mq
  block: tracking request allocation with q_usage_counter
  block: pass flags to blk_queue_enter()
  block: prepare for passing RQF_PREEMPT to request allocation
  block: support PREEMPT_ONLY
  SCSI: set block queue at preempt only when SCSI device is put into
quiesce

 block/blk-core.c| 63 +++--
 block/blk-mq.c  | 14 ---
 block/blk-timeout.c |  2 +-
 drivers/scsi/scsi_lib.c | 25 +---
 fs/block_dev.c  |  4 ++--
 include/linux/blk-mq.h  |  7 +++---
 include/linux/blkdev.h  | 27 ++---
 7 files changed, 107 insertions(+), 35 deletions(-)

-- 
2.9.5

[PATCH V7 2/6] block: tracking request allocation with q_usage_counter

2017-09-29 Thread Ming Lei

This usage is basically same with blk-mq, so that we can
support to freeze legacy queue easily.

Also 'wake_up_all(&q->mq_freeze_wq)' has to be moved
into blk_set_queue_dying() since both legacy and blk-mq
may wait on the wait queue of .mq_freeze_wq.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Reviewed-by: Hannes Reinecke 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 14 ++
 block/blk-mq.c   |  7 ---
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 048be4aa6024..a5011c824ac6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,12 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
+
+   /*
+* We need to ensure that processes currently waiting on
+* the queue are notified as well.
+*/
+   wake_up_all(&q->mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -1395,16 +1401,21 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
   unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
+   int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
 
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   if (ret)
+   return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+   blk_queue_exit(q);
return rq;
}
 
@@ -1576,6 +1587,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+   blk_queue_exit(q);
}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1857,8 +1869,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 * Grab a free request. This is might sleep but can not fail.
 * Returns with the queue unlocked.
 */
+   blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+   blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6fd9f86fc86d..10c1f49f663d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -256,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
-
-   /*
-* If we are called because the queue has now been marked as
-* dying, we need to ensure that processes currently waiting on
-* the queue are notified as well.
-*/
-   wake_up_all(&q->mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
-- 
2.9.5

Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-29 Thread Michael Lyle

Coly--

What you say is correct-- it has a few changes from current behavior.

- When writeback rate is low, it is more willing to do contiguous
I/Os.  This provides an opportunity for the IO scheduler to combine
operations together.  The cost of doing 5 contiguous I/Os and 1 I/O is
usually about the same on spinning disks, because most of the cost is
seeking and rotational latency-- the actual sequential I/O bandwidth
is very high.  This is a benefit.
- When writeback rate is medium, it does I/O more efficiently.  e.g.
if the current writeback rate is 10MB/sec, and there are two
contiguous 1MB segments, they would not presently be combined.  A 1MB
write would occur, then we would increase the delay counter by 100ms,
and then the next write would wait; this new code would issue 2 1MB
writes one after the other, and then sleep 200ms.  On a disk that does
150MB/sec sequential, and has a 7ms seek time, this uses the disk for
13ms + 7ms, compared to the old code that does 13ms + 7ms * 2.  This
is the difference between using 10% of the disk's I/O throughput and
13% of the disk's throughput to do the same work.
- When writeback rate is very high (e.g. can't be obtained), there is
not much difference currently, BUT:

Patch 5 is very important.  Right now, if there are many writebacks
happening at once, the cached blocks can be read in any order.  This
means that if we want to writeback blocks 1,2,3,4,5 we could actually
end up issuing the write I/Os to the backing device as 3,1,4,2,5, with
delays between them.  This is likely to make the disk seek a lot.
Patch 5 provides an ordering property to ensure that the writes get
issued in LBA order to the backing device.

***The next step in this line of development (patch 6 ;) is to link
groups of contiguous I/Os into a list in the dirty_io structure.  To
know whether the "next I/Os" will be contiguous, we need to scan ahead
like the new code in patch 4 does.  Then, in turn, we can plug the
block device, and issue the contiguous writes together.  This allows
us to guarantee that the I/Os will be properly merged and optimized by
the underlying block IO scheduler.   Even with patch 5, currently the
I/Os end up imperfectly combined, and the block layer ends up issuing
writes 1, then 2,3, then 4,5.  This is great that things are combined
some, but it could be combined into one big request.***  To get this
benefit, it requires something like what was done in patch 4.

I believe patch 4 is useful on its own, but I have this and other
pieces of development that depend upon it.
Thanks,

Mike

Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-29 Thread Coly Li

On 2017/9/27 下午3:32, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Hello Mike:
> 
> For the second question, I thinks this modification is somewhat complex, 
> cannot we do something simple to resolve it? I remember there were some
> patches trying to avoid too small writeback rate, Coly, is there any 
> progress now?
> 
Junhui,

That patch works well, but before I solve the latency of calculating
dirty stripe numbers, I won't push it upstream so far.

This patch does not conflict with my max-writeback-rate-when-idle, this
patch tries to fetch more dirty keys from cache device which are
continuous on cached device, and assume they can be continuously written
back to cached device.

For the above purpose, if writeback_rate is high, dc->last_read just
works well. But when dc->writeback_rate is low, e.g. 8, event
KEY_START(&w->key) == dc->last_read, the continuous key will only be
submit in next delay cycle. I feel Micheal wants to make larger
writeback I/O and delay more, then backing cached device may be woke up
less chance.

This policy only works better then current dc->last_read when
writeback_rate is low, that's say, when front write I/O is low or no
front write. I hesitate whether it is worthy to modify general writeback
logic for it.

> ---
> Tang Junhui
>
>> Ah-- re #1 -- I was investigating earlier why not as much was combined
>> as I thought should be when idle.  This is surely a factor.  Thanks
>> for the catch-- KEY_OFFSET is correct.  I will fix and retest.
>>
>> (Under heavy load, the correct thing still happens, but not under
>> light or intermediate load0.
>>
>> About #2-- I wanted to attain a bounded amount of "combining" of
>> operations.  If we have 5 4k extents in a row to dispatch, it seems
>> really wasteful to issue them as 5 IOs 60ms apart, which the existing
>> code would be willing to do-- I'd rather do a 20k write IO (basically
>> the same cost as a 4k write IO) and then sleep 300ms.  It is dependent
>> on the elevator/IO scheduler merging the requests.  At the same time,
>> I'd rather not combine a really large request.
>>
>> It would be really neat to blk_plug the backing device during the
>> write issuance, but that requires further work.
>>
>> Thanks
>>
>> Mike
>>
>> On Tue, Sep 26, 2017 at 11:51 PM,   wrote:
>>> From: Tang Junhui 
>>>
>>> Hello Lyle:
>>>
>>> Two questions:
>>> 1) In keys_contiguous(), you judge I/O contiguous in cache device, but not
>>> in backing device. I think you should judge it by backing device (remove
>>> PTR_CACHE() and use KEY_OFFSET() instead of PTR_OFFSET()?).
>>>
>>> 2) I did not see you combine samll contiguous I/Os to big I/O, so I think
>>> it is useful when writeback rate was low by avoiding single I/O write, but
>>> have no sense in high writeback rate, since previously it is also write
>>> I/Os asynchronously.
>>>
>>> ---
>>> Tang Junhui
>>>
 Previously, there was some logic that attempted to immediately issue
 writeback of backing-contiguous blocks when the writeback rate was
 fast.

 The previous logic did not have any limits on the aggregate size it
 would issue, nor the number of keys it would combine at once.  It
 would also discard the chance to do a contiguous write when the
 writeback rate was low-- e.g. at "background" writeback of target
 rate = 8, it would not combine two adjacent 4k writes and would
 instead seek the disk twice.

 This patch imposes limits and explicitly understands the size of
 contiguous I/O during issue.  It also will combine contiguous I/O
 in all circumstances, not just when writeback is requested to be
 relatively fast.

 It is a win on its own, but also lays the groundwork for skip writes to
 short keys to make the I/O more sequential/contiguous.

 Signed-off-by: Michael Lyle 
[snip code]

-- 
Coly Li

[PATCH v2] null_blk: add "no_sched" module parameter

2017-09-29 Thread weiping zhang

add an option that disable io scheduler for null block device.

Signed-off-by: weiping zhang 
---
 drivers/block/null_blk.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index bd92286..38f4a8c 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -154,6 +154,10 @@ enum {
NULL_Q_MQ   = 2,
 };
 
+static int g_no_sched;
+module_param_named(no_sched, g_no_sched, int, S_IRUGO);
+MODULE_PARM_DESC(no_sched, "No io scheduler");
+
 static int g_submit_queues = 1;
 module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
 MODULE_PARM_DESC(submit_queues, "Number of submission queues");
@@ -1754,6 +1758,8 @@ static int null_init_tag_set(struct nullb *nullb, struct 
blk_mq_tag_set *set)
set->numa_node = nullb ? nullb->dev->home_node : g_home_node;
set->cmd_size   = sizeof(struct nullb_cmd);
set->flags = BLK_MQ_F_SHOULD_MERGE;
+   if (g_no_sched)
+   set->flags |= BLK_MQ_F_NO_SCHED;
set->driver_data = NULL;
 
if ((nullb && nullb->dev->blocking) || g_blocking)
-- 
2.9.4

Re: [PATCH] null_blk: add "no_sched" module parameter

2017-09-29 Thread weiping zhang

On Fri, Sep 29, 2017 at 11:39:03PM +0200, Jens Axboe wrote:
> On 09/29/2017 07:09 PM, weiping zhang wrote:
> > add an option that disable io scheduler for null block device.
> > 
> > Signed-off-by: weiping zhang 
> > ---
> >  drivers/block/null_blk.c | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
> > index bd92286..3c63863 100644
> > --- a/drivers/block/null_blk.c
> > +++ b/drivers/block/null_blk.c
> > @@ -154,6 +154,10 @@ enum {
> > NULL_Q_MQ   = 2,
> >  };
> >  
> > +static int g_no_sched;
> > +module_param_named(no_sched, g_no_sched, int, S_IRUGO);
> > +MODULE_PARM_DESC(no_sched, "No io scheduler");
> > +
> >  static int g_submit_queues = 1;
> >  module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
> >  MODULE_PARM_DESC(submit_queues, "Number of submission queues");
> > @@ -1753,7 +1757,7 @@ static int null_init_tag_set(struct nullb *nullb, 
> > struct blk_mq_tag_set *set)
> > g_hw_queue_depth;
> > set->numa_node = nullb ? nullb->dev->home_node : g_home_node;
> > set->cmd_size   = sizeof(struct nullb_cmd);
> > -   set->flags = BLK_MQ_F_SHOULD_MERGE;
> > +   set->flags = g_no_sched ? BLK_MQ_F_NO_SCHED : BLK_MQ_F_SHOULD_MERGE;
> 
> This should be:
> 
>   set->flags = BLK_MQ_F_SHOULD_MERGE;
>   if (g_no_sched)
>   set->flags |= BLK_MQ_F_NO_SCHED;
> 
That's right, I go through these two flags, if no io scheduler,
BLK_MQ_F_SHOULD_MERGE can make sw ctx merge happen. I will send V2.

Thanks
weiping

How to enable multi-path on kernel 4.8.17

2017-09-29 Thread Tony Yang

Hi, All

 Because my environment requirements, the kernel must use 4.8.17,
I would like to ask, how to use the kernel 4.8.17 nvme multi-path?
Because I see support for multi-path versions are above 4.13

Expect everyone's help, thank you very much

Re: [PATCH 1/2] block: genhd: add device_add_disk_with_groups

2017-09-29 Thread Keith Busch

On Thu, Sep 28, 2017 at 09:36:36PM +0200, Martin Wilck wrote:
> In the NVME subsystem, we're seeing a race condition with udev where
> device_add_disk() is called (which triggers an "add" uevent), and a
> sysfs attribute group is added to the disk device afterwards.
> If udev rules access these attributes before they are created,
> udev processing of the device is incomplete, in particular, device
> WWIDs may not be determined correctly.
> 
> To fix this, this patch introduces a new function
> device_add_disk_with_groups(), which takes a list of attribute groups
> and adds them to the device before sending out uevents.
> 
> Signed-off-by: Martin Wilck 

Is NVMe the only one having this problem? Was putting our attributes in
the disk's kobj a bad choice?

Any, looks fine to me.

Reviewed-by: Keith Busch

Re: [PATCH 2/2] nvme: use device_add_disk_with_groups()

2017-09-29 Thread Keith Busch

On Thu, Sep 28, 2017 at 09:36:37PM +0200, Martin Wilck wrote:
> By using device_add_disk_with_groups(), we can avoid the race
> condition with udev rule processing, because no udev event will
> be triggered before all attributes are available.
> 
> Signed-off-by: Martin Wilck 

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH] null_blk: add "no_sched" module parameter

2017-09-29 Thread Jens Axboe

On 09/29/2017 07:09 PM, weiping zhang wrote:
> add an option that disable io scheduler for null block device.
> 
> Signed-off-by: weiping zhang 
> ---
>  drivers/block/null_blk.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
> index bd92286..3c63863 100644
> --- a/drivers/block/null_blk.c
> +++ b/drivers/block/null_blk.c
> @@ -154,6 +154,10 @@ enum {
>   NULL_Q_MQ   = 2,
>  };
>  
> +static int g_no_sched;
> +module_param_named(no_sched, g_no_sched, int, S_IRUGO);
> +MODULE_PARM_DESC(no_sched, "No io scheduler");
> +
>  static int g_submit_queues = 1;
>  module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
>  MODULE_PARM_DESC(submit_queues, "Number of submission queues");
> @@ -1753,7 +1757,7 @@ static int null_init_tag_set(struct nullb *nullb, 
> struct blk_mq_tag_set *set)
>   g_hw_queue_depth;
>   set->numa_node = nullb ? nullb->dev->home_node : g_home_node;
>   set->cmd_size   = sizeof(struct nullb_cmd);
> - set->flags = BLK_MQ_F_SHOULD_MERGE;
> + set->flags = g_no_sched ? BLK_MQ_F_NO_SCHED : BLK_MQ_F_SHOULD_MERGE;

This should be:

set->flags = BLK_MQ_F_SHOULD_MERGE;
if (g_no_sched)
set->flags |= BLK_MQ_F_NO_SCHED;

-- 
Jens Axboe

RE: [PATCH 2/2] nvme: use device_add_disk_with_groups()

2017-09-29 Thread Schremmer, Steven

> From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On Behalf Of 
> Martin Wilck
> Sent: Thursday, September 28, 2017 2:37 PM
> To: Jens Axboe ; Christoph Hellwig ; Johannes 
> Thumshirn 
> Cc: linux-block@vger.kernel.org; Martin Wilck ; 
> linux-ker...@vger.kernel.org; linux-n...@lists.infradead.org;
> Hannes Reinecke 
> Subject: [PATCH 2/2] nvme: use device_add_disk_with_groups()
> 

Tested-by: Steve Schremmer

RE: [PATCH 1/2] block: genhd: add device_add_disk_with_groups

2017-09-29 Thread Schremmer, Steven

> From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On Behalf Of 
> Martin Wilck
> Sent: Thursday, September 28, 2017 2:37 PM
> To: Jens Axboe ; Christoph Hellwig ; Johannes 
> Thumshirn 
> Cc: linux-block@vger.kernel.org; Martin Wilck ; 
> linux-ker...@vger.kernel.org; linux-n...@lists.infradead.org;
> Hannes Reinecke 
> Subject: [PATCH 1/2] block: genhd: add device_add_disk_with_groups
> 

Tested-by: Steve Schremmer

Re: [PATCH V6 0/6] block/scsi: safe SCSI quiescing

2017-09-29 Thread Martin Steigerwald

Ming Lei - 27.09.17, 16:27:
> On Wed, Sep 27, 2017 at 09:57:37AM +0200, Martin Steigerwald wrote:
> > Hi Ming.
> > 
> > Ming Lei - 27.09.17, 13:48:
> > > Hi,
> > > 
> > > The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
> > > 
> > > Once SCSI device is put into QUIESCE, no new request except for
> > > RQF_PREEMPT can be dispatched to SCSI successfully, and
> > > scsi_device_quiesce() just simply waits for completion of I/Os
> > > dispatched to SCSI stack. It isn't enough at all.
> > > 
> > > Because new request still can be comming, but all the allocated
> > > requests can't be dispatched successfully, so request pool can be
> > > consumed up easily.
> > > 
> > > Then request with RQF_PREEMPT can't be allocated and wait forever,
> > > meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
> > > then system hangs forever, such as during system suspend or
> > > sending SCSI domain alidation.
> > > 
> > > Both IO hang inside system suspend[1] or SCSI domain validation
> > > were reported before.
> > > 
> > > This patch introduces preempt only mode, and solves the issue
> > > by allowing RQF_PREEMP only during SCSI quiesce.
> > > 
> > > Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> > > them all.
> > > 
> > > V6:
> > >   - borrow Bart's idea of preempt only, with clean
> > >   
> > > implementation(patch 5/patch 6)
> > >   
> > >   - needn't any external driver's dependency, such as MD's
> > >   change
> > 
> > Do you want me to test with v6 of the patch set? If so, it would be nice
> > if
> > you´d make a v6 branch in your git repo.
> 
> Hi Martin,
> 
> I appreciate much if you may run V6 and provide your test result,
> follows the branch:
> 
> https://github.com/ming1/linux/tree/blk_safe_scsi_quiesce_V6
> 
> https://github.com/ming1/linux.git #blk_safe_scsi_quiesce_V6
> 
> > After an uptime of almost 6 days I am pretty confident that the V5 one
> > fixes the issue for me. So
> > 
> > Tested-by: Martin Steigerwald 
> > 
> > for V5.
> 
> Thanks for your test!

Two days and almost 6 hours, no hang yet. I bet the whole thing works. 
(3e45474d7df3bfdabe4801b5638d197df9810a79)

Tested-By: Martin Steigerwald 

(It could still hang after three days, but usually I got the first hang within 
the first two days.)

Thanks,
-- 
Martin

[PATCH] null_blk: add "no_sched" module parameter

2017-09-29 Thread weiping zhang

add an option that disable io scheduler for null block device.

Signed-off-by: weiping zhang 
---
 drivers/block/null_blk.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index bd92286..3c63863 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -154,6 +154,10 @@ enum {
NULL_Q_MQ   = 2,
 };
 
+static int g_no_sched;
+module_param_named(no_sched, g_no_sched, int, S_IRUGO);
+MODULE_PARM_DESC(no_sched, "No io scheduler");
+
 static int g_submit_queues = 1;
 module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
 MODULE_PARM_DESC(submit_queues, "Number of submission queues");
@@ -1753,7 +1757,7 @@ static int null_init_tag_set(struct nullb *nullb, struct 
blk_mq_tag_set *set)
g_hw_queue_depth;
set->numa_node = nullb ? nullb->dev->home_node : g_home_node;
set->cmd_size   = sizeof(struct nullb_cmd);
-   set->flags = BLK_MQ_F_SHOULD_MERGE;
+   set->flags = g_no_sched ? BLK_MQ_F_NO_SCHED : BLK_MQ_F_SHOULD_MERGE;
set->driver_data = NULL;
 
if ((nullb && nullb->dev->blocking) || g_blocking)
-- 
2.9.4

Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems

2017-09-29 Thread Tony Yang

Hi, All

 Because my environment requirements, the kernel must use 4.8.17,
I would like to ask, how to use the kernel 4.8.17 nvme multi-path?
Because I see support for multi-path versions are above 4.13

Expect everyone's help, thank you very much

2017-09-28 23:53 GMT+08:00 Keith Busch :
> On Mon, Sep 25, 2017 at 03:40:30PM +0200, Christoph Hellwig wrote:
>> The new block devices nodes for multipath access will show up as
>>
>>   /dev/nvm-subXnZ
>
> Just thinking ahead ... Once this goes in, someone will want to boot their
> OS from a multipath target. It was a pain getting installers to recognize
> /dev/nvmeXnY as an install destination. I'm not sure if installers have
> gotten any better in the last 5 years about recognizing new block names.
>
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

[PATCH v2] blk-throttle: fix possible io stall when upgrade to max

[PATCH V7 3/6] block: pass flags to blk_queue_enter()

[PATCH V7 4/6] block: prepare for passing RQF_PREEMPT to request allocation

[PATCH V7 6/6] SCSI: set block queue at preempt only when SCSI device is put into quiesce

[PATCH V7 5/6] block: support PREEMPT_ONLY

[PATCH V7 1/6] blk-mq: only run hw queues for blk-mq

[PATCH V7 0/6] block/scsi: safe SCSI quiescing

[PATCH V7 2/6] block: tracking request allocation with q_usage_counter

Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

[PATCH v2] null_blk: add "no_sched" module parameter

Re: [PATCH] null_blk: add "no_sched" module parameter

How to enable multi-path on kernel 4.8.17

Re: [PATCH 1/2] block: genhd: add device_add_disk_with_groups

Re: [PATCH 2/2] nvme: use device_add_disk_with_groups()

Re: [PATCH] null_blk: add "no_sched" module parameter

RE: [PATCH 2/2] nvme: use device_add_disk_with_groups()

RE: [PATCH 1/2] block: genhd: add device_add_disk_with_groups

Re: [PATCH V6 0/6] block/scsi: safe SCSI quiescing

[PATCH] null_blk: add "no_sched" module parameter

Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems

22 matches

Site Navigation

Mail list logo

Footer information