Re: [PATCH 05/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-07-31 Thread Bart Van Assche
On Tue, 2017-08-01 at 00:51 +0800, Ming Lei wrote:
> During dispatch, we moved all requests from hctx->dispatch to
> one temporary list, then dispatch them one by one from this list.
> Unfortunately duirng this period, run queue from other contexts
> may think the queue is idle and start to dequeue from sw/scheduler
> queue and try to dispatch because ->dispatch is empty.
> 
> This way will hurt sequential I/O performance because requests are
> dequeued when queue is busy.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq-sched.c   | 24 ++--
>  include/linux/blk-mq.h |  1 +
>  2 files changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 3510c01cb17b..eb638063673f 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -112,8 +112,15 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
> *hctx)
>*/
>   if (!list_empty_careful(&hctx->dispatch)) {
>   spin_lock(&hctx->lock);
> - if (!list_empty(&hctx->dispatch))
> + if (!list_empty(&hctx->dispatch)) {
>   list_splice_init(&hctx->dispatch, &rq_list);
> +
> + /*
> +  * BUSY won't be cleared until all requests
> +  * in hctx->dispatch are dispatched successfully
> +  */
> + set_bit(BLK_MQ_S_BUSY, &hctx->state);
> + }
>   spin_unlock(&hctx->lock);
>   }
>  
> @@ -129,15 +136,20 @@ void blk_mq_sched_dispatch_requests(struct 
> blk_mq_hw_ctx *hctx)
>   if (!list_empty(&rq_list)) {
>   blk_mq_sched_mark_restart_hctx(hctx);
>   can_go = blk_mq_dispatch_rq_list(q, &rq_list);
> - } else if (!has_sched_dispatch && !q->queue_depth) {
> - blk_mq_flush_busy_ctxs(hctx, &rq_list);
> - blk_mq_dispatch_rq_list(q, &rq_list);
> - can_go = false;
> + if (can_go)
> + clear_bit(BLK_MQ_S_BUSY, &hctx->state);
>   }
>  
> - if (!can_go)
> + /* can't go until ->dispatch is flushed */
> + if (!can_go || test_bit(BLK_MQ_S_BUSY, &hctx->state))
>   return;
>  
> + if (!has_sched_dispatch && !q->queue_depth) {
> + blk_mq_flush_busy_ctxs(hctx, &rq_list);
> + blk_mq_dispatch_rq_list(q, &rq_list);
> + return;
> + }

Hello Ming,

Since setting, clearing and testing of BLK_MQ_S_BUSY can happen concurrently
and since clearing and testing happens without any locks held I'm afraid this
patch introduces the following race conditions:
* Clearing of BLK_MQ_S_BUSY immediately after this bit has been set, resulting
  in this bit not being set although there are requests on the dispatch list.
* Checking BLK_MQ_S_BUSY after requests have been added to the dispatch list
  but before that bit is set, resulting in test_bit(BLK_MQ_S_BUSY, &hctx->state)
  reporting that the BLK_MQ_S_BUSY has not been set although there are requests
  on the dispatch list.
* Checking BLK_MQ_S_BUSY after requests have been removed from the dispatch list
  but before that bit is cleared, resulting in test_bit(BLK_MQ_S_BUSY, 
&hctx->state)
  reporting that the BLK_MQ_S_BUSY
has been set although there are no requests
  on the dispatch list.

Bart.

Re: [PATCH 04/14] blk-mq-sched: improve dispatching from sw queue

2017-07-31 Thread Bart Van Assche
On Tue, 2017-08-01 at 00:51 +0800, Ming Lei wrote:
> SCSI devices use host-wide tagset, and the shared
> driver tag space is often quite big. Meantime
> there is also queue depth for each lun(.cmd_per_lun),
> which is often small.
> 
> So lots of requests may stay in sw queue, and we
> always flush all belonging to same hw queue and
> dispatch them all to driver, unfortunately it is
> easy to cause queue busy becasue of the small
> per-lun queue depth. Once these requests are flushed
> out, they have to stay in hctx->dispatch, and no bio
> merge can participate into these requests, and
> sequential IO performance is hurted.
> 
> This patch improves dispatching from sw queue when
> there is per-request-queue queue depth by taking
> request one by one from sw queue, just like the way
> of IO scheduler.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq-sched.c | 25 +++--
>  1 file changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 47a25333a136..3510c01cb17b 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -96,6 +96,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
> *hctx)
>   const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
>   bool can_go = true;
>   LIST_HEAD(rq_list);
> + struct request *(*dispatch_fn)(struct blk_mq_hw_ctx *) =
> + has_sched_dispatch ? e->type->ops.mq.dispatch_request :
> + blk_mq_dispatch_rq_from_ctxs;
>  
>   /* RCU or SRCU read lock is needed before checking quiesced flag */
>   if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
> @@ -126,26 +129,28 @@ void blk_mq_sched_dispatch_requests(struct 
> blk_mq_hw_ctx *hctx)
>   if (!list_empty(&rq_list)) {
>   blk_mq_sched_mark_restart_hctx(hctx);
>   can_go = blk_mq_dispatch_rq_list(q, &rq_list);
> - } else if (!has_sched_dispatch) {
> + } else if (!has_sched_dispatch && !q->queue_depth) {
>   blk_mq_flush_busy_ctxs(hctx, &rq_list);
>   blk_mq_dispatch_rq_list(q, &rq_list);
> + can_go = false;
>   }
>  
> + if (!can_go)
> + return;
> +
>   /*
>* We want to dispatch from the scheduler if we had no work left
>* on the dispatch list, OR if we did have work but weren't able
>* to make progress.
>*/
> - if (can_go && has_sched_dispatch) {
> - do {
> - struct request *rq;
> + do {
> + struct request *rq;
>  
> - rq = e->type->ops.mq.dispatch_request(hctx);
> - if (!rq)
> - break;
> - list_add(&rq->queuelist, &rq_list);
> - } while (blk_mq_dispatch_rq_list(q, &rq_list));
> - }
> + rq = dispatch_fn(hctx);
> + if (!rq)
> + break;
> + list_add(&rq->queuelist, &rq_list);
> + } while (blk_mq_dispatch_rq_list(q, &rq_list));
>  }

Hello Ming,

Although I like the idea behind this patch, I'm afraid that this patch will
cause a performance regression for high-performance SCSI LLD drivers, e.g.
ib_srp. Have you considered to rework this patch as follows:
* Remove the code under "else if (!has_sched_dispatch && !q->queue_depth) {".
* Modify all blk_mq_dispatch_rq_list() functions such that these dispatch up
  to cmd_per_lun - (number of requests in progress) at once.

Thanks,

Bart.

Re: [PATCH 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctxs()

2017-07-31 Thread Bart Van Assche
On Tue, 2017-08-01 at 00:51 +0800, Ming Lei wrote:
> @@ -810,7 +810,11 @@ static void blk_mq_timeout_work(struct work_struct *work)
>  
>  struct ctx_iter_data {
>   struct blk_mq_hw_ctx *hctx;
> - struct list_head *list;
> +
> + union {
> + struct list_head *list;
> + struct request *rq;
> + };
>  };

Hello Ming,

Please introduce a new data structure for dispatch_rq_from_ctx() /
blk_mq_dispatch_rq_from_ctxs() instead of introducing a union in struct
ctx_iter_data. That will avoid that .list can be used in a context where
a struct request * pointer has been stored in the structure and vice versa.
 
>  static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void 
> *data)
> @@ -826,6 +830,26 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned 
> int bitnr, void *data)
>   return true;
>  }
>  
> +static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, 
> void *data)
> +{
> + struct ctx_iter_data *dispatch_data = data;
> + struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
> + struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
> + bool empty = true;
> +
> + spin_lock(&ctx->lock);
> + if (unlikely(!list_empty(&ctx->rq_list))) {
> + dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
> + list_del_init(&dispatch_data->rq->queuelist);
> + empty = list_empty(&ctx->rq_list);
> + }
> + spin_unlock(&ctx->lock);
> + if (empty)
> + sbitmap_clear_bit(sb, bitnr);

This sbitmap_clear_bit() occurs without holding blk_mq_ctx.lock. Sorry but
I don't think this is safe. Please either remove this sbitmap_clear_bit() call
or make sure that it happens with blk_mq_ctx.lock held.

Thanks,

Bart.

Re: [PATCH 02/14] blk-mq: rename flush_busy_ctx_data as ctx_iter_data

2017-07-31 Thread Bart Van Assche
On Tue, 2017-08-01 at 00:50 +0800, Ming Lei wrote:
> The following patch need to reuse this data structure,
> so rename as one generic name.

Hello Ming,

Please drop this patch (see also my comments on the next patch).

Thanks,

Bart.

Re: [PATCH 01/14] blk-mq-sched: fix scheduler bad performance

2017-07-31 Thread Bart Van Assche
On Tue, 2017-08-01 at 00:50 +0800, Ming Lei wrote:
> When hw queue is busy, we shouldn't take requests from
> scheduler queue any more, otherwise IO merge will be
> difficult to do.
> 
> This patch fixes the awful IO performance on some
> SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
> is used by not taking requests if hw queue is busy.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq-sched.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 4ab69435708c..47a25333a136 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
> *hctx)
>   struct request_queue *q = hctx->queue;
>   struct elevator_queue *e = q->elevator;
>   const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
> - bool did_work = false;
> + bool can_go = true;
>   LIST_HEAD(rq_list);
>  
>   /* RCU or SRCU read lock is needed before checking quiesced flag */
> @@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
> *hctx)
>*/
>   if (!list_empty(&rq_list)) {
>   blk_mq_sched_mark_restart_hctx(hctx);
> - did_work = blk_mq_dispatch_rq_list(q, &rq_list);
> + can_go = blk_mq_dispatch_rq_list(q, &rq_list);
>   } else if (!has_sched_dispatch) {
>   blk_mq_flush_busy_ctxs(hctx, &rq_list);
>   blk_mq_dispatch_rq_list(q, &rq_list);
> @@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
> *hctx)
>* on the dispatch list, OR if we did have work but weren't able
>* to make progress.
>*/
> - if (!did_work && has_sched_dispatch) {
> + if (can_go && has_sched_dispatch) {
>   do {
>   struct request *rq;

Hello Ming,

Please chose a better name for the new variable, e.g. "do_sched_dispatch". 
Otherwise this patch looks fine to me. Hence:

Reviewed-by: Bart Van Assche 

Bart.

Re: [patch 3/5] scsi/bnx2i: Prevent recursive cpuhotplug locking

2017-07-31 Thread Steven Rostedt
On Mon, 24 Jul 2017 12:52:58 +0200
Thomas Gleixner  wrote:

> The BNX2I module init/exit code installs/removes the hotplug callbacks with
> the cpu hotplug lock held. This worked with the old CPU locking
> implementation which allowed recursive locking, but with the new percpu
> rwsem based mechanism this is not longer allowed.
> 
> Use the _cpuslocked() variants to fix this.
> 
> Reported-by: Steven Rostedt 

Tested-by: Steven Rostedt (VMware) 

(makes the lockdep splat go away)

-- Steve

> Signed-off-by: Thomas Gleixner 
> ---
>  drivers/scsi/bnx2i/bnx2i_init.c |   15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> --- a/drivers/scsi/bnx2i/bnx2i_init.c
> +++ b/drivers/scsi/bnx2i/bnx2i_init.c
> @@ -516,15 +516,16 @@ static int __init bnx2i_mod_init(void)
>   for_each_online_cpu(cpu)
>   bnx2i_percpu_thread_create(cpu);
>  
> - err = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> -"scsi/bnx2i:online",
> -bnx2i_cpu_online, NULL);
> + err = cpuhp_setup_state_nocalls_cpuslocked(CPUHP_AP_ONLINE_DYN,
> +"scsi/bnx2i:online",
> +bnx2i_cpu_online, NULL);
>   if (err < 0)
>   goto remove_threads;
>   bnx2i_online_state = err;
>  
> - cpuhp_setup_state_nocalls(CPUHP_SCSI_BNX2I_DEAD, "scsi/bnx2i:dead",
> -   NULL, bnx2i_cpu_dead);
> + cpuhp_setup_state_nocalls_cpuslocked(CPUHP_SCSI_BNX2I_DEAD,
> +  "scsi/bnx2i:dead",
> +  NULL, bnx2i_cpu_dead);
>   put_online_cpus();
>   return 0;
>  
> @@ -574,8 +575,8 @@ static void __exit bnx2i_mod_exit(void)
>   for_each_online_cpu(cpu)
>   bnx2i_percpu_thread_destroy(cpu);
>  
> - cpuhp_remove_state_nocalls(bnx2i_online_state);
> - cpuhp_remove_state_nocalls(CPUHP_SCSI_BNX2I_DEAD);
> + cpuhp_remove_state_nocalls_cpuslocked(bnx2i_online_state);
> + cpuhp_remove_state_nocalls_cpuslocked(CPUHP_SCSI_BNX2I_DEAD);
>   put_online_cpus();
>  
>   iscsi_unregister_transport(&bnx2i_iscsi_transport);
> 



[PATCH 1/1] qla2xxx: Fix system crash while triggering FW dump

2017-07-31 Thread Himanshu Madhani
From: Michael Hernandez 

This patch fixes system hang/crash while firmware dump is attempted with
Block MQ enabled in qla2xxx driver. Fix is to remove check in fw dump
template entries for existing request and response queues so that full
buffer size is calculated during template size calculation.

Following stack trace is seen during firmware dump capture process

[  694.390588] qla2xxx [:81:00.0]-5003:11: ISP System Error - mbx1=4b1fh 
mbx2=10h mbx3=2ah mbx7=0h.
[  694.402336] BUG: unable to handle kernel paging request at c90008c7b000
[  694.402372] IP: memcpy_erms+0x6/0x10
[  694.402386] PGD 105f01a067
[  694.402386] PUD 85f89c067
[  694.402398] PMD 10490cb067
[  694.402409] PTE 0
[  694.402421]
[  694.402437] Oops: 0002 [#1] PREEMPT SMP
[  694.402452] Modules linked in: netconsole configfs qla2xxx scsi_transport_fc
nvme_fc nvme_fabrics bnep bluetooth rfkill xt_tcpudp unix_diag xt_multiport
ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet
iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ipmi_ssif sb_edac edac_core
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass igb
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel iTCO_wdt
aes_x86_64 crypto_simd ptp iTCO_vendor_support glue_helper cryptd lpc_ich joydev
i2c_i801 pcspkr ioatdma mei_me pps_core tpm_tis mei mfd_core acpi_power_meter
tpm_tis_core ipmi_si ipmi_devintf tpm ipmi_msghandler shpchp wmi dca button
acpi_pad btrfs xor uas usb_storage hid_generic usbhid raid6_pq crc32c_intel ast
i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
[  694.402692]  sysimgblt fb_sys_fops xhci_pci ttm ehci_pci sr_mod xhci_hcd
cdrom ehci_hcd drm usbcore sg
[  694.402730] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-1-default+ #19
[  694.402753] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
[  694.402776] task: 81c0e4c0 task.stack: 81c0
[  694.402798] RIP: 0010:memcpy_erms+0x6/0x10
[  694.402813] RSP: 0018:88085fc03cd0 EFLAGS: 00210006
[  694.402832] RAX: c90008c7ae0c RBX: 0004 RCX: 0001fe0c
[  694.402856] RDX: 0002 RSI: 8810332c01f4 RDI: c90008c7b000
[  694.402879] RBP: 88085fc03d18 R08: 0002 R09: 00279e0a
[  694.402903] R10:  R11: f000 R12: 88085fc03d80
[  694.402927] R13: c90008a01000 R14: c90008a056d4 R15: 881052ef17e0
[  694.402951] FS:  () GS:88085fc0() 
knlGS:
[  694.402977] CS:  0010 DS:  ES:  CR0: 80050033
[  694.403012] CR2: c90008c7b000 CR3: 01c09000 CR4: 001406f0
[  694.403036] Call Trace:
[  694.403047]  
[  694.403072]  ? qla27xx_fwdt_entry_t263+0x18e/0x380 [qla2xxx]
[  694.403099]  qla27xx_walk_template+0x9d/0x1a0 [qla2xxx]
[  694.403124]  qla27xx_fwdump+0x1f3/0x272 [qla2xxx]
[  694.403149]  qla2x00_async_event+0xb08/0x1a50 [qla2xxx]
[  694.403169]  ? enqueue_task_fair+0xa2/0x9d0

Signed-off-by: Mike Hernandez 
Signed-off-by: Joe Carnuccio 
Signed-off-by: Himanshu Madhani 
---
Hi Martin, 

Please apply this patch to 4.13.0-rc4. Without this patch our capabilty
to collect and analyze firmware dump in a customer enviorment will be
greatly affected. 

Thanks,
Himanshu
---
 drivers/scsi/qla2xxx/qla_tmpl.c | 12 
 1 file changed, 12 deletions(-)

diff --git a/drivers/scsi/qla2xxx/qla_tmpl.c b/drivers/scsi/qla2xxx/qla_tmpl.c
index 33142610882f..b18646d6057f 100644
--- a/drivers/scsi/qla2xxx/qla_tmpl.c
+++ b/drivers/scsi/qla2xxx/qla_tmpl.c
@@ -401,9 +401,6 @@ qla27xx_fwdt_entry_t263(struct scsi_qla_host *vha,
for (i = 0; i < vha->hw->max_req_queues; i++) {
struct req_que *req = vha->hw->req_q_map[i];
 
-   if (!test_bit(i, vha->hw->req_qid_map))
-   continue;
-
if (req || !buf) {
length = req ?
req->length : REQUEST_ENTRY_CNT_24XX;
@@ -418,9 +415,6 @@ qla27xx_fwdt_entry_t263(struct scsi_qla_host *vha,
for (i = 0; i < vha->hw->max_rsp_queues; i++) {
struct rsp_que *rsp = vha->hw->rsp_q_map[i];
 
-   if (!test_bit(i, vha->hw->rsp_qid_map))
-   continue;
-
if (rsp || !buf) {
length = rsp ?
rsp->length : RESPONSE_ENTRY_CNT_MQ;
@@ -660,9 +654,6 @@ qla27xx_fwdt_entry_t274(struct scsi_qla_host *vha,
for (i = 0; i < vha->hw->max_req_queues; i++) {
struct req_que *req = vha->hw->req_q_map[i];
 
-   if (!test_bit(i, vha->hw->req_qid_map))
-   continue;
-
if (req || !buf) {
qla27xx_insert16(i, buf, len);
qla27xx_insert16(1, buf, len);
@@ -67

[PATCH 14/14] blk-mq-sched: improve IO scheduling on SCSI devcie

2017-07-31 Thread Ming Lei
SCSI device often has per-request_queue queue depth
(.cmd_per_lun), which is applied among all hw queues
actually, and this patchset calls this as shared
queue depth.

One theory of scheduler is that we shouldn't dequeue
request from sw/scheduler queue and dispatch it to
driver when the low level queue is busy.

For SCSI device, queue being busy depends on the
per-request_queue limit, so we should hold all
hw queues if the request queue is busy.

This patch introduces per-request_queue dispatch
list for this purpose, and only when all requests
in this list are dispatched out successfully, we
can restart to dequeue request from sw/scheduler
queue and dispath it to lld.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c |  8 +++-
 block/blk-mq.h | 14 +++---
 include/linux/blkdev.h |  5 +
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9b8b3a740d18..6d02901d798e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2667,8 +2667,14 @@ int blk_mq_update_sched_queue_depth(struct request_queue 
*q)
 * this queue depth limit
 */
if (q->queue_depth) {
-   queue_for_each_hw_ctx(q, hctx, i)
+   queue_for_each_hw_ctx(q, hctx, i) {
hctx->flags |= BLK_MQ_F_SHARED_DEPTH;
+   hctx->dispatch_lock = &q->__mq_dispatch_lock;
+   hctx->dispatch_list = &q->__mq_dispatch_list;
+
+   spin_lock_init(hctx->dispatch_lock);
+   INIT_LIST_HEAD(hctx->dispatch_list);
+   }
}
 
if (!q->elevator)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index a8788058da56..4853d422836f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -138,19 +138,27 @@ static inline bool blk_mq_hw_queue_mapped(struct 
blk_mq_hw_ctx *hctx)
 static inline bool blk_mq_hctx_is_busy(struct request_queue *q,
struct blk_mq_hw_ctx *hctx)
 {
-   return test_bit(BLK_MQ_S_BUSY, &hctx->state);
+   if (!(hctx->flags & BLK_MQ_F_SHARED_DEPTH))
+   return test_bit(BLK_MQ_S_BUSY, &hctx->state);
+   return q->mq_dispatch_busy;
 }
 
 static inline void blk_mq_hctx_set_busy(struct request_queue *q,
struct blk_mq_hw_ctx *hctx)
 {
-   set_bit(BLK_MQ_S_BUSY, &hctx->state);
+   if (!(hctx->flags & BLK_MQ_F_SHARED_DEPTH))
+   set_bit(BLK_MQ_S_BUSY, &hctx->state);
+   else
+   q->mq_dispatch_busy = 1;
 }
 
 static inline void blk_mq_hctx_clear_busy(struct request_queue *q,
struct blk_mq_hw_ctx *hctx)
 {
-   clear_bit(BLK_MQ_S_BUSY, &hctx->state);
+   if (!(hctx->flags & BLK_MQ_F_SHARED_DEPTH))
+   clear_bit(BLK_MQ_S_BUSY, &hctx->state);
+   else
+   q->mq_dispatch_busy = 0;
 }
 
 static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25f6a0cb27d3..bc0e607710f2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -395,6 +395,11 @@ struct request_queue {
 
atomic_tshared_hctx_restart;
 
+   /* blk-mq dispatch list and lock for shared queue depth case */
+   struct list_head__mq_dispatch_list;
+   spinlock_t  __mq_dispatch_lock;
+   unsigned intmq_dispatch_busy;
+
struct blk_queue_stats  *stats;
struct rq_wb*rq_wb;
 
-- 
2.9.4



[PATCH 13/14] blk-mq: pass 'request_queue *' to several helpers of operating BUSY

2017-07-31 Thread Ming Lei
We need to support per-request_queue dispatch list for avoiding
early dispatch in case of shared queue depth.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c |  6 +++---
 block/blk-mq.h   | 15 +--
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 8ff74efe4172..37702786c6d1 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -132,7 +132,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * more fair dispatch.
 */
if (blk_mq_has_dispatch_rqs(hctx))
-   blk_mq_take_list_from_dispatch(hctx, &rq_list);
+   blk_mq_take_list_from_dispatch(q, hctx, &rq_list);
 
/*
 * Only ask the scheduler for requests, if we didn't have residual
@@ -147,11 +147,11 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
blk_mq_sched_mark_restart_hctx(hctx);
can_go = blk_mq_dispatch_rq_list(q, &rq_list);
if (can_go)
-   blk_mq_hctx_clear_busy(hctx);
+   blk_mq_hctx_clear_busy(q, hctx);
}
 
/* can't go until ->dispatch is flushed */
-   if (!can_go || blk_mq_hctx_is_busy(hctx))
+   if (!can_go || blk_mq_hctx_is_busy(q, hctx))
return;
 
/*
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d9795cbba1bb..a8788058da56 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -135,17 +135,20 @@ static inline bool blk_mq_hw_queue_mapped(struct 
blk_mq_hw_ctx *hctx)
return hctx->nr_ctx && hctx->tags;
 }
 
-static inline bool blk_mq_hctx_is_busy(struct blk_mq_hw_ctx *hctx)
+static inline bool blk_mq_hctx_is_busy(struct request_queue *q,
+   struct blk_mq_hw_ctx *hctx)
 {
return test_bit(BLK_MQ_S_BUSY, &hctx->state);
 }
 
-static inline void blk_mq_hctx_set_busy(struct blk_mq_hw_ctx *hctx)
+static inline void blk_mq_hctx_set_busy(struct request_queue *q,
+   struct blk_mq_hw_ctx *hctx)
 {
set_bit(BLK_MQ_S_BUSY, &hctx->state);
 }
 
-static inline void blk_mq_hctx_clear_busy(struct blk_mq_hw_ctx *hctx)
+static inline void blk_mq_hctx_clear_busy(struct request_queue *q,
+   struct blk_mq_hw_ctx *hctx)
 {
clear_bit(BLK_MQ_S_BUSY, &hctx->state);
 }
@@ -179,8 +182,8 @@ static inline void blk_mq_add_list_to_dispatch_tail(struct 
blk_mq_hw_ctx *hctx,
spin_unlock(hctx->dispatch_lock);
 }
 
-static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx,
-   struct list_head *list)
+static inline void blk_mq_take_list_from_dispatch(struct request_queue *q,
+   struct blk_mq_hw_ctx *hctx, struct list_head *list)
 {
spin_lock(hctx->dispatch_lock);
list_splice_init(hctx->dispatch_list, list);
@@ -190,7 +193,7 @@ static inline void blk_mq_take_list_from_dispatch(struct 
blk_mq_hw_ctx *hctx,
 * in hctx->dispatch are dispatched successfully
 */
if (!list_empty(list))
-   blk_mq_hctx_set_busy(hctx);
+   blk_mq_hctx_set_busy(q, hctx);
spin_unlock(hctx->dispatch_lock);
 }
 
-- 
2.9.4



[PATCH 12/14] blk-mq: introduce pointers to dispatch lock & list

2017-07-31 Thread Ming Lei
Prepare to support per-request-queue dispatch list,
so introduce dispatch lock and list for avoiding to
do runtime check.

Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c | 10 +-
 block/blk-mq.c |  7 +--
 block/blk-mq.h | 26 +-
 include/linux/blk-mq.h |  3 +++
 4 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index c4f70b453c76..4f8cddb8505f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -370,23 +370,23 @@ static void *hctx_dispatch_start(struct seq_file *m, 
loff_t *pos)
 {
struct blk_mq_hw_ctx *hctx = m->private;
 
-   spin_lock(&hctx->lock);
-   return seq_list_start(&hctx->dispatch, *pos);
+   spin_lock(hctx->dispatch_lock);
+   return seq_list_start(hctx->dispatch_list, *pos);
 }
 
 static void *hctx_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
 {
struct blk_mq_hw_ctx *hctx = m->private;
 
-   return seq_list_next(v, &hctx->dispatch, pos);
+   return seq_list_next(v, hctx->dispatch_list, pos);
 }
 
 static void hctx_dispatch_stop(struct seq_file *m, void *v)
-   __releases(&hctx->lock)
+   __releases(hctx->dispatch_lock)
 {
struct blk_mq_hw_ctx *hctx = m->private;
 
-   spin_unlock(&hctx->lock);
+   spin_unlock(hctx->dispatch_lock);
 }
 
 static const struct seq_operations hctx_dispatch_seq_ops = {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 785145f60c1d..9b8b3a740d18 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1925,8 +1925,11 @@ static void blk_mq_exit_hw_queues(struct request_queue 
*q,
 static void blk_mq_init_dispatch(struct request_queue *q,
struct blk_mq_hw_ctx *hctx)
 {
-   spin_lock_init(&hctx->lock);
-   INIT_LIST_HEAD(&hctx->dispatch);
+   hctx->dispatch_lock = &hctx->lock;
+   hctx->dispatch_list = &hctx->dispatch;
+
+   spin_lock_init(hctx->dispatch_lock);
+   INIT_LIST_HEAD(hctx->dispatch_list);
 }
 
 static int blk_mq_init_hctx(struct request_queue *q,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2ed355881996..d9795cbba1bb 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -152,38 +152,38 @@ static inline void blk_mq_hctx_clear_busy(struct 
blk_mq_hw_ctx *hctx)
 
 static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
 {
-   return !list_empty_careful(&hctx->dispatch);
+   return !list_empty_careful(hctx->dispatch_list);
 }
 
 static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
struct request *rq)
 {
-   spin_lock(&hctx->lock);
-   list_add(&rq->queuelist, &hctx->dispatch);
-   spin_unlock(&hctx->lock);
+   spin_lock(hctx->dispatch_lock);
+   list_add(&rq->queuelist, hctx->dispatch_list);
+   spin_unlock(hctx->dispatch_lock);
 }
 
 static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
struct list_head *list)
 {
-   spin_lock(&hctx->lock);
-   list_splice_init(list, &hctx->dispatch);
-   spin_unlock(&hctx->lock);
+   spin_lock(hctx->dispatch_lock);
+   list_splice_init(list, hctx->dispatch_list);
+   spin_unlock(hctx->dispatch_lock);
 }
 
 static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
struct list_head *list)
 {
-   spin_lock(&hctx->lock);
-   list_splice_tail_init(list, &hctx->dispatch);
-   spin_unlock(&hctx->lock);
+   spin_lock(hctx->dispatch_lock);
+   list_splice_tail_init(list, hctx->dispatch_list);
+   spin_unlock(hctx->dispatch_lock);
 }
 
 static inline void blk_mq_take_list_from_dispatch(struct blk_mq_hw_ctx *hctx,
struct list_head *list)
 {
-   spin_lock(&hctx->lock);
-   list_splice_init(&hctx->dispatch, list);
+   spin_lock(hctx->dispatch_lock);
+   list_splice_init(hctx->dispatch_list, list);
 
/*
 * BUSY won't be cleared until all requests
@@ -191,7 +191,7 @@ static inline void blk_mq_take_list_from_dispatch(struct 
blk_mq_hw_ctx *hctx,
 */
if (!list_empty(list))
blk_mq_hctx_set_busy(hctx);
-   spin_unlock(&hctx->lock);
+   spin_unlock(hctx->dispatch_lock);
 }
 
 #endif
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 14f2ad3af31f..016f16c48f72 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,9 @@ struct blk_mq_hw_ctx {
 
unsigned long   flags;  /* BLK_MQ_F_* flags */
 
+   spinlock_t  *dispatch_lock;
+   struct list_head*dispatch_list;
+
void*sched_data;
struct request_queue*queue;
struct blk_flush_queue  *fq;
-- 
2.9.4



[PATCH 11/14] blk-mq: introduce helpers for operating ->dispatch list

2017-07-31 Thread Ming Lei
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 19 +++
 block/blk-mq.c   | 18 +++---
 block/blk-mq.h   | 44 
 3 files changed, 58 insertions(+), 23 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 112270961af0..8ff74efe4172 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -131,19 +131,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * If we have previous entries on our dispatch list, grab them first for
 * more fair dispatch.
 */
-   if (!list_empty_careful(&hctx->dispatch)) {
-   spin_lock(&hctx->lock);
-   if (!list_empty(&hctx->dispatch)) {
-   list_splice_init(&hctx->dispatch, &rq_list);
-
-   /*
-* BUSY won't be cleared until all requests
-* in hctx->dispatch are dispatched successfully
-*/
-   blk_mq_hctx_set_busy(hctx);
-   }
-   spin_unlock(&hctx->lock);
-   }
+   if (blk_mq_has_dispatch_rqs(hctx))
+   blk_mq_take_list_from_dispatch(hctx, &rq_list);
 
/*
 * Only ask the scheduler for requests, if we didn't have residual
@@ -296,9 +285,7 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx 
*hctx,
 * If we already have a real request tag, send directly to
 * the dispatch list.
 */
-   spin_lock(&hctx->lock);
-   list_add(&rq->queuelist, &hctx->dispatch);
-   spin_unlock(&hctx->lock);
+   blk_mq_add_rq_to_dispatch(hctx, rq);
return true;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index db635ef06a72..785145f60c1d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -63,7 +63,7 @@ static int blk_mq_poll_stats_bkt(const struct request *rq)
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx)
 {
return sbitmap_any_bit_set(&hctx->ctx_map) ||
-   !list_empty_careful(&hctx->dispatch) ||
+   blk_mq_has_dispatch_rqs(hctx) ||
blk_mq_sched_has_work(hctx);
 }
 
@@ -1097,9 +1097,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
rq = list_first_entry(list, struct request, queuelist);
blk_mq_put_driver_tag(rq);
 
-   spin_lock(&hctx->lock);
-   list_splice_init(list, &hctx->dispatch);
-   spin_unlock(&hctx->lock);
+   blk_mq_add_list_to_dispatch(hctx, list);
 
/*
 * If SCHED_RESTART was set by the caller of this function and
@@ -1874,9 +1872,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, 
struct hlist_node *node)
if (list_empty(&tmp))
return 0;
 
-   spin_lock(&hctx->lock);
-   list_splice_tail_init(&tmp, &hctx->dispatch);
-   spin_unlock(&hctx->lock);
+   blk_mq_add_list_to_dispatch_tail(hctx, &tmp);
 
blk_mq_run_hw_queue(hctx, true);
return 0;
@@ -1926,6 +1922,13 @@ static void blk_mq_exit_hw_queues(struct request_queue 
*q,
}
 }
 
+static void blk_mq_init_dispatch(struct request_queue *q,
+   struct blk_mq_hw_ctx *hctx)
+{
+   spin_lock_init(&hctx->lock);
+   INIT_LIST_HEAD(&hctx->dispatch);
+}
+
 static int blk_mq_init_hctx(struct request_queue *q,
struct blk_mq_tag_set *set,
struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
@@ -1939,6 +1942,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
spin_lock_init(&hctx->lock);
INIT_LIST_HEAD(&hctx->dispatch);
+   blk_mq_init_dispatch(q, hctx);
hctx->queue = q;
hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d9f875093613..2ed355881996 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -150,4 +150,48 @@ static inline void blk_mq_hctx_clear_busy(struct 
blk_mq_hw_ctx *hctx)
clear_bit(BLK_MQ_S_BUSY, &hctx->state);
 }
 
+static inline bool blk_mq_has_dispatch_rqs(struct blk_mq_hw_ctx *hctx)
+{
+   return !list_empty_careful(&hctx->dispatch);
+}
+
+static inline void blk_mq_add_rq_to_dispatch(struct blk_mq_hw_ctx *hctx,
+   struct request *rq)
+{
+   spin_lock(&hctx->lock);
+   list_add(&rq->queuelist, &hctx->dispatch);
+   spin_unlock(&hctx->lock);
+}
+
+static inline void blk_mq_add_list_to_dispatch(struct blk_mq_hw_ctx *hctx,
+   struct list_head *list)
+{
+   spin_lock(&hctx->lock);
+   list_splice_init(list, &hctx->dispatch);
+   spin_unlock(&hctx->lock);
+}
+
+static inline void blk_mq_add_list_to_dispatch_tail(struct blk_mq_hw_ctx *hctx,
+   struct list_head *list)
+{
+   spin_lock(&hctx->lock);
+   list_

[PATCH 10/14] blk-mq-sched: introduce helpers for query, change busy state

2017-07-31 Thread Ming Lei
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c |  6 +++---
 block/blk-mq.h   | 15 +++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 07ff53187617..112270961af0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -140,7 +140,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * BUSY won't be cleared until all requests
 * in hctx->dispatch are dispatched successfully
 */
-   set_bit(BLK_MQ_S_BUSY, &hctx->state);
+   blk_mq_hctx_set_busy(hctx);
}
spin_unlock(&hctx->lock);
}
@@ -158,11 +158,11 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
blk_mq_sched_mark_restart_hctx(hctx);
can_go = blk_mq_dispatch_rq_list(q, &rq_list);
if (can_go)
-   clear_bit(BLK_MQ_S_BUSY, &hctx->state);
+   blk_mq_hctx_clear_busy(hctx);
}
 
/* can't go until ->dispatch is flushed */
-   if (!can_go || test_bit(BLK_MQ_S_BUSY, &hctx->state))
+   if (!can_go || blk_mq_hctx_is_busy(hctx))
return;
 
/*
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 44d3aaa03d7c..d9f875093613 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -135,4 +135,19 @@ static inline bool blk_mq_hw_queue_mapped(struct 
blk_mq_hw_ctx *hctx)
return hctx->nr_ctx && hctx->tags;
 }
 
+static inline bool blk_mq_hctx_is_busy(struct blk_mq_hw_ctx *hctx)
+{
+   return test_bit(BLK_MQ_S_BUSY, &hctx->state);
+}
+
+static inline void blk_mq_hctx_set_busy(struct blk_mq_hw_ctx *hctx)
+{
+   set_bit(BLK_MQ_S_BUSY, &hctx->state);
+}
+
+static inline void blk_mq_hctx_clear_busy(struct blk_mq_hw_ctx *hctx)
+{
+   clear_bit(BLK_MQ_S_BUSY, &hctx->state);
+}
+
 #endif
-- 
2.9.4



[PATCH 08/14] blk-mq: introduce BLK_MQ_F_SHARED_DEPTH

2017-07-31 Thread Ming Lei
SCSI devices often provides one per-requeest_queue depth via
q->queue_depth(.cmd_per_lun), which is a global limit on all
hw queues. After the pending I/O submitted to one rquest queue
reaches this limit, BLK_STS_RESOURCE will be returned to all
dispatch path. That means when one hw queue is stuck, actually
all hctxs are stuck too.

This flag is introduced for improving blk-mq IO scheduling
on this kind of device.

Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   |  2 +-
 block/blk-mq.c | 25 ++---
 include/linux/blk-mq.h |  1 +
 4 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 9ebc2945f991..c4f70b453c76 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -209,6 +209,7 @@ static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SG_MERGE),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
+   HCTX_FLAG_NAME(SHARED_DEPTH),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 3eb524ccb7aa..cc0687a4d0ab 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -144,7 +144,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
if (!can_go || test_bit(BLK_MQ_S_BUSY, &hctx->state))
return;
 
-   if (!has_sched_dispatch && !q->queue_depth) {
+   if (!has_sched_dispatch && !(hctx->flags & BLK_MQ_F_SHARED_DEPTH)) {
blk_mq_flush_busy_ctxs(hctx, &rq_list);
blk_mq_dispatch_rq_list(q, &rq_list);
return;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7df68d31bc23..db635ef06a72 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2647,12 +2647,31 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
 int blk_mq_update_sched_queue_depth(struct request_queue *q)
 {
unsigned nr;
+   struct blk_mq_hw_ctx *hctx;
+   unsigned int i;
+   int ret = 0;
 
-   if (!q->mq_ops || !q->elevator)
-   return 0;
+   if (!q->mq_ops)
+   return ret;
+
+   blk_mq_freeze_queue(q);
+   /*
+* if there is q->queue_depth, all hw queues share
+* this queue depth limit
+*/
+   if (q->queue_depth) {
+   queue_for_each_hw_ctx(q, hctx, i)
+   hctx->flags |= BLK_MQ_F_SHARED_DEPTH;
+   }
+
+   if (!q->elevator)
+   goto exit;
 
nr = blk_mq_sched_queue_depth(q);
-   return __blk_mq_update_nr_requests(q, true, nr);
+   ret = __blk_mq_update_nr_requests(q, true, nr);
+ exit:
+   blk_mq_unfreeze_queue(q);
+   return ret;
 }
 
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 6d44b242b495..14f2ad3af31f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -164,6 +164,7 @@ enum {
BLK_MQ_F_SG_MERGE   = 1 << 2,
BLK_MQ_F_BLOCKING   = 1 << 5,
BLK_MQ_F_NO_SCHED   = 1 << 6,
+   BLK_MQ_F_SHARED_DEPTH   = 1 << 7,
BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
-- 
2.9.4



[PATCH 09/14] blk-mq-sched: cleanup blk_mq_sched_dispatch_requests()

2017-07-31 Thread Ming Lei
This patch split blk_mq_sched_dispatch_requests()
into two parts:

1) the 1st part is for checking if queue is busy, and
handle the busy situation

2) the 2nd part is moved to __blk_mq_sched_dispatch_requests()
which focuses on dispatch from sw queue or scheduler queue.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 42 +-
 1 file changed, 25 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index cc0687a4d0ab..07ff53187617 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,16 +89,37 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
*hctx)
return false;
 }
 
-void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+static void __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-   bool can_go = true;
-   LIST_HEAD(rq_list);
struct request *(*dispatch_fn)(struct blk_mq_hw_ctx *) =
has_sched_dispatch ? e->type->ops.mq.dispatch_request :
blk_mq_dispatch_rq_from_ctxs;
+   LIST_HEAD(rq_list);
+
+   if (!has_sched_dispatch && !(hctx->flags & BLK_MQ_F_SHARED_DEPTH)) {
+   blk_mq_flush_busy_ctxs(hctx, &rq_list);
+   blk_mq_dispatch_rq_list(q, &rq_list);
+   return;
+   }
+
+   do {
+   struct request *rq;
+
+   rq = dispatch_fn(hctx);
+   if (!rq)
+   break;
+   list_add(&rq->queuelist, &rq_list);
+   } while (blk_mq_dispatch_rq_list(q, &rq_list));
+}
+
+void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+{
+   struct request_queue *q = hctx->queue;
+   bool can_go = true;
+   LIST_HEAD(rq_list);
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
@@ -144,25 +165,12 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
if (!can_go || test_bit(BLK_MQ_S_BUSY, &hctx->state))
return;
 
-   if (!has_sched_dispatch && !(hctx->flags & BLK_MQ_F_SHARED_DEPTH)) {
-   blk_mq_flush_busy_ctxs(hctx, &rq_list);
-   blk_mq_dispatch_rq_list(q, &rq_list);
-   return;
-   }
-
/*
 * We want to dispatch from the scheduler if we had no work left
 * on the dispatch list, OR if we did have work but weren't able
 * to make progress.
 */
-   do {
-   struct request *rq;
-
-   rq = dispatch_fn(hctx);
-   if (!rq)
-   break;
-   list_add(&rq->queuelist, &rq_list);
-   } while (blk_mq_dispatch_rq_list(q, &rq_list));
+   __blk_mq_sched_dispatch_requests(hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.4



[PATCH 07/14] blk-mq-sched: use q->queue_depth as hint for q->nr_requests

2017-07-31 Thread Ming Lei
SCSI sets q->queue_depth from shost->cmd_per_lun, and
q->queue_depth is per request_queue and more related to
scheduler queue compared with hw queue depth, which can be
shared by queues, such as TAG_SHARED.

This patch trys to use q->queue_depth as hint for computing
q->nr_requests, which should be more effective than
current way.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.h | 18 +++---
 block/blk-mq.c   | 27 +--
 block/blk-mq.h   |  1 +
 block/blk-settings.c |  2 ++
 4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1d47f3fda1d0..bb772e680e01 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -99,12 +99,24 @@ static inline bool blk_mq_sched_needs_restart(struct 
blk_mq_hw_ctx *hctx)
 static inline unsigned blk_mq_sched_queue_depth(struct request_queue *q)
 {
/*
-* Default to double of smaller one between hw queue_depth and 128,
+* q->queue_depth is more close to scheduler queue, so use it
+* as hint for computing scheduler queue depth if it is valid
+*/
+   unsigned q_depth = q->queue_depth ?: q->tag_set->queue_depth;
+
+   /*
+* Default to double of smaller one between queue depth and 128,
 * since we don't split into sync/async like the old code did.
 * Additionally, this is a per-hw queue depth.
 */
-   return 2 * min_t(unsigned int, q->tag_set->queue_depth,
-  BLKDEV_MAX_RQ);
+   q_depth = 2 * min_t(unsigned int, q_depth, BLKDEV_MAX_RQ);
+
+   /*
+* when queue depth of driver is too small, we set queue depth
+* of scheduler queue as 32 so that small queue device still
+* can benefit from IO merging.
+*/
+   return max_t(unsigned, q_depth, 32);
 }
 
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 86b8fdcb8434..7df68d31bc23 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2593,7 +2593,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
-int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+static int __blk_mq_update_nr_requests(struct request_queue *q,
+  bool sched_only,
+  unsigned int nr)
 {
struct blk_mq_tag_set *set = q->tag_set;
struct blk_mq_hw_ctx *hctx;
@@ -2612,7 +2614,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
 * If we're using an MQ scheduler, just update the scheduler
 * queue depth. This is similar to what the old code would do.
 */
-   if (!hctx->sched_tags) {
+   if (!sched_only && !hctx->sched_tags) {
ret = blk_mq_tag_update_depth(hctx, &hctx->tags,
min(nr, 
set->queue_depth),
false);
@@ -2632,6 +2634,27 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
return ret;
 }
 
+int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+{
+   return __blk_mq_update_nr_requests(q, false, nr);
+}
+
+/*
+ * When drivers update q->queue_depth, this API is called so that
+ * we can use this queue depth as hint for adjusting scheduler
+ * queue depth.
+ */
+int blk_mq_update_sched_queue_depth(struct request_queue *q)
+{
+   unsigned nr;
+
+   if (!q->mq_ops || !q->elevator)
+   return 0;
+
+   nr = blk_mq_sched_queue_depth(q);
+   return __blk_mq_update_nr_requests(q, true, nr);
+}
+
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
int nr_hw_queues)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 0c398f29dc4b..44d3aaa03d7c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -36,6 +36,7 @@ bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
bool wait);
 struct request *blk_mq_dispatch_rq_from_ctxs(struct blk_mq_hw_ctx *hctx);
+int blk_mq_update_sched_queue_depth(struct request_queue *q);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/block/blk-settings.c b/block/blk-settings.c
index be1f115b538b..94a349601545 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -877,6 +877,8 @@ void blk_set_queue_depth(struct request_queue *q, unsigned 
int depth)
 {
q->queue_depth = depth;
wbt_set_queue_depth(q->rq_wb, depth);
+
+   WARN_ON(blk_mq_update_sched_queue_depth(q));
 }
 EXPORT_SYMBOL(blk_set_queue_depth);
 
-- 
2.9.4



[PATCH 06/14] blk-mq-sched: introduce blk_mq_sched_queue_depth()

2017-07-31 Thread Ming Lei
The following patch will propose some hints to figure out
default queue depth for scheduler queue, so introduce helper
of blk_mq_sched_queue_depth() for this purpose.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c |  8 +---
 block/blk-mq-sched.h | 11 +++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index eb638063673f..3eb524ccb7aa 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -531,13 +531,7 @@ int blk_mq_init_sched(struct request_queue *q, struct 
elevator_type *e)
return 0;
}
 
-   /*
-* Default to double of smaller one between hw queue_depth and 128,
-* since we don't split into sync/async like the old code did.
-* Additionally, this is a per-hw queue depth.
-*/
-   q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
-  BLKDEV_MAX_RQ);
+   q->nr_requests = blk_mq_sched_queue_depth(q);
 
queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_sched_alloc_tags(q, hctx, i);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 9267d0b7c197..1d47f3fda1d0 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -96,4 +96,15 @@ static inline bool blk_mq_sched_needs_restart(struct 
blk_mq_hw_ctx *hctx)
return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 }
 
+static inline unsigned blk_mq_sched_queue_depth(struct request_queue *q)
+{
+   /*
+* Default to double of smaller one between hw queue_depth and 128,
+* since we don't split into sync/async like the old code did.
+* Additionally, this is a per-hw queue depth.
+*/
+   return 2 * min_t(unsigned int, q->tag_set->queue_depth,
+  BLKDEV_MAX_RQ);
+}
+
 #endif
-- 
2.9.4



[PATCH 05/14] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-07-31 Thread Ming Lei
During dispatch, we moved all requests from hctx->dispatch to
one temporary list, then dispatch them one by one from this list.
Unfortunately duirng this period, run queue from other contexts
may think the queue is idle and start to dequeue from sw/scheduler
queue and try to dispatch because ->dispatch is empty.

This way will hurt sequential I/O performance because requests are
dequeued when queue is busy.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c   | 24 ++--
 include/linux/blk-mq.h |  1 +
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 3510c01cb17b..eb638063673f 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -112,8 +112,15 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty_careful(&hctx->dispatch)) {
spin_lock(&hctx->lock);
-   if (!list_empty(&hctx->dispatch))
+   if (!list_empty(&hctx->dispatch)) {
list_splice_init(&hctx->dispatch, &rq_list);
+
+   /*
+* BUSY won't be cleared until all requests
+* in hctx->dispatch are dispatched successfully
+*/
+   set_bit(BLK_MQ_S_BUSY, &hctx->state);
+   }
spin_unlock(&hctx->lock);
}
 
@@ -129,15 +136,20 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
if (!list_empty(&rq_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
can_go = blk_mq_dispatch_rq_list(q, &rq_list);
-   } else if (!has_sched_dispatch && !q->queue_depth) {
-   blk_mq_flush_busy_ctxs(hctx, &rq_list);
-   blk_mq_dispatch_rq_list(q, &rq_list);
-   can_go = false;
+   if (can_go)
+   clear_bit(BLK_MQ_S_BUSY, &hctx->state);
}
 
-   if (!can_go)
+   /* can't go until ->dispatch is flushed */
+   if (!can_go || test_bit(BLK_MQ_S_BUSY, &hctx->state))
return;
 
+   if (!has_sched_dispatch && !q->queue_depth) {
+   blk_mq_flush_busy_ctxs(hctx, &rq_list);
+   blk_mq_dispatch_rq_list(q, &rq_list);
+   return;
+   }
+
/*
 * We want to dispatch from the scheduler if we had no work left
 * on the dispatch list, OR if we did have work but weren't able
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 14542308d25b..6d44b242b495 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -172,6 +172,7 @@ enum {
BLK_MQ_S_SCHED_RESTART  = 2,
BLK_MQ_S_TAG_WAITING= 3,
BLK_MQ_S_START_ON_RUN   = 4,
+   BLK_MQ_S_BUSY   = 5,
 
BLK_MQ_MAX_DEPTH= 10240,
 
-- 
2.9.4



[PATCH 04/14] blk-mq-sched: improve dispatching from sw queue

2017-07-31 Thread Ming Lei
SCSI devices use host-wide tagset, and the shared
driver tag space is often quite big. Meantime
there is also queue depth for each lun(.cmd_per_lun),
which is often small.

So lots of requests may stay in sw queue, and we
always flush all belonging to same hw queue and
dispatch them all to driver, unfortunately it is
easy to cause queue busy becasue of the small
per-lun queue depth. Once these requests are flushed
out, they have to stay in hctx->dispatch, and no bio
merge can participate into these requests, and
sequential IO performance is hurted.

This patch improves dispatching from sw queue when
there is per-request-queue queue depth by taking
request one by one from sw queue, just like the way
of IO scheduler.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 47a25333a136..3510c01cb17b 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -96,6 +96,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
bool can_go = true;
LIST_HEAD(rq_list);
+   struct request *(*dispatch_fn)(struct blk_mq_hw_ctx *) =
+   has_sched_dispatch ? e->type->ops.mq.dispatch_request :
+   blk_mq_dispatch_rq_from_ctxs;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
@@ -126,26 +129,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
if (!list_empty(&rq_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
can_go = blk_mq_dispatch_rq_list(q, &rq_list);
-   } else if (!has_sched_dispatch) {
+   } else if (!has_sched_dispatch && !q->queue_depth) {
blk_mq_flush_busy_ctxs(hctx, &rq_list);
blk_mq_dispatch_rq_list(q, &rq_list);
+   can_go = false;
}
 
+   if (!can_go)
+   return;
+
/*
 * We want to dispatch from the scheduler if we had no work left
 * on the dispatch list, OR if we did have work but weren't able
 * to make progress.
 */
-   if (can_go && has_sched_dispatch) {
-   do {
-   struct request *rq;
+   do {
+   struct request *rq;
 
-   rq = e->type->ops.mq.dispatch_request(hctx);
-   if (!rq)
-   break;
-   list_add(&rq->queuelist, &rq_list);
-   } while (blk_mq_dispatch_rq_list(q, &rq_list));
-   }
+   rq = dispatch_fn(hctx);
+   if (!rq)
+   break;
+   list_add(&rq->queuelist, &rq_list);
+   } while (blk_mq_dispatch_rq_list(q, &rq_list));
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.4



[PATCH 03/14] blk-mq: introduce blk_mq_dispatch_rq_from_ctxs()

2017-07-31 Thread Ming Lei
This function is introduced for picking up request
from sw queue so that we can dispatch in scheduler's way.

More importantly, for some SCSI devices, driver
tags are host wide, and the number is quite big,
but each lun has very limited queue depth. This
function is introduced for avoiding to take too
many requests from sw queue when queue is busy,
and only try to dispatch request when queue
isn't busy.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 38 +-
 block/blk-mq.h |  1 +
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 94818f78c099..86b8fdcb8434 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -810,7 +810,11 @@ static void blk_mq_timeout_work(struct work_struct *work)
 
 struct ctx_iter_data {
struct blk_mq_hw_ctx *hctx;
-   struct list_head *list;
+
+   union {
+   struct list_head *list;
+   struct request *rq;
+   };
 };
 
 static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
@@ -826,6 +830,26 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned 
int bitnr, void *data)
return true;
 }
 
+static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void 
*data)
+{
+   struct ctx_iter_data *dispatch_data = data;
+   struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
+   struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+   bool empty = true;
+
+   spin_lock(&ctx->lock);
+   if (unlikely(!list_empty(&ctx->rq_list))) {
+   dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
+   list_del_init(&dispatch_data->rq->queuelist);
+   empty = list_empty(&ctx->rq_list);
+   }
+   spin_unlock(&ctx->lock);
+   if (empty)
+   sbitmap_clear_bit(sb, bitnr);
+
+   return !dispatch_data->rq;
+}
+
 /*
  * Process software queues that have been marked busy, splicing them
  * to the for-dispatch
@@ -841,6 +865,18 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, 
struct list_head *list)
 }
 EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
+struct request *blk_mq_dispatch_rq_from_ctxs(struct blk_mq_hw_ctx *hctx)
+{
+   struct ctx_iter_data data = {
+   .hctx = hctx,
+   .rq   = NULL,
+   };
+
+   sbitmap_for_each_set(&hctx->ctx_map, dispatch_rq_from_ctx, &data);
+
+   return data.rq;
+}
+
 static inline unsigned int queued_to_index(unsigned int queued)
 {
if (!queued)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 60b01c0309bc..0c398f29dc4b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,7 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, 
struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
bool wait);
+struct request *blk_mq_dispatch_rq_from_ctxs(struct blk_mq_hw_ctx *hctx);
 
 /*
  * Internal helpers for allocating/freeing the request map
-- 
2.9.4



[PATCH 02/14] blk-mq: rename flush_busy_ctx_data as ctx_iter_data

2017-07-31 Thread Ming Lei
The following patch need to reuse this data structure,
so rename as one generic name.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b70a4ad78b63..94818f78c099 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -808,14 +808,14 @@ static void blk_mq_timeout_work(struct work_struct *work)
blk_queue_exit(q);
 }
 
-struct flush_busy_ctx_data {
+struct ctx_iter_data {
struct blk_mq_hw_ctx *hctx;
struct list_head *list;
 };
 
 static bool flush_busy_ctx(struct sbitmap *sb, unsigned int bitnr, void *data)
 {
-   struct flush_busy_ctx_data *flush_data = data;
+   struct ctx_iter_data *flush_data = data;
struct blk_mq_hw_ctx *hctx = flush_data->hctx;
struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
 
@@ -832,7 +832,7 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned int 
bitnr, void *data)
  */
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 {
-   struct flush_busy_ctx_data data = {
+   struct ctx_iter_data data = {
.hctx = hctx,
.list = list,
};
-- 
2.9.4



[PATCH 01/14] blk-mq-sched: fix scheduler bad performance

2017-07-31 Thread Ming Lei
When hw queue is busy, we shouldn't take requests from
scheduler queue any more, otherwise IO merge will be
difficult to do.

This patch fixes the awful IO performance on some
SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
is used by not taking requests if hw queue is busy.

Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..47a25333a136 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-   bool did_work = false;
+   bool can_go = true;
LIST_HEAD(rq_list);
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -125,7 +125,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty(&rq_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   did_work = blk_mq_dispatch_rq_list(q, &rq_list);
+   can_go = blk_mq_dispatch_rq_list(q, &rq_list);
} else if (!has_sched_dispatch) {
blk_mq_flush_busy_ctxs(hctx, &rq_list);
blk_mq_dispatch_rq_list(q, &rq_list);
@@ -136,7 +136,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * on the dispatch list, OR if we did have work but weren't able
 * to make progress.
 */
-   if (!did_work && has_sched_dispatch) {
+   if (can_go && has_sched_dispatch) {
do {
struct request *rq;
 
-- 
2.9.4



[PATCH 00/14] blk-mq-sched: fix SCSI-MQ performance regression

2017-07-31 Thread Ming Lei
In Red Hat internal storage test wrt. blk-mq scheduler, we
found that its performance is quite bad, especially
about sequential I/O on some multi-queue SCSI devcies.

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to do, and
sequential IO degrades a lot.

The 1st five patches improve this situation, and brings back
some performance loss.

But looks they are still not enough. Finally it is caused by
the shared queue depth among all hw queues. For SCSI devices,
.cmd_per_lun defines the max number of pending I/O on one
request queue, which is per-request_queue depth. So during
dispatch, if one hctx is too busy to move on, all hctxs can't
dispatch too because of the per-request_queue depth.

Patch 6 ~ 14 use per-request_queue dispatch list to avoid
to dequeue requests from sw/scheduler queue when lld queue
is busy.

With this changes, SCSI-MQ performance is brought back
against block legacy path, follows the test result on lpfc:

- fio(libaio, bs:4k, dio, queue_depth:64, 20 jobs)


   |v4.13-rc3   | v4.13-rc3   | patched v4.13-rc3
   |legacy deadline | mq-none | mq-none
-
read"iops" | 401749.4001| 346237.5025 | 387536.4427
randread"iops" | 25175.07121| 21688.64067 | 25578.50374
write   "iops" | 376168.7578| 335262.0475 | 370132.4735
reandwrite  "iops" | 25235.46163| 24982.63819 | 23934.95610

   |v4.13-rc3   | v4.13-rc3   | patched v4.13-rc3
   |legacy deadline | mq-deadline | mq-deadline
--
read"iops" | 401749.4001| 35592.48901 | 401681.1137
randread"iops" | 25175.07121| 30029.52618 | 21446.68731
write   "iops" | 376168.7578| 27340.56777 | 377356.7286
randwrite   "iops" | 25235.46163| 24395.02969 | 24885.66152

Ming Lei (14):
  blk-mq-sched: fix scheduler bad performance
  blk-mq: rename flush_busy_ctx_data as ctx_iter_data
  blk-mq: introduce blk_mq_dispatch_rq_from_ctxs()
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
flushed
  blk-mq-sched: introduce blk_mq_sched_queue_depth()
  blk-mq-sched: use q->queue_depth as hint for q->nr_requests
  blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
  blk-mq-sched: cleanup blk_mq_sched_dispatch_requests()
  blk-mq-sched: introduce helpers for query, change busy state
  blk-mq: introduce helpers for operating ->dispatch list
  blk-mq: introduce pointers to dispatch lock & list
  blk-mq: pass 'request_queue *' to several helpers of operating BUSY
  blk-mq-sched: improve IO scheduling on SCSI devcie

 block/blk-mq-debugfs.c |  11 ++---
 block/blk-mq-sched.c   |  70 +++--
 block/blk-mq-sched.h   |  23 ++
 block/blk-mq.c | 117 +++--
 block/blk-mq.h |  72 ++
 block/blk-settings.c   |   2 +
 include/linux/blk-mq.h |   5 +++
 include/linux/blkdev.h |   5 +++
 8 files changed, 255 insertions(+), 50 deletions(-)

-- 
2.9.4



[PATCH 00/14] blk-mq-sched: fix SCSI-MQ performance regression

2017-07-31 Thread Ming Lei
In Red Hat internal storage test wrt. blk-mq scheduler, we
found that its performance is quite bad, especially
about sequential I/O on some multi-queue SCSI devcies.

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to do, and
sequential IO degrades a lot.

The 1st five patches improve this situation, and brings back
some performance loss.

But looks they are still not enough. Finally it is caused by
the shared queue depth among all hw queues. For SCSI devices,
.cmd_per_lun defines the max number of pending I/O on one
request queue, which is per-request_queue depth. So during
dispatch, if one hctx is too busy to move on, all hctxs can't
dispatch too because of the per-request_queue depth.

Patch 6 ~ 14 use per-request_queue dispatch list to avoid
to dequeue requests from sw/scheduler queue when lld queue
is busy.

With this changes, SCSI-MQ performance is brought back
against block legacy path, follows the test result on lpfc:

- fio(libaio, bs:4k, dio, queue_depth:64, 20 jobs)


   |v4.13-rc3   | v4.13-rc3   | patched v4.13-rc3
   |legacy deadline | mq-none | mq-none
-
read"iops" | 401749.4001| 346237.5025 | 387536.4427
randread"iops" | 25175.07121| 21688.64067 | 25578.50374
write   "iops" | 376168.7578| 335262.0475 | 370132.4735
reandwrite  "iops" | 25235.46163| 24982.63819 | 23934.95610

   |v4.13-rc3   | v4.13-rc3   | patched v4.13-rc3
   |legacy deadline | mq-deadline | mq-deadline
--
read"iops" | 401749.4001| 35592.48901 | 401681.1137
randread"iops" | 25175.07121| 30029.52618 | 21446.68731
write   "iops" | 376168.7578| 27340.56777 | 377356.7286
randwrite   "iops" | 25235.46163| 24395.02969 | 24885.66152

Ming Lei (14):
  blk-mq-sched: fix scheduler bad performance
  blk-mq: rename flush_busy_ctx_data as ctx_iter_data
  blk-mq: introduce blk_mq_dispatch_rq_from_ctxs()
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
flushed
  blk-mq-sched: introduce blk_mq_sched_queue_depth()
  blk-mq-sched: use q->queue_depth as hint for q->nr_requests
  blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
  blk-mq-sched: cleanup blk_mq_sched_dispatch_requests()
  blk-mq-sched: introduce helpers for query, change busy state
  blk-mq: introduce helpers for operating ->dispatch list
  blk-mq: introduce pointers to dispatch lock & list
  blk-mq: pass 'request_queue *' to several helpers of operating BUSY
  blk-mq-sched: improve IO scheduling on SCSI devcie

 block/blk-mq-debugfs.c |  11 ++---
 block/blk-mq-sched.c   |  70 +++--
 block/blk-mq-sched.h   |  23 ++
 block/blk-mq.c | 117 +++--
 block/blk-mq.h |  72 ++
 block/blk-settings.c   |   2 +
 include/linux/blk-mq.h |   5 +++
 include/linux/blkdev.h |   5 +++
 8 files changed, 255 insertions(+), 50 deletions(-)

-- 
2.9.4



[PATCH] scsi: csiostor: fail probe if fw does not support FCoE

2017-07-31 Thread Varun Prakash
Fail probe if FCoE capability is not enabled in the
firmware.

Signed-off-by: Varun Prakash 
---
 drivers/scsi/csiostor/csio_hw.c   |  4 +++-
 drivers/scsi/csiostor/csio_init.c | 12 
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/csiostor/csio_hw.c b/drivers/scsi/csiostor/csio_hw.c
index 2029ad2..5be0086 100644
--- a/drivers/scsi/csiostor/csio_hw.c
+++ b/drivers/scsi/csiostor/csio_hw.c
@@ -3845,8 +3845,10 @@ csio_hw_start(struct csio_hw *hw)
 
if (csio_is_hw_ready(hw))
return 0;
-   else
+   else if (csio_match_state(hw, csio_hws_uninit))
return -EINVAL;
+   else
+   return -ENODEV;
 }
 
 int
diff --git a/drivers/scsi/csiostor/csio_init.c 
b/drivers/scsi/csiostor/csio_init.c
index ea0c310..dcd0741 100644
--- a/drivers/scsi/csiostor/csio_init.c
+++ b/drivers/scsi/csiostor/csio_init.c
@@ -969,10 +969,14 @@ static int csio_probe_one(struct pci_dev *pdev, const 
struct pci_device_id *id)
 
pci_set_drvdata(pdev, hw);
 
-   if (csio_hw_start(hw) != 0) {
-   dev_err(&pdev->dev,
-   "Failed to start FW, continuing in debug mode.\n");
-   return 0;
+   rv = csio_hw_start(hw);
+   if (rv) {
+   if (rv == -EINVAL) {
+   dev_err(&pdev->dev,
+   "Failed to start FW, continuing in debug 
mode.\n");
+   return 0;
+   }
+   goto err_lnode_exit;
}
 
sprintf(hw->fwrev_str, "%u.%u.%u.%u\n",
-- 
2.0.2



Loan offer

2017-07-31 Thread BarclaysHomeFinance


Hello,

BarclaysHomeFinance is offering loan at a low interest rate of 2.5%.do You need 
a loan of any kind ? if yes email now for more info

best regards,
Taylor Anderson



Re: [PATCH 00/29] constify scsi pci_device_id.

2017-07-31 Thread Johannes Thumshirn
On Mon, Jul 31, 2017 at 02:23:11PM +0530, Arvind Yadav wrote:
> Yes, We can add all of them in single patch. But other maintainer wants
> single single patch. thats why I have send 29 patch. :(

Ultimately it's up to Martin and James but I don't see a hughe benefit in
having it all in a separate patch.

Thanks,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 00/29] constify scsi pci_device_id.

2017-07-31 Thread Arvind Yadav



On Monday 31 July 2017 01:26 PM, Johannes Thumshirn wrote:

On Sun, Jul 30, 2017 at 02:07:09PM +0530, Arvind Yadav wrote:

pci_device_id are not supposed to change at runtime. All functions
working with pci_device_id provided by  work with
const pci_device_id. So mark the non-const structs as const.

Can't this go all in one patch instead of replicating the same patch 29
times?

Yes, We can add all of them in single patch. But other maintainer wants
single single patch. thats why I have send 29 patch. :(

Thanks,
Johannes


~arvind


URGENT REPLY FOR MORE DETAILS.

2017-07-31 Thread casimire kere
Compliment of the day,

I am Mr.Kere Casmire I Have a Business Proposal of $5.3 million For You.
I am aware of the unsafe nature of the internet,
and was compelled to use this medium due to the nature of this project.

I have access to very vital information that can be used to transfer
this huge amount of money.

which may culminate into the investment of the said funds into your
company or any lucrative venture in your country.

If you will like to assist me as a partner then indicate your interest,
after which we shall both discuss the modalities and the sharing percentage.

Upon receipt of your reply on your expression of Interest I will give
you full details,
on how the business will be executed I am open for negotiation.

Thanks for your anticipated cooperation.

Note you might receive this message in your inbox or spam or junk folder,
depends on your web host or server network.

Regards,
Mr.Kere Casmire


Re: [PATCH 04/29] scsi: pm8001: constify pci_device_id.

2017-07-31 Thread Jinpu Wang
On Sun, Jul 30, 2017 at 10:37 AM, Arvind Yadav
 wrote:
> pci_device_id are not supposed to change at runtime. All functions
> working with pci_device_id provided by  work with
> const pci_device_id. So mark the non-const structs as const.
>
> Signed-off-by: Arvind Yadav 
> ---
>  drivers/scsi/pm8001/pm8001_init.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/pm8001/pm8001_init.c 
> b/drivers/scsi/pm8001/pm8001_init.c
> index 034b2f7..f2757cc 100644
> --- a/drivers/scsi/pm8001/pm8001_init.c
> +++ b/drivers/scsi/pm8001/pm8001_init.c
> @@ -1270,7 +1270,7 @@ static int pm8001_pci_resume(struct pci_dev *pdev)
>  /* update of pci device, vendor id and driver data with
>   * unique value for each of the controller
>   */
> -static struct pci_device_id pm8001_pci_table[] = {
> +static const struct pci_device_id pm8001_pci_table[] = {
> { PCI_VDEVICE(PMC_Sierra, 0x8001), chip_8001 },
> { PCI_VDEVICE(PMC_Sierra, 0x8006), chip_8006 },
> { PCI_VDEVICE(ADAPTEC2, 0x8006), chip_8006 },
> --
> 2.7.4
>

Thanks,
Acked-by: Jack Wang 

-- 
Jack Wang
Linux Kernel Developer


Re: [PATCH 00/29] constify scsi pci_device_id.

2017-07-31 Thread Johannes Thumshirn
On Sun, Jul 30, 2017 at 02:07:09PM +0530, Arvind Yadav wrote:
> pci_device_id are not supposed to change at runtime. All functions
> working with pci_device_id provided by  work with
> const pci_device_id. So mark the non-const structs as const.

Can't this go all in one patch instead of replicating the same patch 29
times?

Thanks,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850