Hi, Jens,Tejun Current problem we meet:
When Multiple Queue is used for scsi, the period between each LUN probe is increasing as the number of block request queue goes up, eventually it takes hours for Initiator to finish scanning thousands of LUNs. Kernel version we're using: 4.1.6-14 I tested with MainLine, it also has same issue. My analysis: As for each new LUN, scsi_mq_alloc_queue is called when Multiple Queue is enabled for scsi to allocate and initialize the new queue, the process is: ->blk_mq_init_queue ->blk_mq_init_allocated_queue ->blk_mq_add_queue_tag_set ->blk_mq_update_tag_set_depth blk_mq_update_tag_set_depth calls blk_mq_freeze_queue for each queue in the set tag list, if the queue is already in the tag list, we can see that it takes tens of ticks to freeze the queue, if there are thousands of queues in the list, it will take more than ten seconds(HZ=1000) for a new queue to finish its initialization, this is really a high delay, my question here is: why blk_mq_update_tag_set_depth updates each queue in the tag list? if this thing must be done, as the code below shows just changing flags depending on 'shared' variable why shouldn't we store the previous result of 'shared' and compare with the current result, if it's unchanged, nothing will be done and avoid looping all queues in list. static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set) { struct blk_mq_hw_ctx *hctx; struct request_queue *q; bool shared; int i; if (set->tag_list.next == set->tag_list.prev) shared = false; else shared = true; list_for_each_entry(q, &set->tag_list, tag_set_list) { blk_mq_freeze_queue(q); queue_for_each_hw_ctx(q, hctx, i) { if (shared) hctx->flags |= BLK_MQ_F_TAG_SHARED; else hctx->flags &= ~BLK_MQ_F_TAG_SHARED; } blk_mq_unfreeze_queue(q); } } Deep insight: I dig more into the delay and find more details, for each already existing queue, the delay comes from blk_mq_freeze_queue_wait function because the request queue->mq_usage_counter is not zero, blocked on it and woken up by percpu_ref_switch_to_atomic_rcu after a grace period, let me explain how this happens. Although for each new allocating queue, PERCPU_REF_INIT_ATOMIC flag is set when allocating the percpu ref, but when registering queue, blk_mq_finish_init changes it to PERCPU by calling percpu_ref_switch_to_percpu, So every time blk_mq_freeze_queue_start, it runs in this way blk_mq_freeze_queue_start ->percpu_ref_kill->percpu_ref_kill_and_confirm ->__percpu_ref_switch_to_atomic ->call_rcu_sched(&ref->rcu,percpu_ref_switch_to_atomic_rcu) and blk_mq_freeze_queue_wait blocks on queue->mq_usage_counter as it is not zero, and wake up by percpu_ref_switch_to_atomic_rcu after a grace period My question here is why should we change ref to PERCPU at blk_mq_finish_init? because of this changing, delay appears. My suggestion: Don't change ref to PERCPU by blk_mq_finish_init, keeping it in ATOMIC, this can solve the problem. Or optimize blk_mq_update_tag_set_depth to avoid such delay The patch below is one possible fix, as I can't see if it can introduce other regression issue, so could you guy look into it and give some advice? /Thanks 0001-Avoid-long-delay-for-scsi-scanning-when-thousands-of.patch >From 4505e8d499a7c336ebc3b588b089f5408841b421 Mon Sep 17 00:00:00 2001 From: Jason Luo <zhangqing....@oracle.com> Date: Mon, 19 Oct 2015 14:51:51 +0800 Subject: [PATCH] Avoid long delay for scsi scanning when thousands of LUNs Signed-off-by: Jason Luo <zhangqing....@oracle.com> --- block/blk-mq-sysfs.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index b79685e..06533a9 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -404,7 +404,9 @@ static void blk_mq_sysfs_init(struct request_queue *q) /* see blk_register_queue() */ void blk_mq_finish_init(struct request_queue *q) { - percpu_ref_switch_to_percpu(&q->mq_usage_counter); + /* Maybe some other things can be done here in the future + * leave it with empty */ + /*percpu_ref_switch_to_percpu(&q->mq_usage_counter);*/ } int blk_mq_register_disk(struct gendisk *disk) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/