Re: [block regression] kernel oops triggered by removing scsi device dring IO
Hi Bart, On 18/4/9 12:47, Bart Van Assche wrote: > On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: >> The following kernel oops is triggered by 'removing scsi device' during >> heavy IO. > > Is the below patch sufficient to fix this? > > Thanks, > > Bart. > > > Subject: blk-mq: Avoid that submitting a bio concurrently with device removal > triggers a crash > > Because blkcg_exit_queue() is now called from inside blk_cleanup_queue() > it is no longer safe to access cgroup information during or after the > blk_cleanup_queue() call. Hence check earlier in generic_make_request() > whether the queue has been marked as "dying". The oops happens during generic_make_request_checks(), in blk_throtl_bio() exactly. So if we want to bypass dying queue, we have to check this before generic_make_request_checks(), I think. Thanks, Joseph > --- > block/blk-core.c | 72 > +--- > 1 file changed, 37 insertions(+), 35 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index aa8c99fae527..3ac9dd25e04e 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2385,10 +2385,21 @@ blk_qc_t generic_make_request(struct bio *bio) >* yet. >*/ > struct bio_list bio_list_on_stack[2]; > + blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ? > + BLK_MQ_REQ_NOWAIT : 0; > + struct request_queue *q = bio->bi_disk->queue; > blk_qc_t ret = BLK_QC_T_NONE; > > if (!generic_make_request_checks(bio)) > - goto out; > + return ret; > + > + if (blk_queue_enter(q, flags) < 0) { > + if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))) > + bio_wouldblock_error(bio); > + else > + bio_io_error(bio); > + return ret; > + } > > /* >* We only want one ->make_request_fn to be active at a time, else > @@ -2423,46 +2434,37 @@ blk_qc_t generic_make_request(struct bio *bio) > bio_list_init(&bio_list_on_stack[0]); > current->bio_list = bio_list_on_stack; > do { > - struct request_queue *q = bio->bi_disk->queue; > - blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ? > - BLK_MQ_REQ_NOWAIT : 0; > - > - if (likely(blk_queue_enter(q, flags) == 0)) { > - struct bio_list lower, same; > - > - /* Create a fresh bio_list for all subordinate requests > */ > - bio_list_on_stack[1] = bio_list_on_stack[0]; > - bio_list_init(&bio_list_on_stack[0]); > - ret = q->make_request_fn(q, bio); > - > - blk_queue_exit(q); > - > - /* sort new bios into those for a lower level > - * and those for the same level > - */ > - bio_list_init(&lower); > - bio_list_init(&same); > - while ((bio = bio_list_pop(&bio_list_on_stack[0])) != > NULL) > - if (q == bio->bi_disk->queue) > - bio_list_add(&same, bio); > - else > - bio_list_add(&lower, bio); > - /* now assemble so we handle the lowest level first */ > - bio_list_merge(&bio_list_on_stack[0], &lower); > - bio_list_merge(&bio_list_on_stack[0], &same); > - bio_list_merge(&bio_list_on_stack[0], > &bio_list_on_stack[1]); > - } else { > - if (unlikely(!blk_queue_dying(q) && > - (bio->bi_opf & REQ_NOWAIT))) > - bio_wouldblock_error(bio); > + struct bio_list lower, same; > + > + WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT) && > + (bio->bi_opf & REQ_NOWAIT)); > + WARN_ON_ONCE(q != bio->bi_disk->queue); > + q = bio->bi_disk->queue; > + /* Create a fresh bio_list for all subordinate requests */ > + bio_list_on_stack[1] = bio_list_on_stack[0]; > + bio_list_init(&bio_list_on_stack[0]); > + ret = q->make_request_fn(q, bio); > + > + /* sort new bios into those for a lower level > + * and those for the same level > + */ > + bio_list_init(&lower); > + bio_list_init(&same); > + while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL) > + if (q == bio->bi_disk->queue) > + bio_list_add(&same, bio); > else > - bio_io_error(bio); > - } > + bio_list_add(&lower, bio); > + /* now assemble so we handle the lowest level first */ > +
Re: Block layer use of __GFP flags
On Mon, 9 Apr 2018 04:46:22 + "Bart Van Assche" wrote: > On Sun, 2018-04-08 at 12:08 -0700, Matthew Wilcox wrote: > > On Sun, Apr 08, 2018 at 04:40:59PM +, Bart Van Assche wrote: > > > Do you perhaps want me to prepare a patch that makes > > > blk_get_request() again respect the full gfp mask passed as third > > > argument to blk_get_request()? > > > > I think that would be a good idea. If it's onerous to have extra > > arguments, there are some bits in gfp_flags which could be used for > > your purposes. > > That's indeed something we can consider. > > It would be appreciated if you could have a look at the patch below. > > Thanks, > > Bart. > > Why don't you fold the 'flags' argument into the 'gfp_flags', and drop the 'flags' argument completely? Looks a bit pointless to me, having two arguments denoting basically the same ... Cheers, Hannes
Re: Multi-Actuator SAS HDD First Look
On 04/09/2018 04:08 AM, Tim Walker wrote: > On Fri, Apr 6, 2018 at 11:09 AM, Douglas Gilbert > wrote: >> >> On 2018-04-06 02:42 AM, Christoph Hellwig wrote: >>> >>> On Fri, Apr 06, 2018 at 08:24:18AM +0200, Hannes Reinecke wrote: Ah. Far better. What about delegating FORMAT UNIT to the control LUN, and not implementing it for the individual disk LUNs? That would make an even stronger case for having a control LUN; with that there wouldn't be any problem with having to synchronize across LUNs etc. >>> >>> >>> It sounds to me like NVMe might be a much better model for this drive >>> than SCSI, btw :) >> >> >> So you found a document that outlines NVMe's architecture! Could you >> share the url (no marketing BS, please)? >> >> >> And a serious question ... How would you map NVMe's (in Linux) >> subsystem number, controller device minor number, CNTLID field >> (Identify ctl response) and namespace id onto the SCSI subsystem's >> h:c:t:l ? >> >> Doug Gilbert >> > > Hannes- yes, a drive system altering operation like FORMAT UNIT is > asking for a dedicated management port, as the NVMe folks apparently > felt. But what is the least painful endpoint type for LUN0? > > I would probably use 'processor device' (ie type 3) as it's the least defined, so you can do basically everything you like with it. Possibly 'Enclosure Services' (type 0x0d) works, too, but then you have to check with the SES spec if it allows the things you'd need. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
[PATCH] blk-throttle: discard stale last_low_overflow_time
When there is only one type of traffic, the associated last_low_overflow_time will not be updated, so the value is stale and invalid and we should discard it. Otherwise, __tg_last_low_overflow_time always return the stale value because it is smaller, and then we always get bps/iops has been below low limit for 1 throtl_slice, and limit_index will jump down and up between LOW and MAX, the actual bps/iops stays on low_limit. Add last_submit_time[2] into tg to track the time when bio enters into blk_throtl_bio. If there is no bio entering during past 5 throtl_slices, and the actual dispatching bps/iops are indeed lower than low limit, return 0 as the last_low_overflow_time which indicates it is stale. We will discard the stale last_low_overflow_time, but if both types are stale, return 0, Otherwise, the cgroups which don't have any traffic will prevent upgrade. Signed-off-by: Jianchao Wang --- block/blk-throttle.c | 37 ++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index c5a1316..851aa16 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -158,6 +158,7 @@ struct throtl_grp { unsigned int io_disp[2]; unsigned long last_low_overflow_time[2]; + unsigned long last_submit_time[2]; uint64_t last_bytes_disp[2]; unsigned int last_io_disp[2]; @@ -1752,15 +1753,42 @@ static struct blkcg_policy blkcg_policy_throtl = { .pd_free_fn = throtl_pd_free, }; +/* + * If there is no any traffic of type 'rw' into blk_throtl_bio during + * past 5 throtl_slice, AND the actual dispatching bps/iops of type 'rw' + * is indeed lower than low limit, we return 0 as the last_low_overflow_time + * which indicates it is stale. + */ +static inline unsigned long tg_return_lloft(struct throtl_grp *tg, + unsigned int rw) +{ + unsigned long time = tg->last_low_overflow_time[rw]; + unsigned long now = jiffies; + + if (!time_after(now, + tg->last_submit_time[rw] + 5 * tg->td->throtl_slice)) + return time; + + if (!time_after(now, time + 5 * tg->td->throtl_slice)) + return time; + + return 0; +} + static unsigned long __tg_last_low_overflow_time(struct throtl_grp *tg) { unsigned long rtime = jiffies, wtime = jiffies; if (tg->bps[READ][LIMIT_LOW] || tg->iops[READ][LIMIT_LOW]) - rtime = tg->last_low_overflow_time[READ]; + rtime = tg_return_lloft(tg, READ); if (tg->bps[WRITE][LIMIT_LOW] || tg->iops[WRITE][LIMIT_LOW]) - wtime = tg->last_low_overflow_time[WRITE]; - return min(rtime, wtime); + wtime = tg_return_lloft(tg, WRITE); + + /* +* A cgroup w/o any traffic could have two stale value, return 0 instead +* of 'now', otherwise, it will prevent upgrade. +*/ + return (rtime && wtime) ? min(rtime, wtime) : (rtime + wtime); } /* tg should not be an intermediate node */ @@ -2175,8 +2203,11 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, while (true) { if (tg->last_low_overflow_time[rw] == 0) tg->last_low_overflow_time[rw] = jiffies; + if (unlikely(tg->last_submit_time[rw] == 0)) + tg->last_submit_time[rw] = jiffies; throtl_downgrade_check(tg); throtl_upgrade_check(tg); + tg->last_submit_time[rw] = jiffies; /* throtl is FIFO - if bios are already queued, should queue */ if (sq->nr_queued[rw]) break; -- 2.7.4
[PATCH] blk-throttle: only update last_low_overflow_time when LIMIT_MAX
Currently, last_low_overflow_time will be updated whenever bios are throttled and queued. This is true when LIMIT_MAX, but not for LIMIT_LOW. what last_low_overflow_time indicates is dispatching not submitting. When LIMIT_LOW, the dispatch bps/iops should never be above the low limit, we should not update last_low_overflow_time at the moment. Otherwise, it will be hard to upgrade limit_index when there is always bios to be submitted. Signed-off-by: Jianchao Wang --- block/blk-throttle.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 851aa16..b5ba845 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -2214,7 +2214,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, /* if above limits, break to queue */ if (!tg_may_dispatch(tg, bio, NULL)) { - tg->last_low_overflow_time[rw] = jiffies; + if (td->limit_index == LIMIT_MAX) + tg->last_low_overflow_time[rw] = jiffies; if (throtl_can_upgrade(td, tg)) { throtl_upgrade_state(td); goto again; @@ -2258,7 +2259,8 @@ bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg, tg->io_disp[rw], tg_iops_limit(tg, rw), sq->nr_queued[READ], sq->nr_queued[WRITE]); - tg->last_low_overflow_time[rw] = jiffies; + if (td->limit_index == LIMIT_MAX) + tg->last_low_overflow_time[rw] = jiffies; td->nr_queued[rw]++; throtl_add_bio_tg(bio, qn, tg); -- 2.7.4
[PATCH] blk-mq: Fix recently introduced races in the timeout handling code
If a completion occurs after blk_mq_rq_timed_out() has reset rq->aborted_gstate and the request is again in flight when the timeout expires then a request will be completed twice: a first time by the timeout handler and a second time when the regular completion occurs. Additionally, the blk-mq timeout handling code ignores completions that occur after blk_mq_check_expired() has been called and before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver timeout handler always returns BLK_EH_RESET_TIMER then the result will be that the request never terminates. Since the request state can be updated from two different contexts, namely regular completion and request timeout, this race cannot be fixed with RCU synchronization only. Fix this race as follows: - Introduce a spinlock to protect the request state and deadline changes. - Use the deadline instead of the request generation to detect whether or not a request timer fired after reinitialization of a request. - Store the request state in the lowest two bits of the deadline instead of the lowest two bits of 'gstate'. - Remove all request member variables that became superfluous due to this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync. - Remove the request state information that became superfluous due to this patch, namely RQF_MQ_TIMEOUT_EXPIRED. - Remove the code that became superfluous due to this change, namely the RCU lock and unlock statements in blk_mq_complete_request() and also the synchronize_rcu() call in the timeout handler. This patch fixes the following kernel crash: BUG: unable to handle kernel NULL pointer dereference at (null) Oops: [#1] PREEMPT SMP CPU: 2 PID: 151 Comm: kworker/2:1H Tainted: GW4.15.0-dbg+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Workqueue: kblockd blk_mq_timeout_work RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod] Call Trace: blk_mq_terminate_expired+0x42/0x80 bt_iter+0x3d/0x50 blk_mq_queue_tag_busy_iter+0xe9/0x200 blk_mq_timeout_work+0x181/0x2e0 process_one_work+0x21c/0x6d0 worker_thread+0x35/0x380 kthread+0x117/0x130 ret_from_fork+0x24/0x30 Fixes: 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a RCU and generation based scheme") Signed-off-by: Bart Van Assche Cc: Tejun Heo Cc: Christoph Hellwig Cc: Sagi Grimberg Cc: Israel Rukshin , Cc: Max Gurtovoy Cc: # v4.16 --- block/blk-core.c | 3 +- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 178 +++-- block/blk-mq.h | 25 ++- block/blk-timeout.c| 1 - block/blk.h| 4 +- include/linux/blkdev.h | 28 ++-- 7 files changed, 53 insertions(+), 187 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 2623e609db4a..83c7a58e4fb3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -200,8 +200,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->start_time = jiffies; set_start_time_ns(rq); rq->part = NULL; - seqcount_init(&rq->gstate_seq); - u64_stats_init(&rq->aborted_gstate_sync); + spin_lock_init(&rq->state_lock); } EXPORT_SYMBOL(blk_rq_init); diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 6f72413b6cab..80c7c585769f 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -345,7 +345,6 @@ static const char *const rqf_name[] = { RQF_NAME(STATS), RQF_NAME(SPECIAL_PAYLOAD), RQF_NAME(ZONE_WRITE_LOCKED), - RQF_NAME(MQ_TIMEOUT_EXPIRED), RQF_NAME(MQ_POLL_SLEPT), }; #undef RQF_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index 7816d28b7219..1da16d5e5cf1 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -305,7 +305,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->special = NULL; /* tag was already set */ rq->extra_len = 0; - rq->__deadline = 0; INIT_LIST_HEAD(&rq->timeout_list); rq->timeout = 0; @@ -527,8 +526,7 @@ static void __blk_mq_complete_request(struct request *rq) bool shared = false; int cpu; - WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT); - blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE); + WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE); if (rq->internal_tag != -1) blk_mq_sched_completed_request(rq); @@ -577,34 +575,26 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx) *srcu_idx = srcu_read_lock(hctx->srcu); } -static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate) +/** + * blk_mq_change_rq_state - atomically test and set request state + * @rq: Request pointer. + * @old: Old request state. + * @new: New request state. + */ +static bool blk_mq_change_rq_state(struct request *rq, enum mq_rq_state old, + enum mq_rq_stat
Re: [block regression] kernel oops triggered by removing scsi device dring IO
On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: > The following kernel oops is triggered by 'removing scsi device' during > heavy IO. Is the below patch sufficient to fix this? Thanks, Bart. Subject: blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash Because blkcg_exit_queue() is now called from inside blk_cleanup_queue() it is no longer safe to access cgroup information during or after the blk_cleanup_queue() call. Hence check earlier in generic_make_request() whether the queue has been marked as "dying". --- block/blk-core.c | 72 +--- 1 file changed, 37 insertions(+), 35 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index aa8c99fae527..3ac9dd25e04e 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2385,10 +2385,21 @@ blk_qc_t generic_make_request(struct bio *bio) * yet. */ struct bio_list bio_list_on_stack[2]; + blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ? + BLK_MQ_REQ_NOWAIT : 0; + struct request_queue *q = bio->bi_disk->queue; blk_qc_t ret = BLK_QC_T_NONE; if (!generic_make_request_checks(bio)) - goto out; + return ret; + + if (blk_queue_enter(q, flags) < 0) { + if (unlikely(!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))) + bio_wouldblock_error(bio); + else + bio_io_error(bio); + return ret; + } /* * We only want one ->make_request_fn to be active at a time, else @@ -2423,46 +2434,37 @@ blk_qc_t generic_make_request(struct bio *bio) bio_list_init(&bio_list_on_stack[0]); current->bio_list = bio_list_on_stack; do { - struct request_queue *q = bio->bi_disk->queue; - blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ? - BLK_MQ_REQ_NOWAIT : 0; - - if (likely(blk_queue_enter(q, flags) == 0)) { - struct bio_list lower, same; - - /* Create a fresh bio_list for all subordinate requests */ - bio_list_on_stack[1] = bio_list_on_stack[0]; - bio_list_init(&bio_list_on_stack[0]); - ret = q->make_request_fn(q, bio); - - blk_queue_exit(q); - - /* sort new bios into those for a lower level -* and those for the same level -*/ - bio_list_init(&lower); - bio_list_init(&same); - while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL) - if (q == bio->bi_disk->queue) - bio_list_add(&same, bio); - else - bio_list_add(&lower, bio); - /* now assemble so we handle the lowest level first */ - bio_list_merge(&bio_list_on_stack[0], &lower); - bio_list_merge(&bio_list_on_stack[0], &same); - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); - } else { - if (unlikely(!blk_queue_dying(q) && - (bio->bi_opf & REQ_NOWAIT))) - bio_wouldblock_error(bio); + struct bio_list lower, same; + + WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT) && +(bio->bi_opf & REQ_NOWAIT)); + WARN_ON_ONCE(q != bio->bi_disk->queue); + q = bio->bi_disk->queue; + /* Create a fresh bio_list for all subordinate requests */ + bio_list_on_stack[1] = bio_list_on_stack[0]; + bio_list_init(&bio_list_on_stack[0]); + ret = q->make_request_fn(q, bio); + + /* sort new bios into those for a lower level +* and those for the same level +*/ + bio_list_init(&lower); + bio_list_init(&same); + while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL) + if (q == bio->bi_disk->queue) + bio_list_add(&same, bio); else - bio_io_error(bio); - } + bio_list_add(&lower, bio); + /* now assemble so we handle the lowest level first */ + bio_list_merge(&bio_list_on_stack[0], &lower); + bio_list_merge(&bio_list_on_stack[0], &same); + bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); bio = bio_list_pop(&bio_list_on_stack[0]); } while (bio); current->bio_list = NULL; /* deactivate */ out: +
Re: Block layer use of __GFP flags
On Sun, 2018-04-08 at 12:08 -0700, Matthew Wilcox wrote: > On Sun, Apr 08, 2018 at 04:40:59PM +, Bart Van Assche wrote: > > Do you perhaps want me to prepare a patch that makes blk_get_request() again > > respect the full gfp mask passed as third argument to blk_get_request()? > > I think that would be a good idea. If it's onerous to have extra arguments, > there are some bits in gfp_flags which could be used for your purposes. That's indeed something we can consider. It would be appreciated if you could have a look at the patch below. Thanks, Bart. Subject: block: Make blk_get_request() again respect the entire gfp_t argument Commit 6a15674d1e90 ("block: Introduce blk_get_request_flags()") translates the third blk_get_request() arguments GFP_KERNEL, GFP_NOIO and __GFP_RECLAIM into __GFP_DIRECT_RECLAIM. Make blk_get_request() again pass the full gfp_t argument to blk_old_get_request() such that kswapd is again woken up under low memory conditions if the caller requested this. Notes: - This change only affects the behavior of the legacy block layer. See also the mempool_alloc() call in __get_request(). - The __GFP_RECLAIM flag in the blk_get_request_flags() calls in the IDE and SCSI drivers was removed by commit 039c635f4e66 ("ide, scsi: Tell the block layer at request allocation time about preempt requests"). --- block/blk-core.c| 28 +++- drivers/ide/ide-pm.c| 2 +- drivers/scsi/scsi_lib.c | 3 ++- include/linux/blkdev.h | 3 ++- 4 files changed, 20 insertions(+), 16 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 3ac9dd25e04e..83c7a58e4fb3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1333,6 +1333,7 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr) * @op: operation and flags * @bio: bio to allocate request for (can be %NULL) * @flags: BLQ_MQ_REQ_* flags + * @gfp_mask: allocation mask * * Get a free request from @q. This function may fail under memory * pressure or if @q is dead. @@ -1342,7 +1343,8 @@ int blk_update_nr_requests(struct request_queue *q, unsigned int nr) * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *__get_request(struct request_list *rl, unsigned int op, -struct bio *bio, blk_mq_req_flags_t flags) +struct bio *bio, blk_mq_req_flags_t flags, +gfp_t gfp_mask) { struct request_queue *q = rl->q; struct request *rq; @@ -1351,8 +1353,6 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, struct io_cq *icq = NULL; const bool is_sync = op_is_sync(op); int may_queue; - gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC : -__GFP_DIRECT_RECLAIM; req_flags_t rq_flags = RQF_ALLOCED; lockdep_assert_held(q->queue_lock); @@ -1516,6 +1516,7 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, * @op: operation and flags * @bio: bio to allocate request for (can be %NULL) * @flags: BLK_MQ_REQ_* flags. + * @gfp_mask: allocation mask * * Get a free request from @q. If %__GFP_DIRECT_RECLAIM is set in @gfp_mask, * this function keeps retrying under memory pressure and fails iff @q is dead. @@ -1525,7 +1526,8 @@ static struct request *__get_request(struct request_list *rl, unsigned int op, * Returns request pointer on success, with @q->queue_lock *not held*. */ static struct request *get_request(struct request_queue *q, unsigned int op, - struct bio *bio, blk_mq_req_flags_t flags) + struct bio *bio, blk_mq_req_flags_t flags, + gfp_t gfp_mask) { const bool is_sync = op_is_sync(op); DEFINE_WAIT(wait); @@ -1537,7 +1539,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op, rl = blk_get_rl(q, bio);/* transferred to @rq on success */ retry: - rq = __get_request(rl, op, bio, flags); + rq = __get_request(rl, op, bio, flags, gfp_mask); if (!IS_ERR(rq)) return rq; @@ -1575,11 +1577,10 @@ static struct request *get_request(struct request_queue *q, unsigned int op, /* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */ static struct request *blk_old_get_request(struct request_queue *q, - unsigned int op, blk_mq_req_flags_t flags) + unsigned int op, blk_mq_req_flags_t flags, + gfp_t gfp_mask) { struct request *rq; - gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC : -__GFP_DIRECT_RECLAIM; int ret = 0; WARN_ON_ONCE(q->mq_ops); @@ -1591,7 +1592,7 @@ static struct request *blk_old_get_request(struct request_queue
Re: [block regression] kernel oops triggered by removing scsi device dring IO
On Mon, Apr 09, 2018 at 09:33:08AM +0800, Joseph Qi wrote: > Hi Bart, > > On 18/4/8 22:50, Bart Van Assche wrote: > > On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: > >> The following kernel oops is triggered by 'removing scsi device' during > >> heavy IO. > > > > How did you trigger this oops? > > > > I can reproduce this oops by the following steps: > 1) start a fio job with buffered write; > 2) remove the scsi device fio write to: > echo "scsi remove-single-device ${dev}" > /proc/scsi/scsi Yeah, it can be reproduced easily, and I usually remove scsi device via 'echo 1 > /sys/block/sda/device/delete' Thanks, Ming
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
On Sun, Apr 08, 2018 at 04:35:59PM +0300, Sagi Grimberg wrote: > > > On 04/08/2018 03:57 PM, Ming Lei wrote: > > On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote: > > > > > > > > > > > > Hi Sagi > > > > > > > > > > > > > > > > > > Still can reproduce this issue with the change: > > > > > > > > > > > > > > > > Thanks for validating Yi, > > > > > > > > > > > > > > > > Would it be possible to test the following: > > > > > > > > -- > > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > > > > > > index 75336848f7a7..81ced3096433 100644 > > > > > > > > --- a/block/blk-mq.c > > > > > > > > +++ b/block/blk-mq.c > > > > > > > > @@ -444,6 +444,10 @@ struct request > > > > > > > > *blk_mq_alloc_request_hctx(struct > > > > > > > > request_queue *q, > > > > > > > >return ERR_PTR(-EXDEV); > > > > > > > >} > > > > > > > >cpu = cpumask_first_and(alloc_data.hctx->cpumask, > > > > > > > > cpu_online_mask); > > > > > > > > + if (cpu >= nr_cpu_ids) { > > > > > > > > + pr_warn("no online cpu for hctx %d\n", > > > > > > > > hctx_idx); > > > > > > > > + cpu = cpumask_first(alloc_data.hctx->cpumask); > > > > > > > > + } > > > > > > > >alloc_data.ctx = __blk_mq_get_ctx(q, cpu); > > > > > > > > > > > > > > > >rq = blk_mq_get_request(q, NULL, op, &alloc_data); > > > > > > > > -- > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > > > > > > [ 153.384977] BUG: unable to handle kernel paging request at > > > > > > > > > 3a9ed053bd48 > > > > > > > > > [ 153.393197] IP: blk_mq_get_request+0x23e/0x390 > > > > > > > > > > > > > > > > Also would it be possible to provide gdb output of: > > > > > > > > > > > > > > > > l *(blk_mq_get_request+0x23e) > > > > > > > > > > > > > > nvmf_connect_io_queue() is used in this way by asking blk-mq to > > > > > > > allocate > > > > > > > request from one specific hw queue, but there may not be all > > > > > > > online CPUs > > > > > > > mapped to this hw queue. > > > > > > > > > > Yes, this is what I suspect.. > > > > > > > > > > > And the following patchset may fail this kind of allocation and > > > > > > avoid > > > > > > the kernel oops. > > > > > > > > > > > > https://marc.info/?l=linux-block&m=152318091025252&w=2 > > > > > > > > > > Thanks Ming, > > > > > > > > > > But I don't want to fail the allocation, nvmf_connect_io_queue simply > > > > > needs a tag to issue the connect request, I much rather to take this > > > > > tag from an online cpu than failing it... We use this because we > > > > > reserve > > > > > > > > The failure is only triggered when there isn't any online CPU mapped to > > > > this hctx, so do you want to wait for CPUs for this hctx becoming > > > > online? > > > > > > I was thinking of allocating a tag from that hctx even if it had no > > > online cpu, the execution is done on an online cpu (hence the call > > > to blk_mq_alloc_request_hctx). > > > > That can be done, but not following the current blk-mq's rule, because > > blk-mq requires to dispatch the request on CPUs mapping to this hctx. > > > > Could you explain a bit why you want to do in this way? > > My device exposes nr_hw_queues which is not higher than num_online_cpus > so I want to connect all hctxs with hope that they will be used. The issue is that CPU online & offline can happen any time, and after blk-mq removes CPU hotplug handler, there is no way to remap queue when CPU topo is changed. For example: 1) after nr_hw_queues is set as num_online_cpus() and hw queues are initialized, then some of CPUs become offline, and the issue reported by Zhang Yi is triggered, but in this case, we should fail the allocation since 1:1 mapping doesn't need to use this inactive hw queue. 2) when nr_hw_queues is set as num_online_cpus(), there may be much less online CPUs, so the hw queue number can be initialized as much smaller, then performance is degraded much even if some CPUs become online later. So the current policy is to map all possible CPUs for handing CPU hotplug, and if you want to get 1:1 mapping between hw queue and online CPU, the nr_hw_queues can be set as num_possible_cpus. Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors). It will waste some memory resource just like percpu variable, but it simplifies the queue mapping logic a lot, and can support both hard and soft CPU online/offline without CPU hotplug handler, which may cause very complicated queue dependency issue. > > I agree we don't want to connect hctx which doesn't have an online > cpu, that's redundant, but this is not the case here. OK, I will explain below, and it can be fixed by the following patch too: https://marc.info/?l=linux-block&m=152318093725257&w=2 > > > > > Or I may understand you wrong, :-) > > > > > > In the report we connected 40 hctxs (which was exactly the number of > > > online cpus
Re: Multi-Actuator SAS HDD First Look
On Fri, Apr 6, 2018 at 11:09 AM, Douglas Gilbert wrote: > > On 2018-04-06 02:42 AM, Christoph Hellwig wrote: >> >> On Fri, Apr 06, 2018 at 08:24:18AM +0200, Hannes Reinecke wrote: >>> >>> Ah. Far better. >>> What about delegating FORMAT UNIT to the control LUN, and not >>> implementing it for the individual disk LUNs? >>> That would make an even stronger case for having a control LUN; >>> with that there wouldn't be any problem with having to synchronize >>> across LUNs etc. >> >> >> It sounds to me like NVMe might be a much better model for this drive >> than SCSI, btw :) > > > So you found a document that outlines NVMe's architecture! Could you > share the url (no marketing BS, please)? > > > And a serious question ... How would you map NVMe's (in Linux) > subsystem number, controller device minor number, CNTLID field > (Identify ctl response) and namespace id onto the SCSI subsystem's > h:c:t:l ? > > Doug Gilbert > Hannes- yes, a drive system altering operation like FORMAT UNIT is asking for a dedicated management port, as the NVMe folks apparently felt. But what is the least painful endpoint type for LUN0? -- Tim Walker Product Design Systems Engineering, Seagate Technology (303) 775-3770
Re: [block regression] kernel oops triggered by removing scsi device dring IO
Hi Bart, On 18/4/8 22:50, Bart Van Assche wrote: > On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: >> The following kernel oops is triggered by 'removing scsi device' during >> heavy IO. > > How did you trigger this oops? > I can reproduce this oops by the following steps: 1) start a fio job with buffered write; 2) remove the scsi device fio write to: echo "scsi remove-single-device ${dev}" > /proc/scsi/scsi
Re: usercopy whitelist woe in scsi_sense_cache
Hi. Cc'ing linux-block people (mainly, Christoph) too because of 17cb960f29c2. Also, duplicating the initial statement for them. With v4.16 (and now with v4.16.1) it is possible to trigger usercopy whitelist warning and/or bug while doing smartctl on a SATA disk having blk-mq and BFQ enabled. The warning looks like this: === [ 574.997022] Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 76, size 22)! [ 575.017332] WARNING: CPU: 0 PID: 32436 at mm/usercopy.c:81 usercopy_warn +0x7d/0xa0 [ 575.025262] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm bochs_drm iTCO_wdt ttm irqbypass iTCO_vendor_support ppdev drm_kms_helper psmouse parport_pc i2c_i801 joydev pcspkr drm parport rtc_cmos mousedev input_leds led_class intel_agp evdev syscopyarea qemu_fw_cfg intel_gtt sysfillrect mac_hid lpc_ich sysimgblt agpgart fb_sys_fops ip_tables x_tables xfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic dm_crypt algif_skcipher af_alg hid_generic usbhid hid dm_mod raid10 md_mod sr_mod sd_mod cdrom uhci_hcd crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc serio_raw xhci_pci ahci atkbd libps2 ehci_pci xhci_hcd aesni_intel libahci aes_x86_64 ehci_hcd crypto_simd glue_helper cryptd libata usbcore usb_common i8042 serio virtio_scsi scsi_mod [ 575.068775] virtio_blk virtio_net virtio_pci virtio_ring virtio [ 575.073935] CPU: 0 PID: 32436 Comm: smartctl Not tainted 4.16.0-pf2 #1 [ 575.078078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 575.082451] RIP: 0010:usercopy_warn+0x7d/0xa0 [ 575.086223] RSP: 0018:9ca84aee7c40 EFLAGS: 00010286 [ 575.097637] RAX: RBX: 95199d68304c RCX: 0001 [ 575.101471] RDX: 0001 RSI: aeeb050a RDI: [ 575.105939] RBP: 0016 R08: R09: 028b [ 575.110370] R10: aee854e9 R11: 0001 R12: 0001 [ 575.113269] R13: 95199d683062 R14: 95199d68304c R15: 0016 [ 575.116132] FS: 7f993d405040() GS:95199f60() knlGS: [ 575.119285] CS: 0010 DS: ES: CR0: 80050033 [ 575.129619] CR2: 7ffe2390f0a8 CR3: 1d774004 CR4: 00160ef0 [ 575.133976] Call Trace: [ 575.136311] __check_object_size+0x12f/0x1a0 [ 575.139576] sg_io+0x269/0x3f0 [ 575.142000] ? path_lookupat+0xaa/0x1f0 [ 575.144521] ? current_time+0x18/0x70 [ 575.147006] scsi_cmd_ioctl+0x257/0x410 [ 575.149782] ? xfs_bmapi_read+0x1c3/0x340 [xfs] [ 575.161441] sd_ioctl+0xbf/0x1a0 [sd_mod] [ 575.165036] blkdev_ioctl+0x8ca/0x990 [ 575.168291] ? read_null+0x10/0x10 [ 575.171638] block_ioctl+0x39/0x40 [ 575.174998] do_vfs_ioctl+0xa4/0x630 [ 575.178261] ? vfs_write+0x164/0x1a0 [ 575.181410] SyS_ioctl+0x74/0x80 [ 575.190904] do_syscall_64+0x74/0x190 [ 575.195200] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [ 575.199267] RIP: 0033:0x7f993c984d87 [ 575.201350] RSP: 002b:7ffe238aeed8 EFLAGS: 0246 ORIG_RAX: 0010 [ 575.204386] RAX: ffda RBX: 7ffe238af180 RCX: 7f993c984d87 [ 575.208349] RDX: 7ffe238aeef0 RSI: 2285 RDI: 0003 [ 575.211254] RBP: 7ffe238af1d0 R08: 0010 R09: [ 575.220511] R10: R11: 0246 R12: 5637ec8e9ce0 [ 575.225238] R13: R14: 5637ec8e3550 R15: 00da [ 575.230056] Code: 6c e4 ae 41 51 48 c7 c0 19 6e e5 ae 49 89 f1 48 0f 44 c2 48 89 f9 4d 89 d8 4c 89 d2 48 c7 c7 70 6e e5 ae 48 89 c6 e8 c3 5c e5 ff <0f> 0b 48 83 c4 18 c3 48 c7 c6 04 cb e4 ae 49 89 f1 49 89 f3 eb [ 575.239027] ---[ end trace 6e3293933bdd4761 ]--- === Usually, the warning is triggered first, and all the subsequent printouts are bugs because offset gets too big so that it doesn't fit into a real SLAB object size: [ 1687.609889] usercopy: Kernel memory exposure attempt detected from SLUB object 'scsi_sense_cache' (offset 107, size 22)! [ 1687.614197] [ cut here ] [ 1687.615993] kernel BUG at mm/usercopy.c:100! To give you an idea regarding variety of offsets, I've summarised the kernel log from my server: $ sudo journalctl -kb | grep "Kernel memory exposure attempt detected" | grep -oE 'offset [0-9]+, size [0-9]+' | sort | uniq -c 9 offset 107, size 22 6 offset 108, size 22 8 offset 109, size 22 7 offset 110, size 22 5 offset 111, size 22 5 offset 112, size 22 2 offset 113, size 22 2 offset 114, size 22 1 offset 115, size 22 1 offset 116, size 22 1 offset 119, size 22 1 offset 85, size 22 So far, I wasn't able to trigger this with mq-deadline (or without blk-mq). Maybe, this has something to do with blk-mq+BFQ re-queuing, or it's just me not being persis
Re: Block layer use of __GFP flags
On Sun, Apr 08, 2018 at 04:40:59PM +, Bart Van Assche wrote: > __GFP_KSWAPD_RECLAIM wasn't stripped off on purpose for non-atomic > allocations. That was an oversight. OK, good. > Do you perhaps want me to prepare a patch that makes blk_get_request() again > respect the full gfp mask passed as third argument to blk_get_request()? I think that would be a good idea. If it's onerous to have extra arguments, there are some bits in gfp_flags which could be used for your purposes.
Re: KASAN: use-after-free Read in disk_unblock_events
On Mon, 30 Oct 2017 12:32:58 -0700, syzbot wrote: > Hello, > > syzkaller hit the following crash on > 36ef71cae353f88fd6e095e2aaa3e5953af1685d > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > C reproducer is attached > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > for information about syzkaller reproducers > > > BUG: KASAN: use-after-free in disk_unblock_events+0x51/0x60 > block/genhd.c:1632 > Read of size 8 at addr 8801d9049a80 by task blkid/14807 > > CPU: 1 PID: 14807 Comm: blkid Not tainted 4.14.0-rc5-next-20171018+ #36 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:16 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:52 > print_address_description+0x73/0x250 mm/kasan/report.c:252 > kasan_report_error mm/kasan/report.c:351 [inline] > kasan_report+0x25b/0x340 mm/kasan/report.c:409 > __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430 > disk_unblock_events+0x51/0x60 block/genhd.c:1632 > __blkdev_get+0x78d/0xf90 fs/block_dev.c:1535 > blkdev_get+0x3a1/0xad0 fs/block_dev.c:1591 > blkdev_open+0x1c9/0x250 fs/block_dev.c:1749 > do_dentry_open+0x664/0xd40 fs/open.c:752 > vfs_open+0x107/0x220 fs/open.c:866 > do_last fs/namei.c:3387 [inline] > path_openat+0x1151/0x3520 fs/namei.c:3527 > do_filp_open+0x25b/0x3b0 fs/namei.c:3562 > do_sys_open+0x502/0x6d0 fs/open.c:1059 > SYSC_open fs/open.c:1077 [inline] > SyS_open+0x2d/0x40 fs/open.c:1072 > entry_SYSCALL_64_fastpath+0x1f/0xbe > RIP: 0033:0x7ff2d528f120 > RSP: 002b:7ffd153fa6d8 EFLAGS: 0246 ORIG_RAX: 0002 > RAX: ffda RBX: RCX: 7ff2d528f120 > RDX: 7ffd153fbf35 RSI: RDI: 7ffd153fbf35 > RBP: 0082 R08: 0078 R09: > R10: R11: 0246 R12: 01fca030 > R13: R14: R15: 0005 > > Allocated by task 14811: > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > set_track mm/kasan/kasan.c:459 [inline] > kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 > kmem_cache_alloc_node_trace+0x150/0x750 mm/slab.c:3657 > kmalloc_node include/linux/slab.h:540 [inline] > kzalloc_node include/linux/slab.h:702 [inline] > alloc_disk_node+0xb4/0x4e0 block/genhd.c:1375 > alloc_disk+0x18/0x20 block/genhd.c:1359 > loop_add+0x45c/0xa50 drivers/block/loop.c:1808 > loop_probe+0x16d/0x1a0 drivers/block/loop.c:1915 > kobj_lookup+0x2ac/0x410 drivers/base/map.c:124 > get_gendisk+0x37/0x230 block/genhd.c:773 > bd_start_claiming fs/block_dev.c:1097 [inline] > blkdev_get+0x12d/0xad0 fs/block_dev.c:1584 > blkdev_open+0x1c9/0x250 fs/block_dev.c:1749 > do_dentry_open+0x664/0xd40 fs/open.c:752 > vfs_open+0x107/0x220 fs/open.c:866 > do_last fs/namei.c:3387 [inline] > path_openat+0x1151/0x3520 fs/namei.c:3527 > do_filp_open+0x25b/0x3b0 fs/namei.c:3562 > do_sys_open+0x502/0x6d0 fs/open.c:1059 > SYSC_open fs/open.c:1077 [inline] > SyS_open+0x2d/0x40 fs/open.c:1072 > entry_SYSCALL_64_fastpath+0x1f/0xbe > > Freed by task 14807: > save_stack+0x43/0xd0 mm/kasan/kasan.c:447 > set_track mm/kasan/kasan.c:459 [inline] > kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524 > __cache_free mm/slab.c:3492 [inline] > kfree+0xca/0x250 mm/slab.c:3807 > disk_release+0x327/0x410 block/genhd.c:1218 > device_release+0x7c/0x200 drivers/base/core.c:814 > kobject_cleanup lib/kobject.c:648 [inline] > kobject_release lib/kobject.c:677 [inline] > kref_put include/linux/kref.h:70 [inline] > kobject_put+0x14c/0x240 lib/kobject.c:694 > put_disk+0x23/0x30 block/genhd.c:1440 > __blkdev_get+0x6ed/0xf90 fs/block_dev.c:1528 > blkdev_get+0x3a1/0xad0 fs/block_dev.c:1591 > blkdev_open+0x1c9/0x250 fs/block_dev.c:1749 > do_dentry_open+0x664/0xd40 fs/open.c:752 > vfs_open+0x107/0x220 fs/open.c:866 > do_last fs/namei.c:3387 [inline] > path_openat+0x1151/0x3520 fs/namei.c:3527 > do_filp_open+0x25b/0x3b0 fs/namei.c:3562 > do_sys_open+0x502/0x6d0 fs/open.c:1059 > SYSC_open fs/open.c:1077 [inline] > SyS_open+0x2d/0x40 fs/open.c:1072 > entry_SYSCALL_64_fastpath+0x1f/0xbe > > The buggy address belongs to the object at 8801d9049500 > which belongs to the cache kmalloc-2048 of size 2048 > The buggy address is located 1408 bytes inside of > 2048-byte region [8801d9049500, 8801d9049d00) > The buggy address belongs to the page: > page:ea0007641200 count:1 mapcount:0 mapping:8801d9048400 index:0x0 > compound_mapcount: 0 > flags: 0x2008100(slab|head) > raw: 02008100 8801d9048400 00010003 > raw: ea000748e5a0 ea0007063920 8801dac00c40 > page dumped because: kasan: bad access detected > > Memory state around the buggy
Re: Block layer use of __GFP flags
On Sat, 2018-04-07 at 23:54 -0700, Matthew Wilcox wrote: > Please explain: > > commit 6a15674d1e90917f1723a814e2e8c949000440f7 > Author: Bart Van Assche > Date: Thu Nov 9 10:49:54 2017 -0800 > > block: Introduce blk_get_request_flags() > > A side effect of this patch is that the GFP mask that is passed to > several allocation functions in the legacy block layer is changed > from GFP_KERNEL into __GFP_DIRECT_RECLAIM. > > Why was this thought to be a good idea? I think gfp.h is pretty clear: > > * Useful GFP flag combinations that are commonly used. It is recommended > * that subsystems start with one of these combinations and then set/clear > * __GFP_FOO flags as necessary. > > Instead, the block layer now throws away all but one bit of the > information being passed in by the callers, and all it tells the allocator > is whether or not it can start doing direct reclaim. I can see that > you may well be in a situation where you don't want to start more I/O, > but your caller knows that! Why make the allocator work harder than > it has to? In particular, why isn't the page allocator allowed to wake > up kswapd to do reclaim in non-atomic context, but is when the caller > is in atomic context? __GFP_KSWAPD_RECLAIM wasn't stripped off on purpose for non-atomic allocations. That was an oversight. Do you perhaps want me to prepare a patch that makes blk_get_request() again respect the full gfp mask passed as third argument to blk_get_request()? Bart.
Re: 4.15.14 crash with iscsi target and dvd
Wakko Warner wrote: > Bart Van Assche wrote: > > Have you tried to modify the kernel Makefile as indicated in the following > > e-mail? This should make the kernel build: > > > > https://lists.ubuntu.com/archives/kernel-team/2016-May/077178.html > > Thanks. That helped. > > I finished with git bisect. Here's the output: > 84c8590646d5b35804bac60eb58b145839b5893e is the first bad commit > commit 84c8590646d5b35804bac60eb58b145839b5893e > Author: Ming Lei > Date: Fri Nov 11 20:05:32 2016 +0800 > > target: avoid accessing .bi_vcnt directly > > When the bio is full, bio_add_pc_page() will return zero, > so use this information tell when the bio is full. > > Also replace access to .bi_vcnt for pr_debug() with bio_segments(). > > Reviewed-by: Christoph Hellwig > Signed-off-by: Ming Lei > Reviewed-by: Sagi Grimberg > Signed-off-by: Jens Axboe > > :04 04 a3ebbb71c52ee4eb8c3be4d033b81179211bf704 > de39a328dbd1b18519946b3ad46d9302886e0dd0 M drivers > > I did a diff between HEAD^ and HEAD and manually patched the file from > 4.15.14. It's not an exact revert. I'm running it now and it's working. > I'll do a better test later on. Here's the patch: > > --- a/drivers/target/target_core_pscsi.c 2018-02-04 14:31:31.077316617 > -0500 > +++ b/drivers/target/target_core_pscsi.c 2018-04-08 11:43:49.588641374 > -0400 > @@ -915,7 +915,9 @@ > bio, page, bytes, off); > pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n", > bio_segments(bio), nr_vecs); > - if (rc != bytes) { > + if (rc != bytes) > + goto fail; > + if (bio->bi_vcnt > nr_vecs) { > pr_debug("PSCSI: Reached bio->bi_vcnt max:" > " %d i: %d bio: %p, allocating another" > " bio\n", bio->bi_vcnt, i, bio); > > I really appreciate your time and assistance with this. One thing I noticed after doing this is errors in the kernel log on the initiator: [9072625.181744] sr 26:0:0:0: [sr1] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 [9072625.181802] sr 26:0:0:0: [sr1] tag#0 Sense Key : 0x2 [current] [9072625.181835] sr 26:0:0:0: [sr1] tag#0 ASC=0x8 ASCQ=0x0 [9072625.181866] sr 26:0:0:0: [sr1] tag#0 CDB: opcode=0x28 28 00 00 0a 81 22 00 00 80 00 [9072625.181919] blk_update_request: I/O error, dev sr1, sector 2753672 When doing the exact same thing on the target, no mention. My patch may not be right, but it doesn't cause an oops. I'm going to try 4.16.1 and see what happens. -- Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 million bugs.
Re: 4.15.14 crash with iscsi target and dvd
Bart Van Assche wrote: > On Sat, 2018-04-07 at 12:53 -0400, Wakko Warner wrote: > > Bart Van Assche wrote: > > > On Thu, 2018-04-05 at 22:06 -0400, Wakko Warner wrote: > > > > I know now why scsi_print_command isn't doing anything. cmd->cmnd is > > > > null. > > > > I added a dev_printk in scsi_print_command where the 2 if statements > > > > return. > > > > Logs: > > > > [ 29.866415] sr 3:0:0:0: cmd->cmnd is NULL > > > > > > That's something that should never happen. As one can see in > > > scsi_setup_scsi_cmnd() and scsi_setup_fs_cmnd() both functions initialize > > > that pointer. Since I have not yet been able to reproduce myself what you > > > reported, would it be possible for you to bisect this issue? You will need > > > to follow something like the following procedure (see also > > > https://git-scm.com/docs/git-bisect): > > > > After doing 3 successful compiles with good/bad, I got this error and was > > not able to compile any more kernels: > > CC scripts/mod/devicetable-offsets.s > > scripts/mod/empty.c:1:0: error: code model kernel does not support PIC mode > > /* empty file to figure out endianness / word size */ > > > > scripts/mod/devicetable-offsets.c:1:0: error: code model kernel does not > > support PIC mode > > #include > > > > scripts/Makefile.build:153: recipe for target > > 'scripts/mod/devicetable-offsets.s' failed > > > > I don't think it found the bad commit. > > Have you tried to modify the kernel Makefile as indicated in the following > e-mail? This should make the kernel build: > > https://lists.ubuntu.com/archives/kernel-team/2016-May/077178.html Thanks. That helped. I finished with git bisect. Here's the output: 84c8590646d5b35804bac60eb58b145839b5893e is the first bad commit commit 84c8590646d5b35804bac60eb58b145839b5893e Author: Ming Lei Date: Fri Nov 11 20:05:32 2016 +0800 target: avoid accessing .bi_vcnt directly When the bio is full, bio_add_pc_page() will return zero, so use this information tell when the bio is full. Also replace access to .bi_vcnt for pr_debug() with bio_segments(). Reviewed-by: Christoph Hellwig Signed-off-by: Ming Lei Reviewed-by: Sagi Grimberg Signed-off-by: Jens Axboe :04 04 a3ebbb71c52ee4eb8c3be4d033b81179211bf704 de39a328dbd1b18519946b3ad46d9302886e0dd0 M drivers I did a diff between HEAD^ and HEAD and manually patched the file from 4.15.14. It's not an exact revert. I'm running it now and it's working. I'll do a better test later on. Here's the patch: --- a/drivers/target/target_core_pscsi.c2018-02-04 14:31:31.077316617 -0500 +++ b/drivers/target/target_core_pscsi.c2018-04-08 11:43:49.588641374 -0400 @@ -915,7 +915,9 @@ bio, page, bytes, off); pr_debug("PSCSI: bio->bi_vcnt: %d nr_vecs: %d\n", bio_segments(bio), nr_vecs); - if (rc != bytes) { + if (rc != bytes) + goto fail; + if (bio->bi_vcnt > nr_vecs) { pr_debug("PSCSI: Reached bio->bi_vcnt max:" " %d i: %d bio: %p, allocating another" " bio\n", bio->bi_vcnt, i, bio); I really appreciate your time and assistance with this. -- Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 million bugs.
Re: [block regression] kernel oops triggered by removing scsi device dring IO
On Sun, 2018-04-08 at 16:11 +0800, Joseph Qi wrote: > This is because scsi_remove_device() will call blk_cleanup_queue(), and > then all blkgs have been destroyed and root_blkg is NULL. > Thus tg is NULL and trigger NULL pointer dereference when get td from > tg (tg->td). > It seems that we cannot simply move blkcg_exit_queue() up to > blk_cleanup_queue(). Had you considered to add a blk_queue_enter() / blk_queue_exit() pair in generic_make_request()? blk_queue_enter() namely checks the DYING flag. Thanks, Bart.
Re: [block regression] kernel oops triggered by removing scsi device dring IO
On Sun, 2018-04-08 at 12:21 +0800, Ming Lei wrote: > The following kernel oops is triggered by 'removing scsi device' during > heavy IO. How did you trigger this oops? Bart.
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
On 04/08/2018 03:57 PM, Ming Lei wrote: On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote: Hi Sagi Still can reproduce this issue with the change: Thanks for validating Yi, Would it be possible to test the following: -- diff --git a/block/blk-mq.c b/block/blk-mq.c index 75336848f7a7..81ced3096433 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, return ERR_PTR(-EXDEV); } cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); + if (cpu >= nr_cpu_ids) { + pr_warn("no online cpu for hctx %d\n", hctx_idx); + cpu = cpumask_first(alloc_data.hctx->cpumask); + } alloc_data.ctx = __blk_mq_get_ctx(q, cpu); rq = blk_mq_get_request(q, NULL, op, &alloc_data); -- ... [ 153.384977] BUG: unable to handle kernel paging request at 3a9ed053bd48 [ 153.393197] IP: blk_mq_get_request+0x23e/0x390 Also would it be possible to provide gdb output of: l *(blk_mq_get_request+0x23e) nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate request from one specific hw queue, but there may not be all online CPUs mapped to this hw queue. Yes, this is what I suspect.. And the following patchset may fail this kind of allocation and avoid the kernel oops. https://marc.info/?l=linux-block&m=152318091025252&w=2 Thanks Ming, But I don't want to fail the allocation, nvmf_connect_io_queue simply needs a tag to issue the connect request, I much rather to take this tag from an online cpu than failing it... We use this because we reserve The failure is only triggered when there isn't any online CPU mapped to this hctx, so do you want to wait for CPUs for this hctx becoming online? I was thinking of allocating a tag from that hctx even if it had no online cpu, the execution is done on an online cpu (hence the call to blk_mq_alloc_request_hctx). That can be done, but not following the current blk-mq's rule, because blk-mq requires to dispatch the request on CPUs mapping to this hctx. Could you explain a bit why you want to do in this way? My device exposes nr_hw_queues which is not higher than num_online_cpus so I want to connect all hctxs with hope that they will be used. I agree we don't want to connect hctx which doesn't have an online cpu, that's redundant, but this is not the case here. Or I may understand you wrong, :-) In the report we connected 40 hctxs (which was exactly the number of online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs. I'm not sure why some hctxs are left without any online cpus. That is possible after the following two commits: 4b855ad37194 ("blk-mq: Create hctx for each present CPU) 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU) And this can be triggered even without putting down any CPUs. The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't remap queue any more when CPU topo is changed, so the static & fixed mapping has to be setup from the beginning. Then if there are less enough online CPUs compared with number of hw queues, some of hctxes can be mapped with all offline CPUs. For example, if one device has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most 2 hw queues are assigned to online CPUs, and the other two are all with offline CPUs. That is fine, but the problem that I gave in the example below which has nr_hw_queues == num_online_cpus but because of the mapping, we still have unmapped hctxs. Lets say I have 4-cpu system and my device always allocates num_online_cpus() hctxs. at first I get: cpu0 -> hctx0 cpu1 -> hctx1 cpu2 -> hctx2 cpu3 -> hctx3 When cpu1 goes offline I think the new mapping will be: cpu0 -> hctx0 cpu1 -> hctx0 (from cpu_to_queue_index) // offline cpu2 -> hctx2 cpu3 -> hctx0 (from cpu_to_queue_index) This means that now hctx1 is unmapped. I guess we can fix nvmf code to not connect it. But we end up with less queues than cpus without any good reason. I would have optimally want a different mapping that will use all the queues: cpu0 -> hctx0 cpu2 -> hctx1 cpu3 -> hctx2 * cpu1 -> hctx1 (doesn't matter, offline) Something looks broken... No, it isn't broken. maybe broken is the wrong phrase, but its suboptimal... Storage is client/server model, the hw queue should be only active if there is request coming from client(CPU), Correct. and the hw queue becomes inactive if no online CPU is mapped to it. But when we reset the controller, we call blk_mq_update_nr_hw_queues() with the current number of nr_hw_queues which never exceeds num_online_cpus. This in turn, remaps the mq_map which results in unmapped queues because of the mapping function, not because we have more hctx than online cpus... An easy fix, is to allocate num_present_cpus queues, and only connect the oneline ones, but as you said, we
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote: > > > > > > > > Hi Sagi > > > > > > > > > > > > > > Still can reproduce this issue with the change: > > > > > > > > > > > > Thanks for validating Yi, > > > > > > > > > > > > Would it be possible to test the following: > > > > > > -- > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > > > > index 75336848f7a7..81ced3096433 100644 > > > > > > --- a/block/blk-mq.c > > > > > > +++ b/block/blk-mq.c > > > > > > @@ -444,6 +444,10 @@ struct request > > > > > > *blk_mq_alloc_request_hctx(struct > > > > > > request_queue *q, > > > > > > return ERR_PTR(-EXDEV); > > > > > > } > > > > > > cpu = cpumask_first_and(alloc_data.hctx->cpumask, > > > > > > cpu_online_mask); > > > > > > + if (cpu >= nr_cpu_ids) { > > > > > > + pr_warn("no online cpu for hctx %d\n", hctx_idx); > > > > > > + cpu = cpumask_first(alloc_data.hctx->cpumask); > > > > > > + } > > > > > > alloc_data.ctx = __blk_mq_get_ctx(q, cpu); > > > > > > > > > > > > rq = blk_mq_get_request(q, NULL, op, &alloc_data); > > > > > > -- > > > > > > ... > > > > > > > > > > > > > > > > > > > [ 153.384977] BUG: unable to handle kernel paging request at > > > > > > > 3a9ed053bd48 > > > > > > > [ 153.393197] IP: blk_mq_get_request+0x23e/0x390 > > > > > > > > > > > > Also would it be possible to provide gdb output of: > > > > > > > > > > > > l *(blk_mq_get_request+0x23e) > > > > > > > > > > nvmf_connect_io_queue() is used in this way by asking blk-mq to > > > > > allocate > > > > > request from one specific hw queue, but there may not be all online > > > > > CPUs > > > > > mapped to this hw queue. > > > > > > Yes, this is what I suspect.. > > > > > > > And the following patchset may fail this kind of allocation and avoid > > > > the kernel oops. > > > > > > > > https://marc.info/?l=linux-block&m=152318091025252&w=2 > > > > > > Thanks Ming, > > > > > > But I don't want to fail the allocation, nvmf_connect_io_queue simply > > > needs a tag to issue the connect request, I much rather to take this > > > tag from an online cpu than failing it... We use this because we reserve > > > > The failure is only triggered when there isn't any online CPU mapped to > > this hctx, so do you want to wait for CPUs for this hctx becoming online? > > I was thinking of allocating a tag from that hctx even if it had no > online cpu, the execution is done on an online cpu (hence the call > to blk_mq_alloc_request_hctx). That can be done, but not following the current blk-mq's rule, because blk-mq requires to dispatch the request on CPUs mapping to this hctx. Could you explain a bit why you want to do in this way? > > > Or I may understand you wrong, :-) > > In the report we connected 40 hctxs (which was exactly the number of > online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs. > I'm not sure why some hctxs are left without any online cpus. That is possible after the following two commits: 4b855ad37194 ("blk-mq: Create hctx for each present CPU) 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU) And this can be triggered even without putting down any CPUs. The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't remap queue any more when CPU topo is changed, so the static & fixed mapping has to be setup from the beginning. Then if there are less enough online CPUs compared with number of hw queues, some of hctxes can be mapped with all offline CPUs. For example, if one device has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most 2 hw queues are assigned to online CPUs, and the other two are all with offline CPUs. > > This seems to be related to the queue mapping. Yes. > > Lets say I have 4-cpu system and my device always allocates > num_online_cpus() hctxs. > > at first I get: > cpu0 -> hctx0 > cpu1 -> hctx1 > cpu2 -> hctx2 > cpu3 -> hctx3 > > When cpu1 goes offline I think the new mapping will be: > cpu0 -> hctx0 > cpu1 -> hctx0 (from cpu_to_queue_index) // offline > cpu2 -> hctx2 > cpu3 -> hctx0 (from cpu_to_queue_index) > > This means that now hctx1 is unmapped. I guess we can fix nvmf code > to not connect it. But we end up with less queues than cpus without > any good reason. > > I would have optimally want a different mapping that will use all > the queues: > cpu0 -> hctx0 > cpu2 -> hctx1 > cpu3 -> hctx2 > * cpu1 -> hctx1 (doesn't matter, offline) > > Something looks broken... No, it isn't broken. Storage is client/server model, the hw queue should be only active if there is request coming from client(CPU), and the hw queue becomes inactive if no online CPU is mapped to it. That is why the normal rule is that request allocation needs CPU context info, then the hctx is obtained via the queue mapping. Thanks Ming
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
Hi Sagi Still can reproduce this issue with the change: Thanks for validating Yi, Would it be possible to test the following: -- diff --git a/block/blk-mq.c b/block/blk-mq.c index 75336848f7a7..81ced3096433 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, return ERR_PTR(-EXDEV); } cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); + if (cpu >= nr_cpu_ids) { + pr_warn("no online cpu for hctx %d\n", hctx_idx); + cpu = cpumask_first(alloc_data.hctx->cpumask); + } alloc_data.ctx = __blk_mq_get_ctx(q, cpu); rq = blk_mq_get_request(q, NULL, op, &alloc_data); -- ... [ 153.384977] BUG: unable to handle kernel paging request at 3a9ed053bd48 [ 153.393197] IP: blk_mq_get_request+0x23e/0x390 Also would it be possible to provide gdb output of: l *(blk_mq_get_request+0x23e) nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate request from one specific hw queue, but there may not be all online CPUs mapped to this hw queue. Yes, this is what I suspect.. And the following patchset may fail this kind of allocation and avoid the kernel oops. https://marc.info/?l=linux-block&m=152318091025252&w=2 Thanks Ming, But I don't want to fail the allocation, nvmf_connect_io_queue simply needs a tag to issue the connect request, I much rather to take this tag from an online cpu than failing it... We use this because we reserve The failure is only triggered when there isn't any online CPU mapped to this hctx, so do you want to wait for CPUs for this hctx becoming online? I was thinking of allocating a tag from that hctx even if it had no online cpu, the execution is done on an online cpu (hence the call to blk_mq_alloc_request_hctx). Or I may understand you wrong, :-) In the report we connected 40 hctxs (which was exactly the number of online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs. I'm not sure why some hctxs are left without any online cpus. This seems to be related to the queue mapping. Lets say I have 4-cpu system and my device always allocates num_online_cpus() hctxs. at first I get: cpu0 -> hctx0 cpu1 -> hctx1 cpu2 -> hctx2 cpu3 -> hctx3 When cpu1 goes offline I think the new mapping will be: cpu0 -> hctx0 cpu1 -> hctx0 (from cpu_to_queue_index) // offline cpu2 -> hctx2 cpu3 -> hctx0 (from cpu_to_queue_index) This means that now hctx1 is unmapped. I guess we can fix nvmf code to not connect it. But we end up with less queues than cpus without any good reason. I would have optimally want a different mapping that will use all the queues: cpu0 -> hctx0 cpu2 -> hctx1 cpu3 -> hctx2 * cpu1 -> hctx1 (doesn't matter, offline) Something looks broken...
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
On Sun, Apr 08, 2018 at 01:58:49PM +0300, Sagi Grimberg wrote: > > > > > > Hi Sagi > > > > > > > > > > Still can reproduce this issue with the change: > > > > > > > > Thanks for validating Yi, > > > > > > > > Would it be possible to test the following: > > > > -- > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > > index 75336848f7a7..81ced3096433 100644 > > > > --- a/block/blk-mq.c > > > > +++ b/block/blk-mq.c > > > > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct > > > > request_queue *q, > > > > return ERR_PTR(-EXDEV); > > > > } > > > > cpu = cpumask_first_and(alloc_data.hctx->cpumask, > > > > cpu_online_mask); > > > > + if (cpu >= nr_cpu_ids) { > > > > + pr_warn("no online cpu for hctx %d\n", hctx_idx); > > > > + cpu = cpumask_first(alloc_data.hctx->cpumask); > > > > + } > > > > alloc_data.ctx = __blk_mq_get_ctx(q, cpu); > > > > > > > > rq = blk_mq_get_request(q, NULL, op, &alloc_data); > > > > -- > > > > ... > > > > > > > > > > > > > [ 153.384977] BUG: unable to handle kernel paging request at > > > > > 3a9ed053bd48 > > > > > [ 153.393197] IP: blk_mq_get_request+0x23e/0x390 > > > > > > > > Also would it be possible to provide gdb output of: > > > > > > > > l *(blk_mq_get_request+0x23e) > > > > > > nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate > > > request from one specific hw queue, but there may not be all online CPUs > > > mapped to this hw queue. > > Yes, this is what I suspect.. > > > And the following patchset may fail this kind of allocation and avoid > > the kernel oops. > > > > https://marc.info/?l=linux-block&m=152318091025252&w=2 > > Thanks Ming, > > But I don't want to fail the allocation, nvmf_connect_io_queue simply > needs a tag to issue the connect request, I much rather to take this > tag from an online cpu than failing it... We use this because we reserve The failure is only triggered when there isn't any online CPU mapped to this hctx, so do you want to wait for CPUs for this hctx becoming online? Or I may understand you wrong, :-) > a tag per-queue for this, but in this case, I'd rather block until the > inflight tag complete than failing the connect. No, there can't be any inflight request for this hctx. Thanks, Ming
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
Hi Sagi Still can reproduce this issue with the change: Thanks for validating Yi, Would it be possible to test the following: -- diff --git a/block/blk-mq.c b/block/blk-mq.c index 75336848f7a7..81ced3096433 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, return ERR_PTR(-EXDEV); } cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); + if (cpu >= nr_cpu_ids) { + pr_warn("no online cpu for hctx %d\n", hctx_idx); + cpu = cpumask_first(alloc_data.hctx->cpumask); + } alloc_data.ctx = __blk_mq_get_ctx(q, cpu); rq = blk_mq_get_request(q, NULL, op, &alloc_data); -- ... [ 153.384977] BUG: unable to handle kernel paging request at 3a9ed053bd48 [ 153.393197] IP: blk_mq_get_request+0x23e/0x390 Also would it be possible to provide gdb output of: l *(blk_mq_get_request+0x23e) nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate request from one specific hw queue, but there may not be all online CPUs mapped to this hw queue. Yes, this is what I suspect.. And the following patchset may fail this kind of allocation and avoid the kernel oops. https://marc.info/?l=linux-block&m=152318091025252&w=2 Thanks Ming, But I don't want to fail the allocation, nvmf_connect_io_queue simply needs a tag to issue the connect request, I much rather to take this tag from an online cpu than failing it... We use this because we reserve a tag per-queue for this, but in this case, I'd rather block until the inflight tag complete than failing the connect.
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
On Sun, Apr 08, 2018 at 06:44:33PM +0800, Ming Lei wrote: > On Sun, Apr 08, 2018 at 01:36:27PM +0300, Sagi Grimberg wrote: > > > > > Hi Sagi > > > > > > Still can reproduce this issue with the change: > > > > Thanks for validating Yi, > > > > Would it be possible to test the following: > > -- > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 75336848f7a7..81ced3096433 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct > > request_queue *q, > > return ERR_PTR(-EXDEV); > > } > > cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); > > + if (cpu >= nr_cpu_ids) { > > + pr_warn("no online cpu for hctx %d\n", hctx_idx); > > + cpu = cpumask_first(alloc_data.hctx->cpumask); > > + } > > alloc_data.ctx = __blk_mq_get_ctx(q, cpu); > > > > rq = blk_mq_get_request(q, NULL, op, &alloc_data); > > -- > > ... > > > > > > > [ 153.384977] BUG: unable to handle kernel paging request at > > > 3a9ed053bd48 > > > [ 153.393197] IP: blk_mq_get_request+0x23e/0x390 > > > > Also would it be possible to provide gdb output of: > > > > l *(blk_mq_get_request+0x23e) > > nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate > request from one specific hw queue, but there may not be all online CPUs > mapped to this hw queue. And the following patchset may fail this kind of allocation and avoid the kernel oops. https://marc.info/?l=linux-block&m=152318091025252&w=2 Thanks, Ming
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
On Sun, Apr 08, 2018 at 01:36:27PM +0300, Sagi Grimberg wrote: > > > Hi Sagi > > > > Still can reproduce this issue with the change: > > Thanks for validating Yi, > > Would it be possible to test the following: > -- > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 75336848f7a7..81ced3096433 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct > request_queue *q, > return ERR_PTR(-EXDEV); > } > cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); > + if (cpu >= nr_cpu_ids) { > + pr_warn("no online cpu for hctx %d\n", hctx_idx); > + cpu = cpumask_first(alloc_data.hctx->cpumask); > + } > alloc_data.ctx = __blk_mq_get_ctx(q, cpu); > > rq = blk_mq_get_request(q, NULL, op, &alloc_data); > -- > ... > > > > [ 153.384977] BUG: unable to handle kernel paging request at > > 3a9ed053bd48 > > [ 153.393197] IP: blk_mq_get_request+0x23e/0x390 > > Also would it be possible to provide gdb output of: > > l *(blk_mq_get_request+0x23e) nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate request from one specific hw queue, but there may not be all online CPUs mapped to this hw queue. Thanks, Ming
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
Hi Sagi Still can reproduce this issue with the change: Thanks for validating Yi, Would it be possible to test the following: -- diff --git a/block/blk-mq.c b/block/blk-mq.c index 75336848f7a7..81ced3096433 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, return ERR_PTR(-EXDEV); } cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); + if (cpu >= nr_cpu_ids) { + pr_warn("no online cpu for hctx %d\n", hctx_idx); + cpu = cpumask_first(alloc_data.hctx->cpumask); + } alloc_data.ctx = __blk_mq_get_ctx(q, cpu); rq = blk_mq_get_request(q, NULL, op, &alloc_data); -- ... [ 153.384977] BUG: unable to handle kernel paging request at 3a9ed053bd48 [ 153.393197] IP: blk_mq_get_request+0x23e/0x390 Also would it be possible to provide gdb output of: l *(blk_mq_get_request+0x23e) Thanks,
Re: [block regression] kernel oops triggered by removing scsi device dring IO
On Sun, Apr 08, 2018 at 05:25:42PM +0800, Ming Lei wrote: > On Sun, Apr 08, 2018 at 04:11:51PM +0800, Joseph Qi wrote: > > This is because scsi_remove_device() will call blk_cleanup_queue(), and > > then all blkgs have been destroyed and root_blkg is NULL. > > Thus tg is NULL and trigger NULL pointer dereference when get td from > > tg (tg->td). > > It seems that we cannot simply move blkcg_exit_queue() up to > > blk_cleanup_queue(). > > Maybe one per-queue blkcg should be introduced, which seems reasonable > too. Sorry, I mean one per-queue blkcg lock. -- Ming
[PATCH 8/8] blk-mq: remove code for dealing with remapping queue
Firstly, from commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU), blk-mq doesn't remap queue any more after CPU topo is changed. Secondly, set->nr_hw_queues can't be bigger than nr_cpu_ids, and now we map all possible CPUs to hw queues, so at least one CPU is mapped to each hctx. So queue mapping has became static and fixed just like percpu variable, and we don't need to handle queue remapping any more. Cc: Christian Borntraeger Cc: Christoph Hellwig Cc: Stefan Haberland Signed-off-by: Ming Lei --- block/blk-mq.c | 34 +++--- 1 file changed, 3 insertions(+), 31 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 3b4ce83a0ba2..c3964a79638e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2330,7 +2330,7 @@ static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set, static void blk_mq_map_swqueue(struct request_queue *q) { - unsigned int i, hctx_idx; + unsigned int i; struct blk_mq_hw_ctx *hctx; struct blk_mq_ctx *ctx; struct blk_mq_tag_set *set = q->tag_set; @@ -2347,23 +2347,8 @@ static void blk_mq_map_swqueue(struct request_queue *q) /* * Map software to hardware queues. -* -* If the cpu isn't present, the cpu is mapped to first hctx. */ for_each_possible_cpu(i) { - hctx_idx = q->mq_map[i]; - /* unmapped hw queue can be remapped after CPU topo changed */ - if (!set->tags[hctx_idx] && - !__blk_mq_alloc_rq_map(set, hctx_idx)) { - /* -* If tags initialization fail for some hctx, -* that hctx won't be brought online. In this -* case, remap the current ctx to hctx[0] which -* is guaranteed to always have tags allocated -*/ - q->mq_map[i] = 0; - } - ctx = per_cpu_ptr(q->queue_ctx, i); hctx = blk_mq_map_queue(q, i); @@ -2375,21 +2360,8 @@ static void blk_mq_map_swqueue(struct request_queue *q) mutex_unlock(&q->sysfs_lock); queue_for_each_hw_ctx(q, hctx, i) { - /* -* If no software queues are mapped to this hardware queue, -* disable it and free the request entries. -*/ - if (!hctx->nr_ctx) { - /* Never unmap queue 0. We need it as a -* fallback in case of a new remap fails -* allocation -*/ - if (i && set->tags[i]) - blk_mq_free_map_and_requests(set, i); - - hctx->tags = NULL; - continue; - } + /* every hctx should get mapped by at least one CPU */ + WARN_ON(!hctx->nr_ctx); hctx->tags = set->tags[i]; WARN_ON(!hctx->tags); -- 2.9.5
[PATCH 6/8] blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue()
There are several reasons for removing the check: 1) blk_mq_hw_queue_mapped() returns true always now since each hctx may be mapped by one CPU at least 2) when there isn't any online CPU mapped to this hctx, there won't be any IO queued to this CPU, blk_mq_run_hw_queue() only runs queue if there is IO queued to this hctx 3) If __blk_mq_delay_run_hw_queue() is called by blk_mq_delay_run_hw_queue(), which is run from blk_mq_dispatch_rq_list() or scsi_mq_get_budget(), and the hctx to be handled has to be mapped. Cc: Christian Borntraeger Cc: Christoph Hellwig Cc: Stefan Haberland Signed-off-by: Ming Lei --- block/blk-mq.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 8743e9105612..3b4ce83a0ba2 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1394,9 +1394,6 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, unsigned long msecs) { - if (WARN_ON_ONCE(!blk_mq_hw_queue_mapped(hctx))) - return; - if (unlikely(blk_mq_hctx_stopped(hctx))) return; -- 2.9.5
[PATCH 7/8] blk-mq: reimplement blk_mq_hw_queue_mapped
Now the actual meaning of queue mapped is that if there is any online CPU mapped to this hctx, so implement blk_mq_hw_queue_mapped() in this way. Cc: Christian Borntraeger Cc: Christoph Hellwig Cc: Stefan Haberland Signed-off-by: Ming Lei --- block/blk-mq.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.h b/block/blk-mq.h index 88c558f71819..502af371b83b 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -181,7 +181,7 @@ static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx) static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) { - return hctx->nr_ctx && hctx->tags; + return cpumask_first_and(hctx->cpumask, cpu_online_mask) < nr_cpu_ids; } void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part, -- 2.9.5
[PATCH 5/8] blk-mq: remove blk_mq_delay_queue()
No driver uses this interface any more, so remove it. Cc: Christian Borntraeger Cc: Christoph Hellwig Cc: Stefan Haberland Signed-off-by: Ming Lei --- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 30 ++ include/linux/blk-mq.h | 2 -- 3 files changed, 2 insertions(+), 31 deletions(-) diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 58b3b79cbe83..3080e18cb859 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -235,7 +235,6 @@ static const char *const hctx_state_name[] = { HCTX_STATE_NAME(STOPPED), HCTX_STATE_NAME(TAG_ACTIVE), HCTX_STATE_NAME(SCHED_RESTART), - HCTX_STATE_NAME(START_ON_RUN), }; #undef HCTX_STATE_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index e3d02af79010..8743e9105612 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1563,40 +1563,14 @@ static void blk_mq_run_work_fn(struct work_struct *work) hctx = container_of(work, struct blk_mq_hw_ctx, run_work.work); /* -* If we are stopped, don't run the queue. The exception is if -* BLK_MQ_S_START_ON_RUN is set. For that case, we auto-clear -* the STOPPED bit and run it. +* If we are stopped, don't run the queue. */ - if (test_bit(BLK_MQ_S_STOPPED, &hctx->state)) { - if (!test_bit(BLK_MQ_S_START_ON_RUN, &hctx->state)) - return; - - clear_bit(BLK_MQ_S_START_ON_RUN, &hctx->state); + if (test_bit(BLK_MQ_S_STOPPED, &hctx->state)) clear_bit(BLK_MQ_S_STOPPED, &hctx->state); - } __blk_mq_run_hw_queue(hctx); } - -void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) -{ - if (WARN_ON_ONCE(!blk_mq_hw_queue_mapped(hctx))) - return; - - /* -* Stop the hw queue, then modify currently delayed work. -* This should prevent us from running the queue prematurely. -* Mark the queue as auto-clearing STOPPED when it runs. -*/ - blk_mq_stop_hw_queue(hctx); - set_bit(BLK_MQ_S_START_ON_RUN, &hctx->state); - kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), - &hctx->run_work, - msecs_to_jiffies(msecs)); -} -EXPORT_SYMBOL(blk_mq_delay_queue); - static inline void __blk_mq_insert_req_list(struct blk_mq_hw_ctx *hctx, struct request *rq, bool at_head) diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 8efcf49796a3..e3986f4b3461 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -183,7 +183,6 @@ enum { BLK_MQ_S_STOPPED= 0, BLK_MQ_S_TAG_ACTIVE = 1, BLK_MQ_S_SCHED_RESTART = 2, - BLK_MQ_S_START_ON_RUN = 3, BLK_MQ_MAX_DEPTH= 10240, @@ -270,7 +269,6 @@ void blk_mq_unquiesce_queue(struct request_queue *q); void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs); bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); void blk_mq_run_hw_queues(struct request_queue *q, bool async); -void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs); void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv); void blk_mq_freeze_queue(struct request_queue *q); -- 2.9.5
[PATCH 4/8] blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu
This patch introduces helper of blk_mq_hw_queue_first_cpu() for figuring out the hctx's first cpu, and code duplication can be avoided. Cc: Christian Borntraeger Cc: Christoph Hellwig Cc: Stefan Haberland Signed-off-by: Ming Lei --- block/blk-mq.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index a16efa6f2e7f..e3d02af79010 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1336,6 +1336,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) hctx_unlock(hctx, srcu_idx); } +static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx) +{ + int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask); + + if (cpu >= nr_cpu_ids) + cpu = cpumask_first(hctx->cpumask); + return cpu; +} + /* * It'd be great if the workqueue API had a way to pass * in a mask and had some smarts for more clever placement. @@ -1355,14 +1364,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) next_cpu = cpumask_next_and(next_cpu, hctx->cpumask, cpu_online_mask); if (next_cpu >= nr_cpu_ids) - next_cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask); - - /* -* No online CPU is found, so have to make sure hctx->next_cpu -* is set correctly for not breaking workqueue. -*/ - if (next_cpu >= nr_cpu_ids) - next_cpu = cpumask_first(hctx->cpumask); + next_cpu = blk_mq_first_mapped_cpu(hctx); hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; } @@ -2431,10 +2433,7 @@ static void blk_mq_map_swqueue(struct request_queue *q) /* * Initialize batch roundrobin counts */ - hctx->next_cpu = cpumask_first_and(hctx->cpumask, - cpu_online_mask); - if (hctx->next_cpu >= nr_cpu_ids) - hctx->next_cpu = cpumask_first(hctx->cpumask); + hctx->next_cpu = blk_mq_first_mapped_cpu(hctx); hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; } } -- 2.9.5
[PATCH 3/8] blk-mq: avoid to write intermediate result to hctx->next_cpu
This patch figures out the final selected CPU, then writes it to hctx->next_cpu once, then we can avoid to intermediate next cpu observed from other dispatch paths. Cc: Christian Borntraeger Cc: Christoph Hellwig Cc: Stefan Haberland Signed-off-by: Ming Lei --- block/blk-mq.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 9b220dc415ac..a16efa6f2e7f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1345,26 +1345,24 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) { bool tried = false; + int next_cpu = hctx->next_cpu; if (hctx->queue->nr_hw_queues == 1) return WORK_CPU_UNBOUND; if (--hctx->next_cpu_batch <= 0) { - int next_cpu; select_cpu: - next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask, + next_cpu = cpumask_next_and(next_cpu, hctx->cpumask, cpu_online_mask); if (next_cpu >= nr_cpu_ids) - next_cpu = cpumask_first_and(hctx->cpumask,cpu_online_mask); + next_cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask); /* * No online CPU is found, so have to make sure hctx->next_cpu * is set correctly for not breaking workqueue. */ if (next_cpu >= nr_cpu_ids) - hctx->next_cpu = cpumask_first(hctx->cpumask); - else - hctx->next_cpu = next_cpu; + next_cpu = cpumask_first(hctx->cpumask); hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; } @@ -1372,7 +1370,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) * Do unbound schedule if we can't find a online CPU for this hctx, * and it should only happen in the path of handling CPU DEAD. */ - if (!cpu_online(hctx->next_cpu)) { + if (!cpu_online(next_cpu)) { if (!tried) { tried = true; goto select_cpu; @@ -1382,10 +1380,13 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx) * Make sure to re-select CPU next time once after CPUs * in hctx->cpumask become online again. */ + hctx->next_cpu = next_cpu; hctx->next_cpu_batch = 1; return WORK_CPU_UNBOUND; } - return hctx->next_cpu; + + hctx->next_cpu = next_cpu; + return next_cpu; } static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, -- 2.9.5
[PATCH 2/8] blk-mq: don't keep offline CPUs mapped to hctx 0
>From commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU), blk-mq doesn't remap queue after CPU topo is changed, that said when some of these offline CPUs become online, they are still mapped to hctx 0, then hctx 0 may become the bottleneck of IO dispatch and completion. This patch sets up the mapping from the beginning, and aligns to queue mapping for PCI device (blk_mq_pci_map_queues()). Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU) Cc: Christian Borntraeger Cc: Christoph Hellwig Cc: Stefan Haberland Cc: Keith Busch Cc: sta...@vger.kernel.org Signed-off-by: Ming Lei --- block/blk-mq-cpumap.c | 5 - 1 file changed, 5 deletions(-) diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c index 9f8cffc8a701..3eb169f15842 100644 --- a/block/blk-mq-cpumap.c +++ b/block/blk-mq-cpumap.c @@ -16,11 +16,6 @@ static int cpu_to_queue_index(unsigned int nr_queues, const int cpu) { - /* -* Non present CPU will be mapped to queue index 0. -*/ - if (!cpu_present(cpu)) - return 0; return cpu % nr_queues; } -- 2.9.5
[PATCH 1/8] blk-mq: make sure that correct hctx->next_cpu is set
>From commit 20e4d81393196 (blk-mq: simplify queue mapping & schedule with each possisble CPU), one hctx can be mapped from all offline CPUs, then hctx->next_cpu can be set as wrong. This patch fixes this issue by making hctx->next_cpu pointing to the first CPU in hctx->cpumask if all CPUs in hctx->cpumask are offline. Cc: Christian Borntraeger Cc: Christoph Hellwig Cc: Stefan Haberland Fixes: 20e4d81393196 ("blk-mq: simplify queue mapping & schedule with each possisble CPU") Cc: sta...@vger.kernel.org Signed-off-by: Ming Lei --- block/blk-mq.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/block/blk-mq.c b/block/blk-mq.c index f5c7dbcb954f..9b220dc415ac 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2432,6 +2432,8 @@ static void blk_mq_map_swqueue(struct request_queue *q) */ hctx->next_cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask); + if (hctx->next_cpu >= nr_cpu_ids) + hctx->next_cpu = cpumask_first(hctx->cpumask); hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH; } } -- 2.9.5
[PATCH 0/8] blk-mq: fix and improve queue mapping
Hi Jens, The first two patches fix issues about queue mapping. The other 6 patches improve queue mapping for blk-mq. Christian, this patches should fix your issue, so please give a test, and the patches can be found in the following tree: https://github.com/ming1/linux/commits/v4.17-rc-blk-fix_mapping_v1 Thanks, Ming Ming Lei (8): blk-mq: make sure that correct hctx->next_cpu is set blk-mq: don't keep offline CPUs mapped to hctx 0 blk-mq: avoid to write intermediate result to hctx->next_cpu blk-mq: introduce blk_mq_hw_queue_first_cpu() to figure out first cpu blk-mq: remove blk_mq_delay_queue() blk-mq: don't check queue mapped in __blk_mq_delay_run_hw_queue() blk-mq: reimplement blk_mq_hw_queue_mapped blk-mq: remove code for dealing with remapping queue block/blk-mq-cpumap.c | 5 --- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 101 +++-- block/blk-mq.h | 2 +- include/linux/blk-mq.h | 2 - 5 files changed, 24 insertions(+), 87 deletions(-) -- 2.9.5
Re: [block regression] kernel oops triggered by removing scsi device dring IO
On Sun, Apr 08, 2018 at 04:11:51PM +0800, Joseph Qi wrote: > This is because scsi_remove_device() will call blk_cleanup_queue(), and > then all blkgs have been destroyed and root_blkg is NULL. > Thus tg is NULL and trigger NULL pointer dereference when get td from > tg (tg->td). > It seems that we cannot simply move blkcg_exit_queue() up to > blk_cleanup_queue(). Maybe one per-queue blkcg should be introduced, which seems reasonable too. Thanks, Ming
[PATCH] blk-throttle: fix typo in blkg_print_stat_ios() comments
Signed-off-by: Joseph Qi --- block/blk-cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 1c16694..87367d4 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -640,7 +640,7 @@ int blkg_print_stat_bytes(struct seq_file *sf, void *v) EXPORT_SYMBOL_GPL(blkg_print_stat_bytes); /** - * blkg_print_stat_bytes - seq_show callback for blkg->stat_ios + * blkg_print_stat_ios - seq_show callback for blkg->stat_ios * @sf: seq_file to print to * @v: unused * -- 1.8.3.1
Re: [block regression] kernel oops triggered by removing scsi device dring IO
This is because scsi_remove_device() will call blk_cleanup_queue(), and then all blkgs have been destroyed and root_blkg is NULL. Thus tg is NULL and trigger NULL pointer dereference when get td from tg (tg->td). It seems that we cannot simply move blkcg_exit_queue() up to blk_cleanup_queue(). Thanks, Joseph On 18/4/8 12:21, Ming Lei wrote: > Hi, > > The following kernel oops is triggered by 'removing scsi device' during > heavy IO. > > 'git bisect' shows that commit a063057d7c731cffa7d10740(block: Fix a race > between request queue removal and the block cgroup controller) > introduced this regression: > > [ 42.268257] BUG: unable to handle kernel NULL pointer dereference at > 0028 > [ 42.269339] PGD 26bd9f067 P4D 26bd9f067 PUD 26bfec067 PMD 0 > [ 42.270077] Oops: [#1] PREEMPT SMP NOPTI > [ 42.270681] Dumping ftrace buffer: > [ 42.271141](ftrace buffer empty) > [ 42.271641] Modules linked in: scsi_debug iTCO_wdt iTCO_vendor_support > crc32c_intel i2c_i801 i2c_core lpc_ich mfd_core usb_storage nvme shpchp > nvme_core virtio_scsi qemu_fw_cfg ip_tables > [ 42.273770] CPU: 5 PID: 1076 Comm: fio Not tainted 4.16.0+ #49 > [ 42.274530] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.10.2-2.fc27 04/01/2014 > [ 42.275634] RIP: 0010:blk_throtl_bio+0x41/0x904 > [ 42.276225] RSP: 0018:c900033cfaa0 EFLAGS: 00010246 > [ 42.276907] RAX: 8000 RBX: 8801bdcc5118 RCX: > 0001 > [ 42.277818] RDX: 8801bdcc5118 RSI: RDI: > 8802641f8870 > [ 42.278733] RBP: R08: 0001 R09: > c900033cfb94 > [ 42.279651] R10: c900033cfc00 R11: 06ea R12: > 8802641f8870 > [ 42.280567] R13: 88026f34f000 R14: R15: > 8801bdcc5118 > [ 42.281489] FS: 7fc123922d40() GS:880272f4() > knlGS: > [ 42.282525] CS: 0010 DS: ES: CR0: 80050033 > [ 42.283270] CR2: 0028 CR3: 00026d7ac004 CR4: > 007606e0 > [ 42.284194] DR0: DR1: DR2: > > [ 42.285116] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 42.286036] PKRU: 5554 > [ 42.286393] Call Trace: > [ 42.286725] ? try_to_wake_up+0x3a3/0x3c9 > [ 42.287255] ? blk_mq_hctx_notify_dead+0x135/0x135 > [ 42.287880] ? gup_pud_range+0xb5/0x7e1 > [ 42.288381] generic_make_request_checks+0x3cf/0x539 > [ 42.289027] ? gup_pgd_range+0x8e/0xaa > [ 42.289515] generic_make_request+0x38/0x25b > [ 42.290078] ? submit_bio+0x103/0x11f > [ 42.290555] submit_bio+0x103/0x11f > [ 42.291018] ? bio_iov_iter_get_pages+0xe4/0x104 > [ 42.291620] blkdev_direct_IO+0x2a3/0x3af > [ 42.292151] ? kiocb_free+0x34/0x34 > [ 42.292607] ? ___preempt_schedule+0x16/0x18 > [ 42.293168] ? preempt_schedule_common+0x4c/0x65 > [ 42.293771] ? generic_file_read_iter+0x96/0x110 > [ 42.294377] generic_file_read_iter+0x96/0x110 > [ 42.294962] aio_read+0xca/0x13b > [ 42.295388] ? preempt_count_add+0x6d/0x8c > [ 42.295926] ? aio_read_events+0x287/0x2d6 > [ 42.296460] ? do_io_submit+0x4d2/0x62c > [ 42.296964] do_io_submit+0x4d2/0x62c > [ 42.297446] ? do_syscall_64+0x9d/0x15e > [ 42.297950] do_syscall_64+0x9d/0x15e > [ 42.298431] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > [ 42.299090] RIP: 0033:0x7fc12244e687 > [ 42.299556] RSP: 002b:7ffe18388a68 EFLAGS: 0202 ORIG_RAX: > 00d1 > [ 42.300528] RAX: ffda RBX: 7fc0fde08670 RCX: > 7fc12244e687 > [ 42.301442] RDX: 01d1b388 RSI: 0001 RDI: > 7fc123782000 > [ 42.302359] RBP: 22d8 R08: 0001 R09: > 01c461e0 > [ 42.303275] R10: R11: 0202 R12: > 7fc0fde08670 > [ 42.304195] R13: R14: 01d1d0c0 R15: > 01b872f0 > [ 42.305117] Code: 48 85 f6 48 89 7c 24 10 75 0e 48 8b b7 b8 05 00 00 31 ed > 48 85 f6 74 0f 48 63 05 75 a4 e4 00 48 8b ac c6 28 02 00 00 f6 43 15 02 <48> > 8b 45 28 48 89 04 24 0f 85 28 08 00 00 8b 43 10 45 31 e4 83 > [ 42.307553] RIP: blk_throtl_bio+0x41/0x904 RSP: c900033cfaa0 > [ 42.308328] CR2: 0028 > [ 42.308920] ---[ end trace f53a144979f63b29 ]--- > [ 42.309520] Kernel panic - not syncing: Fatal exception > [ 42.310635] Dumping ftrace buffer: > [ 42.311087](ftrace buffer empty) > [ 42.311583] Kernel Offset: disabled > [ 42.312163] ---[ end Kernel panic - not syncing: Fatal exception ]--- >