Re: [PATCH] Use bio_endio instead of bio_put in error path of blk_rq_append_bio
On Tue, Jan 30, 2018 at 04:24:14PM +0100, Jiri Palecek wrote: > > On 1/30/18 1:53 PM, Ming Lei wrote: > > On Thu, Jan 25, 2018 at 9:58 PM, Jiří Paleček wrote: > > > Avoids page leak from bounced requests > > > --- > > > block/blk-map.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/block/blk-map.c b/block/blk-map.c > > > index d3a94719f03f..702d68166689 100644 > > > --- a/block/blk-map.c > > > +++ b/block/blk-map.c > > > @@ -26,7 +26,8 @@ int blk_rq_append_bio(struct request *rq, struct bio > > > **bio) > > > } else { > > > if (!ll_back_merge_fn(rq->q, rq, *bio)) { > > > if (orig_bio != *bio) { > > > - bio_put(*bio); > > > + bio_inc_remaining(orig_bio); > > > + bio_endio(*bio); > > 'bio_inc_remaining(orig_bio);' shouldn't be needed since we don't chain > > bounced > > bio, otherwise this patch is fine. > > I believe it is needed or at least desirable. The situation when the request > bounced is like this > > bio (bounced) . bi_private ---> orig_bio > > and at the end of bounce_end_io, bio_endio(bio->bi_private) [which is > bio_endio(orig_bio) in our case] is called. That doesn't have any effect on > __blk_rq_map_user_iov; its bios have .bi_end_io==0, therefore one call more > or less doesn't matter. However, for other callers, like osd_initiator.c, > this is not the case. They pass bios which have bi_end_io, and might be > surprised if this was called unexpectedly. OK, I think it is good to follow the rule of not calling .bi_end_io() in the failure path, just like __blk_rq_map_user_iov()/blk_rq_map_kern(). But seems pscsi_map_sg() depends on bio_endio(orig_bio) to free the 'orig_bio', could you fix it in this patch too? > > Before caa4b02476e3, blk_rq_append_request wouldn't touch/delete the passed > bio at all under any circumstances. I believe it should stay that way and > incrementing the remaining counter, which effectively nullifies the extra > bio_endio call, does that pretty straightforwardly. Seems too tricky to use bio_inc_remaining() for avoiding bio_endio(orig_bio), if we have to do that, I suggest to add comment on that. Thanks, Ming
Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
Hi Jens On 01/30/2018 11:57 PM, Jens Axboe wrote: > On 1/30/18 8:41 AM, Jens Axboe wrote: >> Hi, >> >> I just hit this on 4.15+ on the laptop, it's running Linus' git >> as of yesterday, right after the block tree merge: >> >> commit 0a4b6e2f80aad46fb55a5cf7b1664c0aef030ee0 >> Merge: 9697e9da8429 796baeeef85a >> Author: Linus Torvalds >> Date: Mon Jan 29 11:51:49 2018 -0800 >> >> Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block >> >> It's hitting this case: >> >> if (WARN_ON_ONCE(n != segments)) { >> >> >> in nvme, indicating that rq->nr_phys_segments does not equal the >> number of bios in the request. Anyone seen this? I'll take a closer >> look as time permits, full trace below. >> >> >> WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 >> nvme_setup_cmd+0x3d3/0x430 >> Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc >> snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat >> snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 >> snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq >> x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb >> snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer >> videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 >> crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore >> hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd >> intel_gtt >> CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ >> #176 >> Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) >> 12/19/2017 >> RIP: 0010:nvme_setup_cmd+0x3d3/0x430 >> RSP: 0018:880423e9f838 EFLAGS: 00010217 >> RAX: RBX: 880423e9f8c8 RCX: 0001 >> RDX: 88022b200010 RSI: 0002 RDI: 327f >> RBP: 880421251400 R08: 88022b20 R09: 0009 >> R10: R11: R12: >> R13: 88042341e280 R14: R15: 880421251440 >> FS: () GS:88044150() knlGS: >> CS: 0010 DS: ES: CR0: 80050033 >> CR2: 55b684795030 CR3: 02e09006 CR4: 001606e0 >> DR0: DR1: DR2: >> DR3: DR6: fffe0ff0 DR7: 0400 >> Call Trace: >> nvme_queue_rq+0x40/0xa00 >> ? __sbitmap_queue_get+0x24/0x90 >> ? blk_mq_get_tag+0xa3/0x250 >> ? wait_woken+0x80/0x80 >> ? blk_mq_get_driver_tag+0x97/0xf0 >> blk_mq_dispatch_rq_list+0x7b/0x4a0 >> ? deadline_remove_request+0x49/0xb0 >> blk_mq_do_dispatch_sched+0x4f/0xc0 >> blk_mq_sched_dispatch_requests+0x106/0x170 >> __blk_mq_run_hw_queue+0x53/0xa0 >> __blk_mq_delay_run_hw_queue+0x83/0xa0 >> blk_mq_run_hw_queue+0x6c/0xd0 >> blk_mq_sched_insert_request+0x96/0x140 >> __blk_mq_try_issue_directly+0x3d/0x190 >> blk_mq_try_issue_directly+0x30/0x70 >> blk_mq_make_request+0x1a4/0x6a0 >> generic_make_request+0xfd/0x2f0 >> ? submit_bio+0x5c/0x110 >> submit_bio+0x5c/0x110 >> ? __blkdev_issue_discard+0x152/0x200 >> submit_bio_wait+0x43/0x60 >> ext4_process_freed_data+0x1cd/0x440 >> ? account_page_dirtied+0xe2/0x1a0 >> ext4_journal_commit_callback+0x4a/0xc0 >> jbd2_journal_commit_transaction+0x17e2/0x19e0 >> ? kjournald2+0xb0/0x250 >> kjournald2+0xb0/0x250 >> ? wait_woken+0x80/0x80 >> ? commit_timeout+0x10/0x10 >> kthread+0x111/0x130 >> ? kthread_create_worker_on_cpu+0x50/0x50 >> ? do_group_exit+0x3a/0xa0 >> ret_from_fork+0x1f/0x30 >> Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 >> 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 >> 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff >> ---[ end trace 50d361cc444506c8 ]--- >> print_req_error: I/O error, dev nvme0n1, sector 847167488 > > Looking at the disassembly, 'n' is 2 and 'segments' is 0x. > Let's consider the case following: blk_mq_bio_to_request() -> blk_init_request_from_bio() -> blk_rq_bio_prep() if (bio_has_data(bio)) rq->nr_phys_segments = bio_phys_segments(q, bio); static inline bool bio_has_data(struct bio *bio) { if (bio && bio->bi_iter.bi_size && bio_op(bio) != REQ_OP_DISCARD && //-> HERE ! bio_op(bio) != REQ_OP_SECURE_ERASE && bio_op(bio) != REQ_OP_WRITE_ZEROES) return true; return false; } For the DISCARD req, the nr_phys_segments is 0. dd_insert_request() -> blk_mq_sched_try_insert_merge() -> elv_attempt_insert_merge() -> blk_attempt_req_merge() -> attempt_merge() -> ll_merge_requests_fn() total_phys_segments = req->nr_phys_segments + next->nr_phys_segments; // total_phys_segments will be 0 if (blk
Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30, 2018 at 08:22:27PM -0700, Jens Axboe wrote: > On 1/30/18 8:21 PM, Bart Van Assche wrote: > > On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: > >> BLK_STS_RESOURCE should always be safe to return, and it should work > >> the same as STS_DEV_RESOURCE, except it may cause an extra queue > >> run. > >> > >> Well written drivers should use STS_DEV_RESOURCE where it makes > >> sense. > > > > Hello Jens, > > > > I would appreciate it if other names would be chosen than BLK_STS_RESOURCE > > and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that > > one of the two status codes causes the queue to be rerun and the other not. > > I'm afraid that the currently chosen names will cause confusion. > > DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE > could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction I guess it still can't cover all, for example, .queue_rq() may not submit rq to hardware successfully because of tansport busy, such FC,.. -- Ming
Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On 1/30/18 8:27 PM, Bart Van Assche wrote: > On Tue, 2018-01-30 at 20:22 -0700, Jens Axboe wrote: >> On 1/30/18 8:21 PM, Bart Van Assche wrote: >>> On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: BLK_STS_RESOURCE should always be safe to return, and it should work the same as STS_DEV_RESOURCE, except it may cause an extra queue run. Well written drivers should use STS_DEV_RESOURCE where it makes sense. >>> >>> Hello Jens, >>> >>> I would appreciate it if other names would be chosen than BLK_STS_RESOURCE >>> and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that >>> one of the two status codes causes the queue to be rerun and the other not. >>> I'm afraid that the currently chosen names will cause confusion. >> >> DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE >> could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction >> a bit clearer. > > I'm concerned about both. A block driver can namely run out of device > resources without receiving a notification if that resource becomes > available again. That is true, and that is why I don't want to driver to make guesses on when to re-run. The saving grace here is that it should happen extremely rarely. I went over this in a previous email. If it's not a rare occurence, then there's no other way around it then wiring up a notification mechanism to kick the queue when the shortage clears. One way to deal with that is making the assumption that other IO will clear the issue. That means we can confine it to the blk-mq layer. Similarly to how we currently sometimes have to track starvation across hardware queues and restart queue X when Y completes IO, we may have to have a broader scope of shortage. That would probably fix all bug pathological cases. > In that case the appropriate return code is BLK_STS_RESOURCE. Hence my > concern ... But this isn't a new thing, the only real change here is making it so that drivers don't have to think about that case. -- Jens Axboe
[PATCH] bcache: fix error return value in memory shrink
From: Tang Junhui In bch_mca_scan(), the return value should not be the number of freed btree nodes, but the number of pages of freed btree nodes. Signed-off-by: Tang Junhui --- drivers/md/bcache/btree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 81e8dc3..a5fbccc 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -718,7 +718,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, } out: mutex_unlock(&c->bucket_lock); - return freed; + return freed * c->btree_pages; } static unsigned long bch_mca_count(struct shrinker *shrink, -- 1.8.3.1
Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, 2018-01-30 at 20:22 -0700, Jens Axboe wrote: > On 1/30/18 8:21 PM, Bart Van Assche wrote: > > On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: > > > BLK_STS_RESOURCE should always be safe to return, and it should work > > > the same as STS_DEV_RESOURCE, except it may cause an extra queue > > > run. > > > > > > Well written drivers should use STS_DEV_RESOURCE where it makes > > > sense. > > > > Hello Jens, > > > > I would appreciate it if other names would be chosen than BLK_STS_RESOURCE > > and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that > > one of the two status codes causes the queue to be rerun and the other not. > > I'm afraid that the currently chosen names will cause confusion. > > DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE > could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction > a bit clearer. I'm concerned about both. A block driver can namely run out of device resources without receiving a notification if that resource becomes available again. In that case the appropriate return code is BLK_STS_RESOURCE. Hence my concern ... Bart.
Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On 1/30/18 8:21 PM, Bart Van Assche wrote: > On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: >> BLK_STS_RESOURCE should always be safe to return, and it should work >> the same as STS_DEV_RESOURCE, except it may cause an extra queue >> run. >> >> Well written drivers should use STS_DEV_RESOURCE where it makes >> sense. > > Hello Jens, > > I would appreciate it if other names would be chosen than BLK_STS_RESOURCE > and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that > one of the two status codes causes the queue to be rerun and the other not. > I'm afraid that the currently chosen names will cause confusion. DEV_RESOURCE is pretty clear I think, but I agree that STS_RESOURCE could perhaps be better. STS_SYSTEM_RESOURCE? It makes the distinction a bit clearer. -- Jens Axboe
Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, 2018-01-30 at 20:17 -0700, Jens Axboe wrote: > BLK_STS_RESOURCE should always be safe to return, and it should work > the same as STS_DEV_RESOURCE, except it may cause an extra queue > run. > > Well written drivers should use STS_DEV_RESOURCE where it makes > sense. Hello Jens, I would appreciate it if other names would be chosen than BLK_STS_RESOURCE and BLK_STS_DEV_RESOURCE, e.g. names that directly refer to the fact that one of the two status codes causes the queue to be rerun and the other not. I'm afraid that the currently chosen names will cause confusion. Thanks, Bart.
Re: [PATCH v6] blk-mq: introduce BLK_STS_DEV_RESOURCE
On 1/30/18 8:04 PM, Mike Snitzer wrote: > From: Ming Lei > > This status is returned from driver to block layer if device related > resource is unavailable, but driver can guarantee that IO dispatch > will be triggered in future when the resource is available. > > Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver > returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after > a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is > 3 ms because both scsi-mq and nvmefc are using that magic value. > > If a driver can make sure there is in-flight IO, it is safe to return > BLK_STS_DEV_RESOURCE because: > > 1) If all in-flight IOs complete before examining SCHED_RESTART in > blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue > is run immediately in this case by blk_mq_dispatch_rq_list(); > > 2) if there is any in-flight IO after/when examining SCHED_RESTART > in blk_mq_dispatch_rq_list(): > - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) > - otherwise, this request will be dispatched after any in-flight IO is > completed via blk_mq_sched_restart() > > 3) if SCHED_RESTART is set concurently in context because of > BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two > cases and make sure IO hang can be avoided. > > One invariant is that queue will be rerun if SCHED_RESTART is set. Applied, thanks. -- Jens Axboe
Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On 1/30/18 10:52 AM, Bart Van Assche wrote: > On 01/30/18 06:24, Mike Snitzer wrote: >> + * >> + * If driver returns BLK_STS_RESOURCE and SCHED_RESTART >> + * bit is set, run queue after a delay to avoid IO stalls >> + * that could otherwise occur if the queue is idle. >> */ >> -if (!blk_mq_sched_needs_restart(hctx) || >> +needs_restart = blk_mq_sched_needs_restart(hctx); >> +if (!needs_restart || >> (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) >> blk_mq_run_hw_queue(hctx, true); >> +else if (needs_restart && (ret == BLK_STS_RESOURCE)) >> +blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); >> } > > If a request completes concurrently with execution of the above code > then the request completion will trigger a call of > blk_mq_sched_restart_hctx() and that call will clear the > BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above code > tests it then the above code will schedule an asynchronous call of > __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the new > queue run returns again BLK_STS_RESOURCE then the above code will be > executed again. In other words, a loop occurs. That loop will repeat as > long as the described race occurs. The current (kernel v4.15) block > layer behavior is simpler: only block drivers call > blk_mq_delay_run_hw_queue() and the block layer core never calls that > function. Hence that loop cannot occur with the v4.15 block layer core > and block drivers. A motivation of why that loop is preferred compared > to the current behavior (no loop) is missing. Does this mean that that > loop is a needless complication of the block layer core? The dispatch, and later restart check, is within the hctx lock. The completions should be as well. > Sorry but I still prefer the v4.15 block layer approach because this > patch has in my view the following disadvantages: > - It involves a blk-mq API change. API changes are always risky and need >some time to stabilize. > - The delay after which to rerun the queue is moved from block layer >drivers into the block layer core. I think that's wrong because only >the block driver authors can make a good choice for this constant. It's exactly the right place to put it, as drivers cannot make a good decision for when to run the queue again. You get NULL on allocating some piece of memory, when do you run it again? That's the last thing I want driver writers to make a decision on, because a novice device driver writer will just think that he should run the queue again asap. In reality we are screwed. Decisions like that SHOULD be in shared and generic code, not in driver private code. > - This patch makes block drivers harder to understand. Anyone who sees >return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first >time will have to look up the meaning of these constants. An explicit >blk_mq_delay_run_hw_queue() call is easier to understand. BLK_STS_RESOURCE should always be safe to return, and it should work the same as STS_DEV_RESOURCE, except it may cause an extra queue run. Well written drivers should use STS_DEV_RESOURCE where it makes sense. > - This patch does not fix any bugs nor makes block drivers easier to >read or to implement. So why is this patch considered useful? It does fix the run bug on global resources, I'm assuming you mean it doesn't fix any EXTRA bugs compared to just use the delayed run? -- Jens Axboe
Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30 2018 at 9:44P -0500, Jens Axboe wrote: > On 1/30/18 7:24 AM, Mike Snitzer wrote: > > From: Ming Lei > > > > This status is returned from driver to block layer if device related > > resource is unavailable, but driver can guarantee that IO dispatch > > will be triggered in future when the resource is available. > > > > Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver > > returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after > > a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is > > 3 ms because both scsi-mq and nvmefc are using that magic value. > > > > If a driver can make sure there is in-flight IO, it is safe to return > > BLK_STS_DEV_RESOURCE because: > > > > 1) If all in-flight IOs complete before examining SCHED_RESTART in > > blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue > > is run immediately in this case by blk_mq_dispatch_rq_list(); > > > > 2) if there is any in-flight IO after/when examining SCHED_RESTART > > in blk_mq_dispatch_rq_list(): > > - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) > > - otherwise, this request will be dispatched after any in-flight IO is > > completed via blk_mq_sched_restart() > > > > 3) if SCHED_RESTART is set concurently in context because of > > BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two > > cases and make sure IO hang can be avoided. > > > > One invariant is that queue will be rerun if SCHED_RESTART is set. > > This looks pretty good to me. I'm waffling a bit on whether to retain > the current BLK_STS_RESOURCE behavior and name the new one something > else, but I do like using the DEV name in there to signify the > difference between a global and device resource. > > Just a few small nits below - can you roll a v6 with the changes? Folded in all your feedback and just replied with v6. > > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > > index 2d973ac54b09..f41d2057215f 100644 > > --- a/include/linux/blk_types.h > > +++ b/include/linux/blk_types.h > > @@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t; > > > > #define BLK_STS_AGAIN ((__force blk_status_t)12) > > > > +/* > > + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device > > + * related resource is unavailable, but driver can guarantee that queue > > + * will be rerun in future once the resource is available (whereby > > + * dispatching requests). > > + * > > + * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a > > + * driver just needs to make sure there is in-flight IO. > > + * > > + * Difference with BLK_STS_RESOURCE: > > + * If driver isn't sure if the queue will be rerun once device resource > > + * is made available, please return BLK_STS_RESOURCE. For example: when > > + * memory allocation, DMA Mapping or other system resource allocation > > + * fails and IO can't be submitted to device. > > + */ > > +#define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) > > I'd rephrase that as: > > BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if > device related resource are unavailable, but the driver can guarantee > that the queue will be rerun in the future once resources become > available again. This is typically the case for device specific > resources that are consumed for IO. If the driver fails allocating these > resources, we know that inflight (or pending) IO will free these > resource upon completion. > > This is different from BLK_STS_RESOURCE in that it explicitly references > device specific resource. For resources of wider scope, allocation > failure can happen without having pending IO. This means that we can't > rely on request completions freeing these resources, as IO may not be in > flight. Examples of that are kernel memory allocations, DMA mappings, or > any other system wide resources. Thanks for that, definitely clearer, nice job. Mike
[PATCH v6] blk-mq: introduce BLK_STS_DEV_RESOURCE
From: Ming Lei This status is returned from driver to block layer if device related resource is unavailable, but driver can guarantee that IO dispatch will be triggered in future when the resource is available. Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is 3 ms because both scsi-mq and nvmefc are using that magic value. If a driver can make sure there is in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE because: 1) If all in-flight IOs complete before examining SCHED_RESTART in blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue is run immediately in this case by blk_mq_dispatch_rq_list(); 2) if there is any in-flight IO after/when examining SCHED_RESTART in blk_mq_dispatch_rq_list(): - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) - otherwise, this request will be dispatched after any in-flight IO is completed via blk_mq_sched_restart() 3) if SCHED_RESTART is set concurently in context because of BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two cases and make sure IO hang can be avoided. One invariant is that queue will be rerun if SCHED_RESTART is set. Suggested-by: Jens Axboe Tested-by: Laurence Oberman Signed-off-by: Ming Lei Signed-off-by: Mike Snitzer --- V6: - use -EBUSY, instead of -ENOMEM, for BLK_STS_DEV_RESOURCE - rename BLK_MQ_QUEUE_DELAY to BLK_MQ_RESOURCE_DELAY - cleanup blk_types.h comment block above BLK_STS_DEV_RESOURCE - all suggested by Jens V5: - fixed stale reference to 10ms delay in blk-mq.c comment - revised commit header to include Ming's proof of how blk-mq drivers will make progress (serves to document how scsi_queue_rq now works) V4: - cleanup header and code comments - rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms - eliminate nvmefc's queue rerun now that blk-mq does it V3: - fix typo, and improvement document - add tested-by tag V2: - add comments on the new introduced status - patch style fix - both are suggested by Christoph block/blk-core.c | 1 + block/blk-mq.c | 20 drivers/block/null_blk.c | 2 +- drivers/block/virtio_blk.c | 2 +- drivers/block/xen-blkfront.c | 2 +- drivers/md/dm-rq.c | 5 ++--- drivers/nvme/host/fc.c | 12 ++-- drivers/scsi/scsi_lib.c | 6 +++--- include/linux/blk_types.h| 18 ++ 9 files changed, 45 insertions(+), 23 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index cdae69be68e9..79dfef84c66c 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -145,6 +145,7 @@ static const struct { [BLK_STS_MEDIUM]= { -ENODATA, "critical medium" }, [BLK_STS_PROTECTION]= { -EILSEQ,"protection" }, [BLK_STS_RESOURCE] = { -ENOMEM,"kernel resource" }, + [BLK_STS_DEV_RESOURCE] = { -EBUSY, "device resource" }, [BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" }, /* device mapper special case, should not leak out: */ diff --git a/block/blk-mq.c b/block/blk-mq.c index 43e7449723e0..52a206e3777f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx, return true; } +#define BLK_MQ_RESOURCE_DELAY 3 /* ms units */ + bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool got_budget) { @@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, struct request *rq, *nxt; bool no_tag = false; int errors, queued; + blk_status_t ret = BLK_STS_OK; if (list_empty(list)) return false; @@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, errors = queued = 0; do { struct blk_mq_queue_data bd; - blk_status_t ret; rq = list_first_entry(list, struct request, queuelist); if (!blk_mq_get_driver_tag(rq, &hctx, false)) { @@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, } ret = q->mq_ops->queue_rq(hctx, &bd); - if (ret == BLK_STS_RESOURCE) { + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { /* * If an I/O scheduler has been configured and we got a * driver tag for the next request already, free it @@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is wh
Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On 1/30/18 7:24 AM, Mike Snitzer wrote: > From: Ming Lei > > This status is returned from driver to block layer if device related > resource is unavailable, but driver can guarantee that IO dispatch > will be triggered in future when the resource is available. > > Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver > returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after > a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is > 3 ms because both scsi-mq and nvmefc are using that magic value. > > If a driver can make sure there is in-flight IO, it is safe to return > BLK_STS_DEV_RESOURCE because: > > 1) If all in-flight IOs complete before examining SCHED_RESTART in > blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue > is run immediately in this case by blk_mq_dispatch_rq_list(); > > 2) if there is any in-flight IO after/when examining SCHED_RESTART > in blk_mq_dispatch_rq_list(): > - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) > - otherwise, this request will be dispatched after any in-flight IO is > completed via blk_mq_sched_restart() > > 3) if SCHED_RESTART is set concurently in context because of > BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two > cases and make sure IO hang can be avoided. > > One invariant is that queue will be rerun if SCHED_RESTART is set. This looks pretty good to me. I'm waffling a bit on whether to retain the current BLK_STS_RESOURCE behavior and name the new one something else, but I do like using the DEV name in there to signify the difference between a global and device resource. Just a few small nits below - can you roll a v6 with the changes? > diff --git a/block/blk-core.c b/block/blk-core.c > index cdae69be68e9..38279d4ae08b 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -145,6 +145,7 @@ static const struct { > [BLK_STS_MEDIUM]= { -ENODATA, "critical medium" }, > [BLK_STS_PROTECTION]= { -EILSEQ,"protection" }, > [BLK_STS_RESOURCE] = { -ENOMEM,"kernel resource" }, > + [BLK_STS_DEV_RESOURCE] = { -ENOMEM,"device resource" }, > [BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" }, I think we should make BLK_STS_DEV_RESOURCE be -EBUSY, that more closely matches the result and makes it distinctly different than the global shortage that is STS_RESOURCE/ENOMEM. > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 43e7449723e0..e39b4e2a63db 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx > **hctx, > return true; > } > > +#define BLK_MQ_QUEUE_DELAY 3 /* ms units */ Make that BLK_MQ_RESOURCE_DELAY or something like that. The name is too generic right now. > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > index 2d973ac54b09..f41d2057215f 100644 > --- a/include/linux/blk_types.h > +++ b/include/linux/blk_types.h > @@ -39,6 +39,23 @@ typedef u8 __bitwise blk_status_t; > > #define BLK_STS_AGAIN((__force blk_status_t)12) > > +/* > + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device > + * related resource is unavailable, but driver can guarantee that queue > + * will be rerun in future once the resource is available (whereby > + * dispatching requests). > + * > + * To safely return BLK_STS_DEV_RESOURCE, and allow forward progress, a > + * driver just needs to make sure there is in-flight IO. > + * > + * Difference with BLK_STS_RESOURCE: > + * If driver isn't sure if the queue will be rerun once device resource > + * is made available, please return BLK_STS_RESOURCE. For example: when > + * memory allocation, DMA Mapping or other system resource allocation > + * fails and IO can't be submitted to device. > + */ > +#define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) I'd rephrase that as: BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if device related resource are unavailable, but the driver can guarantee that the queue will be rerun in the future once resources become available again. This is typically the case for device specific resources that are consumed for IO. If the driver fails allocating these resources, we know that inflight (or pending) IO will free these resource upon completion. This is different from BLK_STS_RESOURCE in that it explicitly references device specific resource. For resources of wider scope, allocation failure can happen without having pending IO. This means that we can't rely on request completions freeing these resources, as IO may not be in flight. Examples of that are kernel memory allocations, DMA mappings, or any other system wide resources. -- Jens Axboe
[PATCH] bcache: fix error count in memory shrink
From: Tang Junhui In bch_mca_scan(), nr btree nodes are expected to shrink, so the for(;;) loop need to satisfy the condition freed < nr. And since c->btree_cache_used always decrease after mca_data_free() calling in for(;;) loop, so we need a auto variable to record c->btree_cache_used before the for(;;) loop. Signed-off-by: Tang Junhui --- drivers/md/bcache/btree.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 81e8dc3..7e9ef3d 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -664,6 +664,7 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, struct btree *b, *t; unsigned long i, nr = sc->nr_to_scan; unsigned long freed = 0; + unsigned int btree_cache_used; if (c->shrinker_disabled) return SHRINK_STOP; @@ -700,7 +701,8 @@ static unsigned long bch_mca_scan(struct shrinker *shrink, } } - for (i = 0; (nr--) && i < c->btree_cache_used; i++) { + btree_cache_used = c->btree_cache_used; + for (i = 0; freed < nr && i < btree_cache_used; i++) { if (list_empty(&c->btree_cache)) goto out; -- 1.8.3.1
Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30, 2018 at 09:52:31AM -0800, Bart Van Assche wrote: > On 01/30/18 06:24, Mike Snitzer wrote: > > +* > > +* If driver returns BLK_STS_RESOURCE and SCHED_RESTART > > +* bit is set, run queue after a delay to avoid IO stalls > > +* that could otherwise occur if the queue is idle. > > */ > > - if (!blk_mq_sched_needs_restart(hctx) || > > + needs_restart = blk_mq_sched_needs_restart(hctx); > > + if (!needs_restart || > > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) > > blk_mq_run_hw_queue(hctx, true); > > + else if (needs_restart && (ret == BLK_STS_RESOURCE)) > > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); > > } > > If a request completes concurrently with execution of the above code then > the request completion will trigger a call of blk_mq_sched_restart_hctx() > and that call will clear the BLK_MQ_S_SCHED_RESTART bit. If that bit is > cleared before the above code tests it then the above code will schedule an > asynchronous call of __blk_mq_run_hw_queue(). If the .queue_rq() call Right. > triggered by the new queue run returns again BLK_STS_RESOURCE then the above > code will be executed again. In other words, a loop occurs. That loop will This patch doesn't change anything about that. When BLK_STS_RESOURCE is returned, this request is added to hctx->dispatch, next time, before dispatching this request, SCHED_RESTART is set and the loop is cut. > repeat as long as the described race occurs. The current (kernel v4.15) > block layer behavior is simpler: only block drivers call > blk_mq_delay_run_hw_queue() and the block layer core never calls that > function. Hence that loop cannot occur with the v4.15 block layer core and That way isn't safe, I have explained to you in the following link: https://marc.info/?l=dm-devel&m=151727454815018&w=2 > block drivers. A motivation of why that loop is preferred compared to the > current behavior (no loop) is missing. Does this mean that that loop is a > needless complication of the block layer core? No such loop as I explained above. > > Sorry but I still prefer the v4.15 block layer approach because this patch > has in my view the following disadvantages: > - It involves a blk-mq API change. API changes are always risky and need > some time to stabilize. > - The delay after which to rerun the queue is moved from block layer > drivers into the block layer core. I think that's wrong because only > the block driver authors can make a good choice for this constant. > - This patch makes block drivers harder to understand. Anyone who sees > return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first > time will have to look up the meaning of these constants. An explicit > blk_mq_delay_run_hw_queue() call is easier to understand. > - This patch makes the blk-mq core harder to understand because of the > loop mentioned above. > - This patch does not fix any bugs nor makes block drivers easier to > read or to implement. So why is this patch considered useful? It does avoid the race I mentioned in the following link: https://marc.info/?l=dm-devel&m=151727454815018&w=2 More importantly, every driver need this change, if you have better idea to fix them all, please post a patch, then we can compare both. Thanks, Ming
[PATCH 4/5] lightnvm: pblk: add padding distribution sysfs attribute
From: Hans Holmberg When pblk receives a sync, all data up to that point in the write buffer must be comitted to persistent storage, and as flash memory comes with a minimal write size there is a significant cost involved both in terms of time for completing the sync and in terms of write amplification padded sectors for filling up to the minimal write size. In order to get a better understanding of the costs involved for syncs, Add a sysfs attribute to pblk: padded_dist, showing a normalized distribution of sectors padded. In order to facilitate measurements of specific workloads during the lifetime of the pblk instance, the distribution can be reset by writing 0 to the attribute. Do this by introducing counters for each possible padding: {0..(minimal write size - 1)} and calculate the normalized distribution when showing the attribute. Signed-off-by: Hans Holmberg Signed-off-by: Javier González --- drivers/lightnvm/pblk-init.c | 16 -- drivers/lightnvm/pblk-rb.c| 15 +- drivers/lightnvm/pblk-sysfs.c | 68 +++ drivers/lightnvm/pblk.h | 6 +++- 4 files changed, 95 insertions(+), 10 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 7eedc5daebc8..a5e3510c0f38 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -921,6 +921,7 @@ static void pblk_free(struct pblk *pblk) { pblk_luns_free(pblk); pblk_lines_free(pblk); + kfree(pblk->pad_dist); pblk_line_meta_free(pblk); pblk_core_free(pblk); pblk_l2p_free(pblk); @@ -998,11 +999,13 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk, pblk->pad_rst_wa = 0; pblk->gc_rst_wa = 0; + atomic_long_set(&pblk->nr_flush, 0); + pblk->nr_flush_rst = 0; + #ifdef CONFIG_NVM_DEBUG atomic_long_set(&pblk->inflight_writes, 0); atomic_long_set(&pblk->padded_writes, 0); atomic_long_set(&pblk->padded_wb, 0); - atomic_long_set(&pblk->nr_flush, 0); atomic_long_set(&pblk->req_writes, 0); atomic_long_set(&pblk->sub_writes, 0); atomic_long_set(&pblk->sync_writes, 0); @@ -1034,10 +1037,17 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk, goto fail_free_luns; } + pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) * sizeof(atomic64_t), +GFP_KERNEL); + if (!pblk->pad_dist) { + ret = -ENOMEM; + goto fail_free_line_meta; + } + ret = pblk_core_init(pblk); if (ret) { pr_err("pblk: could not initialize core\n"); - goto fail_free_line_meta; + goto fail_free_pad_dist; } ret = pblk_l2p_init(pblk); @@ -1097,6 +1107,8 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct gendisk *tdisk, pblk_l2p_free(pblk); fail_free_core: pblk_core_free(pblk); +fail_free_pad_dist: + kfree(pblk->pad_dist); fail_free_line_meta: pblk_line_meta_free(pblk); fail_free_luns: diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c index 7044b5599cc4..264372d7b537 100644 --- a/drivers/lightnvm/pblk-rb.c +++ b/drivers/lightnvm/pblk-rb.c @@ -437,9 +437,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb *rb, unsigned int nr_entries, if (bio->bi_opf & REQ_PREFLUSH) { struct pblk *pblk = container_of(rb, struct pblk, rwb); -#ifdef CONFIG_NVM_DEBUG atomic_long_inc(&pblk->nr_flush); -#endif if (pblk_rb_flush_point_set(&pblk->rwb, bio, mem)) *io_ret = NVM_IO_OK; } @@ -620,14 +618,17 @@ unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb, struct nvm_rq *rqd, pr_err("pblk: could not pad page in write bio\n"); return NVM_IO_ERR; } + + if (pad < pblk->min_write_pgs) + atomic64_inc(&pblk->pad_dist[pad - 1]); + else + pr_warn("pblk: padding more than min. sectors\n"); + + atomic64_add(pad, &pblk->pad_wa); } - atomic64_add(pad, &((struct pblk *) - (container_of(rb, struct pblk, rwb)))->pad_wa); - #ifdef CONFIG_NVM_DEBUG - atomic_long_add(pad, &((struct pblk *) - (container_of(rb, struct pblk, rwb)))->padded_writes); + atomic_long_add(pad, &pblk->padded_writes); #endif return NVM_IO_OK; diff --git a/drivers/lightnvm/pblk-sysfs.c b/drivers/lightnvm/pblk-sysfs.c index 4804bbd32d5f..f902ac00d071 100644 --- a/drivers/lightnvm/pblk-sysfs.c +++ b/drivers/lightnvm/pblk-sysfs.c @@ -341,6 +341,38 @@ static ssize_t pblk_sysfs_get_write_amp_trip(struct pblk *pblk, char *page) atomic64_read(&pblk->pad_wa) - pblk->pad_rst_wa, page); } +static ssize_t pblk_sysfs_get_p
[PATCH 3/5] lightnvm: pblk: export write amplification counters to sysfs
From: Hans Holmberg In a SSD, write amplification, WA, is defined as the average number of page writes per user page write. Write amplification negatively affects write performance and decreases the lifetime of the disk, so it's a useful metric to add to sysfs. In plkb's case, the number of writes per user sector is the sum of: (1) number of user writes (2) number of sectors written by the garbage collector (3) number of sectors padded (i.e. due to syncs) This patch adds persistent counters for 1-3 and two sysfs attributes to export these along with WA calculated with five decimals: write_amp_mileage: the accumulated write amplification stats for the lifetime of the pblk instance write_amp_trip: resetable stats to facilitate delta measurements, values reset at creation and if 0 is written to the attribute. 64-bit counters are used as a 32 bit counter would wrap around already after about 17 TB worth of user data. It will take a long long time before the 64 bit sector counters wrap around. The counters are stored after the bad block bitmap in the first emeta sector of each written line. There is plenty of space in the first emeta sector, so we don't need to bump the major version of the line data format. Signed-off-by: Hans Holmberg Signed-off-by: Javier González --- drivers/lightnvm/pblk-cache.c| 4 ++ drivers/lightnvm/pblk-core.c | 6 +++ drivers/lightnvm/pblk-init.c | 11 - drivers/lightnvm/pblk-map.c | 2 + drivers/lightnvm/pblk-rb.c | 3 ++ drivers/lightnvm/pblk-recovery.c | 25 drivers/lightnvm/pblk-sysfs.c| 86 +++- drivers/lightnvm/pblk.h | 42 8 files changed, 169 insertions(+), 10 deletions(-) diff --git a/drivers/lightnvm/pblk-cache.c b/drivers/lightnvm/pblk-cache.c index 000fcad38136..29a23111b31c 100644 --- a/drivers/lightnvm/pblk-cache.c +++ b/drivers/lightnvm/pblk-cache.c @@ -63,6 +63,8 @@ int pblk_write_to_cache(struct pblk *pblk, struct bio *bio, unsigned long flags) bio_advance(bio, PBLK_EXPOSED_PAGE_SIZE); } + atomic64_add(nr_entries, &pblk->user_wa); + #ifdef CONFIG_NVM_DEBUG atomic_long_add(nr_entries, &pblk->inflight_writes); atomic_long_add(nr_entries, &pblk->req_writes); @@ -117,6 +119,8 @@ int pblk_write_gc_to_cache(struct pblk *pblk, struct pblk_gc_rq *gc_rq) WARN_ONCE(gc_rq->secs_to_gc != valid_entries, "pblk: inconsistent GC write\n"); + atomic64_add(valid_entries, &pblk->gc_wa); + #ifdef CONFIG_NVM_DEBUG atomic_long_add(valid_entries, &pblk->inflight_writes); atomic_long_add(valid_entries, &pblk->recov_gc_writes); diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 155e42a26293..22e61cd4f801 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1630,11 +1630,16 @@ void pblk_line_close_meta(struct pblk *pblk, struct pblk_line *line) struct pblk_line_meta *lm = &pblk->lm; struct pblk_emeta *emeta = line->emeta; struct line_emeta *emeta_buf = emeta->buf; + struct wa_counters *wa = emeta_to_wa(lm, emeta_buf); /* No need for exact vsc value; avoid a big line lock and take aprox. */ memcpy(emeta_to_vsc(pblk, emeta_buf), l_mg->vsc_list, lm->vsc_list_len); memcpy(emeta_to_bb(emeta_buf), line->blk_bitmap, lm->blk_bitmap_len); + wa->user = cpu_to_le64(atomic64_read(&pblk->user_wa)); + wa->pad = cpu_to_le64(atomic64_read(&pblk->pad_wa)); + wa->gc = cpu_to_le64(atomic64_read(&pblk->gc_wa)); + emeta_buf->nr_valid_lbas = cpu_to_le64(line->nr_valid_lbas); emeta_buf->crc = cpu_to_le32(pblk_calc_emeta_crc(pblk, emeta_buf)); @@ -1837,6 +1842,7 @@ void pblk_update_map_dev(struct pblk *pblk, sector_t lba, #endif /* Invalidate and discard padded entries */ if (lba == ADDR_EMPTY) { + atomic64_inc(&pblk->pad_wa); #ifdef CONFIG_NVM_DEBUG atomic_long_inc(&pblk->padded_wb); #endif diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index 93d671ca518e..7eedc5daebc8 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -559,8 +559,8 @@ static unsigned int calc_emeta_len(struct pblk *pblk) /* Round to sector size so that lba_list starts on its own sector */ lm->emeta_sec[1] = DIV_ROUND_UP( - sizeof(struct line_emeta) + lm->blk_bitmap_len, - geo->sec_size); + sizeof(struct line_emeta) + lm->blk_bitmap_len + + sizeof(struct wa_counters), geo->sec_size); lm->emeta_len[1] = lm->emeta_sec[1] * geo->sec_size; /* Round to sector size so that vsc_list starts on its own sector */ @@ -991,6 +991,13 @@ static void *pblk_init(str
[PATCH 5/5] lightnvm: pblk: refactor bad block identification
In preparation for the OCSSD 2.0 spec. bad block identification, refactor the current code to generalize bad block get/set functions and structures. Signed-off-by: Javier González --- drivers/lightnvm/pblk-init.c | 213 +++ drivers/lightnvm/pblk.h | 6 -- 2 files changed, 112 insertions(+), 107 deletions(-) diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c index a5e3510c0f38..86a94a7faa96 100644 --- a/drivers/lightnvm/pblk-init.c +++ b/drivers/lightnvm/pblk-init.c @@ -365,7 +365,25 @@ static void pblk_luns_free(struct pblk *pblk) kfree(pblk->luns); } -static void pblk_free_line_bitmaps(struct pblk_line *line) +static void pblk_line_mg_free(struct pblk *pblk) +{ + struct pblk_line_mgmt *l_mg = &pblk->l_mg; + int i; + + kfree(l_mg->bb_template); + kfree(l_mg->bb_aux); + kfree(l_mg->vsc_list); + + for (i = 0; i < PBLK_DATA_LINES; i++) { + kfree(l_mg->sline_meta[i]); + pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type); + kfree(l_mg->eline_meta[i]); + } + + kfree(pblk->lines); +} + +static void pblk_line_meta_free(struct pblk_line *line) { kfree(line->blk_bitmap); kfree(line->erase_bitmap); @@ -382,40 +400,16 @@ static void pblk_lines_free(struct pblk *pblk) line = &pblk->lines[i]; pblk_line_free(pblk, line); - pblk_free_line_bitmaps(line); + pblk_line_meta_free(line); } spin_unlock(&l_mg->free_lock); } -static void pblk_line_meta_free(struct pblk *pblk) +static int pblk_bb_get_tbl(struct nvm_tgt_dev *dev, struct pblk_lun *rlun, + u8 *blks, int nr_blks) { - struct pblk_line_mgmt *l_mg = &pblk->l_mg; - int i; - - kfree(l_mg->bb_template); - kfree(l_mg->bb_aux); - kfree(l_mg->vsc_list); - - for (i = 0; i < PBLK_DATA_LINES; i++) { - kfree(l_mg->sline_meta[i]); - pblk_mfree(l_mg->eline_meta[i]->buf, l_mg->emeta_alloc_type); - kfree(l_mg->eline_meta[i]); - } - - kfree(pblk->lines); -} - -static int pblk_bb_discovery(struct nvm_tgt_dev *dev, struct pblk_lun *rlun) -{ - struct nvm_geo *geo = &dev->geo; struct ppa_addr ppa; - u8 *blks; - int nr_blks, ret; - - nr_blks = geo->nr_chks * geo->plane_mode; - blks = kmalloc(nr_blks, GFP_KERNEL); - if (!blks) - return -ENOMEM; + int ret; ppa.ppa = 0; ppa.g.ch = rlun->bppa.g.ch; @@ -423,34 +417,56 @@ static int pblk_bb_discovery(struct nvm_tgt_dev *dev, struct pblk_lun *rlun) ret = nvm_get_tgt_bb_tbl(dev, ppa, blks); if (ret) - goto out; + return ret; nr_blks = nvm_bb_tbl_fold(dev->parent, blks, nr_blks); - if (nr_blks < 0) { - ret = nr_blks; - goto out; - } - - rlun->bb_list = blks; + if (nr_blks < 0) + return -EIO; return 0; -out: - kfree(blks); - return ret; +} + +static void *pblk_bb_get_log(struct pblk *pblk) +{ + struct nvm_tgt_dev *dev = pblk->dev; + struct nvm_geo *geo = &dev->geo; + u8 *log; + int i, nr_blks, blk_per_lun; + int ret; + + blk_per_lun = geo->nr_chks * geo->plane_mode; + nr_blks = blk_per_lun * geo->all_luns; + + log = kmalloc(nr_blks, GFP_KERNEL); + if (!log) + return ERR_PTR(-ENOMEM); + + for (i = 0; i < geo->all_luns; i++) { + struct pblk_lun *rlun = &pblk->luns[i]; + u8 *log_pos = log + i * blk_per_lun; + + ret = pblk_bb_get_tbl(dev, rlun, log_pos, blk_per_lun); + if (ret) { + kfree(log); + return ERR_PTR(-EIO); + } + } + + return log; } static int pblk_bb_line(struct pblk *pblk, struct pblk_line *line, - int blk_per_line) + u8 *bb_log, int blk_per_line) { struct nvm_tgt_dev *dev = pblk->dev; struct nvm_geo *geo = &dev->geo; - struct pblk_lun *rlun; - int bb_cnt = 0; - int i; + int i, bb_cnt = 0; for (i = 0; i < blk_per_line; i++) { - rlun = &pblk->luns[i]; - if (rlun->bb_list[line->id] == NVM_BLK_T_FREE) + struct pblk_lun *rlun = &pblk->luns[i]; + u8 *lun_bb_log = bb_log + i * blk_per_line; + + if (lun_bb_log[line->id] == NVM_BLK_T_FREE) continue; set_bit(pblk_ppa_to_pos(geo, rlun->bppa), line->blk_bitmap); @@ -460,29 +476,12 @@ static int pblk_bb_line(struct pblk *pblk, struct pblk_line *line, return bb_cnt; } -static int pblk_alloc_line_bitmaps(struct pblk *pblk, struct pblk_line *line) -{ - struct pblk_line_meta
[PATCH 2/5] lightnvm: pblk: check data lines version on recovery
From: Hans Holmberg As a preparation for future bumps of data line persistent storage versions, we need to start checking the emeta line version during recovery. Also slit up the current emeta/smeta version into two bytes (major,minor). Recovering lines with the same major number as the current pblk data line version must succeed. This means that any changes in the persistent format must be: (1) Backward compatible: if we switch back to and older kernel, recovery of lines stored with major == current_major and minor > current_minor must succeed. (2) Forward compatible: switching to a newer kernel, recovery of lines stored with major=current_major and minor < minor must handle the data format differences gracefully(i.e. initialize new data structures to default values). If we detect lines that have a different major number than the current we must abort recovery. The user must manually migrate the data in this case. Previously the version stored in the emeta header was copied from smeta, which has version 1, so we need to set the minor version to 1. Signed-off-by: Hans Holmberg Signed-off-by: Javier González --- drivers/lightnvm/pblk-core.c | 9 - drivers/lightnvm/pblk-recovery.c | 26 -- drivers/lightnvm/pblk.h | 16 ++-- 3 files changed, 46 insertions(+), 5 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 9027cf2ed1d8..155e42a26293 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -975,7 +975,8 @@ static int pblk_line_init_metadata(struct pblk *pblk, struct pblk_line *line, memcpy(smeta_buf->header.uuid, pblk->instance_uuid, 16); smeta_buf->header.id = cpu_to_le32(line->id); smeta_buf->header.type = cpu_to_le16(line->type); - smeta_buf->header.version = SMETA_VERSION; + smeta_buf->header.version_major = SMETA_VERSION_MAJOR; + smeta_buf->header.version_minor = SMETA_VERSION_MINOR; /* Start metadata */ smeta_buf->seq_nr = cpu_to_le64(line->seq_nr); @@ -998,6 +999,12 @@ static int pblk_line_init_metadata(struct pblk *pblk, struct pblk_line *line, /* End metadata */ memcpy(&emeta_buf->header, &smeta_buf->header, sizeof(struct line_header)); + + emeta_buf->header.version_major = EMETA_VERSION_MAJOR; + emeta_buf->header.version_minor = EMETA_VERSION_MINOR; + emeta_buf->header.crc = cpu_to_le32( + pblk_calc_meta_header_crc(pblk, &emeta_buf->header)); + emeta_buf->seq_nr = cpu_to_le64(line->seq_nr); emeta_buf->nr_lbas = cpu_to_le64(line->sec_in_line); emeta_buf->nr_valid_lbas = cpu_to_le64(0); diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 1d5e961bf5e0..a30fe203d454 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -826,6 +826,25 @@ static u64 pblk_line_emeta_start(struct pblk *pblk, struct pblk_line *line) return emeta_start; } +static int pblk_recov_check_line_version(struct pblk *pblk, +struct line_emeta *emeta) +{ + struct line_header *header = &emeta->header; + + if (header->version_major != EMETA_VERSION_MAJOR) { + pr_err("pblk: line major version mismatch: %d, expected: %d\n", + header->version_major, EMETA_VERSION_MAJOR); + return 1; + } + +#ifdef NVM_DEBUG + if (header->version_minor > EMETA_VERSION_MINOR) + pr_info("pblk: newer line minor version found: %d\n", line_v); +#endif + + return 0; +} + struct pblk_line *pblk_recov_l2p(struct pblk *pblk) { struct pblk_line_meta *lm = &pblk->lm; @@ -873,9 +892,9 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk) if (le32_to_cpu(smeta_buf->header.identifier) != PBLK_MAGIC) continue; - if (smeta_buf->header.version != SMETA_VERSION) { + if (smeta_buf->header.version_major != SMETA_VERSION_MAJOR) { pr_err("pblk: found incompatible line version %u\n", - le16_to_cpu(smeta_buf->header.version)); + smeta_buf->header.version_major); return ERR_PTR(-EINVAL); } @@ -943,6 +962,9 @@ struct pblk_line *pblk_recov_l2p(struct pblk *pblk) goto next; } + if (pblk_recov_check_line_version(pblk, line->emeta->buf)) + return ERR_PTR(-EINVAL); + if (pblk_recov_l2p_from_emeta(pblk, line)) pblk_recov_l2p_from_oob(pblk, line); diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index 8c357fb6538e..fae2526f80b2 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/light
[PATCH 1/5] lightnvm: pblk: handle bad sectors in the emeta area correctly
From: Hans Holmberg Unless we check if there are bad sectors in the entire emeta-area we risk ending up with valid bitmap / available sector count inconsistency. This results in lines with a bad chunk at the last LUN marked as bad, so go through the whole emeta area and mark up the invalid sectors. Signed-off-by: Hans Holmberg --- drivers/lightnvm/pblk-core.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c index 0487b9340c1d..9027cf2ed1d8 100644 --- a/drivers/lightnvm/pblk-core.c +++ b/drivers/lightnvm/pblk-core.c @@ -1021,6 +1021,7 @@ static int pblk_line_init_bb(struct pblk *pblk, struct pblk_line *line, int nr_bb = 0; u64 off; int bit = -1; + int emeta_secs; line->sec_in_line = lm->sec_per_line; @@ -1055,18 +1056,18 @@ static int pblk_line_init_bb(struct pblk *pblk, struct pblk_line *line, /* Mark emeta metadata sectors as bad sectors. We need to consider bad * blocks to make sure that there are enough sectors to store emeta */ - off = lm->sec_per_line - lm->emeta_sec[0]; - bitmap_set(line->invalid_bitmap, off, lm->emeta_sec[0]); - while (nr_bb) { + emeta_secs = lm->emeta_sec[0]; + off = lm->sec_per_line; + while (emeta_secs) { off -= geo->sec_per_pl; if (!test_bit(off, line->invalid_bitmap)) { bitmap_set(line->invalid_bitmap, off, geo->sec_per_pl); - nr_bb--; + emeta_secs -= geo->sec_per_pl; } } - line->sec_in_line -= lm->emeta_sec[0]; line->emeta_ssec = off; + line->sec_in_line -= lm->emeta_sec[0]; line->nr_valid_lbas = 0; line->left_msecs = line->sec_in_line; *line->vsc = cpu_to_le32(line->sec_in_line); -- 2.7.4
Re: [PATCH 2/2] lightnvm: remove multiple groups in 1.2 data structure
> On 30 Jan 2018, at 21.26, Matias Bjørling wrote: > > Only one id group from the 1.2 specification is supported. Make > sure that only the first group is accessible. > > Signed-off-by: Matias Bjørling > --- > drivers/nvme/host/lightnvm.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > Same as before. Even though current implementation only uses one group, the 1.2 specification relies on 4. At least, the identify structure should reflect this. Javier signature.asc Description: Message signed with OpenPGP
Re: [PATCH 1/2] lightnvm: remove mlc pairs structure
> On 30 Jan 2018, at 21.26, Matias Bjørling wrote: > > The known implementations of the 1.2 specification, and upcoming 2.0 > implementation all expose a sequential list of pages to write. > Remove the data structure, as it is no longer needed. > > Signed-off-by: Matias Bjørling > --- > drivers/nvme/host/lightnvm.c | 14 +- > 1 file changed, 1 insertion(+), 13 deletions(-) Even though the current implementations does not use the MLC pairing information, this is part of the on the 1.2 identification. Until we eventually remove 1.2 support (if we do), the identify structure should reflect the specification as is. Javier signature.asc Description: Message signed with OpenPGP
Re: [PATCH] lightnvm: remove chnl_offset in nvme_nvm_identity
> On 30 Jan 2018, at 22.30, Matias Bjørling wrote: > > The identity structure is initialized to zero in the beginning of > the nvme_nvm_identity function. The chnl_offset is separately set to > zero. Since both the variable and assignment is never changed, remove > them. > > Signed-off-by: Matias Bjørling > --- > drivers/nvme/host/lightnvm.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > Looks good. Reviewed-by: Javier González signature.asc Description: Message signed with OpenPGP
Re: [RFC] hard LOCKUP caused by race between blk_init_queue_node and blkcg_print_blkgs
Hi Bart, Thanks very much for the quick response. On 18/1/31 05:19, Bart Van Assche wrote: > On Tue, 2018-01-30 at 19:21 +0800, Joseph Qi wrote: >> Hi Jens and Folks, >> >> Recently we've gotten a hard LOCKUP issue. After investigating the issue >> we've found a race between blk_init_queue_node and blkcg_print_blkgs. >> The race is described below. >> >> blk_init_queue_node blkcg_print_blkgs >> blk_alloc_queue_node (1) >> q->queue_lock = &q->__queue_lock (2) >> blkcg_init_queue(q) (3) >> spin_lock_irq(blkg->q->queue_lock) (4) >> q->queue_lock = lock (5) >> spin_unlock_irq(blkg->q->queue_lock) (6) >> >> (1) allocate an uninitialized queue; >> (2) initialize queue_lock to its default internal lock; >> (3) initialize blkcg part of request queue, which will create blkg and >> then insert it to blkg_list; >> (4) traverse blkg_list and find the created blkg, and then take its >> queue lock, here it is the default *internal lock*; >> (5) *race window*, now queue_lock is overridden with *driver specified >> lock*; >> (6) now unlock *driver specified lock*, not the locked *internal lock*, >> unlock balance breaks. >> >> For the issue above, I think blkcg_init_queue is a bit earlier. We >> can't allow a further use before request queue is fully initialized. >> Since blk_init_queue_node is a really common path and it allows driver >> to override the default internal lock, I'm afraid several other places >> may also have the same issue. >> Am I missing something here? > > What code triggered the blkcg_print_blkgs() call? Was it perhaps a function > related to throttling? If so, I think there are two possible ways to fix this > race: Yes, it is from block throttle. > 1. Moving the .queue_lock initialization from blk_init_queue_node() into >blk_alloc_queue_node() such that it happens before the blkcg_init_queue() >call. This however will require a tree-wide change of the blk_alloc_queue() >and all blk_alloc_queue_node() calls in all block drivers. > 2. Splitting the blk_throtl_init() function such that the > blkcg_activate_policy() >call can occur after the .queue_lock pointer has been initialized, e.g. > from >inside blk_init_allocated_queue() or from inside blk_register_queue() > instead >of from inside blkcg_init_queue(). > At the first glance, I'm thinking of moving the blkcg_init_queue after the queue_lock is overridden. But I don't have an idea yet to avoid the risk during cleanup. Since the initialization of request queue is so fundamental, I'm not sure if there is the same risk in several other places. Thanks, Joseph > I'm not sure what the best approach is. Maybe there is even a third approach > that is better than the two approaches mentioned above. > > Bart. >
Re: BUG: unable to handle kernel NULL pointer dereference in blk_throtl_update_limit_valid
On Tue, Dec 19, 2017 at 06:42:00AM -0800, syzbot wrote: > Hello, > > syzkaller hit the following crash on > 6084b576dca2e898f5c101baef151f7bfdbb606d > git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > > Unfortunately, I don't have any reproducer for this bug yet. > > > netlink: 14 bytes leftover after parsing attributes in process > `syz-executor6'. > BUG: unable to handle kernel NULL pointer dereference at 0098 > IP: blk_throtl_update_limit_valid.isra.19+0x6a/0x2e0 > block/blk-throttle.c:580 > PGD 213563067 P4D 213563067 PUD 214f2a067 PMD 0 > Oops: [#1] SMP > Dumping ftrace buffer: >(ftrace buffer empty) > Modules linked in: > CPU: 1 PID: 3445 Comm: kworker/1:3 Not tainted 4.15.0-rc3-next-20171214+ #67 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Workqueue: events __blk_release_queue > RIP: 0010:blk_throtl_update_limit_valid.isra.19+0x6a/0x2e0 > block/blk-throttle.c:580 > RSP: 0018:c90002e13d30 EFLAGS: 00010093 > RAX: 8801e125c600 RBX: 8801df611c00 RCX: 81714e8a > RDX: RSI: d36b0597 RDI: 0086 > RBP: c90002e13d68 R08: 0001 R09: 000a > R10: c90002e13cb8 R11: 000a R12: > R13: 8801dfdcd808 R14: 817178f0 R15: 0098 > FS: () GS:88021fd0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0098 CR3: 000213784000 CR4: 001426e0 > DR0: 2000 DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0600 > Call Trace: > throtl_pd_offline+0x4e/0x80 block/blk-throttle.c:602 > blkg_destroy+0x93/0x270 block/blk-cgroup.c:327 > blkg_destroy_all+0x69/0xe0 block/blk-cgroup.c:372 > blkcg_exit_queue+0x21/0x40 block/blk-cgroup.c:1202 > __blk_release_queue+0x54/0x180 block/blk-sysfs.c:802 > process_one_work+0x288/0x7a0 kernel/workqueue.c:2112 > worker_thread+0x43/0x4d0 kernel/workqueue.c:2246 > kthread+0x149/0x170 kernel/kthread.c:238 > ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:524 > Code: c7 40 07 08 83 e8 c7 15 b1 ff e8 52 85 b3 ff 85 c0 59 74 12 e8 c8 54 > ba ff 80 3d aa f0 a9 01 00 0f 84 fd 01 00 00 e8 b6 54 ba ff <49> 8b 07 31 ff > 48 8b 80 c8 06 00 00 48 8b 70 28 e8 e1 d9 b7 ff > RIP: blk_throtl_update_limit_valid.isra.19+0x6a/0x2e0 > block/blk-throttle.c:580 RSP: c90002e13d30 > CR2: 0098 > ---[ end trace b30dc449e3987ceb ]--- > Kernel panic - not syncing: Fatal exception > Dumping ftrace buffer: >(ftrace buffer empty) > Kernel Offset: disabled > Rebooting in 86400 seconds.. Invalidating this bug since it hasn't been seen again, and it was reported while KASAN was accidentally disabled in the syzbot kconfig due to a change to the kconfig menus in linux-next (so this crash was possibly caused by slab corruption elsewhere). #syz invalid
Re: [RFC] hard LOCKUP caused by race between blk_init_queue_node and blkcg_print_blkgs
On Tue, 2018-01-30 at 19:21 +0800, Joseph Qi wrote: > Hi Jens and Folks, > > Recently we've gotten a hard LOCKUP issue. After investigating the issue > we've found a race between blk_init_queue_node and blkcg_print_blkgs. > The race is described below. > > blk_init_queue_node blkcg_print_blkgs > blk_alloc_queue_node (1) > q->queue_lock = &q->__queue_lock (2) > blkcg_init_queue(q) (3) > spin_lock_irq(blkg->q->queue_lock) (4) > q->queue_lock = lock (5) > spin_unlock_irq(blkg->q->queue_lock) (6) > > (1) allocate an uninitialized queue; > (2) initialize queue_lock to its default internal lock; > (3) initialize blkcg part of request queue, which will create blkg and > then insert it to blkg_list; > (4) traverse blkg_list and find the created blkg, and then take its > queue lock, here it is the default *internal lock*; > (5) *race window*, now queue_lock is overridden with *driver specified > lock*; > (6) now unlock *driver specified lock*, not the locked *internal lock*, > unlock balance breaks. > > For the issue above, I think blkcg_init_queue is a bit earlier. We > can't allow a further use before request queue is fully initialized. > Since blk_init_queue_node is a really common path and it allows driver > to override the default internal lock, I'm afraid several other places > may also have the same issue. > Am I missing something here? What code triggered the blkcg_print_blkgs() call? Was it perhaps a function related to throttling? If so, I think there are two possible ways to fix this race: 1. Moving the .queue_lock initialization from blk_init_queue_node() into blk_alloc_queue_node() such that it happens before the blkcg_init_queue() call. This however will require a tree-wide change of the blk_alloc_queue() and all blk_alloc_queue_node() calls in all block drivers. 2. Splitting the blk_throtl_init() function such that the blkcg_activate_policy() call can occur after the .queue_lock pointer has been initialized, e.g. from inside blk_init_allocated_queue() or from inside blk_register_queue() instead of from inside blkcg_init_queue(). I'm not sure what the best approach is. Maybe there is even a third approach that is better than the two approaches mentioned above. Bart.
Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
On 1/30/18 1:49 PM, Keith Busch wrote: > On Tue, Jan 30, 2018 at 01:32:25PM -0700, Jens Axboe wrote: >> On 1/30/18 1:30 PM, Keith Busch wrote: >>> On Tue, Jan 30, 2018 at 08:57:49AM -0700, Jens Axboe wrote: Looking at the disassembly, 'n' is 2 and 'segments' is 0x. >>> >>> Is this still a problem if you don't use an IO scheduler? With deadline, >>> I'm not finding any path to bio_attempt_discard_merge which is where the >>> nr_phys_segments is supposed to get it set to 2. Not sure how it could >>> becmoe 0x, though. >> >> blk_mq_make_request() -> blk_mq_sched_bio_merge() -> >> __blk_mq_sched_bio_merge() >> -> blk_mq_attempt_merge() -> bio_attempt_discard_merge() > > That's the calls only if you don't have an elevator_queue, right? With > deadline, it looks like it goes through this path (ftrace confirms): > > __blk_mq_sched_bio_merge() -> dd_bio_merge() -> blk_mq_sched_try_merge() > > Which doesn't have a case for ELEVATOR_DISCARD_MERGE. > > Relavant function_graph: > > 46) |blk_mq_make_request() { > 46) 0.133 us| blk_queue_bounce(); > 46) 0.370 us| blk_queue_split(); > 46) 0.314 us| bio_integrity_prep(); > 46) 0.081 us| blk_attempt_plug_merge(); > 46) | __blk_mq_sched_bio_merge() { > 46) |dd_bio_merge() { > 46) 0.792 us| _raw_spin_lock(); > 46) | blk_mq_sched_try_merge() { Yeah I guess you are right, it can't happen for mq-deadline since the generic sched path doesn't support it. I'll see if I can make it happen again, but I'm not too hopeful. And if I can't, it's hard to compare with "none" to see if it makes a difference or not. -- Jens Axboe
Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
On Tue, Jan 30, 2018 at 01:32:25PM -0700, Jens Axboe wrote: > On 1/30/18 1:30 PM, Keith Busch wrote: > > On Tue, Jan 30, 2018 at 08:57:49AM -0700, Jens Axboe wrote: > >> > >> Looking at the disassembly, 'n' is 2 and 'segments' is 0x. > > > > Is this still a problem if you don't use an IO scheduler? With deadline, > > I'm not finding any path to bio_attempt_discard_merge which is where the > > nr_phys_segments is supposed to get it set to 2. Not sure how it could > > becmoe 0x, though. > > blk_mq_make_request() -> blk_mq_sched_bio_merge() -> > __blk_mq_sched_bio_merge() > -> blk_mq_attempt_merge() -> bio_attempt_discard_merge() That's the calls only if you don't have an elevator_queue, right? With deadline, it looks like it goes through this path (ftrace confirms): __blk_mq_sched_bio_merge() -> dd_bio_merge() -> blk_mq_sched_try_merge() Which doesn't have a case for ELEVATOR_DISCARD_MERGE. Relavant function_graph: 46) |blk_mq_make_request() { 46) 0.133 us| blk_queue_bounce(); 46) 0.370 us| blk_queue_split(); 46) 0.314 us| bio_integrity_prep(); 46) 0.081 us| blk_attempt_plug_merge(); 46) | __blk_mq_sched_bio_merge() { 46) |dd_bio_merge() { 46) 0.792 us| _raw_spin_lock(); 46) | blk_mq_sched_try_merge() {
Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
On 1/30/18 1:30 PM, Keith Busch wrote: > On Tue, Jan 30, 2018 at 08:57:49AM -0700, Jens Axboe wrote: >> >> Looking at the disassembly, 'n' is 2 and 'segments' is 0x. > > Is this still a problem if you don't use an IO scheduler? With deadline, > I'm not finding any path to bio_attempt_discard_merge which is where the > nr_phys_segments is supposed to get it set to 2. Not sure how it could > becmoe 0x, though. blk_mq_make_request() -> blk_mq_sched_bio_merge() -> __blk_mq_sched_bio_merge() -> blk_mq_attempt_merge() -> bio_attempt_discard_merge() Doesn't matter what IO scheduler you use. I don't know if it triggers without a scheduler. I've been running this code continually on the laptop (always do), and haven't seen it before today. -- Jens Axboe
Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
On Tue, Jan 30, 2018 at 08:57:49AM -0700, Jens Axboe wrote: > > Looking at the disassembly, 'n' is 2 and 'segments' is 0x. Is this still a problem if you don't use an IO scheduler? With deadline, I'm not finding any path to bio_attempt_discard_merge which is where the nr_phys_segments is supposed to get it set to 2. Not sure how it could becmoe 0x, though.
Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30 2018 at 2:42pm -0500, Bart Van Assche wrote: > On Tue, 2018-01-30 at 14:33 -0500, Mike Snitzer wrote: > > On Tue, Jan 30 2018 at 12:52pm -0500, > > Bart Van Assche wrote: > > > > > - This patch does not fix any bugs nor makes block drivers easier to > > > read or to implement. So why is this patch considered useful? > > > > It enables blk-mq core to own the problem that individual drivers should > > have no business needing to worry about. Period. > > Thanks for having confirmed that this patch is an API-change only and does > not fix any bugs. No, it is an API change that enables blk-mq drivers to make forward progress without compromising the potential benefits associated with blk-mq's SCHED_RESTART functionality. (If we're going to beat this horse to death it might as well be with precision)
Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, 2018-01-30 at 14:33 -0500, Mike Snitzer wrote: > On Tue, Jan 30 2018 at 12:52pm -0500, > Bart Van Assche wrote: > > > - This patch does not fix any bugs nor makes block drivers easier to > > read or to implement. So why is this patch considered useful? > > It enables blk-mq core to own the problem that individual drivers should > have no business needing to worry about. Period. Thanks for having confirmed that this patch is an API-change only and does not fix any bugs. Bart.
Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30 2018 at 12:52pm -0500, Bart Van Assche wrote: > On 01/30/18 06:24, Mike Snitzer wrote: > >+ * > >+ * If driver returns BLK_STS_RESOURCE and SCHED_RESTART > >+ * bit is set, run queue after a delay to avoid IO stalls > >+ * that could otherwise occur if the queue is idle. > > */ > >-if (!blk_mq_sched_needs_restart(hctx) || > >+needs_restart = blk_mq_sched_needs_restart(hctx); > >+if (!needs_restart || > > (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) > > blk_mq_run_hw_queue(hctx, true); > >+else if (needs_restart && (ret == BLK_STS_RESOURCE)) > >+blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); > > } > > If a request completes concurrently with execution of the above code > then the request completion will trigger a call of > blk_mq_sched_restart_hctx() and that call will clear the > BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above > code tests it then the above code will schedule an asynchronous call > of __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the > new queue run returns again BLK_STS_RESOURCE then the above code > will be executed again. In other words, a loop occurs. That loop > will repeat as long as the described race occurs. The current > (kernel v4.15) block layer behavior is simpler: only block drivers > call blk_mq_delay_run_hw_queue() and the block layer core never > calls that function. Hence that loop cannot occur with the v4.15 > block layer core and block drivers. A motivation of why that loop is > preferred compared to the current behavior (no loop) is missing. > Does this mean that that loop is a needless complication of the > block layer core? No it means the loop is an internal blk-mq concern. And that drivers needn't worry about kicking the queue. And it affords blk-mq latitude to change how it responds to BLK_STS_RESOURCE in the future (without needing to change every driver). But even v4.15 has a similar loop. It just happens to extend into the .queue_rq() where the driver is completely blind to SCHED_RESTART. And the driver may just repeatedly kick the queue after a delay (via blk_mq_delay_run_hw_queue). This cycle should be a very rare occurrence regardless of which approach is taken (V5 vs 4.15). The fact that you engineered your SRP initiator and target code to pathologically trigger this worst case (via target_can_queue) doesn't mean it is the fast path for a properly configured and functioning storage network. > Sorry but I still prefer the v4.15 block layer approach because this > patch has in my view the following disadvantages: > - It involves a blk-mq API change. API changes are always risky and need > some time to stabilize. The number of blk-mq API changes that have occurred since blk-mq was introduced is a very long list. Seems contrived to make this the one that is breaking the camel's back. > - The delay after which to rerun the queue is moved from block layer > drivers into the block layer core. I think that's wrong because only > the block driver authors can make a good choice for this constant. Unsubstantiated. 3ms (scsi-mq, nvmefc) vs 100ms (dm-mq mpath): which is better? Pretty sure if the underlying storage network is 1) performant 2) properly configured -- then a shorter delay is preferable. > - This patch makes block drivers harder to understand. Anyone who sees > return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first > time will have to look up the meaning of these constants. An explicit > blk_mq_delay_run_hw_queue() call is easier to understand. No it doesn't make blk-mq harder to understand. But even if it did it actually acknowledges that there is existing blk-mq SCHED_RESTART heuristic for how to deal with the need for blk-mq to back-off in the face of BLK_STS_RESOURCE returns. By just having each blk-mq driver blindly kick the queue you're actively ignoring, and defeating, that entire design element of blk-mq (SCHED_RESTART). It is an instance where blk-mq can and does know better. Acknowledging that fact is moving blk-mq in a better direction. > - This patch makes the blk-mq core harder to understand because of the > loop mentioned above. You've said your peace. But you've taken on this campaign to undermine this line of development with such passion that we're now in a place where Jens is shell-shocked by all the repeat campaigning and noise. Bart you keep saying the same things over and over. Yet you cannot show the patch to actively be a problem with testing-based detail. Seems you'd rather refuse to even test it than open yourself up to the possibility that this concern of yours has been making a mountain out of a mole hill. > - This patch does not fix any bugs nor makes block drivers easier to > read or to implement. So why is this patch considered usefu
Re: [dm-devel] [LSF/MM TOPIC] block, dm: restack queue_limits
On Tue, 2018-01-30 at 16:07 +0100, Hannes Reinecke wrote: > On 01/29/2018 10:08 PM, Mike Snitzer wrote: > > We currently don't restack the queue_limits if the lowest, or > > intermediate, layer of an IO stack changes. > > > > This is particularly unfortunate in the case of FLUSH/FUA which may > > change if/when a HW controller's BBU fails; whereby requiring the device > > advertise that it has a volatile write cache (WCE=1). > > > Uh-oh. Device rescan. > Would be a valid topic on its own... > > > But in the context of DM, really it'd be best if the entire stack of > > devices had their limits restacked if any underlying layer's limits > > change. > > > > In the past, Martin and I discussed that we should "just do it" but > > never did. Not sure we need a lengthy discussion but figured I'd put it > > out there. > > > > Maybe I'll find time, between now and April, to try implementing it. > > > For SCSI the device capabilities are pretty much set in stone after the > initial scan; there are hooks for rescanning, but they will only work > half of the time. > Plus we can't really change the device type on the fly (eg if the SCSI > device type changes; if it moves away from '0' we would need to unbind > the sd driver, and if it moves to '0' we'll need to rescan the sd > device. None of this is happening right now.) > > So I'd be glad to have a discussion around this. At least array vendor has also desired the ability to change various attributes of the device after the initial scan, such as the model name. Not sure what would break if we did this, since who knows what userspace software might be caching this info, but... -Ewan
Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, 2018-01-30 at 09:52 -0800, Bart Van Assche wrote: > On 01/30/18 06:24, Mike Snitzer wrote: > > + * > > + * If driver returns BLK_STS_RESOURCE and > > SCHED_RESTART > > + * bit is set, run queue after a delay to avoid IO > > stalls > > + * that could otherwise occur if the queue is > > idle. > > */ > > - if (!blk_mq_sched_needs_restart(hctx) || > > + needs_restart = blk_mq_sched_needs_restart(hctx); > > + if (!needs_restart || > > (no_tag && list_empty_careful(&hctx- > > >dispatch_wait.entry))) > > blk_mq_run_hw_queue(hctx, true); > > + else if (needs_restart && (ret == > > BLK_STS_RESOURCE)) > > + blk_mq_delay_run_hw_queue(hctx, > > BLK_MQ_QUEUE_DELAY); > > } > > If a request completes concurrently with execution of the above code > then the request completion will trigger a call of > blk_mq_sched_restart_hctx() and that call will clear the > BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above > code > tests it then the above code will schedule an asynchronous call of > __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the > new > queue run returns again BLK_STS_RESOURCE then the above code will be > executed again. In other words, a loop occurs. That loop will repeat > as > long as the described race occurs. The current (kernel v4.15) block > layer behavior is simpler: only block drivers call > blk_mq_delay_run_hw_queue() and the block layer core never calls > that > function. Hence that loop cannot occur with the v4.15 block layer > core > and block drivers. A motivation of why that loop is preferred > compared > to the current behavior (no loop) is missing. Does this mean that > that > loop is a needless complication of the block layer core? > > Sorry but I still prefer the v4.15 block layer approach because this > patch has in my view the following disadvantages: > - It involves a blk-mq API change. API changes are always risky and > need > some time to stabilize. > - The delay after which to rerun the queue is moved from block layer > drivers into the block layer core. I think that's wrong because > only > the block driver authors can make a good choice for this constant. > - This patch makes block drivers harder to understand. Anyone who > sees > return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the > first > time will have to look up the meaning of these constants. An > explicit > blk_mq_delay_run_hw_queue() call is easier to understand. > - This patch makes the blk-mq core harder to understand because of > the > loop mentioned above. > - This patch does not fix any bugs nor makes block drivers easier to > read or to implement. So why is this patch considered useful? > > Thanks, > > Bart. Hello Bart What is the performance implication of your method versus this patch above. Is there more of a delay in your approach or in Ming's approach from your own testing. I guess it seems we will never get consensus on this so is it time to take a vote. I respect and trust your inputs, you know that, but are you perhaps prepared to accept the approach above and agree to it and if it turns out to expose more issues it can be addressed later. Is that not a better way to progress this because to me it looks like you and Ming will continue to disagree on which is the better approach. With much respect Laurence
Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
On 01/30/18 06:24, Mike Snitzer wrote: +* +* If driver returns BLK_STS_RESOURCE and SCHED_RESTART +* bit is set, run queue after a delay to avoid IO stalls +* that could otherwise occur if the queue is idle. */ - if (!blk_mq_sched_needs_restart(hctx) || + needs_restart = blk_mq_sched_needs_restart(hctx); + if (!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); + else if (needs_restart && (ret == BLK_STS_RESOURCE)) + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); } If a request completes concurrently with execution of the above code then the request completion will trigger a call of blk_mq_sched_restart_hctx() and that call will clear the BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above code tests it then the above code will schedule an asynchronous call of __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the new queue run returns again BLK_STS_RESOURCE then the above code will be executed again. In other words, a loop occurs. That loop will repeat as long as the described race occurs. The current (kernel v4.15) block layer behavior is simpler: only block drivers call blk_mq_delay_run_hw_queue() and the block layer core never calls that function. Hence that loop cannot occur with the v4.15 block layer core and block drivers. A motivation of why that loop is preferred compared to the current behavior (no loop) is missing. Does this mean that that loop is a needless complication of the block layer core? Sorry but I still prefer the v4.15 block layer approach because this patch has in my view the following disadvantages: - It involves a blk-mq API change. API changes are always risky and need some time to stabilize. - The delay after which to rerun the queue is moved from block layer drivers into the block layer core. I think that's wrong because only the block driver authors can make a good choice for this constant. - This patch makes block drivers harder to understand. Anyone who sees return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first time will have to look up the meaning of these constants. An explicit blk_mq_delay_run_hw_queue() call is easier to understand. - This patch makes the blk-mq core harder to understand because of the loop mentioned above. - This patch does not fix any bugs nor makes block drivers easier to read or to implement. So why is this patch considered useful? Thanks, Bart.
Re: [LSF/MM TOPIC] De-clustered RAID with MD
On 30/01/18 11:24, NeilBrown wrote: On Tue, Jan 30 2018, Wols Lists wrote: On 29/01/18 21:50, NeilBrown wrote: By doing declustered parity you can sanely do raid6 on 100 drives, using a logical stripe size that is much smaller than 100. When recovering a single drive, the 10-groups-of-10 would put heavy load on 9 other drives, while the decluster approach puts light load on 99 other drives. No matter how clever md is at throttling recovery, I would still rather distribute the load so that md has an easier job. Not offering to do it ... :-) But that sounds a bit like linux raid-10. Could a simple approach be to do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two parity, striped across 100 drives? Okay, it's not as good as the decluster approach, but it would spread the stress of a rebuild across 20 drives, not 10. And probably be fairly easy to implement. If you did that, I think you would be about 80% of the way to fully declustered-parity RAID. If you then tweak the math a bit so that one stripe would was A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 and the next A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 and then A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 XX When Ax are a logical stripe and Bx are the next, then you have a slightly better distribution. If device XX fails then the reads needed for the first stripe mostly come from different drives than those for the second stripe, which are mostly different again for the 3rd stripe. Presumably the CRUSH algorithm (which I only skim-read once about a year ago) formalizes how to do this, and does it better. Once you have the data handling in place for your proposal, it should be little more than replacing a couple of calculations to get the full solution. Okay. I think I have it - a definition for raid-16 (or is it raid-61). But I need a bit of help with the maths. And it might need a look-up table :-( Okay. Like raid-10, raid-16 would be spec'd as "--level 16,3,8", ie 3 mirrors, emulating an 8-drive raid-6. What I'm looking for is a PRNG that has the "bug" that it repeats over a short period, and over that period is guaranteed to repeat each number the same number of times. I saw a wonderful video demonstration of this years ago - if you plot the generated number against the number of times it was generated, after a few rows it "filled up" a rectangle on the graph. At which point the maths becomes very simple. I just need at least as many real drives as "mirrors times emulated". If somebody can come up with the algorithm I want, I can spec it, and then maybe someone can implement it? It'll be fun testing - I'll need my new machine when I get it working :-) Cheers, Wol
Re: [PATCH v6 1/2] Return bytes transferred for partial direct I/O
On 01/29/2018 01:04 PM, Randy Dunlap wrote: > On 01/29/2018 06:57 AM, Goldwyn Rodrigues wrote: >> From: Goldwyn Rodrigues >> > >> diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt >> index 6c00c1e2743f..72e213d62511 100644 >> --- a/Documentation/sysctl/fs.txt >> +++ b/Documentation/sysctl/fs.txt >> @@ -76,6 +77,19 @@ dcache isn't pruned yet. >> >> == >> >> +dio_short_writes: >> + >> +In case Direct I/O encounters an transient error, it returns > > a transient > >> +the errorcode, even if it has performed part of the write. > >error code, > >> +This flag, if on (default), will return the number of bytes written >> +so far, as the write(2) symantics are. However, some older applications > >semantics > >> +still consider a direct write as an error if all of the I/O >> +submitted is not complete. ie write(file, count, buf) != count. > > I.e. > >> +This option can be disabled on systems in order to support >> +existing applications which do not expect short writes. Thanks for the comments. I will incorporate the language changes. > > and if my system has a mix of older applications and new ones, > will they all work just fine? > Newer applications should treat the error as nothing is written. But yes, I tried doing it through prctl for an individual processes, but did not find a free bit to stick it in. -- Goldwyn
Re: WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
On 1/30/18 8:41 AM, Jens Axboe wrote: > Hi, > > I just hit this on 4.15+ on the laptop, it's running Linus' git > as of yesterday, right after the block tree merge: > > commit 0a4b6e2f80aad46fb55a5cf7b1664c0aef030ee0 > Merge: 9697e9da8429 796baeeef85a > Author: Linus Torvalds > Date: Mon Jan 29 11:51:49 2018 -0800 > > Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block > > It's hitting this case: > > if (WARN_ON_ONCE(n != segments)) { > > > in nvme, indicating that rq->nr_phys_segments does not equal the > number of bios in the request. Anyone seen this? I'll take a closer > look as time permits, full trace below. > > > WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 > nvme_setup_cmd+0x3d3/0x430 > Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc > snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat > snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 > snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq > x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb > snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer > videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 > crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore > hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd > intel_gtt > CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ #176 > Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017 > RIP: 0010:nvme_setup_cmd+0x3d3/0x430 > RSP: 0018:880423e9f838 EFLAGS: 00010217 > RAX: RBX: 880423e9f8c8 RCX: 0001 > RDX: 88022b200010 RSI: 0002 RDI: 327f > RBP: 880421251400 R08: 88022b20 R09: 0009 > R10: R11: R12: > R13: 88042341e280 R14: R15: 880421251440 > FS: () GS:88044150() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 55b684795030 CR3: 02e09006 CR4: 001606e0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > nvme_queue_rq+0x40/0xa00 > ? __sbitmap_queue_get+0x24/0x90 > ? blk_mq_get_tag+0xa3/0x250 > ? wait_woken+0x80/0x80 > ? blk_mq_get_driver_tag+0x97/0xf0 > blk_mq_dispatch_rq_list+0x7b/0x4a0 > ? deadline_remove_request+0x49/0xb0 > blk_mq_do_dispatch_sched+0x4f/0xc0 > blk_mq_sched_dispatch_requests+0x106/0x170 > __blk_mq_run_hw_queue+0x53/0xa0 > __blk_mq_delay_run_hw_queue+0x83/0xa0 > blk_mq_run_hw_queue+0x6c/0xd0 > blk_mq_sched_insert_request+0x96/0x140 > __blk_mq_try_issue_directly+0x3d/0x190 > blk_mq_try_issue_directly+0x30/0x70 > blk_mq_make_request+0x1a4/0x6a0 > generic_make_request+0xfd/0x2f0 > ? submit_bio+0x5c/0x110 > submit_bio+0x5c/0x110 > ? __blkdev_issue_discard+0x152/0x200 > submit_bio_wait+0x43/0x60 > ext4_process_freed_data+0x1cd/0x440 > ? account_page_dirtied+0xe2/0x1a0 > ext4_journal_commit_callback+0x4a/0xc0 > jbd2_journal_commit_transaction+0x17e2/0x19e0 > ? kjournald2+0xb0/0x250 > kjournald2+0xb0/0x250 > ? wait_woken+0x80/0x80 > ? commit_timeout+0x10/0x10 > kthread+0x111/0x130 > ? kthread_create_worker_on_cpu+0x50/0x50 > ? do_group_exit+0x3a/0xa0 > ret_from_fork+0x1f/0x30 > Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 > 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 > 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff > ---[ end trace 50d361cc444506c8 ]--- > print_req_error: I/O error, dev nvme0n1, sector 847167488 Looking at the disassembly, 'n' is 2 and 'segments' is 0x. -- Jens Axboe
WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3
Hi, I just hit this on 4.15+ on the laptop, it's running Linus' git as of yesterday, right after the block tree merge: commit 0a4b6e2f80aad46fb55a5cf7b1664c0aef030ee0 Merge: 9697e9da8429 796baeeef85a Author: Linus Torvalds Date: Mon Jan 29 11:51:49 2018 -0800 Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block It's hitting this case: if (WARN_ON_ONCE(n != segments)) { in nvme, indicating that rq->nr_phys_segments does not equal the number of bios in the request. Anyone seen this? I'll take a closer look as time permits, full trace below. WARNING: CPU: 2 PID: 207 at drivers/nvme/host/core.c:527 nvme_setup_cmd+0x3d3/0x430 Modules linked in: rfcomm fuse ctr ccm bnep arc4 binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nls_cp437 vfat snd_hda_codec_conexant fat snd_hda_codec_generic iwlmvm snd_hda_intel snd_hda_codec snd_hwdep mac80211 snd_hda_core snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq x86_pkg_temp_thermal intel_powerclamp kvm_intel uvcvideo iwlwifi btusb snd_seq_device videobuf2_vmalloc btintel videobuf2_memops kvm snd_timer videobuf2_v4l2 bluetooth irqbypass videobuf2_core aesni_intel aes_x86_64 crypto_simd cryptd snd glue_helper videodev cfg80211 ecdh_generic soundcore hid_generic usbhid hid i915 psmouse e1000e ptp pps_core xhci_pci xhci_hcd intel_gtt CPU: 2 PID: 207 Comm: jbd2/nvme0n1p7- Tainted: G U 4.15.0+ #176 Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET59W (1.33 ) 12/19/2017 RIP: 0010:nvme_setup_cmd+0x3d3/0x430 RSP: 0018:880423e9f838 EFLAGS: 00010217 RAX: RBX: 880423e9f8c8 RCX: 0001 RDX: 88022b200010 RSI: 0002 RDI: 327f RBP: 880421251400 R08: 88022b20 R09: 0009 R10: R11: R12: R13: 88042341e280 R14: R15: 880421251440 FS: () GS:88044150() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 55b684795030 CR3: 02e09006 CR4: 001606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: nvme_queue_rq+0x40/0xa00 ? __sbitmap_queue_get+0x24/0x90 ? blk_mq_get_tag+0xa3/0x250 ? wait_woken+0x80/0x80 ? blk_mq_get_driver_tag+0x97/0xf0 blk_mq_dispatch_rq_list+0x7b/0x4a0 ? deadline_remove_request+0x49/0xb0 blk_mq_do_dispatch_sched+0x4f/0xc0 blk_mq_sched_dispatch_requests+0x106/0x170 __blk_mq_run_hw_queue+0x53/0xa0 __blk_mq_delay_run_hw_queue+0x83/0xa0 blk_mq_run_hw_queue+0x6c/0xd0 blk_mq_sched_insert_request+0x96/0x140 __blk_mq_try_issue_directly+0x3d/0x190 blk_mq_try_issue_directly+0x30/0x70 blk_mq_make_request+0x1a4/0x6a0 generic_make_request+0xfd/0x2f0 ? submit_bio+0x5c/0x110 submit_bio+0x5c/0x110 ? __blkdev_issue_discard+0x152/0x200 submit_bio_wait+0x43/0x60 ext4_process_freed_data+0x1cd/0x440 ? account_page_dirtied+0xe2/0x1a0 ext4_journal_commit_callback+0x4a/0xc0 jbd2_journal_commit_transaction+0x17e2/0x19e0 ? kjournald2+0xb0/0x250 kjournald2+0xb0/0x250 ? wait_woken+0x80/0x80 ? commit_timeout+0x10/0x10 kthread+0x111/0x130 ? kthread_create_worker_on_cpu+0x50/0x50 ? do_group_exit+0x3a/0xa0 ret_from_fork+0x1f/0x30 Code: 73 89 c1 83 ce 10 c1 e1 10 09 ca 83 f8 04 0f 87 0f ff ff ff 8b 4d 20 48 8b 7d 00 c1 e9 09 48 01 8c c7 00 08 00 00 e9 f8 fe ff ff <0f> ff 4c 89 c7 41 bc 0a 00 00 00 e8 0d 78 d6 ff e9 a1 fc ff ff ---[ end trace 50d361cc444506c8 ]--- print_req_error: I/O error, dev nvme0n1, sector 847167488 -- Jens Axboe
Re: v4.15 and I/O hang with BFQ
> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei ha > scritto: > > On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote: >> Hi. >> > ... >> systemd-udevd-271 [000] 4.311033: bfq_insert_requests: insert >> rq->0 >> systemd-udevd-271 [000] ...1 4.311037: blk_mq_do_dispatch_sched: >> not get rq, 1 >> cfdisk-408 [000] 13.484220: bfq_insert_requests: insert >> rq->1 >>kworker/0:1H-174 [000] 13.484253: blk_mq_do_dispatch_sched: >> not get rq, 1 >> === >> >> Looks the same, right? > > Yeah, same with before. > Hi guys, sorry for the delay with this fix. We are proceeding very slowly on this, because I'm super busy. Anyway, now I can at least explain in more detail the cause that leads to this hang. Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests be re-inserted into the I/O scheduler. With this change, I/O schedulers may get the same request inserted again, even several times, without a finish_request invoked on the request before each re-insertion. For the I/O scheduler, every such re-prepared request is equivalent to the insertion of a new request. For schedulers like mq-deadline or kyber this fact causes no problems. In contrast, it confuses a stateful scheduler like BFQ, which preserves states for an I/O request until finish_request is invoked on it. In particular, BFQ has no way to know that the above re-insertions concerns the same, already dispatched request. So it may get stuck waiting for the completion of these re-inserted requests forever, thus preventing any other queue of requests to be served. We are trying to address this issue by adding the hook requeue_request to bfq interface. Unfortunately, with our current implementation of requeue_request in place, bfq eventually gets to an incoherent state. This is apparently caused by a requeue of an I/O request, immediately followed by a completion of the same request. This seems rather absurd, and drives bfq crazy. But this is something for which we don't have definite results yet. We're working on it, sorry again for the delay. Thanks, Paolo > -- > Ming
Re: [PATCH] Use bio_endio instead of bio_put in error path of blk_rq_append_bio
On 1/30/18 1:53 PM, Ming Lei wrote: On Thu, Jan 25, 2018 at 9:58 PM, Jiří Paleček wrote: Avoids page leak from bounced requests --- block/blk-map.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/blk-map.c b/block/blk-map.c index d3a94719f03f..702d68166689 100644 --- a/block/blk-map.c +++ b/block/blk-map.c @@ -26,7 +26,8 @@ int blk_rq_append_bio(struct request *rq, struct bio **bio) } else { if (!ll_back_merge_fn(rq->q, rq, *bio)) { if (orig_bio != *bio) { - bio_put(*bio); + bio_inc_remaining(orig_bio); + bio_endio(*bio); 'bio_inc_remaining(orig_bio);' shouldn't be needed since we don't chain bounced bio, otherwise this patch is fine. I believe it is needed or at least desirable. The situation when the request bounced is like this bio (bounced) . bi_private ---> orig_bio and at the end of bounce_end_io, bio_endio(bio->bi_private) [which is bio_endio(orig_bio) in our case] is called. That doesn't have any effect on __blk_rq_map_user_iov; its bios have .bi_end_io==0, therefore one call more or less doesn't matter. However, for other callers, like osd_initiator.c, this is not the case. They pass bios which have bi_end_io, and might be surprised if this was called unexpectedly. Before caa4b02476e3, blk_rq_append_request wouldn't touch/delete the passed bio at all under any circumstances. I believe it should stay that way and incrementing the remaining counter, which effectively nullifies the extra bio_endio call, does that pretty straightforwardly. Regards Jiri Palecek
Re: [dm-devel] [LSF/MM TOPIC] block, dm: restack queue_limits
On 01/29/2018 10:08 PM, Mike Snitzer wrote: > We currently don't restack the queue_limits if the lowest, or > intermediate, layer of an IO stack changes. > > This is particularly unfortunate in the case of FLUSH/FUA which may > change if/when a HW controller's BBU fails; whereby requiring the device > advertise that it has a volatile write cache (WCE=1). > Uh-oh. Device rescan. Would be a valid topic on its own... > But in the context of DM, really it'd be best if the entire stack of > devices had their limits restacked if any underlying layer's limits > change. > > In the past, Martin and I discussed that we should "just do it" but > never did. Not sure we need a lengthy discussion but figured I'd put it > out there. > > Maybe I'll find time, between now and April, to try implementing it. > For SCSI the device capabilities are pretty much set in stone after the initial scan; there are hooks for rescanning, but they will only work half of the time. Plus we can't really change the device type on the fly (eg if the SCSI device type changes; if it moves away from '0' we would need to unbind the sd driver, and if it moves to '0' we'll need to rescan the sd device. None of this is happening right now.) So I'd be glad to have a discussion around this. Cheers, Hannes -- Dr. Hannes ReineckeTeamlead Storage & Networking h...@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)
Re: v4.15 and I/O hang with BFQ
On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote: > Hi. > ... >systemd-udevd-271 [000] 4.311033: bfq_insert_requests: insert > rq->0 >systemd-udevd-271 [000] ...1 4.311037: blk_mq_do_dispatch_sched: > not get rq, 1 > cfdisk-408 [000] 13.484220: bfq_insert_requests: insert > rq->1 > kworker/0:1H-174 [000] 13.484253: blk_mq_do_dispatch_sched: > not get rq, 1 > === > > Looks the same, right? Yeah, same with before. -- Ming
[PATCH] lightnvm: remove chnl_offset in nvme_nvm_identity
The identity structure is initialized to zero in the beginning of the nvme_nvm_identity function. The chnl_offset is separately set to zero. Since both the variable and assignment is never changed, remove them. Signed-off-by: Matias Bjørling --- drivers/nvme/host/lightnvm.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index e5544806fb0e..dc0b1335c7c6 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -59,8 +59,7 @@ struct nvme_nvm_identity { __u64 rsvd[2]; __le64 prp1; __le64 prp2; - __le32 chnl_off; - __u32 rsvd11[5]; + __u32 rsvd11[6]; }; struct nvme_nvm_getbbtbl { @@ -268,7 +267,6 @@ static int nvme_nvm_identity(struct nvm_dev *nvmdev, struct nvm_id *nvm_id) c.identity.opcode = nvme_nvm_admin_identity; c.identity.nsid = cpu_to_le32(ns->head->ns_id); - c.identity.chnl_off = 0; nvme_nvm_id = kmalloc(sizeof(struct nvme_nvm_id), GFP_KERNEL); if (!nvme_nvm_id) -- 2.11.0
Re: v4.15 and I/O hang with BFQ
Hi. 30.01.2018 09:19, Ming Lei wrote: Hi, We knew there is IO hang issue on BFQ over USB-storage wrt. blk-mq, and last time I found it is inside BFQ. You can try the debug patch in the following link[1] to see if it is same with the previous report[1][2]: [1] https://marc.info/?l=linux-block&m=151214241518562&w=2 [2] https://bugzilla.kernel.org/show_bug.cgi?id=198023 If you aren't sure if they are same, please post the trace somewhere, then I can check if it is a new bug. OK, first, I got 2 more stacktraces without tracing, then I've rebooted with your patch and checked tracing. Before reboot, cfdisk: === [ 266.630386] INFO: task cfdisk:437 blocked for more than 20 seconds. [ 266.640950] Not tainted 4.15.0-pf1 #1 [ 266.645131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 266.651789] cfdisk D0 437410 0x [ 266.661331] Call Trace: [ 266.664517] ? __schedule+0x35f/0x1000 [ 266.668660] ? bio_alloc_bioset+0xc7/0x1e0 [ 266.672330] ? submit_bh_wbc+0x162/0x190 [ 266.678034] schedule+0x32/0xc0 [ 266.681293] io_schedule+0x12/0x40 [ 266.685230] wait_on_page_bit+0xea/0x130 [ 266.689015] ? add_to_page_cache_lru+0xe0/0xe0 [ 266.693456] ? blkdev_writepages+0x30/0x30 [ 266.695521] do_read_cache_page+0x167/0x350 [ 266.697160] read_dev_sector+0x28/0xc0 [ 266.698685] scsi_bios_ptable+0x4e/0x120 [scsi_mod] [ 266.700156] scsicam_bios_param+0x16/0x1a0 [scsi_mod] [ 266.701868] sd_getgeo+0xbf/0xd0 [sd_mod] [ 266.703388] blkdev_ioctl+0x178/0x990 [ 266.704818] block_ioctl+0x39/0x40 [ 266.706381] do_vfs_ioctl+0xa4/0x630 [ 266.708584] ? __fput+0x131/0x1e0 [ 266.710184] ? preempt_count_add+0x68/0xa0 [ 266.711762] ? _raw_spin_unlock_irq+0x1d/0x30 [ 266.713304] SyS_ioctl+0x74/0x80 [ 266.714502] ? exit_to_usermode_loop+0x39/0xa0 [ 266.717352] entry_SYSCALL_64_fastpath+0x20/0x83 [ 266.718857] RIP: 0033:0x7fc482064d87 === Blocked kworker: === [ 389.511083] INFO: task kworker/u8:5:178 blocked for more than 20 seconds. [ 389.516454] Not tainted 4.15.0-pf1 #1 [ 389.520091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 389.524726] kworker/u8:5D0 178 2 0x8000 [ 389.528321] Workqueue: events_freezable_power_ disk_events_workfn [ 389.532228] Call Trace: [ 389.534909] ? __schedule+0x35f/0x1000 [ 389.541906] ? blk_mq_sched_dispatch_requests+0x116/0x190 [ 389.546561] ? __sbitmap_get_word+0x2a/0x80 [ 389.551167] ? sbitmap_get_shallow+0x5c/0xa0 [ 389.555633] schedule+0x32/0xc0 [ 389.559803] io_schedule+0x12/0x40 [ 389.564504] blk_mq_get_tag+0x181/0x2a0 [ 389.572541] ? wait_woken+0x80/0x80 [ 389.576008] blk_mq_get_request+0xf9/0x410 [ 389.579998] blk_mq_alloc_request+0x7e/0xe0 [ 389.584824] blk_get_request_flags+0x40/0x160 [ 389.588917] scsi_execute+0x38/0x1e0 [scsi_mod] [ 389.593079] scsi_test_unit_ready+0x5d/0xd0 [scsi_mod] [ 389.596966] sd_check_events+0xf5/0x1a0 [sd_mod] [ 389.602558] disk_check_events+0x69/0x150 [ 389.604822] process_one_work+0x1df/0x420 [ 389.606584] worker_thread+0x2b/0x3d0 [ 389.608175] ? process_one_work+0x420/0x420 [ 389.609833] kthread+0x113/0x130 [ 389.611327] ? kthread_create_on_node+0x70/0x70 [ 389.612852] ret_from_fork+0x35/0x40 === After reboot, tracing info: === systemd-udevd-269 [000] ...1 4.309198: blk_mq_do_dispatch_sched: get rq->0, 1 kworker/0:1H-174 [000] 4.309380: blk_mq_do_dispatch_sched: not get rq, 1 kworker/u8:2-167 [000] 4.309562: bfq_insert_requests: insert rq->0 kworker/u8:2-167 [000] ...1 4.309563: blk_mq_do_dispatch_sched: get rq->0, 1 kworker/0:1H-174 [000] 4.309694: blk_mq_do_dispatch_sched: not get rq, 1 kworker/u8:2-167 [000] 4.309879: bfq_insert_requests: insert rq->0 kworker/u8:2-167 [000] ...1 4.309880: blk_mq_do_dispatch_sched: get rq->0, 1 kworker/0:1H-174 [000] 4.310001: blk_mq_do_dispatch_sched: not get rq, 1 systemd-udevd-271 [000] 4.311033: bfq_insert_requests: insert rq->0 systemd-udevd-271 [000] ...1 4.311037: blk_mq_do_dispatch_sched: not get rq, 1 cfdisk-408 [000] 13.484220: bfq_insert_requests: insert rq->1 kworker/0:1H-174 [000] 13.484253: blk_mq_do_dispatch_sched: not get rq, 1 === Looks the same, right? Or Paolo should know if the issue is fixed or not in V4.15. So, Paolo :)? Regards, Oleksandr
[PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE
From: Ming Lei This status is returned from driver to block layer if device related resource is unavailable, but driver can guarantee that IO dispatch will be triggered in future when the resource is available. Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is 3 ms because both scsi-mq and nvmefc are using that magic value. If a driver can make sure there is in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE because: 1) If all in-flight IOs complete before examining SCHED_RESTART in blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue is run immediately in this case by blk_mq_dispatch_rq_list(); 2) if there is any in-flight IO after/when examining SCHED_RESTART in blk_mq_dispatch_rq_list(): - if SCHED_RESTART isn't set, queue is run immediately as handled in 1) - otherwise, this request will be dispatched after any in-flight IO is completed via blk_mq_sched_restart() 3) if SCHED_RESTART is set concurently in context because of BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two cases and make sure IO hang can be avoided. One invariant is that queue will be rerun if SCHED_RESTART is set. Suggested-by: Jens Axboe Tested-by: Laurence Oberman Signed-off-by: Ming Lei Signed-off-by: Mike Snitzer --- V5: - fixed stale reference to 10ms delay in blk-mq.c comment - revised commit header to include Ming's proof of how blk-mq drivers will make progress (serves to document how scsi_queue_rq now works) V4: - cleanup header and code comments - rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms - eliminate nvmefc's queue rerun now that blk-mq does it V3: - fix typo, and improvement document - add tested-by tag V2: - add comments on the new introduced status - patch style fix - both are suggested by Christoph block/blk-core.c | 1 + block/blk-mq.c | 20 drivers/block/null_blk.c | 2 +- drivers/block/virtio_blk.c | 2 +- drivers/block/xen-blkfront.c | 2 +- drivers/md/dm-rq.c | 5 ++--- drivers/nvme/host/fc.c | 12 ++-- drivers/scsi/scsi_lib.c | 6 +++--- include/linux/blk_types.h| 17 + 9 files changed, 44 insertions(+), 23 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index cdae69be68e9..38279d4ae08b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -145,6 +145,7 @@ static const struct { [BLK_STS_MEDIUM]= { -ENODATA, "critical medium" }, [BLK_STS_PROTECTION]= { -EILSEQ,"protection" }, [BLK_STS_RESOURCE] = { -ENOMEM,"kernel resource" }, + [BLK_STS_DEV_RESOURCE] = { -ENOMEM,"device resource" }, [BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" }, /* device mapper special case, should not leak out: */ diff --git a/block/blk-mq.c b/block/blk-mq.c index 43e7449723e0..e39b4e2a63db 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx, return true; } +#define BLK_MQ_QUEUE_DELAY 3 /* ms units */ + bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool got_budget) { @@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, struct request *rq, *nxt; bool no_tag = false; int errors, queued; + blk_status_t ret = BLK_STS_OK; if (list_empty(list)) return false; @@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, errors = queued = 0; do { struct blk_mq_queue_data bd; - blk_status_t ret; rq = list_first_entry(list, struct request, queuelist); if (!blk_mq_get_driver_tag(rq, &hctx, false)) { @@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, } ret = q->mq_ops->queue_rq(hctx, &bd); - if (ret == BLK_STS_RESOURCE) { + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { /* * If an I/O scheduler has been configured and we got a * driver tag for the next request already, free it @@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is where we will continue on next queue run. */ if (!list_empty(list)) { + bool needs_restart; + spin_lock(&hctx->lock); list_splice_init(list, &hctx->dispatch);
[PATCH 2/2] lightnvm: remove multiple groups in 1.2 data structure
Only one id group from the 1.2 specification is supported. Make sure that only the first group is accessible. Signed-off-by: Matias Bjørling --- drivers/nvme/host/lightnvm.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index 16906327645e..e5544806fb0e 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -167,7 +167,8 @@ struct nvme_nvm_id { __le32 dom; struct nvme_nvm_addr_format ppaf; __u8resv[228]; - struct nvme_nvm_id_group groups[4]; + struct nvme_nvm_id_group group; + __u8resv2[2880]; } __packed; struct nvme_nvm_bb_tbl { @@ -209,7 +210,7 @@ static int init_grps(struct nvm_id *nvm_id, struct nvme_nvm_id *nvme_nvm_id) if (nvme_nvm_id->cgrps != 1) return -EINVAL; - src = &nvme_nvm_id->groups[0]; + src = &nvme_nvm_id->group; grp = &nvm_id->grp; grp->mtype = src->mtype; -- 2.11.0
[PATCH 1/2] lightnvm: remove mlc pairs structure
The known implementations of the 1.2 specification, and upcoming 2.0 implementation all expose a sequential list of pages to write. Remove the data structure, as it is no longer needed. Signed-off-by: Matias Bjørling --- drivers/nvme/host/lightnvm.c | 14 +- 1 file changed, 1 insertion(+), 13 deletions(-) diff --git a/drivers/nvme/host/lightnvm.c b/drivers/nvme/host/lightnvm.c index 50ef71ee3d86..16906327645e 100644 --- a/drivers/nvme/host/lightnvm.c +++ b/drivers/nvme/host/lightnvm.c @@ -116,17 +116,6 @@ struct nvme_nvm_command { }; }; -#define NVME_NVM_LP_MLC_PAIRS 886 -struct nvme_nvm_lp_mlc { - __le16 num_pairs; - __u8pairs[NVME_NVM_LP_MLC_PAIRS]; -}; - -struct nvme_nvm_lp_tbl { - __u8id[8]; - struct nvme_nvm_lp_mlc mlc; -}; - struct nvme_nvm_id_group { __u8mtype; __u8fmtype; @@ -150,8 +139,7 @@ struct nvme_nvm_id_group { __le32 mpos; __le32 mccap; __le16 cpar; - __u8reserved[10]; - struct nvme_nvm_lp_tbl lptbl; + __u8reserved[906]; } __packed; struct nvme_nvm_addr_format { -- 2.11.0
Re: [PATCH] Use bio_endio instead of bio_put in error path of blk_rq_append_bio
On Thu, Jan 25, 2018 at 9:58 PM, Jiří Paleček wrote: > Avoids page leak from bounced requests > --- > block/blk-map.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/block/blk-map.c b/block/blk-map.c > index d3a94719f03f..702d68166689 100644 > --- a/block/blk-map.c > +++ b/block/blk-map.c > @@ -26,7 +26,8 @@ int blk_rq_append_bio(struct request *rq, struct bio **bio) > } else { > if (!ll_back_merge_fn(rq->q, rq, *bio)) { > if (orig_bio != *bio) { > - bio_put(*bio); > + bio_inc_remaining(orig_bio); > + bio_endio(*bio); 'bio_inc_remaining(orig_bio);' shouldn't be needed since we don't chain bounced bio, otherwise this patch is fine. Thanks, Ming Lei
Re: [LSF/MM TOPIC] De-clustered RAID with MD
On Tue, Jan 30 2018, Wols Lists wrote: > On 29/01/18 21:50, NeilBrown wrote: >> By doing declustered parity you can sanely do raid6 on 100 drives, using >> a logical stripe size that is much smaller than 100. >> When recovering a single drive, the 10-groups-of-10 would put heavy load >> on 9 other drives, while the decluster approach puts light load on 99 >> other drives. No matter how clever md is at throttling recovery, I >> would still rather distribute the load so that md has an easier job. > > Not offering to do it ... :-) > > But that sounds a bit like linux raid-10. Could a simple approach be to > do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two > parity, striped across 100 drives? Okay, it's not as good as the > decluster approach, but it would spread the stress of a rebuild across > 20 drives, not 10. And probably be fairly easy to implement. If you did that, I think you would be about 80% of the way to fully declustered-parity RAID. If you then tweak the math a bit so that one stripe would was A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 and the next A1 C1 A2 C2 A3 C3 A4 C4 B1 D1 B2 D2 and then A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 XX When Ax are a logical stripe and Bx are the next, then you have a slightly better distribution. If device XX fails then the reads needed for the first stripe mostly come from different drives than those for the second stripe, which are mostly different again for the 3rd stripe. Presumably the CRUSH algorithm (which I only skim-read once about a year ago) formalizes how to do this, and does it better. Once you have the data handling in place for your proposal, it should be little more than replacing a couple of calculations to get the full solution. NeilBrown signature.asc Description: PGP signature
[RFC] hard LOCKUP caused by race between blk_init_queue_node and blkcg_print_blkgs
Hi Jens and Folks, Recently we've gotten a hard LOCKUP issue. After investigating the issue we've found a race between blk_init_queue_node and blkcg_print_blkgs. The race is described below. blk_init_queue_node blkcg_print_blkgs blk_alloc_queue_node (1) q->queue_lock = &q->__queue_lock (2) blkcg_init_queue(q) (3) spin_lock_irq(blkg->q->queue_lock) (4) q->queue_lock = lock (5) spin_unlock_irq(blkg->q->queue_lock) (6) (1) allocate an uninitialized queue; (2) initialize queue_lock to its default internal lock; (3) initialize blkcg part of request queue, which will create blkg and then insert it to blkg_list; (4) traverse blkg_list and find the created blkg, and then take its queue lock, here it is the default *internal lock*; (5) *race window*, now queue_lock is overridden with *driver specified lock*; (6) now unlock *driver specified lock*, not the locked *internal lock*, unlock balance breaks. For the issue above, I think blkcg_init_queue is a bit earlier. We can't allow a further use before request queue is fully initialized. Since blk_init_queue_node is a really common path and it allows driver to override the default internal lock, I'm afraid several other places may also have the same issue. Am I missing something here? Thanks, Joseph
Re: [LSF/MM TOPIC] Two blk-mq related topics
On Tue, Jan 30, 2018 at 11:08:28AM +0100, Johannes Thumshirn wrote: > [+Cc Mel] > Jens Axboe writes: > > On 1/29/18 1:56 PM, James Bottomley wrote: > >> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote: > >> [...] > >>> 2. When to enable SCSI_MQ at default again? > >> > >> I'm not sure there's much to discuss ... I think the basic answer is as > >> soon as Christoph wants to try it again. > > > > FWIW, internally I've been running various IO intensive workloads on > > what is essentially 4.12 upstream with scsi-mq the default (with > > mq-deadline as the scheduler) and comparing IO workloads with a > > previous 4.6 kernel (without scsi-mq), and things are looking > > great. > > > > We're never going to iron out the last kinks with it being off > > by default, I think we should attempt to flip the switch again > > for 4.16. > > The 4.12 sounds interesting. I remember Mel ran some test with 4.12 as > we where considering to flip the config option for SLES and it showed > several road blocks. > Mostly due to slow storage and BFQ where mq-deadline was not a universal win as an alternative default. I don't have current data and I archived what I had, but it was based on 4.13-rc7 at the time and BFQ has changed a lot since so it would need to be redone. > I'm not sure whether he re-evaluated 4.13/4.14 on his grid though. > No, it hasn't. Grid time for performance testing has been tight during the last few months to say the least. > But I'm definitively interested in this discussion and can even possibly > share some benchmark results we did in our FC Lab. > If you remind me, I may be able to re-execute the tests in a 4.16-rcX before LSF/MM so you have other data to work with. Unfortunately, I'll not be able to make LSF/MM this time due to personal commitments that conflict and are unmovable. -- Mel Gorman SUSE Labs
Re: [LSF/MM TOPIC] De-clustered RAID with MD
On 29/01/18 21:50, NeilBrown wrote: > By doing declustered parity you can sanely do raid6 on 100 drives, using > a logical stripe size that is much smaller than 100. > When recovering a single drive, the 10-groups-of-10 would put heavy load > on 9 other drives, while the decluster approach puts light load on 99 > other drives. No matter how clever md is at throttling recovery, I > would still rather distribute the load so that md has an easier job. Not offering to do it ... :-) But that sounds a bit like linux raid-10. Could a simple approach be to do something like "raid-6,11,100", ie raid-6 with 9 data chunks, two parity, striped across 100 drives? Okay, it's not as good as the decluster approach, but it would spread the stress of a rebuild across 20 drives, not 10. And probably be fairly easy to implement. Cheers, Wol
Re: [LSF/MM TOPIC] Two blk-mq related topics
On 30/01/2018 01:24, Ming Lei wrote: On Mon, Jan 29, 2018 at 12:56:30PM -0800, James Bottomley wrote: On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote: [...] 2. When to enable SCSI_MQ at default again? I'm not sure there's much to discuss ... I think the basic answer is as soon as Christoph wants to try it again. I guess Christoph still need to evaluate if there are existed issues or blockers before trying it again. And more input may be got from F2F discussion, IMHO. SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1, it is enabled at default, but later the patch is reverted in V4.13-rc7, and becomes disabled at default too. Now both the original reported PM issue(actually SCSI quiesce) and the sequential IO performance issue have been addressed. Is the blocker bug just not closed because no-one thought to do it: https://bugzilla.kernel.org/show_bug.cgi?id=178381 (we have confirmed that this issue is now fixed with the original reporter?) From a developer view, this issue is fixed by the following commit: 3a0a52997(block, scsi: Make SCSI quiesce and resume work reliably), and it is verified by kernel list reporter. And did the Huawei guy (Jonathan Cameron) confirm his performance issue was fixed (I don't think I saw email that he did)? Last time I talked with John Garry about the issue, and the merged .get_budget based patch improves much on the IO performance, but there is still a bit gap compared with legacy path. Seems a driver specific issue, remembered that removing a driver's lock can improve performance much. Garry, could you provide further update on this issue? Hi Ming, From our testing with experimental changes to our driver to support SCSI mq we were almost getting on par performance with legacy path. But without these MQ was hitting performance (and I would not necessarily say it was a driver issue). We can retest from today's mainline and see where we are. BTW, Have you got performance figures for many other single queue HBAs with and without CONFIG_SCSI_MQ_DEFAULT=Y? Thanks, John Thanks, Ming .
Re: [LSF/MM TOPIC] Two blk-mq related topics
[+Cc Mel] Jens Axboe writes: > On 1/29/18 1:56 PM, James Bottomley wrote: >> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote: >> [...] >>> 2. When to enable SCSI_MQ at default again? >> >> I'm not sure there's much to discuss ... I think the basic answer is as >> soon as Christoph wants to try it again. > > FWIW, internally I've been running various IO intensive workloads on > what is essentially 4.12 upstream with scsi-mq the default (with > mq-deadline as the scheduler) and comparing IO workloads with a > previous 4.6 kernel (without scsi-mq), and things are looking > great. > > We're never going to iron out the last kinks with it being off > by default, I think we should attempt to flip the switch again > for 4.16. The 4.12 sounds interesting. I remember Mel ran some test with 4.12 as we where considering to flip the config option for SLES and it showed several road blocks. I'm not sure whether he re-evaluated 4.13/4.14 on his grid though. But I'm definitively interested in this discussion and can even possibly share some benchmark results we did in our FC Lab. Byte, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [LSF/MM TOPIC] De-clustered RAID with MD
Wols Lists writes: > On 29/01/18 15:23, Johannes Thumshirn wrote: >> Hi linux-raid, lsf-pc >> >> (If you've received this mail multiple times, I'm sorry, I'm having >> trouble with the mail setup). > > My immediate reactions as a lay person (I edit the raid wiki) ... >> >> With the rise of bigger and bigger disks, array rebuilding times start >> skyrocketing. > > And? Yes, your data is at risk during a rebuild, but md-raid throttles > the i/o, so it doesn't hammer the system. >> >> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm >> similar to RAID5 but instead of utilizing all disks in an array for >> every I/O operation, but implement a per-I/O mapping function to only >> use a subset of the available disks. >> >> This has at least two advantages: >> 1) If one disk has to be replaced, it's not needed to read the data from >>all disks to recover the one failed disk so non-affected disks can be >>used for real user I/O and not just recovery and > > Again, that's throttling, so that's not a problem ... And throttling in a production environment is not exactly desired. Imagine a 500 disk array (and yes this is something we've seen with MD) and you have to replace disks. While the array is rebuilt you have to throttle all I/O because with raid-{1,5,6,10} all data is striped across all disks. With a parity declustered RAID (or DDP like Dell, NetApp or Huawei call it) you don't have to as the I/O is replicated in parity groups across a subset of disks. All I/O targeting disks which aren't needed to recover the data from the failed disks aren't affected by the throttling at all. >> 2) an efficient mapping function can improve parallel I/O submission, as >>two different I/Os are not necessarily going to the same disks in the >>array. >> >> For the mapping function used a hashing algorithm like Ceph's CRUSH [2] >> would be ideal, as it provides a pseudo random but deterministic mapping >> for the I/O onto the drives. >> >> This whole declustering of cause only makes sense for more than (at >> least) 4 drives but we do have customers with several orders of >> magnitude more drivers in an MD array. > > If you have four drives or more - especially if they are multi-terabyte > drives - you should NOT be using raid-5 ... raid-6 won't help you much in above scenario. >> >> At LSF I'd like to discuss if: >> 1) The wider MD audience is interested in de-clusterd RAID with MD > > I haven't read the papers, so no comment, sorry. > >> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or >>as a new personality > > Neither! If you're going to do it, it should be raid-6. > >> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that >>the NetApp E-Series Arrays do use CRUSH for parity declustering) >> >> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf >> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf >> [3] >> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf >> > Okay - I've now skimmed the crush paper [2]. Looks well interesting. > BUT. It feels more like btrfs than it does like raid. > > Btrfs manages disks, and does raid, it tries to be the "everything > between the hard drive and the file". This crush thingy reads to me like > it wants to be the same. There's nothing wrong with that, but md is a > unix-y "do one thing (raid) and do it well". Well CRUSH is (one of) the algorithms behind Ceph. It takes the decisions where to place a block. It is just a hash (well technically a weighted decision-tree) function that takes a block of I/O and a some configuration parameters and "calculates" the placement. > My knee-jerk reaction is if you want to go for it, it sounds like a good > idea. It just doesn't really feel a good fit for md. Thanks for the input. Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [LSF/MM TOPIC] Two blk-mq related topics
Ming Lei - 30.01.18, 02:24: > > > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In > > > V4.13-rc1, it is enabled at default, but later the patch is reverted > > > in V4.13-rc7, and becomes disabled at default too. > > > > > > Now both the original reported PM issue(actually SCSI quiesce) and > > > the sequential IO performance issue have been addressed. > > > > Is the blocker bug just not closed because no-one thought to do it: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=178381 > > > > (we have confirmed that this issue is now fixed with the original > > reporter?) > > From a developer view, this issue is fixed by the following commit: > 3a0a52997(block, scsi: Make SCSI quiesce and resume work reliably), > and it is verified by kernel list reporter. I never seen any suspend / hibernate related issues with blk-mq + bfq since then. Using heavily utilized BTRFS dual SSD RAID 1. % egrep "MQ|BFQ" /boot/config-4.15.0-tp520-btrfstrim+ CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y CONFIG_BLK_WBT_MQ=y CONFIG_BLK_MQ_PCI=y CONFIG_BLK_MQ_VIRTIO=y CONFIG_MQ_IOSCHED_DEADLINE=m CONFIG_MQ_IOSCHED_KYBER=m CONFIG_IOSCHED_BFQ=m CONFIG_BFQ_GROUP_IOSCHED=y CONFIG_NET_SCH_MQPRIO=m # CONFIG_SCSI_MQ_DEFAULT is not set # CONFIG_DM_MQ_DEFAULT is not set CONFIG_DM_CACHE_SMQ=m % cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.15.0-tp520-btrfstrim+ root=UUID=[…] ro rootflags=subvol=debian resume=/dev/mapper/sata-swap init=/bin/systemd thinkpad_acpi.fan_control=1 systemd.restore_state=0 scsi_mod.use_blk_mq=1 % cat /sys/block/sda/queue/scheduler [bfq] none Thanks, -- Martin
Re: v4.15 and I/O hang with BFQ
Hi, On Tue, Jan 30, 2018 at 09:05:26AM +0100, Oleksandr Natalenko wrote: > Hi, Paolo, Ivan, Ming et al. > > It looks like I've just encountered the issue Ivan has already described in > [1]. Since I'm able to reproduce it reliably in a VM, I'd like to draw more > attention to it. > > First, I'm using v4.15 kernel with all pending BFQ fixes: > > === > 2ad909a300c4 bfq-iosched: don't call bfqg_and_blkg_put for > !CONFIG_BFQ_GROUP_IOSCHED > 83c97a310f83 block, bfq: release oom-queue ref to root group on exit > 5b9eb4716af1 block, bfq: put async queues for root bfq groups too > 3c5529454a27 block, bfq: limit sectors served with interactive weight > raising > e6c72be3486b block, bfq: limit tags for writes and async I/O > e579b91d96ce block, bfq: increase threshold to deem I/O as random > f6cbc16aac88 block, bfq: remove superfluous check in queue-merging setup > 8045d8575183 block, bfq: let a queue be merged only shortly after starting > I/O > 242954975f5e block, bfq: check low_latency flag in bfq_bfqq_save_state() > 8349c1bddd95 block, bfq: add missing rq_pos_tree update on rq removal > 558200440cb9 block, bfq: fix occurrences of request finish method's old name > 6ed2f47ee870 block, bfq: consider also past I/O in soft real-time detection > e5f295dd18f2 block, bfq: remove batches of confusing ifdefs > === > > Next, I boot an Arch VM with this kernel and emulated USB stick attached: > > === > qemu-system-x86_64 -display gtk,gl=on -machine q35,accel=kvm -cpu host,+vmx > -enable-kvm -drive if=pflash,format=raw,readonly,file=/mnt/vms/ovmf/code.img > -drive if=pflash,format=raw,file=/mnt/vms/ovmf/vars.img -cdrom > /mnt/vms/ovmf/shell.iso -netdev user,id=user.0 -device > virtio-net,netdev=user.0 -usb -device nec-usb-xhci,id=xhci -device > usb-tablet,bus=xhci.0 -serial stdio -m 512 -hda sda.img -hdb sdb.img -smp 4 > -drive if=none,id=stick,file=usb.img -device > usb-storage,bus=xhci.0,drive=stick > === > > Within the VM itself I use udev rule to set the I/O scheduler: > > === > ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="bfq" > === We knew there is IO hang issue on BFQ over USB-storage wrt. blk-mq, and last time I found it is inside BFQ. You can try the debug patch in the following link[1] to see if it is same with the previous report[1][2]: [1] https://marc.info/?l=linux-block&m=151214241518562&w=2 [2] https://bugzilla.kernel.org/show_bug.cgi?id=198023 If you aren't sure if they are same, please post the trace somewhere, then I can check if it is a new bug. Or Paolo should know if the issue is fixed or not in V4.15. Thanks, Ming
v4.15 and I/O hang with BFQ
Hi, Paolo, Ivan, Ming et al. It looks like I've just encountered the issue Ivan has already described in [1]. Since I'm able to reproduce it reliably in a VM, I'd like to draw more attention to it. First, I'm using v4.15 kernel with all pending BFQ fixes: === 2ad909a300c4 bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED 83c97a310f83 block, bfq: release oom-queue ref to root group on exit 5b9eb4716af1 block, bfq: put async queues for root bfq groups too 3c5529454a27 block, bfq: limit sectors served with interactive weight raising e6c72be3486b block, bfq: limit tags for writes and async I/O e579b91d96ce block, bfq: increase threshold to deem I/O as random f6cbc16aac88 block, bfq: remove superfluous check in queue-merging setup 8045d8575183 block, bfq: let a queue be merged only shortly after starting I/O 242954975f5e block, bfq: check low_latency flag in bfq_bfqq_save_state() 8349c1bddd95 block, bfq: add missing rq_pos_tree update on rq removal 558200440cb9 block, bfq: fix occurrences of request finish method's old name 6ed2f47ee870 block, bfq: consider also past I/O in soft real-time detection e5f295dd18f2 block, bfq: remove batches of confusing ifdefs === Next, I boot an Arch VM with this kernel and emulated USB stick attached: === qemu-system-x86_64 -display gtk,gl=on -machine q35,accel=kvm -cpu host,+vmx -enable-kvm -drive if=pflash,format=raw,readonly,file=/mnt/vms/ovmf/code.img -drive if=pflash,format=raw,file=/mnt/vms/ovmf/vars.img -cdrom /mnt/vms/ovmf/shell.iso -netdev user,id=user.0 -device virtio-net,netdev=user.0 -usb -device nec-usb-xhci,id=xhci -device usb-tablet,bus=xhci.0 -serial stdio -m 512 -hda sda.img -hdb sdb.img -smp 4 -drive if=none,id=stick,file=usb.img -device usb-storage,bus=xhci.0,drive=stick === Within the VM itself I use udev rule to set the I/O scheduler: === ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="bfq" === Things boot and work fine until I try to create a partition on the emulated USB stick with cfdisk. On writing a new partition table it blocks, and I get the following stacktrace: === [ 225.670702] INFO: task cfdisk:416 blocked for more than 20 seconds. [ 225.681427] Not tainted 4.15.0-pf1 #1 [ 225.685341] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 225.691910] cfdisk D0 416407 0x [ 225.700777] Call Trace: [ 225.703654] ? __schedule+0x35f/0x1000 [ 225.706990] ? __switch_to_asm+0x40/0x70 [ 225.709943] ? __switch_to_asm+0x34/0x70 [ 225.712224] ? __switch_to_asm+0x40/0x70 [ 225.714225] ? __switch_to_asm+0x40/0x70 [ 225.716790] schedule+0x32/0xc0 [ 225.718355] io_schedule+0x12/0x40 [ 225.719747] wait_on_page_bit+0xea/0x130 [ 225.721266] ? add_to_page_cache_lru+0xe0/0xe0 [ 225.722622] __filemap_fdatawait_range+0xbf/0x110 [ 225.724625] ? preempt_count_sub+0x50/0x90 [ 225.726591] ? sync_inodes_one_sb+0x20/0x20 [ 225.727507] filemap_fdatawait_keep_errors+0x1a/0x40 [ 225.728491] iterate_bdevs+0xa7/0x140 [ 225.729304] sys_sync+0x7c/0xb0 [ 225.730095] entry_SYSCALL_64_fastpath+0x20/0x83 [ 225.732420] RIP: 0033:0x7f2631cf4a17 === I've tried it several times with the same result. Next, I reboot the system, change the scheduler to Kyber, and then I can create a new partition successfully. I also went further. While having Kyber activated to work with an emulated USB stick properly, I create RAID10 on it, then I create LVM PV, then VG, then LV, then XFS on top of that, then I set I/O scheduler back to BFQ, then reboot, and on reboot mdadm gets blocked while discovering things, producing the following stacktrace: === [ 41.350763] INFO: task mdadm:354 blocked for more than 20 seconds. [ 41.356154] Not tainted 4.15.0-pf1 #1 [ 41.358674] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 41.363046] mdadm D0 354352 0x0100 [ 41.368700] Call Trace: [ 41.370417] ? __schedule+0x35f/0x1000 [ 41.372668] ? blk_queue_exit+0x3e/0x60 [ 41.374816] ? generic_make_request+0x12f/0x2d0 [ 41.377363] schedule+0x32/0xc0 [ 41.380416] io_schedule+0x12/0x40 [ 41.382698] __blkdev_direct_IO_simple+0x206/0x360 [ 41.382707] ? bdget+0x120/0x120 [ 41.382714] ? blkdev_direct_IO+0x3a5/0x3f0 [ 41.382718] blkdev_direct_IO+0x3a5/0x3f0 [ 41.382724] ? current_time+0x18/0x70 [ 41.382731] ? __atime_needs_update+0x7f/0x190 [ 41.382743] ? generic_file_read_iter+0x8c/0x9d0 [ 41.382747] generic_file_read_iter+0x8c/0x9d0 [ 41.382759] ? __seccomp_filter+0x3b/0x260 [ 41.382765] __vfs_read+0xf5/0x170 [ 41.382772] vfs_read+0x91/0x130 [ 41.382778] SyS_read+0x52/0xc0 [ 41.382819] do_syscall_64+0x67/0x1a0 [ 41.382828] entry_SYSCALL64_slow_path+0x25/0x25 [ 41.382833] RIP: 0033:0x7f4b8088a3a1 [ 41.382836] RSP: 002b:7ffdd681def8 EFLAGS: 0246 ORIG_RAX: [ 41.382841] RAX: ffda RBX: