Re: [dm-devel] [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30, 2018 at 6:14 AM, Mike Snitzerwrote: > On Mon, Jan 29 2018 at 4:51pm -0500, > Bart Van Assche wrote: > >> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote: >> > But regardless of which wins the race, the queue will have been run. >> > Which is all we care about right? >> >> Running the queue is not sufficient. With this patch applied it can happen >> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more >> concurrent queue runs finish before sufficient device resources are available >> to execute the request and that blk_mq_delay_run_hw_queue() does not get >> called at all. If no other activity triggers a queue run, e.g. request >> completion, this will result in a queue stall. > > If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely > on a future queue run. IIUC, that is the entire premise of > BLK_STS_DEV_RESOURCE. If the driver had doubt about whether the > resource were going to be available in the future it should return > BLK_STS_RESOURCE. > > That may seem like putting a lot on a driver developer (to decide > between the 2) but I'll again defer to Jens here. This was the approach > he was advocating be pursued. Thinking of further, maybe you can add the following description in V5, and it should be much easier for driver developer to follow: When any resource allocation fails, if driver can make sure that there is any in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE to blk-mq, that is exactly what scsi_queue_rq() is doing. Follows the theory: 1) driver returns BLK_STS_DEV_RESOURCE if driver figures out there is any in-flight IO, in case of any resource allocation failure 2) If all these in-flight IOs complete before examining SCHED_RESTART in blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue is run immediately in this case by blk_mq_dispatch_rq_list(); 3) if there is any in-flight IO after/when examining SCHED_RESTART in blk_mq_dispatch_rq_list(): - if SCHED_RESTART isn't set, queue is run immediately as handled in 2) - otherwise, this request will be dispatched after any in-flight IO is completed via blk_mq_sched_restart() since this request is added to hctx->dispatch already And there are two invariants when driver returns BLK_STS_DEV_RESOURCE iff there is any in-flight IOs: 1) SCHED_RESTART must be zero if no in-flight IOs 2) there has to be any IO in-flight if SCHED_RESTART is read as 1 Thanks, Ming
Re: [2/2] block: fix blk_rq_append_bio
On Tue, Jan 30, 2018 at 02:42:41AM +0100, Jiri Palecek wrote: > Hello, > > I see a problem with this patch: > > Ming Leiwrites: > > > From: Jens Axboe > > > > Commit caa4b02476e3(blk-map: call blk_queue_bounce from blk_rq_append_bio) > > moves blk_queue_bounce() into blk_rq_append_bio(), but don't consider > > that the bounced bio becomes invisiable to caller since the parameter type > > is > > 'struct bio *', which should have been 'struct bio **'. > > > > This patch fixes this issue by passing 'struct bio **' to > > blk_rq_append_bio(), then the bounced bio can be returned to caller. > > > > Also failure handling is considered too. > > > > Fixes: caa4b02476e3 ("blk-map: call blk_queue_bounce from > > blk_rq_append_bio") > > Cc: Christoph Hellwig > > Reported-by: Michele Ballabio > > (handling failure of blk_rq_append_bio(), only call bio_get() after > > blk_rq_append_bio() returns OK) > > Signed-off-by: Ming Lei > > --- > > block/blk-map.c| 38 > > ++ > > drivers/scsi/osd/osd_initiator.c | 4 +++- > > drivers/target/target_core_pscsi.c | 4 ++-- > > include/linux/blkdev.h | 2 +- > > 4 files changed, 28 insertions(+), 20 deletions(-) > > > > diff --git a/block/blk-map.c b/block/blk-map.c > > index b21f8e86f120..d3a94719f03f 100644 > > --- a/block/blk-map.c > > +++ b/block/blk-map.c > > @@ -12,22 +12,29 @@ > > #include "blk.h" > > > > /* > > - * Append a bio to a passthrough request. Only works can be merged into > > - * the request based on the driver constraints. > > + * Append a bio to a passthrough request. Only works if the bio can be > > merged > > + * into the request based on the driver constraints. > > */ > > -int blk_rq_append_bio(struct request *rq, struct bio *bio) > > +int blk_rq_append_bio(struct request *rq, struct bio **bio) > > { > > - blk_queue_bounce(rq->q, ); > > + struct bio *orig_bio = *bio; > > + > > + blk_queue_bounce(rq->q, bio); > > > > if (!rq->bio) { > > - blk_rq_bio_prep(rq->q, rq, bio); > > + blk_rq_bio_prep(rq->q, rq, *bio); > > } else { > > - if (!ll_back_merge_fn(rq->q, rq, bio)) > > + if (!ll_back_merge_fn(rq->q, rq, *bio)) { > > + if (orig_bio != *bio) { > > + bio_put(*bio); > > + *bio = orig_bio; > > 1) Suppose the bio *does* bounce, which allocates some pages from the > DMA pool, and subsequently this test fails. bio_put won't deallocate > those pages. Prior to caa4b02476e3, it would be done by bio_endio, which > is IMHO the correct way to do it. You are right, that is my fault, and will review your patch later. > > 2) What if the bio bounces and goes into the splitting path of > __blk_queue_bounce. That would submit the request, which would be a > problem should this test or some other later fail, cancelling the whole > request. When bio comes here, bio_is_passthrough() is true, it won't go into that path, so needn't to worry. Thanks, Ming
Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30, 2018 at 03:37:38AM +, Bart Van Assche wrote: > On Tue, 2018-01-30 at 11:31 +0800, Ming Lei wrote: > > Please take a look at drivers, when BLK_STS_RESOURCE is returned, who > > will call blk_mq_delay_run_hw_queue() for drivers? > > As you know the SCSI and dm drivers in kernel v4.15 already call that function > whenever necessary. We have concluded that it is generic issue which need generic solution[1]. [1] https://marc.info/?l=linux-kernel=151638176727612=2 [ming@ming linux]$ git grep -n BLK_STS_RESOURCE ./drivers/ | wc -l 43 Almost every driver need this kind of change if BLK_STS_RESOURCE is returned from IO path. And the failure can be caused by kmalloc(GFP_ATOMIC), DMA Mapping, or what ever. Thanks, Ming
Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30, 2018 at 01:11:22AM +, Bart Van Assche wrote: > On Tue, 2018-01-30 at 09:07 +0800, Ming Lei wrote: > > On Mon, Jan 29, 2018 at 04:48:31PM +, Bart Van Assche wrote: > > > - It is easy to fix this race inside the block layer, namely by using > > > call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to > > > postpone the queue rerunning until after the request has been added > > > back to > > > the dispatch list. > > > > It is just easy to say, can you cook a patch and fix all drivers first? > > Please reread what I wrote. I proposed to change the > blk_mq_delay_run_hw_queue() > IMPLEMENTATION such that the callers do not have to be modified. Please take a look at drivers, when BLK_STS_RESOURCE is returned, who will call blk_mq_delay_run_hw_queue() for drivers? > > > [ ... ] Later, you admitted you understood the patch wrong. [ ... ] > > That's nonsense. I never wrote that. Believe it or not, follows the link and your reply: https://marc.info/?l=dm-devel=151672694508389=2 > So what is wrong with this way? >Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my >reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call >followed >by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a >race condition in code where there was no race condition. > >Bart. -- Ming
Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30, 2018 at 5:51 AM, Bart Van Asschewrote: > On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote: >> But regardless of which wins the race, the queue will have been run. >> Which is all we care about right? > > Running the queue is not sufficient. With this patch applied it can happen > that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more > concurrent queue runs finish before sufficient device resources are available > to execute the request and that blk_mq_delay_run_hw_queue() does not get > called at all. If no other activity triggers a queue run, e.g. request > completion, this will result in a queue stall. Please see document of BLK_STS_DEV_RESOURCE: + * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device + * related resource is unavailable, but driver can guarantee that queue + * will be rerun in future once the resource is available (whereby + * dispatching requests). I have explained the SCSI's BLK_STS_DEV_RESOURCE conversion in another thread, let's know if you have further concern: https://marc.info/?l=linux-block=151727454815018=2 -- Ming Lei
Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, Jan 30, 2018 at 5:22 AM, Bart Van Asschewrote: > On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote: >> + * If driver returns BLK_STS_RESOURCE and SCHED_RESTART >> + * bit is set, run queue after 10ms to avoid IO stalls >> + * that could otherwise occur if the queue is idle. >>*/ >> - if (!blk_mq_sched_needs_restart(hctx) || >> + needs_restart = blk_mq_sched_needs_restart(hctx); >> + if (!needs_restart || >> (no_tag && list_empty_careful(>dispatch_wait.entry))) >> blk_mq_run_hw_queue(hctx, true); >> + else if (needs_restart && (ret == BLK_STS_RESOURCE)) >> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); >> } > > The above code only calls blk_mq_delay_run_hw_queue() if the following > condition > is met: !(!needs_restart || (no_tag && > list_empty_careful(>dispatch_wait.entry))) > && (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can > be > simplified into the following: needs_restart && ret == BLK_STS_RESOURCE && > !(no_tag && list_empty_careful(>dispatch_wait.entry)). From > blk-mq-sched.h: No, the expression of (needs_restart && (ret == BLK_STS_RESOURCE)) clearly expresses what we want to do. When RESTART is working, and meantime BLK_STS_RESOURCE is returned from driver, we need to avoid IO hang by blk_mq_delay_run_hw_queue(). > > static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx) > { > return test_bit(BLK_MQ_S_SCHED_RESTART, >state); > } > > In other words, the above code will not call blk_mq_delay_run_hw_queue() if > BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code > is > reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes > concurrently with the above code. From blk-mq-sched.c: That won't be an issue, given any time the SCHED_RESTART is cleared, the queue will be run again, so we needn't to worry about that. > > static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) > { > if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state)) > return false; > > if (hctx->flags & BLK_MQ_F_TAG_SHARED) { > struct request_queue *q = hctx->queue; > > if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, >state)) > atomic_dec(>shared_hctx_restart); > } else > clear_bit(BLK_MQ_S_SCHED_RESTART, >state); > > return blk_mq_run_hw_queue(hctx, true); > } > > The above blk_mq_run_hw_queue() call may finish either before or after If the above blk_mq_run_hw_queue() finishes before blk_mq_dispatch_rq_list() examines the flag, blk_mq_dispatch_rq_list() will see the flag cleared, and run queue again by the following code branch: if (!blk_mq_sched_needs_restart(hctx) || (no_tag && list_empty_careful(>dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); If blk_mq_run_hw_queue() finishes after blk_mq_dispatch_rq_list() examines the flag, this blk_mq_run_hw_queue() will see the new added request, and still everything is fine. Even blk_mq_delay_run_hw_queue() can be started too. > blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag. > > That's why I wrote in previous e-mails that this patch and also the previous > versions of this patch change a systematic call of blk_mq_delay_run_hw_queue() > into a call that may or may not happen, depending on whether or not a request > completes concurrently with request queueing. Sorry but I think that means > that the above change combined with changing BLK_STS_RESOURCE into > BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in > sporadic queue stalls. As you know sporadic queue stalls are annoying and hard > to debug. Now there is debugfs, it isn't difficult to deal with that if reporter'd like to cowork. > > Plenty of e-mails have already been exchanged about versions one to four of > this > patch but so far nobody has even tried to contradict the above. No, I don't see the issue, let's revisit the main change again: + needs_restart = blk_mq_sched_needs_restart(hctx); + if (!needs_restart || (no_tag && list_empty_careful(>dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); + else if (needs_restart && (ret == BLK_STS_RESOURCE)) + blk_mq_delay_run_hw_queue(hctx, 10); If SCHED_RESTART isn't set, queue is run immediately, otherwise when BLK_STS_RESOURCE is returned, we avoid IO hang by blk_mq_delay_run_hw_queue(hctx, XXX). And we don't cover BLK_STS_DEV_RESOURCE above because it is documented clearly that BLK_STS_DEV_RESOURCE is only returned by driver iff queue will be rerun in future if resource is available. Is there any hole in above code? Thanks, Ming Lei
Re: [PATCH v4 05/13] bcache: stop dc->writeback_rate_update properly
On 30/01/2018 9:57 AM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > Hello Coly: > > OK, I got your point now. > Thanks for your patience. > > And there is a small issue I hope to be modified: > +#define BCACHE_DEV_WB_RUNNING4 > +#define BCACHE_DEV_RATE_DW_RUNNING8 > Would be OK just as: > +#define BCACHE_DEV_WB_RUNNING3 > +#define BCACHE_DEV_RATE_DW_RUNNING4 > > Reviewed-by: Tang Junhui > Hi Junhui, I will fix that in v5 patch. Thanks for your review :-) Coly Li [snip]
Re: [PATCH v4 05/13] bcache: stop dc->writeback_rate_update properly
From: Tang JunhuiHello Coly: OK, I got your point now. Thanks for your patience. And there is a small issue I hope to be modified: +#define BCACHE_DEV_WB_RUNNING4 +#define BCACHE_DEV_RATE_DW_RUNNING8 Would be OK just as: +#define BCACHE_DEV_WB_RUNNING3 +#define BCACHE_DEV_RATE_DW_RUNNING4 Reviewed-by: Tang Junhui >On 29/01/2018 8:22 PM, tang.jun...@zte.com.cn wrote: >> From: Tang Junhui >> >> Hello Coly: >> >> There are some differences, >> Using variable of atomic_t type can not guarantee the atomicity of >> transaction. >> for example: >> A thread runs in update_writeback_rate() >> update_writeback_rate(){ >> >> +if (test_bit(BCACHE_DEV_WB_RUNNING, >disk.flags)) { >> +schedule_delayed_work(>writeback_rate_update, >>dc->writeback_rate_update_seconds * HZ); >> +} >> >> Then another thread executes in cached_dev_detach_finish(): >> if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, >disk.flags)) >> cancel_writeback_rate_update_dwork(dc); >> >> + >> +/* >> + * should check BCACHE_DEV_RATE_DW_RUNNING before calling >> + * cancel_delayed_work_sync(). >> + */ >> +clear_bit(BCACHE_DEV_RATE_DW_RUNNING, >disk.flags); >> +/* paired with where BCACHE_DEV_RATE_DW_RUNNING is tested */ >> +smp_mb(); >> >> Race still exists. >> > >Hi Junhui, > >Check super.c:cancel_writeback_rate_update_dwork(), >BCACHE_DEV_RATE_DW_RUNNING is checked there. > >You may see in cached_dev_detach_finish() and update_writeback_rate(), >the orders to check BCACHE_DEV_RATE_DW_RUNNING and BCACHE_DEV_WB_RUNNING >are different. > >cached_dev_detach_finish()update_writeback_rate() > >test_and_clear_bitset_bit >BCACHE_DEV_WB_RUNNINGBCACHE_DEV_RATE_DW_RUNNING > >(implicit smp_mb())smp_mb() > >test_bittest_bit >BCACHE_DEV_RATE_DW_RUNNINGBCACHE_DEV_WB_RUNNING > >clear_bit() >BCACHE_DEV_RATE_DW_RUNNING > >smp_mb() > > >This two flags are accessed in reversed order in different locations, >there is a smp_mb() between accessing two flags to serialize the access >order. > >By the above reserve ordering accessing, it is sure that >- in cached_dev_detach_finish(), before >test_bit(BCACHE_DEV_RATE_DW_RUNNING) bit BCACHE_DEV_WB_RUNNING must be >cleared already. >- in update_writeback_rate(), before test_bit(BCACHE_DEV_WB_RUNNING), >BCACHE_DEV_RATE_DW_RUNNING must be set already. > >Therefore in your example, if a thread is testing BCACHE_DEV_WB_RUNNING >in update_writeback_rate(), it means BCACHE_DEV_RATE_DW_RUNNING must be >set already. So in cancel_writeback_rate_update_dwork() another thread >must wait until BCACHE_DEV_RATE_DW_RUNNING is cleared then >cancel_delayed_work_sync() can be called. And in update_writeback_rate() >the bit BCACHE_DEV_RATE_DW_RUNNING is cleared after >schedule_delayed_work() returns, so the race is killed. > >A mutex lock indicates an implicit memory barrier, and in your >suggestion up_read(>writeback_lock) is after schedule_delayed_work() >too. This is why I said they are almost same. > >Thanks. > >Coly Li > >>> >>> On 29/01/2018 3:35 PM, tang.jun...@zte.com.cn wrote: From: Tang Junhui Hello Coly: This patch is somewhat difficult for me, I think we can resolve it in a simple way. We can take the schedule_delayed_work() under the protection of dc->writeback_lock, and judge if we need re-arm this work to queue. static void update_writeback_rate(struct work_struct *work) { struct cached_dev *dc = container_of(to_delayed_work(work), struct cached_dev, writeback_rate_update); down_read(>writeback_lock); if (atomic_read(>has_dirty) && dc->writeback_percent) __update_writeback_rate(dc); -up_read(>writeback_lock); +if (NEED_RE-AEMING) schedule_delayed_work(>writeback_rate_update, dc->writeback_rate_update_seconds * HZ); +up_read(>writeback_lock); } In cached_dev_detach_finish() and cached_dev_free() we can set the no need flag under the protection of dc->writeback_lock, for example: static void cached_dev_detach_finish(struct work_struct *w) { ... +down_write(>writeback_lock); +SET NO NEED RE-ARM FLAG +up_write(>writeback_lock); cancel_delayed_work_sync(>writeback_rate_update); } I think this way is more simple and readable. >>> >>> Hi Junhui, >>> >>> Your suggest is essentially almost same to my patch, >>> - clear BCACHE_DEV_DETACHING bit acts as SET NO NEED RE-ARM FLAG. >>> - cancel_writeback_rate_update_dwork acts as some kind
Re: [LSF/MM TOPIC] Two blk-mq related topics
On 1/29/18 4:46 PM, James Bottomley wrote: > On Mon, 2018-01-29 at 14:00 -0700, Jens Axboe wrote: >> On 1/29/18 1:56 PM, James Bottomley wrote: >>> >>> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote: >>> [...] 2. When to enable SCSI_MQ at default again? >>> >>> I'm not sure there's much to discuss ... I think the basic answer >>> is as soon as Christoph wants to try it again. >> >> FWIW, internally I've been running various IO intensive workloads on >> what is essentially 4.12 upstream with scsi-mq the default (with >> mq-deadline as the scheduler) and comparing IO workloads with a >> previous 4.6 kernel (without scsi-mq), and things are looking >> great. >> >> We're never going to iron out the last kinks with it being off >> by default, I think we should attempt to flip the switch again >> for 4.16. > > Absolutely, I agree we turn it on ASAP. I just don't want to be on the > receiving end of Linus' flamethrower because a bug we already had > reported against scsi-mq caused problems. Get confirmation from the > original reporters (or as close to it as you can) that their problems > are fixed and we're good to go; he won't kick us nearly as hard for new > bugs that turn up. I agree, the functional issues definitely have to be verified to be resolved. Various performance hitches we can dive into if they crop up, but reintroducing some random suspend regression is not acceptable. -- Jens Axboe
[PATCH] Use bio_endio instead of bio_put in error path of blk_rq_append_bio
Avoids page leak from bounced requests --- block/blk-map.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/blk-map.c b/block/blk-map.c index d3a94719f03f..702d68166689 100644 --- a/block/blk-map.c +++ b/block/blk-map.c @@ -26,7 +26,8 @@ int blk_rq_append_bio(struct request *rq, struct bio **bio) } else { if (!ll_back_merge_fn(rq->q, rq, *bio)) { if (orig_bio != *bio) { - bio_put(*bio); + bio_inc_remaining(orig_bio); + bio_endio(*bio); *bio = orig_bio; } return -EINVAL; -- 2.15.1
Re: [2/2] block: fix blk_rq_append_bio
Hello, I see a problem with this patch: Ming Leiwrites: > From: Jens Axboe > > Commit caa4b02476e3(blk-map: call blk_queue_bounce from blk_rq_append_bio) > moves blk_queue_bounce() into blk_rq_append_bio(), but don't consider > that the bounced bio becomes invisiable to caller since the parameter type is > 'struct bio *', which should have been 'struct bio **'. > > This patch fixes this issue by passing 'struct bio **' to > blk_rq_append_bio(), then the bounced bio can be returned to caller. > > Also failure handling is considered too. > > Fixes: caa4b02476e3 ("blk-map: call blk_queue_bounce from blk_rq_append_bio") > Cc: Christoph Hellwig > Reported-by: Michele Ballabio > (handling failure of blk_rq_append_bio(), only call bio_get() after > blk_rq_append_bio() returns OK) > Signed-off-by: Ming Lei > --- > block/blk-map.c| 38 > ++ > drivers/scsi/osd/osd_initiator.c | 4 +++- > drivers/target/target_core_pscsi.c | 4 ++-- > include/linux/blkdev.h | 2 +- > 4 files changed, 28 insertions(+), 20 deletions(-) > > diff --git a/block/blk-map.c b/block/blk-map.c > index b21f8e86f120..d3a94719f03f 100644 > --- a/block/blk-map.c > +++ b/block/blk-map.c > @@ -12,22 +12,29 @@ > #include "blk.h" > > /* > - * Append a bio to a passthrough request. Only works can be merged into > - * the request based on the driver constraints. > + * Append a bio to a passthrough request. Only works if the bio can be > merged > + * into the request based on the driver constraints. > */ > -int blk_rq_append_bio(struct request *rq, struct bio *bio) > +int blk_rq_append_bio(struct request *rq, struct bio **bio) > { > - blk_queue_bounce(rq->q, ); > + struct bio *orig_bio = *bio; > + > + blk_queue_bounce(rq->q, bio); > > if (!rq->bio) { > - blk_rq_bio_prep(rq->q, rq, bio); > + blk_rq_bio_prep(rq->q, rq, *bio); > } else { > - if (!ll_back_merge_fn(rq->q, rq, bio)) > + if (!ll_back_merge_fn(rq->q, rq, *bio)) { > + if (orig_bio != *bio) { > + bio_put(*bio); > + *bio = orig_bio; 1) Suppose the bio *does* bounce, which allocates some pages from the DMA pool, and subsequently this test fails. bio_put won't deallocate those pages. Prior to caa4b02476e3, it would be done by bio_endio, which is IMHO the correct way to do it. 2) What if the bio bounces and goes into the splitting path of __blk_queue_bounce. That would submit the request, which would be a problem should this test or some other later fail, cancelling the whole request. For 1, I will send a patch. As for 2, it would be a problem for a long time so I have a hunch it doesn't actually happen, however, I'd like to know if it is somehow guaranteed. Regards Jiri Palecek
Re: [LSF/MM TOPIC] Two blk-mq related topics
On Mon, Jan 29, 2018 at 03:40:31PM -0500, Mike Snitzer wrote: > On Mon, Jan 29 2018 at 10:46am -0500, > Ming Leiwrote: > > > 2. When to enable SCSI_MQ at default again? > > > > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1, > > it is enabled at default, but later the patch is reverted in V4.13-rc7, and > > becomes disabled at default too. > > > > Now both the original reported PM issue(actually SCSI quiesce) and the > > sequential IO performance issue have been addressed. And MQ IO schedulers > > are ready too for traditional disks. Are there other issues to be addressed > > for enabling SCSI_MQ at default? When can we do that again? > > > > Last time, the two issues were reported during V4.13 dev cycle just when it > > is > > enabled at default, that seems if SCSI_MQ isn't enabled at default, it > > wouldn't > > be exposed to run/tested completely & fully. > > > > So if we continue to disable it at default, maybe it can never be exposed to > > full test/production environment. > > I was going to propose revisiting this as well. > > I'd really like to see all the old .request_fn block core code removed. Yeah, that should be a final goal, but may take a bit long. > > But maybe we take a first step of enabling: > CONFIG_SCSI_MQ_DEFAULT=Y > CONFIG_DM_MQ_DEFAULT=Y Maybe you can remove legacy path from DM_RQ first, and take your original approach to allow DM/MQ over legacy underlying driver, seems we discussed this topic before, :-) Thanks, Ming
Re: [LSF/MM TOPIC] Two blk-mq related topics
On Mon, Jan 29, 2018 at 12:56:30PM -0800, James Bottomley wrote: > On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote: > [...] > > 2. When to enable SCSI_MQ at default again? > > I'm not sure there's much to discuss ... I think the basic answer is as > soon as Christoph wants to try it again. I guess Christoph still need to evaluate if there are existed issues or blockers before trying it again. And more input may be got from F2F discussion, IMHO. > > > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In > > V4.13-rc1, it is enabled at default, but later the patch is reverted > > in V4.13-rc7, and becomes disabled at default too. > > > > Now both the original reported PM issue(actually SCSI quiesce) and > > the sequential IO performance issue have been addressed. > > Is the blocker bug just not closed because no-one thought to do it: > > https://bugzilla.kernel.org/show_bug.cgi?id=178381 > > (we have confirmed that this issue is now fixed with the original > reporter?) >From a developer view, this issue is fixed by the following commit: 3a0a52997(block, scsi: Make SCSI quiesce and resume work reliably), and it is verified by kernel list reporter. > > And did the Huawei guy (Jonathan Cameron) confirm his performance issue > was fixed (I don't think I saw email that he did)? Last time I talked with John Garry about the issue, and the merged .get_budget based patch improves much on the IO performance, but there is still a bit gap compared with legacy path. Seems a driver specific issue, remembered that removing a driver's lock can improve performance much. Garry, could you provide further update on this issue? Thanks, Ming
Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Tue, 2018-01-30 at 09:07 +0800, Ming Lei wrote: > On Mon, Jan 29, 2018 at 04:48:31PM +, Bart Van Assche wrote: > > - It is easy to fix this race inside the block layer, namely by using > > call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to > > postpone the queue rerunning until after the request has been added back > > to > > the dispatch list. > > It is just easy to say, can you cook a patch and fix all drivers first? Please reread what I wrote. I proposed to change the blk_mq_delay_run_hw_queue() IMPLEMENTATION such that the callers do not have to be modified. > [ ... ] Later, you admitted you understood the patch wrong. [ ... ] That's nonsense. I never wrote that. Bart.
Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Mon, Jan 29, 2018 at 04:48:31PM +, Bart Van Assche wrote: > On Sun, 2018-01-28 at 07:41 +0800, Ming Lei wrote: > > Not mention, the request isn't added to dispatch list yet in .queue_rq(), > > strictly speaking, it is not correct to call blk_mq_delay_run_hw_queue() in > > .queue_rq(), so the current block layer API can't handle it well enough. > > I disagree that it is not correct to call blk_mq_delay_run_hw_queue() from > inside .queue_rq(). Additionally, I have already explained to you in > previous e-mails why it's fine to call that function from inside .queue_rq(): > - Nobody has ever observed the race you described because the time after 'Nobody observed it' does not mean there isn't the race, since I can explain it clearly. > which a queue is rerun by blk_mq_delay_run_hw_queue() is much larger than > the time during which that race exists. Again it is fragile to depend on the delay(10ms, 3ms, or what ever) to make everything correct, why can't we make the way robust like the approach I took? You can not guarantee that the request handled in .queue_rq() is added to hctx->dispatch when you call blk_mq_delay_run_hw_queue(), because the operation of adding request to hctx->dispatch happens after returning from .queue_rq() to blk_mq_dispatch_rq_list(), which is done in the future against calling blk_mq_delay_run_hw_queue(). Right? Especially now kblockd_mod_delayed_work_on() is changed to use mod_delayed_work_on(), which may run the queue before the timer is expired from another context, then IO hang still can be triggered since the run queue may miss the request to be added to hctx->dispatch. > - It is easy to fix this race inside the block layer, namely by using > call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to > postpone the queue rerunning until after the request has been added back to > the dispatch list. It is just easy to say, can you cook a patch and fix all drivers first? Then let's compare which patch is simpler and better. > > > > - The patch at the start of this thread introduces a regression in the > > > SCSI core, namely a queue stall if a request completion occurs > > > concurrently > > > with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core. > > > > This patch only moves the blk_mq_delay_run_hw_queue() from scsi_queue_rq() > > to blk-mq, again, please explain it in detail how this patch V3 introduces > > this > > regression on SCSI. > > Please reread the following message: > https://marc.info/?l=dm-devel=151672480107560. OK, the following message is copied from the above link: > My opinion about this patch is as follows: > * Changing a blk_mq_delay_run_hw_queue() call followed by return > BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it > changes > a guaranteed queue rerun into a queue rerun that may or may not happen > depending on whether or not multiple queue runs happen simultaneously. > * This change makes block drivers less readable because anyone who encounters > BLK_STS_DEV_RESOURCE will have to look up its definition to figure out what > it's meaning is. > * We don't need the new status code BLK_STS_DEV_RESOURCE because a delayed > queue run can be implemented easily with the existing block layer API. Later, you admitted you understood the patch wrong, so follows your reply again from https://marc.info/?l=dm-devel=151672694508389=2 > On Wed, 2018-01-24 at 00:37 +0800, Ming Lei wrote: > > On Tue, Jan 23, 2018 at 04:24:20PM +, Bart Van Assche wrote: > > > My opinion about this patch is as follows: > > > * Changing a blk_mq_delay_run_hw_queue() call followed by return > > > BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it > > > changes > > > a guaranteed queue rerun into a queue rerun that may or may not happen > > > depending on whether or not multiple queue runs happen simultaneously. > > > > You may not understand the two: > > > > 1) it is always safe to return BLK_STS_RESOURCE, which will make sure to > > avoid IO hang by blk_mq_delay_run_hw_queue() or blk_mq_run_hw_queue(), > > and using which one depends on SCHED_RESTART. > > > > 2) if driver can make sure the queue will be rerun after some resource > > is available, either by itself or by blk-mq, it will return > > BLK_STS_DEV_RESOURCE > > > > So what is wrong with this way? > > Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my > reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call > followed > by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a > race condition in code where there was no race condition. You still doesn't explain the race condition here, right? I can explain it again, but I am losing patience on you if you continue to refuse answer my question, just like you refused to provide debugfs log before when you reported issue. When turning BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE, it is
[LSF/MM TOPIC] Killing reliance on struct page->mapping
I started a patchset about $TOPIC a while ago, right now i am working on other thing but i hope to have an RFC for $TOPIC before LSF/MM and thus would like a slot during common track to talk about it as it impacts FS, BLOCK and MM (i am assuming their will be common track). Idea is that mapping (struct address_space) is available in virtualy all the places where it is needed and that their should be no reasons to depend only on struct page->mapping field. My patchset basicly add mapping to a bunch of vfs callback (struct address_space_operations) where it is missing, changing call site. Then i do an individual patch per filesystem to leverage the new argument instead on struct page. I am doing this for a generic page write protection mechanism which generalize KSM to file back page. They are couple other aspect like struct page->index, struct page->private which are addressed in similar way. The block layer is mostly affected because on block device error it needs the page->mapping to report I/O error. Maybe we can kill page->mapping altogether as a result of this. However this is not my motivation at this time. Sorry for absence of patchset at this time but i wanted to submit the subject before LSF/MM deadline. Cheers, Jérôme
Re: [LSF/MM TOPIC] Two blk-mq related topics
On Mon, 2018-01-29 at 14:00 -0700, Jens Axboe wrote: > On 1/29/18 1:56 PM, James Bottomley wrote: > > > > On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote: > > [...] > > > > > > 2. When to enable SCSI_MQ at default again? > > > > I'm not sure there's much to discuss ... I think the basic answer > > is as soon as Christoph wants to try it again. > > FWIW, internally I've been running various IO intensive workloads on > what is essentially 4.12 upstream with scsi-mq the default (with > mq-deadline as the scheduler) and comparing IO workloads with a > previous 4.6 kernel (without scsi-mq), and things are looking > great. > > We're never going to iron out the last kinks with it being off > by default, I think we should attempt to flip the switch again > for 4.16. Absolutely, I agree we turn it on ASAP. I just don't want to be on the receiving end of Linus' flamethrower because a bug we already had reported against scsi-mq caused problems. Get confirmation from the original reporters (or as close to it as you can) that their problems are fixed and we're good to go; he won't kick us nearly as hard for new bugs that turn up. James
Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle
On 01/19/18 07:24, Jens Axboe wrote: That's what I thought. So for a low queue depth underlying queue, it's quite possible that this situation can happen. Two potential solutions I see: 1) As described earlier in this thread, having a mechanism for being notified when the scarce resource becomes available. It would not be hard to tap into the existing sbitmap wait queue for that. 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource allocation. I haven't read the dm code to know if this is a possibility or not. I'd probably prefer #1. It's a classic case of trying to get the request, and if it fails, add ourselves to the sbitmap tag wait queue head, retry, and bail if that also fails. Connecting the scarce resource and the consumer is the only way to really fix this, without bogus arbitrary delays. (replying to an e-mail from ten days ago) Implementing a notification mechanism for all cases in which blk_insert_cloned_request() returns BLK_STS_RESOURCE today would require a lot of work. If e.g. a SCSI LLD returns one of the SCSI_MLQUEUE_*_BUSY return codes from its .queuecommand() implementation then the SCSI core will translate that return code into BLK_STS_RESOURCE. From scsi_queue_rq(): reason = scsi_dispatch_cmd(cmd); if (reason) { scsi_set_blocked(cmd, reason); ret = BLK_STS_RESOURCE; goto out_dec_host_busy; } In other words, implementing a notification mechanism for all cases in which blk_insert_cloned_request() can return BLK_STS_RESOURCE would require to modify all SCSI LLDs. Bart.
Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Mon, 2018-01-29 at 17:14 -0500, Mike Snitzer wrote: > On Mon, Jan 29 2018 at 4:51pm -0500, > Bart Van Asschewrote: > > > On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote: > > > But regardless of which wins the race, the queue will have been run. > > > Which is all we care about right? > > > > Running the queue is not sufficient. With this patch applied it can happen > > that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more > > concurrent queue runs finish before sufficient device resources are > > available > > to execute the request and that blk_mq_delay_run_hw_queue() does not get > > called at all. If no other activity triggers a queue run, e.g. request > > completion, this will result in a queue stall. > > If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely > on a future queue run. IIUC, that is the entire premise of > BLK_STS_DEV_RESOURCE. If the driver had doubt about whether the > resource were going to be available in the future it should return > BLK_STS_RESOURCE. > > That may seem like putting a lot on a driver developer (to decide > between the 2) but I'll again defer to Jens here. This was the approach > he was advocating be pursued. Hello Mike, There was a typo in my reply: I should have referred to BLK_STS_RESOURCE instead of BLK_STS_DEV_RESOURCE. The convention for BLK_STS_DEV_RESOURCE is namely that if .queue_rq() returns that status code that the block driver guarantees that the queue will be rerun. Bart.
Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote: > But regardless of which wins the race, the queue will have been run. > Which is all we care about right? Running the queue is not sufficient. With this patch applied it can happen that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more concurrent queue runs finish before sufficient device resources are available to execute the request and that blk_mq_delay_run_hw_queue() does not get called at all. If no other activity triggers a queue run, e.g. request completion, this will result in a queue stall. Bart.
Re: [LSF/MM TOPIC] De-clustered RAID with MD
On Mon, Jan 29 2018, Wols Lists wrote: > On 29/01/18 15:23, Johannes Thumshirn wrote: >> Hi linux-raid, lsf-pc >> >> (If you've received this mail multiple times, I'm sorry, I'm having >> trouble with the mail setup). > > My immediate reactions as a lay person (I edit the raid wiki) ... >> >> With the rise of bigger and bigger disks, array rebuilding times start >> skyrocketing. > > And? Yes, your data is at risk during a rebuild, but md-raid throttles > the i/o, so it doesn't hammer the system. >> >> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm >> similar to RAID5 but instead of utilizing all disks in an array for >> every I/O operation, but implement a per-I/O mapping function to only >> use a subset of the available disks. >> >> This has at least two advantages: >> 1) If one disk has to be replaced, it's not needed to read the data from >>all disks to recover the one failed disk so non-affected disks can be >>used for real user I/O and not just recovery and > > Again, that's throttling, so that's not a problem ... Imagine an array with 100 drives on which we store data in sets of (say) 6 data chunks and 2 parity chunks. Each group of 8 chunks is distributed over the 100 drives in a different way so that (e.g) 600 data chunks and 200 parity chunks are distributed over 8 physical stripes using some clever distribution function. If (when) one drive fails, the 8 chunks in this set of 8 physical stripes can be recovered by reading 6*8 == 48 chunks which will each be on a different drive. Half the drives deliver only one chunk (in an ideal distribution) and the other half deliver none. Maybe they will deliver some for the next set of 100 logical stripes. You would probably say that even doing raid6 on 100 drives is crazy. Better to make, e.g. 10 groups of 10 and do raid6 on each of the 10, then LVM them together. By doing declustered parity you can sanely do raid6 on 100 drives, using a logical stripe size that is much smaller than 100. When recovering a single drive, the 10-groups-of-10 would put heavy load on 9 other drives, while the decluster approach puts light load on 99 other drives. No matter how clever md is at throttling recovery, I would still rather distribute the load so that md has an easier job. NeilBrown > >> 2) an efficient mapping function can improve parallel I/O submission, as >>two different I/Os are not necessarily going to the same disks in the >>array. >> >> For the mapping function used a hashing algorithm like Ceph's CRUSH [2] >> would be ideal, as it provides a pseudo random but deterministic mapping >> for the I/O onto the drives. >> >> This whole declustering of cause only makes sense for more than (at >> least) 4 drives but we do have customers with several orders of >> magnitude more drivers in an MD array. > > If you have four drives or more - especially if they are multi-terabyte > drives - you should NOT be using raid-5 ... >> >> At LSF I'd like to discuss if: >> 1) The wider MD audience is interested in de-clusterd RAID with MD > > I haven't read the papers, so no comment, sorry. > >> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or >>as a new personality > > Neither! If you're going to do it, it should be raid-6. > >> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that >>the NetApp E-Series Arrays do use CRUSH for parity declustering) >> >> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf >> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf >> [3] >> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf >> > Okay - I've now skimmed the crush paper [2]. Looks well interesting. > BUT. It feels more like btrfs than it does like raid. > > Btrfs manages disks, and does raid, it tries to be the "everything > between the hard drive and the file". This crush thingy reads to me like > it wants to be the same. There's nothing wrong with that, but md is a > unix-y "do one thing (raid) and do it well". > > My knee-jerk reaction is if you want to go for it, it sounds like a good > idea. It just doesn't really feel a good fit for md. > > Cheers, > Wol signature.asc Description: PGP signature
[LSF/MM TOPIC] block: extend generic biosets to allow per-device frontpad
I'd like to enable bio-based DM to _not_ need to clone bios. But to do so each bio-based DM target's required per-bio-data would need to be provided by upper layer biosets (as opposed to the bioset DM currently creates). So my thinking is that all system-level biosets (e.g. fs_bio_set, blkdev_dio_pool) would redirect to a device specific variant bioset IFF the underlying device advertises the need for a specific per-bio-data payload to be provided. I know this _could_ become a rathole but I'd like to avoid reverting DM back to the days of having to worry about managing mempools for the purpose of per-io allocations. I've grown spoiled by the performance and elegance that comes with having the bio and per-bio-data allocated from the same bioset. Thoughts? Mike
Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote: > + * If driver returns BLK_STS_RESOURCE and SCHED_RESTART > + * bit is set, run queue after 10ms to avoid IO stalls > + * that could otherwise occur if the queue is idle. >*/ > - if (!blk_mq_sched_needs_restart(hctx) || > + needs_restart = blk_mq_sched_needs_restart(hctx); > + if (!needs_restart || > (no_tag && list_empty_careful(>dispatch_wait.entry))) > blk_mq_run_hw_queue(hctx, true); > + else if (needs_restart && (ret == BLK_STS_RESOURCE)) > + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY); > } The above code only calls blk_mq_delay_run_hw_queue() if the following condition is met: !(!needs_restart || (no_tag && list_empty_careful(>dispatch_wait.entry))) && (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can be simplified into the following: needs_restart && ret == BLK_STS_RESOURCE && !(no_tag && list_empty_careful(>dispatch_wait.entry)). From blk-mq-sched.h: static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx) { return test_bit(BLK_MQ_S_SCHED_RESTART, >state); } In other words, the above code will not call blk_mq_delay_run_hw_queue() if BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code is reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes concurrently with the above code. From blk-mq-sched.c: static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) { if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state)) return false; if (hctx->flags & BLK_MQ_F_TAG_SHARED) { struct request_queue *q = hctx->queue; if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, >state)) atomic_dec(>shared_hctx_restart); } else clear_bit(BLK_MQ_S_SCHED_RESTART, >state); return blk_mq_run_hw_queue(hctx, true); } The above blk_mq_run_hw_queue() call may finish either before or after blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag. That's why I wrote in previous e-mails that this patch and also the previous versions of this patch change a systematic call of blk_mq_delay_run_hw_queue() into a call that may or may not happen, depending on whether or not a request completes concurrently with request queueing. Sorry but I think that means that the above change combined with changing BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in sporadic queue stalls. As you know sporadic queue stalls are annoying and hard to debug. Plenty of e-mails have already been exchanged about versions one to four of this patch but so far nobody has even tried to contradict the above. Bart.
[LSF/MM TOPIC] block, dm: restack queue_limits
We currently don't restack the queue_limits if the lowest, or intermediate, layer of an IO stack changes. This is particularly unfortunate in the case of FLUSH/FUA which may change if/when a HW controller's BBU fails; whereby requiring the device advertise that it has a volatile write cache (WCE=1). But in the context of DM, really it'd be best if the entire stack of devices had their limits restacked if any underlying layer's limits change. In the past, Martin and I discussed that we should "just do it" but never did. Not sure we need a lengthy discussion but figured I'd put it out there. Maybe I'll find time, between now and April, to try implementing it. Thanks, Mike
Re: [LSF/MM TOPIC] Two blk-mq related topics
On 1/29/18 1:56 PM, James Bottomley wrote: > On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote: > [...] >> 2. When to enable SCSI_MQ at default again? > > I'm not sure there's much to discuss ... I think the basic answer is as > soon as Christoph wants to try it again. FWIW, internally I've been running various IO intensive workloads on what is essentially 4.12 upstream with scsi-mq the default (with mq-deadline as the scheduler) and comparing IO workloads with a previous 4.6 kernel (without scsi-mq), and things are looking great. We're never going to iron out the last kinks with it being off by default, I think we should attempt to flip the switch again for 4.16. -- Jens Axboe
Re: [LSF/MM TOPIC] Two blk-mq related topics
On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote: [...] > 2. When to enable SCSI_MQ at default again? I'm not sure there's much to discuss ... I think the basic answer is as soon as Christoph wants to try it again. > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In > V4.13-rc1, it is enabled at default, but later the patch is reverted > in V4.13-rc7, and becomes disabled at default too. > > Now both the original reported PM issue(actually SCSI quiesce) and > the sequential IO performance issue have been addressed. Is the blocker bug just not closed because no-one thought to do it: https://bugzilla.kernel.org/show_bug.cgi?id=178381 (we have confirmed that this issue is now fixed with the original reporter?) And did the Huawei guy (Jonathan Cameron) confirm his performance issue was fixed (I don't think I saw email that he did)? James
Re: [LSF/MM TOPIC] Two blk-mq related topics
On Mon, Jan 29 2018 at 10:46am -0500, Ming Leiwrote: > 2. When to enable SCSI_MQ at default again? > > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1, > it is enabled at default, but later the patch is reverted in V4.13-rc7, and > becomes disabled at default too. > > Now both the original reported PM issue(actually SCSI quiesce) and the > sequential IO performance issue have been addressed. And MQ IO schedulers > are ready too for traditional disks. Are there other issues to be addressed > for enabling SCSI_MQ at default? When can we do that again? > > Last time, the two issues were reported during V4.13 dev cycle just when it is > enabled at default, that seems if SCSI_MQ isn't enabled at default, it > wouldn't > be exposed to run/tested completely & fully. > > So if we continue to disable it at default, maybe it can never be exposed to > full test/production environment. I was going to propose revisiting this as well. I'd really like to see all the old .request_fn block core code removed. But maybe we take a first step of enabling: CONFIG_SCSI_MQ_DEFAULT=Y CONFIG_DM_MQ_DEFAULT=Y Thanks, Mike
Re: [GIT PULL] Block changes for 4.16-rc
On Mon, Jan 29, 2018 at 12:08 PM, Jens Axboewrote: > > Yes of course, I can switch to using signed tags for pull requests > for you. Thanks. I don't actually know the maintainership status of kernel.dk. You show up as the 'whois' contact, but I'm also assuming it doesn't have the same kind of fairly draconian access control as the kernel.org suite of sites, and I'm actively trying to make sure we either use signed tags or the kernel.org repos (and honestly, even with the kernel.org ones I tend to prefer signed tags). Right now I only _require_ it for public hosting, but I'm trying to move to just signed tags being the default everywhere. Linus
[PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE
From: Ming LeiThis status is returned from driver to block layer if device related resource is unavailable, but driver can guarantee that IO dispatch will be triggered in future when the resource is available. Convert some drivers to return BLK_STS_DEV_RESOURCE. Also, if driver returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls. BLK_MQ_DELAY_QUEUE is 3 ms because both scsi-mq and nvmefc are using that magic value. Suggested-by: Jens Axboe Tested-by: Laurence Oberman Signed-off-by: Ming Lei Signed-off-by: Mike Snitzer --- V4: - cleanup header and code comments - rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms - eliminate nvmefc's queue rerun now that blk-mq does it V3: - fix typo, and improvement document - add tested-by tag V2: - add comments on the new introduced status - patch style fix - both are suggested by Christoph block/blk-core.c | 1 + block/blk-mq.c | 20 drivers/block/null_blk.c | 2 +- drivers/block/virtio_blk.c | 2 +- drivers/block/xen-blkfront.c | 2 +- drivers/md/dm-rq.c | 5 ++--- drivers/nvme/host/fc.c | 12 ++-- drivers/scsi/scsi_lib.c | 6 +++--- include/linux/blk_types.h| 14 ++ 9 files changed, 41 insertions(+), 23 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index cdae69be68e9..38279d4ae08b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -145,6 +145,7 @@ static const struct { [BLK_STS_MEDIUM]= { -ENODATA, "critical medium" }, [BLK_STS_PROTECTION]= { -EILSEQ,"protection" }, [BLK_STS_RESOURCE] = { -ENOMEM,"kernel resource" }, + [BLK_STS_DEV_RESOURCE] = { -ENOMEM,"device resource" }, [BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" }, /* device mapper special case, should not leak out: */ diff --git a/block/blk-mq.c b/block/blk-mq.c index 43e7449723e0..dd097ca5f1e9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx, return true; } +#define BLK_MQ_QUEUE_DELAY 3 /* ms units */ + bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool got_budget) { @@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, struct request *rq, *nxt; bool no_tag = false; int errors, queued; + blk_status_t ret = BLK_STS_OK; if (list_empty(list)) return false; @@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, errors = queued = 0; do { struct blk_mq_queue_data bd; - blk_status_t ret; rq = list_first_entry(list, struct request, queuelist); if (!blk_mq_get_driver_tag(rq, , false)) { @@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, } ret = q->mq_ops->queue_rq(hctx, ); - if (ret == BLK_STS_RESOURCE) { + if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { /* * If an I/O scheduler has been configured and we got a * driver tag for the next request already, free it @@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is where we will continue on next queue run. */ if (!list_empty(list)) { + bool needs_restart; + spin_lock(>lock); list_splice_init(list, >dispatch); spin_unlock(>lock); @@ -1278,10 +1282,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * - Some but not all block drivers stop a queue before * returning BLK_STS_RESOURCE. Two exceptions are scsi-mq * and dm-rq. +* +* If driver returns BLK_STS_RESOURCE and SCHED_RESTART +* bit is set, run queue after 10ms to avoid IO stalls +* that could otherwise occur if the queue is idle. */ - if (!blk_mq_sched_needs_restart(hctx) || + needs_restart = blk_mq_sched_needs_restart(hctx); + if (!needs_restart || (no_tag && list_empty_careful(>dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); + else if (needs_restart && (ret == BLK_STS_RESOURCE)) + blk_mq_delay_run_hw_queue(hctx,
Re: [GIT PULL] Block changes for 4.16-rc
On 1/29/18 1:14 PM, Linus Torvalds wrote: > On Mon, Jan 29, 2018 at 12:08 PM, Jens Axboewrote: >> >> Yes of course, I can switch to using signed tags for pull requests >> for you. > > Thanks. I don't actually know the maintainership status of kernel.dk. > > You show up as the 'whois' contact, but I'm also assuming it doesn't > have the same kind of fairly draconian access control as the > kernel.org suite of sites, and I'm actively trying to make sure we > either use signed tags or the kernel.org repos (and honestly, even > with the kernel.org ones I tend to prefer signed tags). > > Right now I only _require_ it for public hosting, but I'm trying to > move to just signed tags being the default everywhere. The box is sitting right next to me in my office, which is locked. But it's not kernel.org, and I don't have a dedicated sysadmin. My tree does mirror to git.kernel.org, so they should be in sync (albeit with some delay). If you're fine with signed tags on git.kernel.dk, I'd prefer continue using that, but just signing the pull requests. I sign all fio releases with the same key, so it'd be trivial for me to just sign pulls as well. -- Jens Axboe
Re: [GIT PULL] Block changes for 4.16-rc
On 1/29/18 1:02 PM, Linus Torvalds wrote: > On Mon, Jan 29, 2018 at 7:40 AM, Jens Axboewrote: >> >> This is the main pull request for block IO related changes for the 4.16 >> kernel. Nothing major in this pull request, but a good amount of >> improvements and fixes all over the map. This pull request contains: > > Pulled. > > However, can I request that you please start using signed tags? > > I know you have a gpg key and know how to do signed tags, because in > the almost four hundred pulls that I've done from you over the many > years, I've actually got three such pull request from you. > > Can we make that percentage go up a bit? Pretty please? Yes of course, I can switch to using signed tags for pull requests for you. -- Jens Axboe
Re: [GIT PULL] Block changes for 4.16-rc
On Mon, Jan 29, 2018 at 7:40 AM, Jens Axboewrote: > > This is the main pull request for block IO related changes for the 4.16 > kernel. Nothing major in this pull request, but a good amount of > improvements and fixes all over the map. This pull request contains: Pulled. However, can I request that you please start using signed tags? I know you have a gpg key and know how to do signed tags, because in the almost four hundred pulls that I've done from you over the many years, I've actually got three such pull request from you. Can we make that percentage go up a bit? Pretty please? Linus
Re: [PATCH v6 1/2] Return bytes transferred for partial direct I/O
On 01/29/2018 06:57 AM, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt > index 6c00c1e2743f..72e213d62511 100644 > --- a/Documentation/sysctl/fs.txt > +++ b/Documentation/sysctl/fs.txt > @@ -76,6 +77,19 @@ dcache isn't pruned yet. > > == > > +dio_short_writes: > + > +In case Direct I/O encounters an transient error, it returns a transient > +the errorcode, even if it has performed part of the write. error code, > +This flag, if on (default), will return the number of bytes written > +so far, as the write(2) symantics are. However, some older applications semantics > +still consider a direct write as an error if all of the I/O > +submitted is not complete. ie write(file, count, buf) != count. I.e. > +This option can be disabled on systems in order to support > +existing applications which do not expect short writes. and if my system has a mix of older applications and new ones, will they all work just fine? thanks, -- ~Randy
[LSF/MM TOPIC] Bringing Virtual Data Optimizer (VDO) to upstream
Hello, I'd like to talk about Virtual Data Optimizer (VDO), which is a device-mapper target that provides deduplication, compression, thin provisioning, and zero-block elimination for primary storage. Now that VDO is open-source, LSF/MM would be a good time to open a dialogue between other upstream developers as we work to bring VDO upstream. VDO was originally developed by Permabit Technology Corporation as a proprietary set of kernel modules and userspace tools. After being acquired by Red Hat, VDO has been re-licensed under the GNU General Public License, version 2. The VDO team has also been working a lot closer with Mike Snitzer and other developers of device-mapper and LVM. The source for VDO is available at the following repositories: https://github.com/dm-vdo/kvdo (kernel modules) https://github.com/dm-vdo/vdo (userspace tools) As the README files in the repositories above indicate, development on VDO is currently focusing on upstream integration, which includes various coding standards compliance changes, refactoring efforts, and various other changes that become possible now that VDO is re-licensed under the GPL. That being said, VDO has been in production use since 2014 in its previous proprietary form, and continues to undergo thorough testing, including real-world test scenarios. Thanks, Bryan
Re: [PATCH 2/2] xfs: remove assert to check bytes returned
On 1/29/18 7:57 AM, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues> > Since we can return less than count in case of partial direct > writes, remove the ASSERT. > > Signed-off-by: Goldwyn Rodrigues > Acked-by: Dave Chinner > Reviewed-by: Darrick J. Wong Minot nit - you should probably reorder this patch before the other one, otherwise you'll have a point in the tree where it can knowingly trigger. -- Jens Axboe
Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE
On Sun, 2018-01-28 at 07:41 +0800, Ming Lei wrote: > Not mention, the request isn't added to dispatch list yet in .queue_rq(), > strictly speaking, it is not correct to call blk_mq_delay_run_hw_queue() in > .queue_rq(), so the current block layer API can't handle it well enough. I disagree that it is not correct to call blk_mq_delay_run_hw_queue() from inside .queue_rq(). Additionally, I have already explained to you in previous e-mails why it's fine to call that function from inside .queue_rq(): - Nobody has ever observed the race you described because the time after which a queue is rerun by blk_mq_delay_run_hw_queue() is much larger than the time during which that race exists. - It is easy to fix this race inside the block layer, namely by using call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to postpone the queue rerunning until after the request has been added back to the dispatch list. > > - The patch at the start of this thread introduces a regression in the > > SCSI core, namely a queue stall if a request completion occurs > > concurrently > > with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core. > > This patch only moves the blk_mq_delay_run_hw_queue() from scsi_queue_rq() > to blk-mq, again, please explain it in detail how this patch V3 introduces > this > regression on SCSI. Please reread the following message: https://marc.info/?l=dm-devel=151672480107560. Thanks, Bart.
Re: [LSF/MM TOPIC] De-clustered RAID with MD
On 29/01/18 15:23, Johannes Thumshirn wrote: > Hi linux-raid, lsf-pc > > (If you've received this mail multiple times, I'm sorry, I'm having > trouble with the mail setup). My immediate reactions as a lay person (I edit the raid wiki) ... > > With the rise of bigger and bigger disks, array rebuilding times start > skyrocketing. And? Yes, your data is at risk during a rebuild, but md-raid throttles the i/o, so it doesn't hammer the system. > > In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm > similar to RAID5 but instead of utilizing all disks in an array for > every I/O operation, but implement a per-I/O mapping function to only > use a subset of the available disks. > > This has at least two advantages: > 1) If one disk has to be replaced, it's not needed to read the data from >all disks to recover the one failed disk so non-affected disks can be >used for real user I/O and not just recovery and Again, that's throttling, so that's not a problem ... > 2) an efficient mapping function can improve parallel I/O submission, as >two different I/Os are not necessarily going to the same disks in the >array. > > For the mapping function used a hashing algorithm like Ceph's CRUSH [2] > would be ideal, as it provides a pseudo random but deterministic mapping > for the I/O onto the drives. > > This whole declustering of cause only makes sense for more than (at > least) 4 drives but we do have customers with several orders of > magnitude more drivers in an MD array. If you have four drives or more - especially if they are multi-terabyte drives - you should NOT be using raid-5 ... > > At LSF I'd like to discuss if: > 1) The wider MD audience is interested in de-clusterd RAID with MD I haven't read the papers, so no comment, sorry. > 2) de-clustered RAID should be implemented as a sublevel of RAID5 or >as a new personality Neither! If you're going to do it, it should be raid-6. > 3) CRUSH is a suitible algorith for this (there's evidence in [3] that >the NetApp E-Series Arrays do use CRUSH for parity declustering) > > [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf > [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf > [3] > https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf > Okay - I've now skimmed the crush paper [2]. Looks well interesting. BUT. It feels more like btrfs than it does like raid. Btrfs manages disks, and does raid, it tries to be the "everything between the hard drive and the file". This crush thingy reads to me like it wants to be the same. There's nothing wrong with that, but md is a unix-y "do one thing (raid) and do it well". My knee-jerk reaction is if you want to go for it, it sounds like a good idea. It just doesn't really feel a good fit for md. Cheers, Wol
[LSF/MM TOPIC] Two blk-mq related topics
Hi guys, Two blk-mq related topics 1. blk-mq vs. CPU hotplug & IRQ vectors spread on CPUs We have done three big changes in this field before, each time some issues are fixed, meantime new ones are introduced 1) freeze all queues during CPU hotplug handler - issues: queue dependency such as loop-mq/dm vs underlying queues, NVMe admin queue vs. namespace queues, and IO hang may be caused during freezing all these queues in CPU hotplug handler. 2) IRQ vectors spread on all present CPUs - fix issue on 1) - new issues introduced: don't support CPU hotplug physically, and cause blk-mq warning during dispatch 3) IRQ vectors spread on all possible CPUs - can support CPU hotplug physically - warning in __blk_mq_run_hw_queue() still may be triggered if CPU offline/online happens between blk_mq_hctx_next_cpu() and running __blk_mq_run_hw_queue() - new issues introduced: queue mapping may be distorted completely, patch sent out(https://marc.info/?t=15160323092=1=2), but may need further discussion about this approach; drivers(such as NVMe) may need to pass 'num_possible_cpus()' as the max vectors for allocating irq vectors; some drivers(NVMe) uses hard-code hw queue index directly, then this way becomes very fragile, since the hw queue may be inactive from the beginning. Also starting from 2), another issue is that IO completion may not be delivered to CPUs, for example, IO may be dispatched to hw queue just before(or after) all CPUs mapped to the hctx become offline, then IRQ vector of the hw queue can be shutdown. Now seems we depend on timeout handler to deal with the situation, and is there better way to solve this issue? 2. When to enable SCSI_MQ at default again? SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1, it is enabled at default, but later the patch is reverted in V4.13-rc7, and becomes disabled at default too. Now both the original reported PM issue(actually SCSI quiesce) and the sequential IO performance issue have been addressed. And MQ IO schedulers are ready too for traditional disks. Are there other issues to be addressed for enabling SCSI_MQ at default? When can we do that again? Last time, the two issues were reported during V4.13 dev cycle just when it is enabled at default, that seems if SCSI_MQ isn't enabled at default, it wouldn't be exposed to run/tested completely & fully. So if we continue to disable it at default, maybe it can never be exposed to full test/production environment. Thanks, Ming
[GIT PULL] Block changes for 4.16-rc
Hi Linus, This is the main pull request for block IO related changes for the 4.16 kernel. Nothing major in this pull request, but a good amount of improvements and fixes all over the map. This pull request contains: - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and Paolo. - Support for SMR zones for deadline and mq-deadline from Damien and Christoph. - Set of fixes for bcache by way of Michael Lyle, including fixes from himself, Kent, Rui, Tang, and Coly. - Series from Matias for lightnvm with fixes from Hans Holmberg, Javier, and Matias. Mostly centered around pblk, and the removing rrpc 1.2 in preparation for supporting 2.0. - A couple of NVMe pull requests from Christoph. Nothing major in here, just fixes and cleanups, and support for command tracing from Johannes. - Support for blk-throttle for tracking reads and writes separately. From Joseph Qi. A few cleanups/fixes also for blk-throttle from Weiping. - Series from Mike Snitzer that enables dm to register its queue more logically, something that's alwways been problematic on dm since it's a stacked device. - Series from Ming cleaning up some of the bio accessor use, in preparation for supporting multipage bvecs. - Various fixes from Ming closing up holes around queue mapping and quiescing. - BSD partition fix from Richard Narron, fixing a problem where we can't mount newer (10/11) FreeBSD partitions. - Series from Tejun reworking blk-mq timeout handling. The previous scheme relied on atomic bits, but it had races where we would think a request had timed out if it to reused at the wrong time. - null_blk now supports faking timeouts, to enable us to better exercise and test that functionality separately. From me. - Kill the separate atomic poll bit in the request struct. After this, we don't use the atomic bits on blk-mq anymore at all. From me. - sgl_alloc/free helpers from Bart. - Heavily contended tag case scalability improvement from me. - Various little fixes and cleanups from Arnd, Bart, Corentin, Douglas, Eryu, Goldwyn, and myself. Note that you'll get a single merge conflict when pulling this in, due to the change: commit 4ccafe032005e9b96acbef2e389a4de5b1254add Author: Jens AxboeDate: Wed Dec 20 13:13:58 2017 -0700 block: unalign call_single_data in struct request that went into your tree after I had forked off for-4.16/block. The resolution is trivial, just ensure that the moved 'csd' struct member retains the type of struct __call_single_data, not call_single_data_t. Please pull! git://git.kernel.dk/linux-block.git for-4.16/block Angelo Ruocco (2): block, bfq: check low_latency flag in bfq_bfqq_save_state() block, bfq: remove superfluous check in queue-merging setup Arnd Bergmann (3): DAC960: split up ioctl function to reduce stack size null_blk: remove explicit 'select FAULT_INJECTION' blkcg: simplify statistic accumulation code Bart Van Assche (20): pktcdvd: Fix pkt_setup_dev() error path pktcdvd: Fix a recently introduced NULL pointer dereference lib/scatterlist: Introduce sgl_alloc() and sgl_free() crypto: scompress - use sgl_alloc() and sgl_free() nvmet/fc: Use sgl_alloc() and sgl_free() nvmet/rdma: Use sgl_alloc() and sgl_free() target: Use sgl_alloc_order() and sgl_free() blk-mq: Fix spelling in a source code comment block: Fix kernel-doc warnings reported when building with W=1 blk-mq: Explain when 'active_queues' is decremented blk-mq: Add locking annotations to hctx_lock() and hctx_unlock() blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait() block: Fix __bio_integrity_endio() documentation block: Unexport elv_register_queue() and elv_unregister_queue() block: Document scheduler modification locking requirements block: Protect less code with sysfs_lock in blk_{un,}register_queue() lib/scatterlist: Fix chaining support in sgl_alloc_order() blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly() blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays block: Remove kblockd_schedule_delayed_work{,_on}() Chiara Bruschi (1): block, bfq: fix occurrences of request finish method's old name Christoph Hellwig (5): block: introduce zoned block devices zone write locking genirq/affinity: assign vectors to all possible CPUs blk-mq: simplify queue mapping & schedule with each possisble CPU nvme-pci: clean up CMB initialization nvme-pci: clean up SMBSZ bit definitions Coly Li (2): bcache: reduce cache_set devices iteration by devices_max_used bcache: fix misleading error message in bch_count_io_errors() Corentin Labbe (1): block: remove smart1,2.h Damien Le Moal (4): mq-deadline: Introduce dispatch
[LSF/MM TOPIC] De-clustered RAID with MD
Hi linux-raid, lsf-pc (If you've received this mail multiple times, I'm sorry, I'm having trouble with the mail setup). With the rise of bigger and bigger disks, array rebuilding times start skyrocketing. In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm similar to RAID5 but instead of utilizing all disks in an array for every I/O operation, but implement a per-I/O mapping function to only use a subset of the available disks. This has at least two advantages: 1) If one disk has to be replaced, it's not needed to read the data from all disks to recover the one failed disk so non-affected disks can be used for real user I/O and not just recovery and 2) an efficient mapping function can improve parallel I/O submission, as two different I/Os are not necessarily going to the same disks in the array. For the mapping function used a hashing algorithm like Ceph's CRUSH [2] would be ideal, as it provides a pseudo random but deterministic mapping for the I/O onto the drives. This whole declustering of cause only makes sense for more than (at least) 4 drives but we do have customers with several orders of magnitude more drivers in an MD array. At LSF I'd like to discuss if: 1) The wider MD audience is interested in de-clusterd RAID with MD 2) de-clustered RAID should be implemented as a sublevel of RAID5 or as a new personality 3) CRUSH is a suitible algorith for this (there's evidence in [3] that the NetApp E-Series Arrays do use CRUSH for parity declustering) [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf [3] https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf Thanks, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[PATCH 2/2] xfs: remove assert to check bytes returned
From: Goldwyn RodriguesSince we can return less than count in case of partial direct writes, remove the ASSERT. Signed-off-by: Goldwyn Rodrigues Acked-by: Dave Chinner Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_file.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 8601275cc5e6..8fc4dbf66910 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -590,12 +590,6 @@ xfs_file_dio_aio_write( ret = iomap_dio_rw(iocb, from, _iomap_ops, xfs_dio_write_end_io); out: xfs_iunlock(ip, iolock); - - /* -* No fallback to buffered IO on errors for XFS, direct IO will either -* complete fully or fail. -*/ - ASSERT(ret < 0 || ret == count); return ret; } -- 2.16.1
[PATCH v6 1/2] Return bytes transferred for partial direct I/O
From: Goldwyn RodriguesIn case direct I/O encounters an error midway, it returns the error. Instead it should be returning the number of bytes transferred so far. Test case for filesystems (with ENOSPC): 1. Create an almost full filesystem 2. Create a file, say /mnt/lastfile, until the filesystem is full. 3. Direct write() with count > sizeof /mnt/lastfile. Result: write() returns -ENOSPC. However, file content has data written in step 3. Added a sysctl entry: dio_short_writes which is on by default. This is to support applications which expect either and error or the bytes submitted as a return value for the write calls. This fixes fstest generic/472. Signed-off-by: Goldwyn Rodrigues --- Changes since v1: - incorporated iomap and block devices Changes since v2: - realized that file size was not increasing when performing a (partial) direct I/O because end_io function was receiving the error instead of size. Fixed. Changes since v3: - [hch] initialize transferred with dio->size and use transferred instead of dio->size. Changes since v4: - Refreshed to v4.14 Changes since v5: - Added /proc/sys/fs/dio_short_writes (default 1) to guard older applications which expect write(fd, buf, count) returns either count or error. Documentation/sysctl/fs.txt | 14 ++ fs/block_dev.c | 2 +- fs/direct-io.c | 7 +-- fs/iomap.c | 23 --- include/linux/fs.h | 1 + kernel/sysctl.c | 9 + 6 files changed, 42 insertions(+), 14 deletions(-) diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index 6c00c1e2743f..72e213d62511 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt @@ -22,6 +22,7 @@ Currently, these files are in /proc/sys/fs: - aio-max-nr - aio-nr - dentry-state +- dio_short_writes - dquot-max - dquot-nr - file-max @@ -76,6 +77,19 @@ dcache isn't pruned yet. == +dio_short_writes: + +In case Direct I/O encounters an transient error, it returns +the errorcode, even if it has performed part of the write. +This flag, if on (default), will return the number of bytes written +so far, as the write(2) symantics are. However, some older applications +still consider a direct write as an error if all of the I/O +submitted is not complete. ie write(file, count, buf) != count. +This option can be disabled on systems in order to support +existing applications which do not expect short writes. + +== + dquot-max & dquot-nr: The file dquot-max shows the maximum number of cached disk diff --git a/fs/block_dev.c b/fs/block_dev.c index 4a181fcb5175..49d94360bb51 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -409,7 +409,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, int nr_pages) if (!ret) ret = blk_status_to_errno(dio->bio.bi_status); - if (likely(!ret)) + if (likely(dio->size)) ret = dio->size; bio_put(>bio); diff --git a/fs/direct-io.c b/fs/direct-io.c index 3aafb3343a65..9bd15be64c25 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -151,6 +151,7 @@ struct dio { } cacheline_aligned_in_smp; static struct kmem_cache *dio_cache __read_mostly; +unsigned int sysctl_dio_short_writes = 1; /* * How many pages are in the queue? @@ -262,7 +263,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) ret = dio->page_errors; if (ret == 0) ret = dio->io_error; - if (ret == 0) + if (!sysctl_dio_short_writes && (ret == 0)) ret = transferred; if (dio->end_io) { @@ -310,7 +311,9 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags) } kmem_cache_free(dio_cache, dio); - return ret; + if (!sysctl_dio_short_writes) + return ret; + return transferred ? transferred : ret; } static void dio_aio_complete_work(struct work_struct *work) diff --git a/fs/iomap.c b/fs/iomap.c index 47d29ccffaef..a8d6908dc0de 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -716,23 +716,24 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio) struct kiocb *iocb = dio->iocb; struct inode *inode = file_inode(iocb->ki_filp); loff_t offset = iocb->ki_pos; - ssize_t ret; + ssize_t err; + ssize_t transferred = dio->size; if (dio->end_io) { - ret = dio->end_io(iocb, - dio->error ? dio->error : dio->size, - dio->flags); + err = dio->end_io(iocb, + (transferred && sysctl_dio_short_writes) ? + transferred : dio->error, +
Re: [PATCH v4 13/13] bcache: add stop_when_cache_set_failed to struct cached_dev
On 29/01/2018 8:57 PM, Nix wrote: > On 27 Jan 2018, Coly Li said: > >> Current bcache failure handling code will stop all attached bcache devices >> when the cache set is broken or disconnected. This is desired behavior for >> most of enterprise or cloud use cases, but maybe not for low end >> configuration. Nixpoints out, users may still want to >> access the bcache device after cache device failed, for example on laptops. > > Actually I'm much interested in the server use case. On laptops, it's > relatively easy to recover if you know what you're doing, because they > usually have a user in front of them with console access -- but if a > remote headless server with a hundred users a thousand miles away has > its cache device wear out I would really rather the hundred users get > served, if more slowly, rather than the whole machine going down for > hours or days until I can get someone there to bash on the hardware! > Hi Nix, Thanks for the input, I didn't think of such use case before. It makes a lot sense ! > (Sure, ideally you'd detect the wearing out in advance, but SSDs are not > always nice and helpful like that and sometimes just go instantly > readonly or simply vanish off the bus entirely without warning.) > Yes. Then in the v5 patch set, I will add an option for "always"/"auto", which will leave bcache device alive if the broken cache set is clean. Thank you all again, for the insight and brilliant suggestion ! Coly Li
Re: [PATCH v4 05/13] bcache: stop dc->writeback_rate_update properly
On 29/01/2018 8:22 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > Hello Coly: > > There are some differences, > Using variable of atomic_t type can not guarantee the atomicity of > transaction. > for example: > A thread runs in update_writeback_rate() > update_writeback_rate(){ > > + if (test_bit(BCACHE_DEV_WB_RUNNING, >disk.flags)) { > + schedule_delayed_work(>writeback_rate_update, > dc->writeback_rate_update_seconds * HZ); > + } > > Then another thread executes in cached_dev_detach_finish(): > if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, >disk.flags)) > cancel_writeback_rate_update_dwork(dc); > > + > + /* > + * should check BCACHE_DEV_RATE_DW_RUNNING before calling > + * cancel_delayed_work_sync(). > + */ > + clear_bit(BCACHE_DEV_RATE_DW_RUNNING, >disk.flags); > + /* paired with where BCACHE_DEV_RATE_DW_RUNNING is tested */ > + smp_mb(); > > Race still exists. > Hi Junhui, Check super.c:cancel_writeback_rate_update_dwork(), BCACHE_DEV_RATE_DW_RUNNING is checked there. You may see in cached_dev_detach_finish() and update_writeback_rate(), the orders to check BCACHE_DEV_RATE_DW_RUNNING and BCACHE_DEV_WB_RUNNING are different. cached_dev_detach_finish() update_writeback_rate() test_and_clear_bit set_bit BCACHE_DEV_WB_RUNNING BCACHE_DEV_RATE_DW_RUNNING (implicit smp_mb()) smp_mb() test_bittest_bit BCACHE_DEV_RATE_DW_RUNNING BCACHE_DEV_WB_RUNNING clear_bit() BCACHE_DEV_RATE_DW_RUNNING smp_mb() This two flags are accessed in reversed order in different locations, there is a smp_mb() between accessing two flags to serialize the access order. By the above reserve ordering accessing, it is sure that - in cached_dev_detach_finish(), before test_bit(BCACHE_DEV_RATE_DW_RUNNING) bit BCACHE_DEV_WB_RUNNING must be cleared already. - in update_writeback_rate(), before test_bit(BCACHE_DEV_WB_RUNNING), BCACHE_DEV_RATE_DW_RUNNING must be set already. Therefore in your example, if a thread is testing BCACHE_DEV_WB_RUNNING in update_writeback_rate(), it means BCACHE_DEV_RATE_DW_RUNNING must be set already. So in cancel_writeback_rate_update_dwork() another thread must wait until BCACHE_DEV_RATE_DW_RUNNING is cleared then cancel_delayed_work_sync() can be called. And in update_writeback_rate() the bit BCACHE_DEV_RATE_DW_RUNNING is cleared after schedule_delayed_work() returns, so the race is killed. A mutex lock indicates an implicit memory barrier, and in your suggestion up_read(>writeback_lock) is after schedule_delayed_work() too. This is why I said they are almost same. Thanks. Coly Li >> >> On 29/01/2018 3:35 PM, tang.jun...@zte.com.cn wrote: >>> From: Tang Junhui >>> >>> Hello Coly: >>> >>> This patch is somewhat difficult for me, >>> I think we can resolve it in a simple way. >>> >>> We can take the schedule_delayed_work() under the protection of >>> dc->writeback_lock, and judge if we need re-arm this work to queue. >>> >>> static void update_writeback_rate(struct work_struct *work) >>> { >>> struct cached_dev *dc = container_of(to_delayed_work(work), >>> struct cached_dev, >>> writeback_rate_update); >>> >>> down_read(>writeback_lock); >>> >>> if (atomic_read(>has_dirty) && >>> dc->writeback_percent) >>> __update_writeback_rate(dc); >>> >>> -up_read(>writeback_lock); >>> +if (NEED_RE-AEMING) >>> schedule_delayed_work(>writeback_rate_update, >>> dc->writeback_rate_update_seconds * HZ); >>> +up_read(>writeback_lock); >>> } >>> >>> In cached_dev_detach_finish() and cached_dev_free() we can set the no need >>> flag under the protection of dc->writeback_lock, for example: >>> >>> static void cached_dev_detach_finish(struct work_struct *w) >>> { >>> ... >>> +down_write(>writeback_lock); >>> +SET NO NEED RE-ARM FLAG >>> +up_write(>writeback_lock); >>> cancel_delayed_work_sync(>writeback_rate_update); >>> } >>> >>> I think this way is more simple and readable. >>> >> >> Hi Junhui, >> >> Your suggest is essentially almost same to my patch, >> - clear BCACHE_DEV_DETACHING bit acts as SET NO NEED RE-ARM FLAG. >> - cancel_writeback_rate_update_dwork acts as some kind of locking with a >> timeout. >> >> The difference is I don't use dc->writeback_lock, and replace it by >> BCACHE_DEV_RATE_DW_RUNNING. >> >> The reason is my following development. I plan to implement a real-time >> update stripe_sectors_dirty of bcache device and cache set, then >> bcache_flash_devs_sectors_dirty() can be very fast and bch_register_lock >> can be removed here. And then I also plan
Re: [PATCH v4 13/13] bcache: add stop_when_cache_set_failed to struct cached_dev
On 27 Jan 2018, Coly Li said: > Current bcache failure handling code will stop all attached bcache devices > when the cache set is broken or disconnected. This is desired behavior for > most of enterprise or cloud use cases, but maybe not for low end > configuration. Nixpoints out, users may still want to > access the bcache device after cache device failed, for example on laptops. Actually I'm much interested in the server use case. On laptops, it's relatively easy to recover if you know what you're doing, because they usually have a user in front of them with console access -- but if a remote headless server with a hundred users a thousand miles away has its cache device wear out I would really rather the hundred users get served, if more slowly, rather than the whole machine going down for hours or days until I can get someone there to bash on the hardware! (Sure, ideally you'd detect the wearing out in advance, but SSDs are not always nice and helpful like that and sometimes just go instantly readonly or simply vanish off the bus entirely without warning.) -- NULL && (void)
Re: [PATCH v4 05/13] bcache: stop dc->writeback_rate_update properly
From: Tang JunhuiHello Coly: There are some differences, Using variable of atomic_t type can not guarantee the atomicity of transaction. for example: A thread runs in update_writeback_rate() update_writeback_rate(){ + if (test_bit(BCACHE_DEV_WB_RUNNING, >disk.flags)) { + schedule_delayed_work(>writeback_rate_update, dc->writeback_rate_update_seconds * HZ); + } Then another thread executes in cached_dev_detach_finish(): if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, >disk.flags)) cancel_writeback_rate_update_dwork(dc); + + /* +* should check BCACHE_DEV_RATE_DW_RUNNING before calling +* cancel_delayed_work_sync(). +*/ + clear_bit(BCACHE_DEV_RATE_DW_RUNNING, >disk.flags); + /* paired with where BCACHE_DEV_RATE_DW_RUNNING is tested */ + smp_mb(); Race still exists. > > On 29/01/2018 3:35 PM, tang.jun...@zte.com.cn wrote: > > From: Tang Junhui > > > > Hello Coly: > > > > This patch is somewhat difficult for me, > > I think we can resolve it in a simple way. > > > > We can take the schedule_delayed_work() under the protection of > > dc->writeback_lock, and judge if we need re-arm this work to queue. > > > > static void update_writeback_rate(struct work_struct *work) > > { > > struct cached_dev *dc = container_of(to_delayed_work(work), > > struct cached_dev, > > writeback_rate_update); > > > > down_read(>writeback_lock); > > > > if (atomic_read(>has_dirty) && > > dc->writeback_percent) > > __update_writeback_rate(dc); > > > > -up_read(>writeback_lock); > > +if (NEED_RE-AEMING) > > schedule_delayed_work(>writeback_rate_update, > > dc->writeback_rate_update_seconds * HZ); > > +up_read(>writeback_lock); > > } > > > > In cached_dev_detach_finish() and cached_dev_free() we can set the no need > > flag under the protection of dc->writeback_lock, for example: > > > > static void cached_dev_detach_finish(struct work_struct *w) > > { > > ... > > +down_write(>writeback_lock); > > +SET NO NEED RE-ARM FLAG > > +up_write(>writeback_lock); > > cancel_delayed_work_sync(>writeback_rate_update); > > } > > > > I think this way is more simple and readable. > > > > Hi Junhui, > > Your suggest is essentially almost same to my patch, > - clear BCACHE_DEV_DETACHING bit acts as SET NO NEED RE-ARM FLAG. > - cancel_writeback_rate_update_dwork acts as some kind of locking with a > timeout. > > The difference is I don't use dc->writeback_lock, and replace it by > BCACHE_DEV_RATE_DW_RUNNING. > > The reason is my following development. I plan to implement a real-time > update stripe_sectors_dirty of bcache device and cache set, then > bcache_flash_devs_sectors_dirty() can be very fast and bch_register_lock > can be removed here. And then I also plan to remove reference of > dc->writeback_lock in update_writeback_rate() because indeed it is > unnecessary here (the patch is held by Mike's locking resort work). > > Since I plan to remove dc->writeback_lock from update_writeback_rate(), > I don't want to reference dc->writeback in the delayed work. > > The basic idea behind your suggestion and this patch, is almost > identical. The only difference might be the timeout in > cancel_writeback_rate_update_dwork(). > > Thanks. > > Coly Li Thanks. Tang Junhui
Re: [PATCH] bcache: finish incremental GC
On 29/01/2018 7:58 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > In GC thread, we record the latest GC key in gc_done, which is expected > to be used for incremental GC, but in currently code, we didn't realize > it. When GC runs, front side IO would be blocked until the GC over, it > would be a long time if there is a lot of btree nodes. > > This patch realizes incremental GC, the main ideal is that, when there > are front side I/Os, after GC some nodes (100), we stop GC, release locker > of the btree node, and go to process the front side I/Os for some times > (100 ms), then go back to GC again. > > By this patch, when we doing GC, I/Os are not blocked all the time, and > there is no obvious I/Os zero jump problem any more. > > Signed-off-by: Tang Junhui Hi Junhui, It sounds great :-) Currently I am not very familiar with gc code, I need some time to understand how bcache GC works before providing my feedback. Please give me some time, I hope I am able to provide my comment after Chinese Spring Festival. Hope other people may have a look before I am able to. Thanks. Coly Li [code snip]
[PATCH] bcache: finish incremental GC
From: Tang JunhuiIn GC thread, we record the latest GC key in gc_done, which is expected to be used for incremental GC, but in currently code, we didn't realize it. When GC runs, front side IO would be blocked until the GC over, it would be a long time if there is a lot of btree nodes. This patch realizes incremental GC, the main ideal is that, when there are front side I/Os, after GC some nodes (100), we stop GC, release locker of the btree node, and go to process the front side I/Os for some times (100 ms), then go back to GC again. By this patch, when we doing GC, I/Os are not blocked all the time, and there is no obvious I/Os zero jump problem any more. Signed-off-by: Tang Junhui --- drivers/md/bcache/bcache.h | 5 + drivers/md/bcache/btree.c | 15 ++- drivers/md/bcache/request.c | 3 +++ 3 files changed, 22 insertions(+), 1 deletion(-) diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h index abd31e8..c047aff 100644 --- a/drivers/md/bcache/bcache.h +++ b/drivers/md/bcache/bcache.h @@ -445,6 +445,7 @@ struct cache { struct gc_stat { size_t nodes; + size_t nodes_pre; size_t key_bytes; size_t nkeys; @@ -568,6 +569,10 @@ struct cache_set { */ atomic_trescale; /* +* used for GC, identify if any front side I/Os is inflight +*/ + atomic_tio_inflight; + /* * When we invalidate buckets, we use both the priority and the amount * of good data to determine which buckets to reuse first - to weight * those together consistently we keep track of the smallest nonzero diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 658c54b..48a4d18 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -90,6 +90,9 @@ #define MAX_NEED_GC64 #define MAX_SAVE_PRIO 72 +#define MIN_GC_NODES 100 +#define GC_SLEEP_TIME 100 + #define PTR_DIRTY_BIT (((uint64_t) 1 << 36)) @@ -1573,6 +1576,13 @@ static int btree_gc_recurse(struct btree *b, struct btree_op *op, memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1)); r->b = NULL; + if (atomic_read(>c->io_inflight) && + gc->nodes >= gc->nodes_pre + MIN_GC_NODES) { + gc->nodes_pre = gc->nodes; + ret = -EAGAIN; + break; + } + if (need_resched()) { ret = -EAGAIN; break; @@ -1742,7 +1752,10 @@ static void bch_btree_gc(struct cache_set *c) closure_sync(); cond_resched(); - if (ret && ret != -EAGAIN) + if (ret == -EAGAIN) + schedule_timeout_interruptible(msecs_to_jiffies + (GC_SLEEP_TIME)); + else if (ret) pr_warn("gc failed!"); } while (ret); diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index 3475d66..a8b1226 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -628,7 +628,9 @@ static void do_bio_hook(struct search *s, struct bio *orig_bio) static void search_free(struct closure *cl) { struct search *s = container_of(cl, struct search, cl); + bio_complete(s); + atomic_dec(>d->c->io_inflight); if (s->iop.bio) bio_put(s->iop.bio); @@ -646,6 +648,7 @@ static inline struct search *search_alloc(struct bio *bio, closure_init(>cl, NULL); do_bio_hook(s, bio); + atomic_inc(>c->io_inflight); s->orig_bio = bio; s->cache_miss = NULL; -- 1.8.3.1
Re: [LSF/MM TOPIC] Springfield: smart and automated storage
On Sat, Jan 27, 2018 at 5:32 PM, James Bottomleywrote: > On Sat, 2018-01-27 at 09:37 +0100, Jan Tulak wrote: >> Springfield is a collection of projects unifying multiple levels of >> the storage stack and providing a general API for automation, health >> and status monitoring, as well as sane and easy configuration across >> multiple levels of the storage stack. It is a scalable solution, >> working from a single node to large deployments with a mix of base >> metal, containers and VMs. For the first time, it was presented on >> Vault 2017 in a birds of a feather session. >> >> Springfield builds upon and enhances the existing work of those >> projects: >> * udisks >> * libblockdev >> * blivet >> * libstoragemgmt >> >> It is a coordinated effort to overcome many of the shortcomings of >> the current situation and provide a unified approach to storage >> management, so there is nothing like a binary or a library with the >> name “Springfield.” >> >> An example of things we are trying to tackle is a useful storage >> reporting, like notifications about filesystem being full, especially >> with thin provisioning and multiple levels of the storage stack, >> where >> the user-visible status is not a true representation of what is >> happening underneath. >> >> We want to see how the community sees this effort. What shortcomings >> of the current situation in storage layering can Springfield address >> (is there something we don’t see, but a project like this could >> help)? How would you use it, or what would make it more useful to >> you? > > Do you have a link to the project page and the source tree(s)? > Sure, for Springfield the website is: https://springfield-project.github.io/ Each of the listed projects has it's own repositories: https://github.com/storaged-project/udisks https://github.com/libstorage/libstoragemgmt https://github.com/storaged-project/libblockdev https://github.com/storaged-project/blivet Cheers, Jan