Re: [dm-devel] [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 6:14 AM, Mike Snitzer  wrote:
> On Mon, Jan 29 2018 at  4:51pm -0500,
> Bart Van Assche  wrote:
>
>> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
>> > But regardless of which wins the race, the queue will have been run.
>> > Which is all we care about right?
>>
>> Running the queue is not sufficient. With this patch applied it can happen
>> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
>> concurrent queue runs finish before sufficient device resources are available
>> to execute the request and that blk_mq_delay_run_hw_queue() does not get
>> called at all. If no other activity triggers a queue run, e.g. request
>> completion, this will result in a queue stall.
>
> If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely
> on a future queue run.  IIUC, that is the entire premise of
> BLK_STS_DEV_RESOURCE.  If the driver had doubt about whether the
> resource were going to be available in the future it should return
> BLK_STS_RESOURCE.
>
> That may seem like putting a lot on a driver developer (to decide
> between the 2) but I'll again defer to Jens here.  This was the approach
> he was advocating be pursued.

Thinking of further, maybe you can add the following description in V5,
and it should be much easier for driver developer to follow:

When any resource allocation fails, if driver can make sure that there is
any in-flight IO, it is safe to return BLK_STS_DEV_RESOURCE to blk-mq,
that is exactly what scsi_queue_rq() is doing.

Follows the theory:

1) driver returns BLK_STS_DEV_RESOURCE if driver figures out
there is any in-flight IO, in case of any resource allocation failure

2) If all these in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

3) if there is any in-flight IO after/when examining SCHED_RESTART in
blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 2)
- otherwise, this request will be dispatched after any in-flight IO is
completed via
blk_mq_sched_restart() since this request is added to hctx->dispatch already

And there are two invariants when driver returns BLK_STS_DEV_RESOURCE
iff there is any in-flight IOs:

1) SCHED_RESTART must be zero if no in-flight IOs
2) there has to be any IO in-flight if SCHED_RESTART is read as 1


Thanks,
Ming


Re: [2/2] block: fix blk_rq_append_bio

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 02:42:41AM +0100, Jiri Palecek wrote:
> Hello,
> 
> I see a problem with this patch:
> 
> Ming Lei  writes:
> 
> > From: Jens Axboe 
> >
> > Commit caa4b02476e3(blk-map: call blk_queue_bounce from blk_rq_append_bio)
> > moves blk_queue_bounce() into blk_rq_append_bio(), but don't consider
> > that the bounced bio becomes invisiable to caller since the parameter type 
> > is
> > 'struct bio *', which should have been 'struct bio **'.
> >
> > This patch fixes this issue by passing 'struct bio **' to
> > blk_rq_append_bio(), then the bounced bio can be returned to caller.
> >
> > Also failure handling is considered too.
> >
> > Fixes: caa4b02476e3 ("blk-map: call blk_queue_bounce from 
> > blk_rq_append_bio")
> > Cc: Christoph Hellwig 
> > Reported-by: Michele Ballabio 
> > (handling failure of blk_rq_append_bio(), only call bio_get() after
> > blk_rq_append_bio() returns OK)
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-map.c| 38 
> > ++
> >  drivers/scsi/osd/osd_initiator.c   |  4 +++-
> >  drivers/target/target_core_pscsi.c |  4 ++--
> >  include/linux/blkdev.h |  2 +-
> >  4 files changed, 28 insertions(+), 20 deletions(-)
> >
> > diff --git a/block/blk-map.c b/block/blk-map.c
> > index b21f8e86f120..d3a94719f03f 100644
> > --- a/block/blk-map.c
> > +++ b/block/blk-map.c
> > @@ -12,22 +12,29 @@
> >  #include "blk.h"
> >  
> >  /*
> > - * Append a bio to a passthrough request.  Only works can be merged into
> > - * the request based on the driver constraints.
> > + * Append a bio to a passthrough request.  Only works if the bio can be 
> > merged
> > + * into the request based on the driver constraints.
> >   */
> > -int blk_rq_append_bio(struct request *rq, struct bio *bio)
> > +int blk_rq_append_bio(struct request *rq, struct bio **bio)
> >  {
> > -   blk_queue_bounce(rq->q, );
> > +   struct bio *orig_bio = *bio;
> > +
> > +   blk_queue_bounce(rq->q, bio);
> >  
> > if (!rq->bio) {
> > -   blk_rq_bio_prep(rq->q, rq, bio);
> > +   blk_rq_bio_prep(rq->q, rq, *bio);
> > } else {
> > -   if (!ll_back_merge_fn(rq->q, rq, bio))
> > +   if (!ll_back_merge_fn(rq->q, rq, *bio)) {
> > +   if (orig_bio != *bio) {
> > +   bio_put(*bio);
> > +   *bio = orig_bio;
> 
> 1) Suppose the bio *does* bounce, which allocates some pages from the
> DMA pool, and subsequently this test fails. bio_put won't deallocate
> those pages. Prior to caa4b02476e3, it would be done by bio_endio, which
> is IMHO the correct way to do it.

You are right, that is my fault, and will review your patch later.

> 
> 2) What if the bio bounces and goes into the splitting path of
> __blk_queue_bounce. That would submit the request, which would be a
> problem should this test or some other later fail, cancelling the whole
> request.

When bio comes here, bio_is_passthrough() is true, it won't go into that
path, so needn't to worry.

Thanks,
Ming


Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 03:37:38AM +, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 11:31 +0800, Ming Lei wrote:
> > Please take a look at drivers, when BLK_STS_RESOURCE is returned, who
> > will call blk_mq_delay_run_hw_queue() for drivers?
> 
> As you know the SCSI and dm drivers in kernel v4.15 already call that function
> whenever necessary.

We have concluded that it is generic issue which need generic solution[1].

[1] https://marc.info/?l=linux-kernel=151638176727612=2

[ming@ming linux]$ git grep -n BLK_STS_RESOURCE ./drivers/ | wc -l
43

Almost every driver need this kind of change if BLK_STS_RESOURCE is returned
from IO path. And the failure can be caused by kmalloc(GFP_ATOMIC), DMA Mapping,
or what ever.

Thanks,
Ming


Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 01:11:22AM +, Bart Van Assche wrote:
> On Tue, 2018-01-30 at 09:07 +0800, Ming Lei wrote:
> > On Mon, Jan 29, 2018 at 04:48:31PM +, Bart Van Assche wrote:
> > > - It is easy to fix this race inside the block layer, namely by using
> > >   call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
> > >   postpone the queue rerunning until after the request has been added 
> > > back to
> > >   the dispatch list.
> > 
> > It is just easy to say, can you cook a patch and fix all drivers first?
> 
> Please reread what I wrote. I proposed to change the 
> blk_mq_delay_run_hw_queue()
> IMPLEMENTATION such that the callers do not have to be modified.

Please take a look at drivers, when BLK_STS_RESOURCE is returned, who
will call blk_mq_delay_run_hw_queue() for drivers?

> 
> > [ ... ] Later, you admitted you understood the patch wrong. [ ... ]
> 
> That's nonsense. I never wrote that.

Believe it or not, follows the link and your reply:

https://marc.info/?l=dm-devel=151672694508389=2

> So what is wrong with this way?

>Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
>reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call 
>followed
>by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
>race condition in code where there was no race condition.
>
>Bart.


-- 
Ming


Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 5:51 AM, Bart Van Assche  wrote:
> On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
>> But regardless of which wins the race, the queue will have been run.
>> Which is all we care about right?
>
> Running the queue is not sufficient. With this patch applied it can happen
> that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
> concurrent queue runs finish before sufficient device resources are available
> to execute the request and that blk_mq_delay_run_hw_queue() does not get
> called at all. If no other activity triggers a queue run, e.g. request
> completion, this will result in a queue stall.

Please see document of BLK_STS_DEV_RESOURCE:

+ * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
+ * related resource is unavailable, but driver can guarantee that queue
+ * will be rerun in future once the resource is available (whereby
+ * dispatching requests).

I have explained the SCSI's BLK_STS_DEV_RESOURCE conversion in
another thread, let's know if you have further concern:

https://marc.info/?l=linux-block=151727454815018=2

-- 
Ming Lei


Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Tue, Jan 30, 2018 at 5:22 AM, Bart Van Assche  wrote:
> On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
>> +  * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
>> +  * bit is set, run queue after 10ms to avoid IO stalls
>> +  * that could otherwise occur if the queue is idle.
>>*/
>> - if (!blk_mq_sched_needs_restart(hctx) ||
>> + needs_restart = blk_mq_sched_needs_restart(hctx);
>> + if (!needs_restart ||
>>   (no_tag && list_empty_careful(>dispatch_wait.entry)))
>>   blk_mq_run_hw_queue(hctx, true);
>> + else if (needs_restart && (ret == BLK_STS_RESOURCE))
>> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>>   }
>
> The above code only calls blk_mq_delay_run_hw_queue() if the following 
> condition
> is met: !(!needs_restart || (no_tag && 
> list_empty_careful(>dispatch_wait.entry)))
> && (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can 
> be
> simplified into the following: needs_restart && ret == BLK_STS_RESOURCE &&
> !(no_tag && list_empty_careful(>dispatch_wait.entry)). From 
> blk-mq-sched.h:

No, the expression of  (needs_restart && (ret == BLK_STS_RESOURCE)) clearly
expresses what we want to do.

When RESTART is working, and meantime BLK_STS_RESOURCE is returned
from driver, we need to avoid IO hang by blk_mq_delay_run_hw_queue().

>
> static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
> {
> return test_bit(BLK_MQ_S_SCHED_RESTART, >state);
> }
>
> In other words, the above code will not call blk_mq_delay_run_hw_queue() if
> BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code 
> is
> reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
> concurrently with the above code. From blk-mq-sched.c:

That won't be an issue, given any time the SCHED_RESTART is cleared, the queue
will be run again, so we needn't to worry about that.

>
> static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> {
> if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state))
> return false;
>
> if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
> struct request_queue *q = hctx->queue;
>
> if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, >state))
> atomic_dec(>shared_hctx_restart);
> } else
> clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
>
> return blk_mq_run_hw_queue(hctx, true);
> }
>
> The above blk_mq_run_hw_queue() call may finish either before or after

If the above blk_mq_run_hw_queue() finishes before blk_mq_dispatch_rq_list()
examines the flag, blk_mq_dispatch_rq_list() will see the flag cleared, and run
queue again by the following code branch:

if (!blk_mq_sched_needs_restart(hctx) ||
(no_tag && list_empty_careful(>dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);

If blk_mq_run_hw_queue() finishes after blk_mq_dispatch_rq_list() examines
the flag, this blk_mq_run_hw_queue() will see the new added request, and
still everything is fine. Even blk_mq_delay_run_hw_queue() can be started
too.

> blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.
>
> That's why I wrote in previous e-mails that this patch and also the previous
> versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
> into a call that may or may not happen, depending on whether or not a request
> completes concurrently with request queueing. Sorry but I think that means
> that the above change combined with changing BLK_STS_RESOURCE into
> BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
> sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
> to debug.

Now there is debugfs, it isn't difficult to deal with that if
reporter'd like to cowork.

>
> Plenty of e-mails have already been exchanged about versions one to four of 
> this
> patch but so far nobody has even tried to contradict the above.

No, I don't see the issue, let's revisit the main change again:

+   needs_restart = blk_mq_sched_needs_restart(hctx);
+   if (!needs_restart ||
(no_tag && list_empty_careful(>dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
+   else if (needs_restart && (ret == BLK_STS_RESOURCE))
+   blk_mq_delay_run_hw_queue(hctx, 10);

If SCHED_RESTART isn't set, queue is run immediately, otherwise when
BLK_STS_RESOURCE is returned, we avoid IO hang by
blk_mq_delay_run_hw_queue(hctx, XXX).

And we don't cover  BLK_STS_DEV_RESOURCE above because it is documented
clearly that  BLK_STS_DEV_RESOURCE is only returned by driver iff queue will be
rerun in future if resource is available.

Is there  any hole in above code?

Thanks,
Ming Lei


Re: [PATCH v4 05/13] bcache: stop dc->writeback_rate_update properly

2018-01-29 Thread Coly Li
On 30/01/2018 9:57 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Hello Coly:
> 
> OK, I got your point now.
> Thanks for your patience.
> 
> And there is a small issue I hope to be modified:
> +#define BCACHE_DEV_WB_RUNNING4
> +#define BCACHE_DEV_RATE_DW_RUNNING8
> Would be OK just as:
> +#define BCACHE_DEV_WB_RUNNING3
> +#define BCACHE_DEV_RATE_DW_RUNNING4
> 
> Reviewed-by: Tang Junhui 
> 

Hi Junhui,

I will fix that in v5 patch. Thanks for your review :-)

Coly Li


[snip]


Re: [PATCH v4 05/13] bcache: stop dc->writeback_rate_update properly

2018-01-29 Thread tang . junhui
From: Tang Junhui 

Hello Coly:

OK, I got your point now.
Thanks for your patience.

And there is a small issue I hope to be modified:
+#define BCACHE_DEV_WB_RUNNING4
+#define BCACHE_DEV_RATE_DW_RUNNING8
Would be OK just as:
+#define BCACHE_DEV_WB_RUNNING3
+#define BCACHE_DEV_RATE_DW_RUNNING4

Reviewed-by: Tang Junhui 

   
>On 29/01/2018 8:22 PM, tang.jun...@zte.com.cn wrote:
>> From: Tang Junhui 
>> 
>> Hello Coly:
>> 
>> There are some differences,
>> Using variable of atomic_t type can not guarantee the atomicity of 
>> transaction.
>> for example:
>> A thread runs in update_writeback_rate()
>> update_writeback_rate(){
>> 
>> +if (test_bit(BCACHE_DEV_WB_RUNNING, >disk.flags)) {
>> +schedule_delayed_work(>writeback_rate_update,
>>dc->writeback_rate_update_seconds * HZ);
>> +}
>> 
>> Then another thread executes in cached_dev_detach_finish():
>> if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, >disk.flags))
>> cancel_writeback_rate_update_dwork(dc);
>> 
>> +
>> +/*
>> + * should check BCACHE_DEV_RATE_DW_RUNNING before calling
>> + * cancel_delayed_work_sync().
>> + */
>> +clear_bit(BCACHE_DEV_RATE_DW_RUNNING, >disk.flags);
>> +/* paired with where BCACHE_DEV_RATE_DW_RUNNING is tested */
>> +smp_mb();
>> 
>> Race still exists.
>>  
>
>Hi Junhui,
>
>Check super.c:cancel_writeback_rate_update_dwork(),
>BCACHE_DEV_RATE_DW_RUNNING is checked there.
>
>You may see in cached_dev_detach_finish() and update_writeback_rate(),
>the orders to check BCACHE_DEV_RATE_DW_RUNNING and BCACHE_DEV_WB_RUNNING
>are different.
>
>cached_dev_detach_finish()update_writeback_rate()
>
>test_and_clear_bitset_bit
>BCACHE_DEV_WB_RUNNINGBCACHE_DEV_RATE_DW_RUNNING
>
>(implicit smp_mb())smp_mb()
>
>test_bittest_bit
>BCACHE_DEV_RATE_DW_RUNNINGBCACHE_DEV_WB_RUNNING
>
>clear_bit()
>BCACHE_DEV_RATE_DW_RUNNING
>
>smp_mb()
>
>
>This two flags are accessed in reversed order in different locations,
>there is a smp_mb() between accessing two flags to serialize the access
>order.
>
>By the above reserve ordering accessing, it is sure that
>- in cached_dev_detach_finish(), before
>test_bit(BCACHE_DEV_RATE_DW_RUNNING) bit BCACHE_DEV_WB_RUNNING must be
>cleared already.
>- in update_writeback_rate(), before test_bit(BCACHE_DEV_WB_RUNNING),
>BCACHE_DEV_RATE_DW_RUNNING must be set already.
>
>Therefore in your example, if a thread is testing BCACHE_DEV_WB_RUNNING
>in update_writeback_rate(), it means BCACHE_DEV_RATE_DW_RUNNING must be
>set already. So in cancel_writeback_rate_update_dwork() another thread
>must wait until BCACHE_DEV_RATE_DW_RUNNING is cleared then
>cancel_delayed_work_sync() can be called. And in update_writeback_rate()
>the bit BCACHE_DEV_RATE_DW_RUNNING is cleared after
>schedule_delayed_work() returns, so the race is killed.
>
>A mutex lock indicates an implicit memory barrier, and in your
>suggestion up_read(>writeback_lock) is after schedule_delayed_work()
>too. This is why I said they are almost same.
>
>Thanks.
>
>Coly Li
>
>>>
>>> On 29/01/2018 3:35 PM, tang.jun...@zte.com.cn wrote:
 From: Tang Junhui 

 Hello Coly:

 This patch is somewhat difficult for me,
 I think we can resolve it in a simple way.

 We can take the schedule_delayed_work() under the protection of 
 dc->writeback_lock, and judge if we need re-arm this work to queue.

 static void update_writeback_rate(struct work_struct *work)
 {
 struct cached_dev *dc = container_of(to_delayed_work(work),
  struct cached_dev,
  writeback_rate_update);

 down_read(>writeback_lock);

 if (atomic_read(>has_dirty) &&
 dc->writeback_percent)
 __update_writeback_rate(dc);

 -up_read(>writeback_lock);
 +if (NEED_RE-AEMING)
 schedule_delayed_work(>writeback_rate_update,
   dc->writeback_rate_update_seconds * HZ);
 +up_read(>writeback_lock);
 }

 In cached_dev_detach_finish() and cached_dev_free() we can set the no need
 flag under the protection of dc->writeback_lock, for example:

 static void cached_dev_detach_finish(struct work_struct *w)
 {
 ...
 +down_write(>writeback_lock);
 +SET NO NEED RE-ARM FLAG
 +up_write(>writeback_lock);
 cancel_delayed_work_sync(>writeback_rate_update);
 }

 I think this way is more simple and readable.

>>>
>>> Hi Junhui,
>>>
>>> Your suggest is essentially almost same to my patch,
>>> - clear BCACHE_DEV_DETACHING bit acts as SET NO NEED RE-ARM FLAG.
>>> - cancel_writeback_rate_update_dwork acts as some kind 

Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Jens Axboe
On 1/29/18 4:46 PM, James Bottomley wrote:
> On Mon, 2018-01-29 at 14:00 -0700, Jens Axboe wrote:
>> On 1/29/18 1:56 PM, James Bottomley wrote:
>>>
>>> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
>>> [...]

 2. When to enable SCSI_MQ at default again?
>>>
>>> I'm not sure there's much to discuss ... I think the basic answer
>>> is as soon as Christoph wants to try it again.
>>
>> FWIW, internally I've been running various IO intensive workloads on
>> what is essentially 4.12 upstream with scsi-mq the default (with
>> mq-deadline as the scheduler) and comparing IO workloads with a
>> previous 4.6 kernel (without scsi-mq), and things are looking
>> great.
>>
>> We're never going to iron out the last kinks with it being off
>> by default, I think we should attempt to flip the switch again
>> for 4.16.
> 
> Absolutely, I agree we turn it on ASAP.  I just don't want to be on the
> receiving end of Linus' flamethrower because a bug we already had
> reported against scsi-mq caused problems.  Get confirmation from the
> original reporters (or as close to it as you can) that their problems
> are fixed and we're good to go; he won't kick us nearly as hard for new
> bugs that turn up.

I agree, the functional issues definitely have to be verified to be
resolved. Various performance hitches we can dive into if they
crop up, but reintroducing some random suspend regression is not
acceptable.

-- 
Jens Axboe



[PATCH] Use bio_endio instead of bio_put in error path of blk_rq_append_bio

2018-01-29 Thread Jiří Paleček
 Avoids page leak from bounced requests
---
 block/blk-map.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index d3a94719f03f..702d68166689 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -26,7 +26,8 @@ int blk_rq_append_bio(struct request *rq, struct bio **bio)
} else {
if (!ll_back_merge_fn(rq->q, rq, *bio)) {
if (orig_bio != *bio) {
-   bio_put(*bio);
+   bio_inc_remaining(orig_bio);
+   bio_endio(*bio);
*bio = orig_bio;
}
return -EINVAL;
-- 
2.15.1



Re: [2/2] block: fix blk_rq_append_bio

2018-01-29 Thread Jiri Palecek
Hello,

I see a problem with this patch:

Ming Lei  writes:

> From: Jens Axboe 
>
> Commit caa4b02476e3(blk-map: call blk_queue_bounce from blk_rq_append_bio)
> moves blk_queue_bounce() into blk_rq_append_bio(), but don't consider
> that the bounced bio becomes invisiable to caller since the parameter type is
> 'struct bio *', which should have been 'struct bio **'.
>
> This patch fixes this issue by passing 'struct bio **' to
> blk_rq_append_bio(), then the bounced bio can be returned to caller.
>
> Also failure handling is considered too.
>
> Fixes: caa4b02476e3 ("blk-map: call blk_queue_bounce from blk_rq_append_bio")
> Cc: Christoph Hellwig 
> Reported-by: Michele Ballabio 
> (handling failure of blk_rq_append_bio(), only call bio_get() after
> blk_rq_append_bio() returns OK)
> Signed-off-by: Ming Lei 
> ---
>  block/blk-map.c| 38 
> ++
>  drivers/scsi/osd/osd_initiator.c   |  4 +++-
>  drivers/target/target_core_pscsi.c |  4 ++--
>  include/linux/blkdev.h |  2 +-
>  4 files changed, 28 insertions(+), 20 deletions(-)
>
> diff --git a/block/blk-map.c b/block/blk-map.c
> index b21f8e86f120..d3a94719f03f 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -12,22 +12,29 @@
>  #include "blk.h"
>  
>  /*
> - * Append a bio to a passthrough request.  Only works can be merged into
> - * the request based on the driver constraints.
> + * Append a bio to a passthrough request.  Only works if the bio can be 
> merged
> + * into the request based on the driver constraints.
>   */
> -int blk_rq_append_bio(struct request *rq, struct bio *bio)
> +int blk_rq_append_bio(struct request *rq, struct bio **bio)
>  {
> - blk_queue_bounce(rq->q, );
> + struct bio *orig_bio = *bio;
> +
> + blk_queue_bounce(rq->q, bio);
>  
>   if (!rq->bio) {
> - blk_rq_bio_prep(rq->q, rq, bio);
> + blk_rq_bio_prep(rq->q, rq, *bio);
>   } else {
> - if (!ll_back_merge_fn(rq->q, rq, bio))
> + if (!ll_back_merge_fn(rq->q, rq, *bio)) {
> + if (orig_bio != *bio) {
> + bio_put(*bio);
> + *bio = orig_bio;

1) Suppose the bio *does* bounce, which allocates some pages from the
DMA pool, and subsequently this test fails. bio_put won't deallocate
those pages. Prior to caa4b02476e3, it would be done by bio_endio, which
is IMHO the correct way to do it.

2) What if the bio bounces and goes into the splitting path of
__blk_queue_bounce. That would submit the request, which would be a
problem should this test or some other later fail, cancelling the whole
request.

For 1, I will send a patch. As for 2, it would be a problem for a long
time so I have a hunch it doesn't actually happen, however, I'd like to
know if it is somehow guaranteed.

Regards
Jiri Palecek


Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Ming Lei
On Mon, Jan 29, 2018 at 03:40:31PM -0500, Mike Snitzer wrote:
> On Mon, Jan 29 2018 at 10:46am -0500,
> Ming Lei  wrote:
>  
> > 2. When to enable SCSI_MQ at default again?
> > 
> > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1,
> > it is enabled at default, but later the patch is reverted in V4.13-rc7, and
> > becomes disabled at default too.
> > 
> > Now both the original reported PM issue(actually SCSI quiesce) and the
> > sequential IO performance issue have been addressed. And MQ IO schedulers
> > are ready too for traditional disks. Are there other issues to be addressed
> > for enabling SCSI_MQ at default? When can we do that again?
> > 
> > Last time, the two issues were reported during V4.13 dev cycle just when it 
> > is
> > enabled at default, that seems if SCSI_MQ isn't enabled at default, it 
> > wouldn't
> > be exposed to run/tested completely & fully.  
> > 
> > So if we continue to disable it at default, maybe it can never be exposed to
> > full test/production environment.
> 
> I was going to propose revisiting this as well.
> 
> I'd really like to see all the old .request_fn block core code removed.

Yeah, that should be a final goal, but may take a bit long.

> 
> But maybe we take a first step of enabling:
> CONFIG_SCSI_MQ_DEFAULT=Y
> CONFIG_DM_MQ_DEFAULT=Y

Maybe you can remove legacy path from DM_RQ first, and take your
original approach to allow DM/MQ over legacy underlying driver,
seems we discussed this topic before, :-)

Thanks,
Ming


Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Ming Lei
On Mon, Jan 29, 2018 at 12:56:30PM -0800, James Bottomley wrote:
> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
> [...]
> > 2. When to enable SCSI_MQ at default again?
> 
> I'm not sure there's much to discuss ... I think the basic answer is as
> soon as Christoph wants to try it again.

I guess Christoph still need to evaluate if there are existed issues or
blockers before trying it again. And more input may be got from F2F
discussion, IMHO.

> 
> > SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In
> > V4.13-rc1, it is enabled at default, but later the patch is reverted
> > in V4.13-rc7, and becomes disabled at default too.
> > 
> > Now both the original reported PM issue(actually SCSI quiesce) and
> > the sequential IO performance issue have been addressed.
> 
> Is the blocker bug just not closed because no-one thought to do it:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=178381
> 
> (we have confirmed that this issue is now fixed with the original
> reporter?)

>From a developer view, this issue is fixed by the following commit:
3a0a52997(block, scsi: Make SCSI quiesce and resume work reliably),
and it is verified by kernel list reporter.

> 
> And did the Huawei guy (Jonathan Cameron) confirm his performance issue
> was fixed (I don't think I saw email that he did)?

Last time I talked with John Garry about the issue, and the merged .get_budget
based patch improves much on the IO performance, but there is still a bit gap
compared with legacy path. Seems a driver specific issue, remembered that 
removing
a driver's lock can improve performance much.

Garry, could you provide further update on this issue?

Thanks,
Ming


Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Tue, 2018-01-30 at 09:07 +0800, Ming Lei wrote:
> On Mon, Jan 29, 2018 at 04:48:31PM +, Bart Van Assche wrote:
> > - It is easy to fix this race inside the block layer, namely by using
> >   call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
> >   postpone the queue rerunning until after the request has been added back 
> > to
> >   the dispatch list.
> 
> It is just easy to say, can you cook a patch and fix all drivers first?

Please reread what I wrote. I proposed to change the blk_mq_delay_run_hw_queue()
IMPLEMENTATION such that the callers do not have to be modified.

> [ ... ] Later, you admitted you understood the patch wrong. [ ... ]

That's nonsense. I never wrote that.

Bart.

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Ming Lei
On Mon, Jan 29, 2018 at 04:48:31PM +, Bart Van Assche wrote:
> On Sun, 2018-01-28 at 07:41 +0800, Ming Lei wrote:
> > Not mention, the request isn't added to dispatch list yet in .queue_rq(),
> > strictly speaking, it is not correct to call blk_mq_delay_run_hw_queue() in
> > .queue_rq(), so the current block layer API can't handle it well enough.
> 
> I disagree that it is not correct to call blk_mq_delay_run_hw_queue() from
> inside .queue_rq(). Additionally, I have already explained to you in
> previous e-mails why it's fine to call that function from inside .queue_rq():
> - Nobody has ever observed the race you described because the time after

'Nobody observed it' does not mean there isn't the race, since I can explain
it clearly.

>   which a queue is rerun by blk_mq_delay_run_hw_queue() is much larger than
>   the time during which that race exists.

Again it is fragile to depend on the delay(10ms, 3ms, or what ever) to make
everything correct, why can't we make the way robust like the approach I took?

You can not guarantee that the request handled in .queue_rq() is added to
hctx->dispatch when you call blk_mq_delay_run_hw_queue(), because the
operation of adding request to hctx->dispatch happens after returning
from .queue_rq() to blk_mq_dispatch_rq_list(), which is done in the future
against calling blk_mq_delay_run_hw_queue(). Right? 

Especially now kblockd_mod_delayed_work_on() is changed to use 
mod_delayed_work_on(),
which may run the queue before the timer is expired from another context, then
IO hang still can be triggered since the run queue may miss the request
to be added to hctx->dispatch.

> - It is easy to fix this race inside the block layer, namely by using
>   call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
>   postpone the queue rerunning until after the request has been added back to
>   the dispatch list.

It is just easy to say, can you cook a patch and fix all drivers first?

Then let's compare which patch is simpler and better.

> 
> > > - The patch at the start of this thread introduces a regression in the
> > >   SCSI core, namely a queue stall if a request completion occurs 
> > > concurrently
> > >   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.

> > 
> > This patch only moves the blk_mq_delay_run_hw_queue() from scsi_queue_rq()
> > to blk-mq, again, please explain it in detail how this patch V3 introduces 
> > this
> > regression on SCSI.
> 
> Please reread the following message: 
> https://marc.info/?l=dm-devel=151672480107560.

OK, the following message is copied from the above link:

> My opinion about this patch is as follows:
> * Changing a blk_mq_delay_run_hw_queue() call followed by return
>   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it 
> changes
>   a guaranteed queue rerun into a queue rerun that may or may not happen
>   depending on whether or not multiple queue runs happen simultaneously.
> * This change makes block drivers less readable because anyone who encounters
>   BLK_STS_DEV_RESOURCE will have to look up its definition to figure out what
>   it's meaning is.
> * We don't need the new status code BLK_STS_DEV_RESOURCE because a delayed
>   queue run can be implemented easily with the existing block layer API.

Later, you admitted you understood the patch wrong, so follows your
reply again from https://marc.info/?l=dm-devel=151672694508389=2

> On Wed, 2018-01-24 at 00:37 +0800, Ming Lei wrote:
> > On Tue, Jan 23, 2018 at 04:24:20PM +, Bart Van Assche wrote:
> > > My opinion about this patch is as follows:
> > > * Changing a blk_mq_delay_run_hw_queue() call followed by return
> > >   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it 
> > > changes
> > >   a guaranteed queue rerun into a queue rerun that may or may not happen
> > >   depending on whether or not multiple queue runs happen simultaneously.
> > 
> > You may not understand the two:
> > 
> > 1) it is always safe to return BLK_STS_RESOURCE, which will make sure to
> > avoid IO hang by blk_mq_delay_run_hw_queue() or blk_mq_run_hw_queue(),
> > and using which one depends on SCHED_RESTART.
> > 
> > 2) if driver can make sure the queue will be rerun after some resource
> > is available, either by itself or by blk-mq, it will return 
> > BLK_STS_DEV_RESOURCE
> > 
> > So what is wrong with this way?
> 
> Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in my
> reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call 
> followed
> by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and introduces a
> race condition in code where there was no race condition.

You still doesn't explain the race condition here, right?

I can explain it again, but I am losing patience on you if you continue
to refuse answer my question, just like you refused to provide debugfs
log before when you reported issue.

When turning BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE, it is 

[LSF/MM TOPIC] Killing reliance on struct page->mapping

2018-01-29 Thread Jerome Glisse
I started a patchset about $TOPIC a while ago, right now i am working on other
thing but i hope to have an RFC for $TOPIC before LSF/MM and thus would like a
slot during common track to talk about it as it impacts FS, BLOCK and MM (i am
assuming their will be common track).

Idea is that mapping (struct address_space) is available in virtualy all the
places where it is needed and that their should be no reasons to depend only on
struct page->mapping field. My patchset basicly add mapping to a bunch of vfs
callback (struct address_space_operations) where it is missing, changing call
site. Then i do an individual patch per filesystem to leverage the new argument
instead on struct page.

I am doing this for a generic page write protection mechanism which generalize
KSM to file back page. They are couple other aspect like struct page->index,
struct page->private which are addressed in similar way. The block layer is
mostly affected because on block device error it needs the page->mapping to
report I/O error.

Maybe we can kill page->mapping altogether as a result of this. However this is
not my motivation at this time.


Sorry for absence of patchset at this time but i wanted to submit the subject
before LSF/MM deadline.

Cheers,
Jérôme


Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread James Bottomley
On Mon, 2018-01-29 at 14:00 -0700, Jens Axboe wrote:
> On 1/29/18 1:56 PM, James Bottomley wrote:
> > 
> > On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
> > [...]
> > > 
> > > 2. When to enable SCSI_MQ at default again?
> > 
> > I'm not sure there's much to discuss ... I think the basic answer
> > is as soon as Christoph wants to try it again.
> 
> FWIW, internally I've been running various IO intensive workloads on
> what is essentially 4.12 upstream with scsi-mq the default (with
> mq-deadline as the scheduler) and comparing IO workloads with a
> previous 4.6 kernel (without scsi-mq), and things are looking
> great.
> 
> We're never going to iron out the last kinks with it being off
> by default, I think we should attempt to flip the switch again
> for 4.16.

Absolutely, I agree we turn it on ASAP.  I just don't want to be on the
receiving end of Linus' flamethrower because a bug we already had
reported against scsi-mq caused problems.  Get confirmation from the
original reporters (or as close to it as you can) that their problems
are fixed and we're good to go; he won't kick us nearly as hard for new
bugs that turn up.

James



Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-29 Thread Bart Van Assche

On 01/19/18 07:24, Jens Axboe wrote:

That's what I thought. So for a low queue depth underlying queue, it's
quite possible that this situation can happen. Two potential solutions
I see:

1) As described earlier in this thread, having a mechanism for being
notified when the scarce resource becomes available. It would not
be hard to tap into the existing sbitmap wait queue for that.

2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
allocation. I haven't read the dm code to know if this is a
possibility or not.

I'd probably prefer #1. It's a classic case of trying to get the
request, and if it fails, add ourselves to the sbitmap tag wait
queue head, retry, and bail if that also fails. Connecting the
scarce resource and the consumer is the only way to really fix
this, without bogus arbitrary delays.


(replying to an e-mail from ten days ago)

Implementing a notification mechanism for all cases in which 
blk_insert_cloned_request() returns BLK_STS_RESOURCE today would require 
a lot of work. If e.g. a SCSI LLD returns one of the SCSI_MLQUEUE_*_BUSY 
return codes from its .queuecommand() implementation then the SCSI core 
will translate that return code into BLK_STS_RESOURCE. From scsi_queue_rq():


reason = scsi_dispatch_cmd(cmd);
if (reason) {
scsi_set_blocked(cmd, reason);
ret = BLK_STS_RESOURCE;
goto out_dec_host_busy;
}

In other words, implementing a notification mechanism for all cases in 
which blk_insert_cloned_request() can return BLK_STS_RESOURCE would 
require to modify all SCSI LLDs.


Bart.


Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Mon, 2018-01-29 at 17:14 -0500, Mike Snitzer wrote:
> On Mon, Jan 29 2018 at  4:51pm -0500,
> Bart Van Assche  wrote:
> 
> > On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
> > > But regardless of which wins the race, the queue will have been run.
> > > Which is all we care about right?
> > 
> > Running the queue is not sufficient. With this patch applied it can happen
> > that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
> > concurrent queue runs finish before sufficient device resources are 
> > available
> > to execute the request and that blk_mq_delay_run_hw_queue() does not get
> > called at all. If no other activity triggers a queue run, e.g. request
> > completion, this will result in a queue stall.
> 
> If BLK_STS_DEV_RESOURCE is returned then the driver doesn't need to rely
> on a future queue run.  IIUC, that is the entire premise of
> BLK_STS_DEV_RESOURCE.  If the driver had doubt about whether the
> resource were going to be available in the future it should return
> BLK_STS_RESOURCE.
> 
> That may seem like putting a lot on a driver developer (to decide
> between the 2) but I'll again defer to Jens here.  This was the approach
> he was advocating be pursued.

Hello Mike,

There was a typo in my reply: I should have referred to BLK_STS_RESOURCE
instead of BLK_STS_DEV_RESOURCE. The convention for BLK_STS_DEV_RESOURCE is
namely that if .queue_rq() returns that status code that the block driver
guarantees that the queue will be rerun.

Bart.

Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Mon, 2018-01-29 at 16:44 -0500, Mike Snitzer wrote:
> But regardless of which wins the race, the queue will have been run.
> Which is all we care about right?

Running the queue is not sufficient. With this patch applied it can happen
that the block driver returns BLK_STS_DEV_RESOURCE, that the two or more
concurrent queue runs finish before sufficient device resources are available
to execute the request and that blk_mq_delay_run_hw_queue() does not get
called at all. If no other activity triggers a queue run, e.g. request
completion, this will result in a queue stall.

Bart.

Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-29 Thread NeilBrown
On Mon, Jan 29 2018, Wols Lists wrote:

> On 29/01/18 15:23, Johannes Thumshirn wrote:
>> Hi linux-raid, lsf-pc
>> 
>> (If you've received this mail multiple times, I'm sorry, I'm having
>> trouble with the mail setup).
>
> My immediate reactions as a lay person (I edit the raid wiki) ...
>> 
>> With the rise of bigger and bigger disks, array rebuilding times start
>> skyrocketing.
>
> And? Yes, your data is at risk during a rebuild, but md-raid throttles
> the i/o, so it doesn't hammer the system.
>> 
>> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
>> similar to RAID5 but instead of utilizing all disks in an array for
>> every I/O operation, but implement a per-I/O mapping function to only
>> use a subset of the available disks.
>> 
>> This has at least two advantages:
>> 1) If one disk has to be replaced, it's not needed to read the data from
>>all disks to recover the one failed disk so non-affected disks can be
>>used for real user I/O and not just recovery and
>
> Again, that's throttling, so that's not a problem ...

Imagine an array with 100 drives on which we store data in sets of
(say) 6 data chunks and 2 parity chunks.
Each group of 8 chunks is distributed over the 100 drives in a
different way so that (e.g) 600 data chunks and 200 parity chunks are
distributed over 8 physical stripes using some clever distribution
function.
If (when) one drive fails, the 8 chunks in this set of 8 physical
stripes can be recovered by reading 6*8 == 48 chunks which will each be
on a different drive.  Half the drives deliver only one chunk (in an ideal
distribution) and the other half deliver none.  Maybe they will deliver
some for the next set of 100 logical stripes.

You would probably say that even doing raid6 on 100 drives is crazy.
Better to make, e.g. 10 groups of 10 and do raid6 on each of the 10,
then LVM them together.

By doing declustered parity you can sanely do raid6 on 100 drives, using
a logical stripe size that is much smaller than 100.
When recovering a single drive, the 10-groups-of-10 would put heavy load
on 9 other drives, while the decluster approach puts light load on 99
other drives.  No matter how clever md is at throttling recovery, I
would still rather distribute the load so that md has an easier job.

NeilBrown

>
>> 2) an efficient mapping function can improve parallel I/O submission, as
>>two different I/Os are not necessarily going to the same disks in the
>>array. 
>> 
>> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
>> would be ideal, as it provides a pseudo random but deterministic mapping
>> for the I/O onto the drives.
>> 
>> This whole declustering of cause only makes sense for more than (at
>> least) 4 drives but we do have customers with several orders of
>> magnitude more drivers in an MD array.
>
> If you have four drives or more - especially if they are multi-terabyte
> drives - you should NOT be using raid-5 ...
>> 
>> At LSF I'd like to discuss if:
>> 1) The wider MD audience is interested in de-clusterd RAID with MD
>
> I haven't read the papers, so no comment, sorry.
>
>> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or
>>as a new personality
>
> Neither! If you're going to do it, it should be raid-6.
>
>> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that
>>the NetApp E-Series Arrays do use CRUSH for parity declustering)
>> 
>> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
>> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
>> [3]
>> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
>> 
> Okay - I've now skimmed the crush paper [2]. Looks well interesting.
> BUT. It feels more like btrfs than it does like raid.
>
> Btrfs manages disks, and does raid, it tries to be the "everything
> between the hard drive and the file". This crush thingy reads to me like
> it wants to be the same. There's nothing wrong with that, but md is a
> unix-y "do one thing (raid) and do it well".
>
> My knee-jerk reaction is if you want to go for it, it sounds like a good
> idea. It just doesn't really feel a good fit for md.
>
> Cheers,
> Wol


signature.asc
Description: PGP signature


[LSF/MM TOPIC] block: extend generic biosets to allow per-device frontpad

2018-01-29 Thread Mike Snitzer
I'd like to enable bio-based DM to _not_ need to clone bios.  But to do
so each bio-based DM target's required per-bio-data would need to be
provided by upper layer biosets (as opposed to the bioset DM currently
creates).

So my thinking is that all system-level biosets (e.g. fs_bio_set,
blkdev_dio_pool) would redirect to a device specific variant bioset IFF
the underlying device advertises the need for a specific per-bio-data
payload to be provided.

I know this _could_ become a rathole but I'd like to avoid reverting DM
back to the days of having to worry about managing mempools for the
purpose of per-io allocations.  I've grown spoiled by the performance
and elegance that comes with having the bio and per-bio-data allocated
from the same bioset.

Thoughts?
Mike


Re: [PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
> +  * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> +  * bit is set, run queue after 10ms to avoid IO stalls
> +  * that could otherwise occur if the queue is idle.
>*/
> - if (!blk_mq_sched_needs_restart(hctx) ||
> + needs_restart = blk_mq_sched_needs_restart(hctx);
> + if (!needs_restart ||
>   (no_tag && list_empty_careful(>dispatch_wait.entry)))
>   blk_mq_run_hw_queue(hctx, true);
> + else if (needs_restart && (ret == BLK_STS_RESOURCE))
> + blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>   }

The above code only calls blk_mq_delay_run_hw_queue() if the following condition
is met: !(!needs_restart || (no_tag && 
list_empty_careful(>dispatch_wait.entry)))
&& (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can be
simplified into the following: needs_restart && ret == BLK_STS_RESOURCE && 
!(no_tag && list_empty_careful(>dispatch_wait.entry)). From 
blk-mq-sched.h:

static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
{
return test_bit(BLK_MQ_S_SCHED_RESTART, >state);
}

In other words, the above code will not call blk_mq_delay_run_hw_queue() if
BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code is
reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
concurrently with the above code. From blk-mq-sched.c:

static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
{
if (!test_bit(BLK_MQ_S_SCHED_RESTART, >state))
return false;

if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
struct request_queue *q = hctx->queue;

if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, >state))
atomic_dec(>shared_hctx_restart);
} else
clear_bit(BLK_MQ_S_SCHED_RESTART, >state);

return blk_mq_run_hw_queue(hctx, true);
}

The above blk_mq_run_hw_queue() call may finish either before or after
blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.

That's why I wrote in previous e-mails that this patch and also the previous
versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
into a call that may or may not happen, depending on whether or not a request
completes concurrently with request queueing. Sorry but I think that means
that the above change combined with changing BLK_STS_RESOURCE into
BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
to debug.

Plenty of e-mails have already been exchanged about versions one to four of this
patch but so far nobody has even tried to contradict the above.

Bart.

[LSF/MM TOPIC] block, dm: restack queue_limits

2018-01-29 Thread Mike Snitzer
We currently don't restack the queue_limits if the lowest, or
intermediate, layer of an IO stack changes.

This is particularly unfortunate in the case of FLUSH/FUA which may
change if/when a HW controller's BBU fails; whereby requiring the device
advertise that it has a volatile write cache (WCE=1).

But in the context of DM, really it'd be best if the entire stack of
devices had their limits restacked if any underlying layer's limits
change.

In the past, Martin and I discussed that we should "just do it" but
never did.  Not sure we need a lengthy discussion but figured I'd put it
out there.

Maybe I'll find time, between now and April, to try implementing it.

Thanks,
Mike


Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Jens Axboe
On 1/29/18 1:56 PM, James Bottomley wrote:
> On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
> [...]
>> 2. When to enable SCSI_MQ at default again?
> 
> I'm not sure there's much to discuss ... I think the basic answer is as
> soon as Christoph wants to try it again.

FWIW, internally I've been running various IO intensive workloads on
what is essentially 4.12 upstream with scsi-mq the default (with
mq-deadline as the scheduler) and comparing IO workloads with a
previous 4.6 kernel (without scsi-mq), and things are looking
great.

We're never going to iron out the last kinks with it being off
by default, I think we should attempt to flip the switch again
for 4.16.

-- 
Jens Axboe



Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread James Bottomley
On Mon, 2018-01-29 at 23:46 +0800, Ming Lei wrote:
[...]
> 2. When to enable SCSI_MQ at default again?

I'm not sure there's much to discuss ... I think the basic answer is as
soon as Christoph wants to try it again.

> SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In
> V4.13-rc1, it is enabled at default, but later the patch is reverted
> in V4.13-rc7, and becomes disabled at default too.
> 
> Now both the original reported PM issue(actually SCSI quiesce) and
> the sequential IO performance issue have been addressed.

Is the blocker bug just not closed because no-one thought to do it:

https://bugzilla.kernel.org/show_bug.cgi?id=178381

(we have confirmed that this issue is now fixed with the original
reporter?)

And did the Huawei guy (Jonathan Cameron) confirm his performance issue
was fixed (I don't think I saw email that he did)?

James




Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Mike Snitzer
On Mon, Jan 29 2018 at 10:46am -0500,
Ming Lei  wrote:
 
> 2. When to enable SCSI_MQ at default again?
> 
> SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1,
> it is enabled at default, but later the patch is reverted in V4.13-rc7, and
> becomes disabled at default too.
> 
> Now both the original reported PM issue(actually SCSI quiesce) and the
> sequential IO performance issue have been addressed. And MQ IO schedulers
> are ready too for traditional disks. Are there other issues to be addressed
> for enabling SCSI_MQ at default? When can we do that again?
> 
> Last time, the two issues were reported during V4.13 dev cycle just when it is
> enabled at default, that seems if SCSI_MQ isn't enabled at default, it 
> wouldn't
> be exposed to run/tested completely & fully.  
> 
> So if we continue to disable it at default, maybe it can never be exposed to
> full test/production environment.

I was going to propose revisiting this as well.

I'd really like to see all the old .request_fn block core code removed.

But maybe we take a first step of enabling:
CONFIG_SCSI_MQ_DEFAULT=Y
CONFIG_DM_MQ_DEFAULT=Y

Thanks,
Mike


Re: [GIT PULL] Block changes for 4.16-rc

2018-01-29 Thread Linus Torvalds
On Mon, Jan 29, 2018 at 12:08 PM, Jens Axboe  wrote:
>
> Yes of course, I can switch to using signed tags for pull requests
> for you.

Thanks. I don't actually know the maintainership status of kernel.dk.

You show up as the 'whois' contact, but I'm also assuming it doesn't
have the same kind of fairly draconian access control as the
kernel.org suite of sites, and I'm actively trying to make sure we
either use signed tags or the kernel.org repos (and honestly, even
with the kernel.org ones I tend to prefer signed tags).

Right now I only _require_ it for public hosting, but I'm trying to
move to just signed tags being the default everywhere.

  Linus


[PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Mike Snitzer
From: Ming Lei 

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

Suggested-by: Jens Axboe 
Tested-by: Laurence Oberman 
Signed-off-by: Ming Lei 
Signed-off-by: Mike Snitzer 
---

V4:
- cleanup header and code comments
- rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
- eliminate nvmefc's queue rerun now that blk-mq does it
V3:
- fix typo, and improvement document
- add tested-by tag
V2:
- add comments on the new introduced status
- patch style fix
- both are suggested by Christoph

 block/blk-core.c |  1 +
 block/blk-mq.c   | 20 
 drivers/block/null_blk.c |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c   |  5 ++---
 drivers/nvme/host/fc.c   | 12 ++--
 drivers/scsi/scsi_lib.c  |  6 +++---
 include/linux/blk_types.h| 14 ++
 9 files changed, 41 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..38279d4ae08b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
[BLK_STS_MEDIUM]= { -ENODATA,   "critical medium" },
[BLK_STS_PROTECTION]= { -EILSEQ,"protection" },
[BLK_STS_RESOURCE]  = { -ENOMEM,"kernel resource" },
+   [BLK_STS_DEV_RESOURCE]  = { -ENOMEM,"device resource" },
[BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" },
 
/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..dd097ca5f1e9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
return true;
 }
 
+#define BLK_MQ_QUEUE_DELAY 3   /* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 bool got_budget)
 {
@@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
struct request *rq, *nxt;
bool no_tag = false;
int errors, queued;
+   blk_status_t ret = BLK_STS_OK;
 
if (list_empty(list))
return false;
@@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
errors = queued = 0;
do {
struct blk_mq_queue_data bd;
-   blk_status_t ret;
 
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, , false)) {
@@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
}
 
ret = q->mq_ops->queue_rq(hctx, );
-   if (ret == BLK_STS_RESOURCE) {
+   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
/*
 * If an I/O scheduler has been configured and we got a
 * driver tag for the next request already, free it
@@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * that is where we will continue on next queue run.
 */
if (!list_empty(list)) {
+   bool needs_restart;
+
spin_lock(>lock);
list_splice_init(list, >dispatch);
spin_unlock(>lock);
@@ -1278,10 +1282,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * - Some but not all block drivers stop a queue before
 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 *   and dm-rq.
+*
+* If driver returns BLK_STS_RESOURCE and SCHED_RESTART
+* bit is set, run queue after 10ms to avoid IO stalls
+* that could otherwise occur if the queue is idle.
 */
-   if (!blk_mq_sched_needs_restart(hctx) ||
+   needs_restart = blk_mq_sched_needs_restart(hctx);
+   if (!needs_restart ||
(no_tag && list_empty_careful(>dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
+   else if (needs_restart && (ret == BLK_STS_RESOURCE))
+   blk_mq_delay_run_hw_queue(hctx, 

Re: [GIT PULL] Block changes for 4.16-rc

2018-01-29 Thread Jens Axboe
On 1/29/18 1:14 PM, Linus Torvalds wrote:
> On Mon, Jan 29, 2018 at 12:08 PM, Jens Axboe  wrote:
>>
>> Yes of course, I can switch to using signed tags for pull requests
>> for you.
> 
> Thanks. I don't actually know the maintainership status of kernel.dk.
> 
> You show up as the 'whois' contact, but I'm also assuming it doesn't
> have the same kind of fairly draconian access control as the
> kernel.org suite of sites, and I'm actively trying to make sure we
> either use signed tags or the kernel.org repos (and honestly, even
> with the kernel.org ones I tend to prefer signed tags).
> 
> Right now I only _require_ it for public hosting, but I'm trying to
> move to just signed tags being the default everywhere.

The box is sitting right next to me in my office, which is locked.  But
it's not kernel.org, and I don't have a dedicated sysadmin. My tree does
mirror to git.kernel.org, so they should be in sync (albeit with some
delay). If you're fine with signed tags on git.kernel.dk, I'd prefer
continue using that, but just signing the pull requests.

I sign all fio releases with the same key, so it'd be trivial for me to
just sign pulls as well.

-- 
Jens Axboe



Re: [GIT PULL] Block changes for 4.16-rc

2018-01-29 Thread Jens Axboe
On 1/29/18 1:02 PM, Linus Torvalds wrote:
> On Mon, Jan 29, 2018 at 7:40 AM, Jens Axboe  wrote:
>>
>> This is the main pull request for block IO related changes for the 4.16
>> kernel. Nothing major in this pull request, but a good amount of
>> improvements and fixes all over the map. This pull request contains:
> 
> Pulled.
> 
> However, can I request that you please start using signed tags?
> 
> I know you have a gpg key and know how to do signed tags, because in
> the almost four hundred pulls that I've done from you over the many
> years, I've actually got three such pull request from you.
> 
> Can we make that percentage go up a bit? Pretty please?

Yes of course, I can switch to using signed tags for pull requests
for you.


-- 
Jens Axboe



Re: [GIT PULL] Block changes for 4.16-rc

2018-01-29 Thread Linus Torvalds
On Mon, Jan 29, 2018 at 7:40 AM, Jens Axboe  wrote:
>
> This is the main pull request for block IO related changes for the 4.16
> kernel. Nothing major in this pull request, but a good amount of
> improvements and fixes all over the map. This pull request contains:

Pulled.

However, can I request that you please start using signed tags?

I know you have a gpg key and know how to do signed tags, because in
the almost four hundred pulls that I've done from you over the many
years, I've actually got three such pull request from you.

Can we make that percentage go up a bit? Pretty please?

  Linus


Re: [PATCH v6 1/2] Return bytes transferred for partial direct I/O

2018-01-29 Thread Randy Dunlap
On 01/29/2018 06:57 AM, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 

> diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
> index 6c00c1e2743f..72e213d62511 100644
> --- a/Documentation/sysctl/fs.txt
> +++ b/Documentation/sysctl/fs.txt
> @@ -76,6 +77,19 @@ dcache isn't pruned yet.
>  
>  ==
>  
> +dio_short_writes:
> +
> +In case Direct I/O encounters an transient error, it returns

 a transient

> +the errorcode, even if it has performed part of the write.

   error code,

> +This flag, if on (default), will return the number of bytes written
> +so far, as the write(2) symantics are. However, some older applications

   semantics

> +still consider a direct write as an error if all of the I/O
> +submitted is not complete. ie write(file, count, buf) != count.

  I.e.

> +This option can be disabled on systems in order to support
> +existing applications which do not expect short writes.

and if my system has a mix of older applications and new ones,
will they all work just fine?


thanks,
-- 
~Randy


[LSF/MM TOPIC] Bringing Virtual Data Optimizer (VDO) to upstream

2018-01-29 Thread Bryan Gurney
Hello,

I'd like to talk about Virtual Data Optimizer (VDO), which is a
device-mapper target that provides deduplication, compression, thin
provisioning, and zero-block elimination for primary storage.  Now
that VDO is open-source, LSF/MM would be a good time to open a
dialogue between other upstream developers as we work to bring VDO
upstream.

VDO was originally developed by Permabit Technology Corporation as a
proprietary set of kernel modules and userspace tools.  After being
acquired by Red Hat, VDO has been re-licensed under the GNU General
Public License, version 2.  The VDO team has also been working a lot
closer with Mike Snitzer and other developers of device-mapper and
LVM.

The source for VDO is available at the following repositories:

https://github.com/dm-vdo/kvdo (kernel modules)
https://github.com/dm-vdo/vdo (userspace tools)

As the README files in the repositories above indicate, development on
VDO is currently focusing on upstream integration, which includes
various coding standards compliance changes, refactoring efforts, and
various other changes that become possible now that VDO is re-licensed
under the GPL.

That being said, VDO has been in production use since 2014 in its
previous proprietary form, and continues to undergo thorough testing,
including real-world test scenarios.


Thanks,

Bryan


Re: [PATCH 2/2] xfs: remove assert to check bytes returned

2018-01-29 Thread Jens Axboe
On 1/29/18 7:57 AM, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Since we can return less than count in case of partial direct
> writes, remove the ASSERT.
> 
> Signed-off-by: Goldwyn Rodrigues 
> Acked-by: Dave Chinner 
> Reviewed-by: Darrick J. Wong 

Minot nit - you should probably reorder this patch before the other one,
otherwise you'll have a point in the tree where it can knowingly
trigger.

-- 
Jens Axboe



Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Bart Van Assche
On Sun, 2018-01-28 at 07:41 +0800, Ming Lei wrote:
> Not mention, the request isn't added to dispatch list yet in .queue_rq(),
> strictly speaking, it is not correct to call blk_mq_delay_run_hw_queue() in
> .queue_rq(), so the current block layer API can't handle it well enough.

I disagree that it is not correct to call blk_mq_delay_run_hw_queue() from
inside .queue_rq(). Additionally, I have already explained to you in
previous e-mails why it's fine to call that function from inside .queue_rq():
- Nobody has ever observed the race you described because the time after
  which a queue is rerun by blk_mq_delay_run_hw_queue() is much larger than
  the time during which that race exists.
- It is easy to fix this race inside the block layer, namely by using
  call_rcu() inside the blk_mq_delay_run_hw_queue() implementation to
  postpone the queue rerunning until after the request has been added back to
  the dispatch list.

> > - The patch at the start of this thread introduces a regression in the
> >   SCSI core, namely a queue stall if a request completion occurs 
> > concurrently
> >   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.
> 
> This patch only moves the blk_mq_delay_run_hw_queue() from scsi_queue_rq()
> to blk-mq, again, please explain it in detail how this patch V3 introduces 
> this
> regression on SCSI.

Please reread the following message: 
https://marc.info/?l=dm-devel=151672480107560.

Thanks,

Bart.

Re: [LSF/MM TOPIC] De-clustered RAID with MD

2018-01-29 Thread Wols Lists
On 29/01/18 15:23, Johannes Thumshirn wrote:
> Hi linux-raid, lsf-pc
> 
> (If you've received this mail multiple times, I'm sorry, I'm having
> trouble with the mail setup).

My immediate reactions as a lay person (I edit the raid wiki) ...
> 
> With the rise of bigger and bigger disks, array rebuilding times start
> skyrocketing.

And? Yes, your data is at risk during a rebuild, but md-raid throttles
the i/o, so it doesn't hammer the system.
> 
> In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
> similar to RAID5 but instead of utilizing all disks in an array for
> every I/O operation, but implement a per-I/O mapping function to only
> use a subset of the available disks.
> 
> This has at least two advantages:
> 1) If one disk has to be replaced, it's not needed to read the data from
>all disks to recover the one failed disk so non-affected disks can be
>used for real user I/O and not just recovery and

Again, that's throttling, so that's not a problem ...

> 2) an efficient mapping function can improve parallel I/O submission, as
>two different I/Os are not necessarily going to the same disks in the
>array. 
> 
> For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
> would be ideal, as it provides a pseudo random but deterministic mapping
> for the I/O onto the drives.
> 
> This whole declustering of cause only makes sense for more than (at
> least) 4 drives but we do have customers with several orders of
> magnitude more drivers in an MD array.

If you have four drives or more - especially if they are multi-terabyte
drives - you should NOT be using raid-5 ...
> 
> At LSF I'd like to discuss if:
> 1) The wider MD audience is interested in de-clusterd RAID with MD

I haven't read the papers, so no comment, sorry.

> 2) de-clustered RAID should be implemented as a sublevel of RAID5 or
>as a new personality

Neither! If you're going to do it, it should be raid-6.

> 3) CRUSH is a suitible algorith for this (there's evidence in [3] that
>the NetApp E-Series Arrays do use CRUSH for parity declustering)
> 
> [1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
> [2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
> [3]
> https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf
> 
Okay - I've now skimmed the crush paper [2]. Looks well interesting.
BUT. It feels more like btrfs than it does like raid.

Btrfs manages disks, and does raid, it tries to be the "everything
between the hard drive and the file". This crush thingy reads to me like
it wants to be the same. There's nothing wrong with that, but md is a
unix-y "do one thing (raid) and do it well".

My knee-jerk reaction is if you want to go for it, it sounds like a good
idea. It just doesn't really feel a good fit for md.

Cheers,
Wol



[LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Ming Lei
Hi guys,

Two blk-mq related topics

1. blk-mq vs. CPU hotplug & IRQ vectors spread on CPUs

We have done three big changes in this field before, each time some issues
are fixed, meantime new ones are introduced

1) freeze all queues during CPU hotplug handler
- issues: queue dependency such as loop-mq/dm vs underlying queues, NVMe admin
queue vs. namespace queues, and IO hang may be caused during freezing all
these queues in CPU hotplug handler.

2) IRQ vectors spread on all present CPUs
- fix issue on 1)
- new issues introduced: don't support CPU hotplug physically, and cause blk-mq
warning during dispatch

3) IRQ vectors spread on all possible CPUs
- can support CPU hotplug physically
- warning in __blk_mq_run_hw_queue() still may be triggered if CPU
  offline/online happens between blk_mq_hctx_next_cpu() and running
   __blk_mq_run_hw_queue()
- new issues introduced: queue mapping may be distorted completely,
patch sent out(https://marc.info/?t=15160323092=1=2), but may
need further discussion about this approach; drivers(such as NVMe) may
need to pass 'num_possible_cpus()' as the max vectors for allocating
irq vectors; some drivers(NVMe) uses hard-code hw queue index directly,
then this way becomes very fragile, since the hw queue may be inactive
from the beginning.

Also starting from 2), another issue is that IO completion may not be
delivered to CPUs, for example, IO may be dispatched to hw queue just
before(or after) all CPUs mapped to the hctx become offline, then IRQ
vector of the hw queue can be shutdown. Now seems we depend on timeout
handler to deal with the situation, and is there better way to solve this
issue?

2. When to enable SCSI_MQ at default again?

SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1,
it is enabled at default, but later the patch is reverted in V4.13-rc7, and
becomes disabled at default too.

Now both the original reported PM issue(actually SCSI quiesce) and the
sequential IO performance issue have been addressed. And MQ IO schedulers
are ready too for traditional disks. Are there other issues to be addressed
for enabling SCSI_MQ at default? When can we do that again?

Last time, the two issues were reported during V4.13 dev cycle just when it is
enabled at default, that seems if SCSI_MQ isn't enabled at default, it wouldn't
be exposed to run/tested completely & fully.  

So if we continue to disable it at default, maybe it can never be exposed to
full test/production environment.


Thanks,
Ming


[GIT PULL] Block changes for 4.16-rc

2018-01-29 Thread Jens Axboe
Hi Linus,

This is the main pull request for block IO related changes for the 4.16
kernel. Nothing major in this pull request, but a good amount of
improvements and fixes all over the map. This pull request contains:

- BFQ improvements, fixes, and cleanups from Angelo, Chiara, and Paolo.

- Support for SMR zones for deadline and mq-deadline from Damien and
  Christoph.

- Set of fixes for bcache by way of Michael Lyle, including fixes from
  himself, Kent, Rui, Tang, and Coly.

- Series from Matias for lightnvm with fixes from Hans Holmberg, Javier,
  and Matias. Mostly centered around pblk, and the removing rrpc 1.2 in
  preparation for supporting 2.0.

- A couple of NVMe pull requests from Christoph. Nothing major in here,
  just fixes and cleanups, and support for command tracing from
  Johannes.

- Support for blk-throttle for tracking reads and writes separately.
  From Joseph Qi. A few cleanups/fixes also for blk-throttle from
  Weiping.

- Series from Mike Snitzer that enables dm to register its queue more
  logically, something that's alwways been problematic on dm since it's
  a stacked device.

- Series from Ming cleaning up some of the bio accessor use, in
  preparation for supporting multipage bvecs.

- Various fixes from Ming closing up holes around queue mapping and
  quiescing.

- BSD partition fix from Richard Narron, fixing a problem where we can't
  mount newer (10/11) FreeBSD partitions.

- Series from Tejun reworking blk-mq timeout handling. The previous
  scheme relied on atomic bits, but it had races where we would think a
  request had timed out if it to reused at the wrong time.

- null_blk now supports faking timeouts, to enable us to better exercise
  and test that functionality separately. From me.

- Kill the separate atomic poll bit in the request struct. After this,
  we don't use the atomic bits on blk-mq anymore at all. From me.

- sgl_alloc/free helpers from Bart.

- Heavily contended tag case scalability improvement from me.

- Various little fixes and cleanups from Arnd, Bart, Corentin, Douglas,
  Eryu, Goldwyn, and myself.


Note that you'll get a single merge conflict when pulling this in, due
to the change:

commit 4ccafe032005e9b96acbef2e389a4de5b1254add
Author: Jens Axboe 
Date:   Wed Dec 20 13:13:58 2017 -0700

block: unalign call_single_data in struct request

that went into your tree after I had forked off for-4.16/block. The
resolution is trivial, just ensure that the moved 'csd' struct member
retains the type of struct __call_single_data, not call_single_data_t.

Please pull!


  git://git.kernel.dk/linux-block.git for-4.16/block



Angelo Ruocco (2):
  block, bfq: check low_latency flag in bfq_bfqq_save_state()
  block, bfq: remove superfluous check in queue-merging setup

Arnd Bergmann (3):
  DAC960: split up ioctl function to reduce stack size
  null_blk: remove explicit 'select FAULT_INJECTION'
  blkcg: simplify statistic accumulation code

Bart Van Assche (20):
  pktcdvd: Fix pkt_setup_dev() error path
  pktcdvd: Fix a recently introduced NULL pointer dereference
  lib/scatterlist: Introduce sgl_alloc() and sgl_free()
  crypto: scompress - use sgl_alloc() and sgl_free()
  nvmet/fc: Use sgl_alloc() and sgl_free()
  nvmet/rdma: Use sgl_alloc() and sgl_free()
  target: Use sgl_alloc_order() and sgl_free()
  blk-mq: Fix spelling in a source code comment
  block: Fix kernel-doc warnings reported when building with W=1
  blk-mq: Explain when 'active_queues' is decremented
  blk-mq: Add locking annotations to hctx_lock() and hctx_unlock()
  blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()
  block: Fix __bio_integrity_endio() documentation
  block: Unexport elv_register_queue() and elv_unregister_queue()
  block: Document scheduler modification locking requirements
  block: Protect less code with sysfs_lock in blk_{un,}register_queue()
  lib/scatterlist: Fix chaining support in sgl_alloc_order()
  blk-mq: Rename blk_mq_request_direct_issue() into 
blk_mq_request_issue_directly()
  blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended 
delays
  block: Remove kblockd_schedule_delayed_work{,_on}()

Chiara Bruschi (1):
  block, bfq: fix occurrences of request finish method's old name

Christoph Hellwig (5):
  block: introduce zoned block devices zone write locking
  genirq/affinity: assign vectors to all possible CPUs
  blk-mq: simplify queue mapping & schedule with each possisble CPU
  nvme-pci: clean up CMB initialization
  nvme-pci: clean up SMBSZ bit definitions

Coly Li (2):
  bcache: reduce cache_set devices iteration by devices_max_used
  bcache: fix misleading error message in bch_count_io_errors()

Corentin Labbe (1):
  block: remove smart1,2.h

Damien Le Moal (4):
  mq-deadline: Introduce dispatch 

[LSF/MM TOPIC] De-clustered RAID with MD

2018-01-29 Thread Johannes Thumshirn
Hi linux-raid, lsf-pc

(If you've received this mail multiple times, I'm sorry, I'm having
trouble with the mail setup).

With the rise of bigger and bigger disks, array rebuilding times start
skyrocketing.

In a paper form '92 Holland and Gibson [1] suggest a mapping algorithm
similar to RAID5 but instead of utilizing all disks in an array for
every I/O operation, but implement a per-I/O mapping function to only
use a subset of the available disks.

This has at least two advantages:
1) If one disk has to be replaced, it's not needed to read the data from
   all disks to recover the one failed disk so non-affected disks can be
   used for real user I/O and not just recovery and
2) an efficient mapping function can improve parallel I/O submission, as
   two different I/Os are not necessarily going to the same disks in the
   array. 

For the mapping function used a hashing algorithm like Ceph's CRUSH [2]
would be ideal, as it provides a pseudo random but deterministic mapping
for the I/O onto the drives.

This whole declustering of cause only makes sense for more than (at
least) 4 drives but we do have customers with several orders of
magnitude more drivers in an MD array.

At LSF I'd like to discuss if:
1) The wider MD audience is interested in de-clusterd RAID with MD
2) de-clustered RAID should be implemented as a sublevel of RAID5 or
   as a new personality
3) CRUSH is a suitible algorith for this (there's evidence in [3] that
   the NetApp E-Series Arrays do use CRUSH for parity declustering)

[1] http://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf 
[2] https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf
[3]
https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/DistributedStorage/Jibbe-Gwaltney_Method-to_Establish_High_Availability.pdf

Thanks,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH 2/2] xfs: remove assert to check bytes returned

2018-01-29 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Since we can return less than count in case of partial direct
writes, remove the ASSERT.

Signed-off-by: Goldwyn Rodrigues 
Acked-by: Dave Chinner 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/xfs_file.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8601275cc5e6..8fc4dbf66910 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -590,12 +590,6 @@ xfs_file_dio_aio_write(
ret = iomap_dio_rw(iocb, from, _iomap_ops, xfs_dio_write_end_io);
 out:
xfs_iunlock(ip, iolock);
-
-   /*
-* No fallback to buffered IO on errors for XFS, direct IO will either
-* complete fully or fail.
-*/
-   ASSERT(ret < 0 || ret == count);
return ret;
 }
 
-- 
2.16.1



[PATCH v6 1/2] Return bytes transferred for partial direct I/O

2018-01-29 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

In case direct I/O encounters an error midway, it returns the error.
Instead it should be returning the number of bytes transferred so far.

Test case for filesystems (with ENOSPC):
1. Create an almost full filesystem
2. Create a file, say /mnt/lastfile, until the filesystem is full.
3. Direct write() with count > sizeof /mnt/lastfile.

Result: write() returns -ENOSPC. However, file content has data written
in step 3.

Added a sysctl entry: dio_short_writes which is on by default. This is
to support applications which expect either and error or the bytes submitted
as a return value for the write calls.

This fixes fstest generic/472.

Signed-off-by: Goldwyn Rodrigues 
---

Changes since v1:
 - incorporated iomap and block devices

Changes since v2:
 - realized that file size was not increasing when performing a (partial)
   direct I/O because end_io function was receiving the error instead of
   size. Fixed.

Changes since v3:
 - [hch] initialize transferred with dio->size and use transferred instead
   of dio->size.

Changes since v4:
 - Refreshed to v4.14

Changes since v5:
 - Added /proc/sys/fs/dio_short_writes (default 1) to guard older applications
   which expect write(fd, buf, count) returns either count or error.


 Documentation/sysctl/fs.txt | 14 ++
 fs/block_dev.c  |  2 +-
 fs/direct-io.c  |  7 +--
 fs/iomap.c  | 23 ---
 include/linux/fs.h  |  1 +
 kernel/sysctl.c |  9 +
 6 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
index 6c00c1e2743f..72e213d62511 100644
--- a/Documentation/sysctl/fs.txt
+++ b/Documentation/sysctl/fs.txt
@@ -22,6 +22,7 @@ Currently, these files are in /proc/sys/fs:
 - aio-max-nr
 - aio-nr
 - dentry-state
+- dio_short_writes
 - dquot-max
 - dquot-nr
 - file-max
@@ -76,6 +77,19 @@ dcache isn't pruned yet.
 
 ==
 
+dio_short_writes:
+
+In case Direct I/O encounters an transient error, it returns
+the errorcode, even if it has performed part of the write.
+This flag, if on (default), will return the number of bytes written
+so far, as the write(2) symantics are. However, some older applications
+still consider a direct write as an error if all of the I/O
+submitted is not complete. ie write(file, count, buf) != count.
+This option can be disabled on systems in order to support
+existing applications which do not expect short writes.
+
+==
+
 dquot-max & dquot-nr:
 
 The file dquot-max shows the maximum number of cached disk
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb5175..49d94360bb51 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -409,7 +409,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
 
if (!ret)
ret = blk_status_to_errno(dio->bio.bi_status);
-   if (likely(!ret))
+   if (likely(dio->size))
ret = dio->size;
 
bio_put(>bio);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 3aafb3343a65..9bd15be64c25 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -151,6 +151,7 @@ struct dio {
 } cacheline_aligned_in_smp;
 
 static struct kmem_cache *dio_cache __read_mostly;
+unsigned int sysctl_dio_short_writes = 1;
 
 /*
  * How many pages are in the queue?
@@ -262,7 +263,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
unsigned int flags)
ret = dio->page_errors;
if (ret == 0)
ret = dio->io_error;
-   if (ret == 0)
+   if (!sysctl_dio_short_writes && (ret == 0))
ret = transferred;
 
if (dio->end_io) {
@@ -310,7 +311,9 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
unsigned int flags)
}
 
kmem_cache_free(dio_cache, dio);
-   return ret;
+   if (!sysctl_dio_short_writes)
+   return ret;
+   return transferred ? transferred : ret;
 }
 
 static void dio_aio_complete_work(struct work_struct *work)
diff --git a/fs/iomap.c b/fs/iomap.c
index 47d29ccffaef..a8d6908dc0de 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -716,23 +716,24 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
struct kiocb *iocb = dio->iocb;
struct inode *inode = file_inode(iocb->ki_filp);
loff_t offset = iocb->ki_pos;
-   ssize_t ret;
+   ssize_t err;
+   ssize_t transferred = dio->size;
 
if (dio->end_io) {
-   ret = dio->end_io(iocb,
-   dio->error ? dio->error : dio->size,
-   dio->flags);
+   err = dio->end_io(iocb,
+ (transferred && sysctl_dio_short_writes) ?
+   transferred : dio->error,
+

Re: [PATCH v4 13/13] bcache: add stop_when_cache_set_failed to struct cached_dev

2018-01-29 Thread Coly Li
On 29/01/2018 8:57 PM, Nix wrote:
> On 27 Jan 2018, Coly Li said:
> 
>> Current bcache failure handling code will stop all attached bcache devices
>> when the cache set is broken or disconnected. This is desired behavior for
>> most of enterprise or cloud use cases, but maybe not for low end
>> configuration. Nix  points out, users may still want to
>> access the bcache device after cache device failed, for example on laptops.
> 
> Actually I'm much interested in the server use case. On laptops, it's
> relatively easy to recover if you know what you're doing, because they
> usually have a user in front of them with console access -- but if a
> remote headless server with a hundred users a thousand miles away has
> its cache device wear out I would really rather the hundred users get
> served, if more slowly, rather than the whole machine going down for
> hours or days until I can get someone there to bash on the hardware!
> 

Hi Nix,

Thanks for the input, I didn't think of such use case before. It makes a
lot sense !

> (Sure, ideally you'd detect the wearing out in advance, but SSDs are not
> always nice and helpful like that and sometimes just go instantly
> readonly or simply vanish off the bus entirely without warning.)
> 

Yes. Then in the v5 patch set, I will add an option for "always"/"auto",
which will leave bcache device alive if the broken cache set is clean.

Thank you all again, for the insight and brilliant suggestion !

Coly Li


Re: [PATCH v4 05/13] bcache: stop dc->writeback_rate_update properly

2018-01-29 Thread Coly Li
On 29/01/2018 8:22 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Hello Coly:
> 
> There are some differences,
> Using variable of atomic_t type can not guarantee the atomicity of 
> transaction.
> for example:
> A thread runs in update_writeback_rate()
> update_writeback_rate(){
>   
> + if (test_bit(BCACHE_DEV_WB_RUNNING, >disk.flags)) {
> + schedule_delayed_work(>writeback_rate_update,
> dc->writeback_rate_update_seconds * HZ);
> + }
> 
> Then another thread executes in cached_dev_detach_finish():
>   if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, >disk.flags))
>   cancel_writeback_rate_update_dwork(dc);
> 
> +
> + /*
> +  * should check BCACHE_DEV_RATE_DW_RUNNING before calling
> +  * cancel_delayed_work_sync().
> +  */
> + clear_bit(BCACHE_DEV_RATE_DW_RUNNING, >disk.flags);
> + /* paired with where BCACHE_DEV_RATE_DW_RUNNING is tested */
> + smp_mb();
> 
> Race still exists.
>  

Hi Junhui,

Check super.c:cancel_writeback_rate_update_dwork(),
BCACHE_DEV_RATE_DW_RUNNING is checked there.

You may see in cached_dev_detach_finish() and update_writeback_rate(),
the orders to check BCACHE_DEV_RATE_DW_RUNNING and BCACHE_DEV_WB_RUNNING
are different.

cached_dev_detach_finish()  update_writeback_rate()

test_and_clear_bit  set_bit
BCACHE_DEV_WB_RUNNING   BCACHE_DEV_RATE_DW_RUNNING

(implicit smp_mb()) smp_mb()

test_bittest_bit
BCACHE_DEV_RATE_DW_RUNNING  BCACHE_DEV_WB_RUNNING

clear_bit()
BCACHE_DEV_RATE_DW_RUNNING

smp_mb()


This two flags are accessed in reversed order in different locations,
there is a smp_mb() between accessing two flags to serialize the access
order.

By the above reserve ordering accessing, it is sure that
- in cached_dev_detach_finish(), before
test_bit(BCACHE_DEV_RATE_DW_RUNNING) bit BCACHE_DEV_WB_RUNNING must be
cleared already.
- in update_writeback_rate(), before test_bit(BCACHE_DEV_WB_RUNNING),
BCACHE_DEV_RATE_DW_RUNNING must be set already.

Therefore in your example, if a thread is testing BCACHE_DEV_WB_RUNNING
in update_writeback_rate(), it means BCACHE_DEV_RATE_DW_RUNNING must be
set already. So in cancel_writeback_rate_update_dwork() another thread
must wait until BCACHE_DEV_RATE_DW_RUNNING is cleared then
cancel_delayed_work_sync() can be called. And in update_writeback_rate()
the bit BCACHE_DEV_RATE_DW_RUNNING is cleared after
schedule_delayed_work() returns, so the race is killed.

A mutex lock indicates an implicit memory barrier, and in your
suggestion up_read(>writeback_lock) is after schedule_delayed_work()
too. This is why I said they are almost same.

Thanks.

Coly Li

>>
>> On 29/01/2018 3:35 PM, tang.jun...@zte.com.cn wrote:
>>> From: Tang Junhui 
>>>
>>> Hello Coly:
>>>
>>> This patch is somewhat difficult for me,
>>> I think we can resolve it in a simple way.
>>>
>>> We can take the schedule_delayed_work() under the protection of 
>>> dc->writeback_lock, and judge if we need re-arm this work to queue.
>>>
>>> static void update_writeback_rate(struct work_struct *work)
>>> {
>>> struct cached_dev *dc = container_of(to_delayed_work(work),
>>>  struct cached_dev,
>>>  writeback_rate_update);
>>>
>>> down_read(>writeback_lock);
>>>
>>> if (atomic_read(>has_dirty) &&
>>> dc->writeback_percent)
>>> __update_writeback_rate(dc);
>>>
>>> -up_read(>writeback_lock);
>>> +if (NEED_RE-AEMING)
>>> schedule_delayed_work(>writeback_rate_update,
>>>   dc->writeback_rate_update_seconds * HZ);
>>> +up_read(>writeback_lock);
>>> }
>>>
>>> In cached_dev_detach_finish() and cached_dev_free() we can set the no need
>>> flag under the protection of dc->writeback_lock, for example:
>>>
>>> static void cached_dev_detach_finish(struct work_struct *w)
>>> {
>>> ...
>>> +down_write(>writeback_lock);
>>> +SET NO NEED RE-ARM FLAG
>>> +up_write(>writeback_lock);
>>> cancel_delayed_work_sync(>writeback_rate_update);
>>> }
>>>
>>> I think this way is more simple and readable.
>>>
>>
>> Hi Junhui,
>>
>> Your suggest is essentially almost same to my patch,
>> - clear BCACHE_DEV_DETACHING bit acts as SET NO NEED RE-ARM FLAG.
>> - cancel_writeback_rate_update_dwork acts as some kind of locking with a
>> timeout.
>>
>> The difference is I don't use dc->writeback_lock, and replace it by
>> BCACHE_DEV_RATE_DW_RUNNING.
>>
>> The reason is my following development. I plan to implement a real-time
>> update stripe_sectors_dirty of bcache device and cache set, then
>> bcache_flash_devs_sectors_dirty() can be very fast and bch_register_lock
>> can be removed here. And then I also plan 

Re: [PATCH v4 13/13] bcache: add stop_when_cache_set_failed to struct cached_dev

2018-01-29 Thread Nix
On 27 Jan 2018, Coly Li said:

> Current bcache failure handling code will stop all attached bcache devices
> when the cache set is broken or disconnected. This is desired behavior for
> most of enterprise or cloud use cases, but maybe not for low end
> configuration. Nix  points out, users may still want to
> access the bcache device after cache device failed, for example on laptops.

Actually I'm much interested in the server use case. On laptops, it's
relatively easy to recover if you know what you're doing, because they
usually have a user in front of them with console access -- but if a
remote headless server with a hundred users a thousand miles away has
its cache device wear out I would really rather the hundred users get
served, if more slowly, rather than the whole machine going down for
hours or days until I can get someone there to bash on the hardware!

(Sure, ideally you'd detect the wearing out in advance, but SSDs are not
always nice and helpful like that and sometimes just go instantly
readonly or simply vanish off the bus entirely without warning.)

-- 
NULL && (void)


Re: [PATCH v4 05/13] bcache: stop dc->writeback_rate_update properly

2018-01-29 Thread tang . junhui
From: Tang Junhui 

Hello Coly:

There are some differences,
Using variable of atomic_t type can not guarantee the atomicity of transaction.
for example:
A thread runs in update_writeback_rate()
update_writeback_rate(){

+   if (test_bit(BCACHE_DEV_WB_RUNNING, >disk.flags)) {
+   schedule_delayed_work(>writeback_rate_update,
  dc->writeback_rate_update_seconds * HZ);
+   }

Then another thread executes in cached_dev_detach_finish():
if (test_and_clear_bit(BCACHE_DEV_WB_RUNNING, >disk.flags))
cancel_writeback_rate_update_dwork(dc);

+
+   /*
+* should check BCACHE_DEV_RATE_DW_RUNNING before calling
+* cancel_delayed_work_sync().
+*/
+   clear_bit(BCACHE_DEV_RATE_DW_RUNNING, >disk.flags);
+   /* paired with where BCACHE_DEV_RATE_DW_RUNNING is tested */
+   smp_mb();

Race still exists.
 
> 
> On 29/01/2018 3:35 PM, tang.jun...@zte.com.cn wrote:
> > From: Tang Junhui 
> > 
> > Hello Coly:
> > 
> > This patch is somewhat difficult for me,
> > I think we can resolve it in a simple way.
> > 
> > We can take the schedule_delayed_work() under the protection of 
> > dc->writeback_lock, and judge if we need re-arm this work to queue.
> > 
> > static void update_writeback_rate(struct work_struct *work)
> > {
> > struct cached_dev *dc = container_of(to_delayed_work(work),
> >  struct cached_dev,
> >  writeback_rate_update);
> > 
> > down_read(>writeback_lock);
> > 
> > if (atomic_read(>has_dirty) &&
> > dc->writeback_percent)
> > __update_writeback_rate(dc);
> > 
> > -up_read(>writeback_lock);
> > +if (NEED_RE-AEMING)
> > schedule_delayed_work(>writeback_rate_update,
> >   dc->writeback_rate_update_seconds * HZ);
> > +up_read(>writeback_lock);
> > }
> > 
> > In cached_dev_detach_finish() and cached_dev_free() we can set the no need
> > flag under the protection of dc->writeback_lock, for example:
> > 
> > static void cached_dev_detach_finish(struct work_struct *w)
> > {
> > ...
> > +down_write(>writeback_lock);
> > +SET NO NEED RE-ARM FLAG
> > +up_write(>writeback_lock);
> > cancel_delayed_work_sync(>writeback_rate_update);
> > }
> > 
> > I think this way is more simple and readable.
> > 
> 
> Hi Junhui,
> 
> Your suggest is essentially almost same to my patch,
> - clear BCACHE_DEV_DETACHING bit acts as SET NO NEED RE-ARM FLAG.
> - cancel_writeback_rate_update_dwork acts as some kind of locking with a
> timeout.
> 
> The difference is I don't use dc->writeback_lock, and replace it by
> BCACHE_DEV_RATE_DW_RUNNING.
> 
> The reason is my following development. I plan to implement a real-time
> update stripe_sectors_dirty of bcache device and cache set, then
> bcache_flash_devs_sectors_dirty() can be very fast and bch_register_lock
> can be removed here. And then I also plan to remove reference of
> dc->writeback_lock in update_writeback_rate() because indeed it is
> unnecessary here (the patch is held by Mike's locking resort work).
> 
> Since I plan to remove dc->writeback_lock from update_writeback_rate(),
> I don't want to reference dc->writeback in the delayed work.
> 
> The basic idea behind your suggestion and this patch, is almost
> identical. The only difference might be the timeout in
> cancel_writeback_rate_update_dwork().
> 
> Thanks.
> 
> Coly Li

Thanks.
Tang Junhui



Re: [PATCH] bcache: finish incremental GC

2018-01-29 Thread Coly Li
On 29/01/2018 7:58 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> In GC thread, we record the latest GC key in gc_done, which is expected
> to be used for incremental GC, but in currently code, we didn't realize
> it. When GC runs, front side IO would be blocked until the GC over, it
> would be a long time if there is a lot of btree nodes.
> 
> This patch realizes incremental GC, the main ideal is that, when there
> are front side I/Os, after GC some nodes (100), we stop GC, release locker
> of the btree node, and go to process the front side I/Os for some times
> (100 ms), then go back to GC again.
> 
> By this patch, when we doing GC, I/Os are not blocked all the time, and
> there is no obvious I/Os zero jump problem any more.
> 
> Signed-off-by: Tang Junhui 

Hi Junhui,

It sounds great :-) Currently I am not very familiar with gc code, I
need some time to understand how bcache GC works before providing my
feedback. Please give me some time, I hope I am able to provide my
comment after Chinese Spring Festival.

Hope other people may have a look before I am able to.

Thanks.

Coly Li



[code snip]


[PATCH] bcache: finish incremental GC

2018-01-29 Thread tang . junhui
From: Tang Junhui 

In GC thread, we record the latest GC key in gc_done, which is expected
to be used for incremental GC, but in currently code, we didn't realize
it. When GC runs, front side IO would be blocked until the GC over, it
would be a long time if there is a lot of btree nodes.

This patch realizes incremental GC, the main ideal is that, when there
are front side I/Os, after GC some nodes (100), we stop GC, release locker
of the btree node, and go to process the front side I/Os for some times
(100 ms), then go back to GC again.

By this patch, when we doing GC, I/Os are not blocked all the time, and
there is no obvious I/Os zero jump problem any more.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/bcache.h  |  5 +
 drivers/md/bcache/btree.c   | 15 ++-
 drivers/md/bcache/request.c |  3 +++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index abd31e8..c047aff 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -445,6 +445,7 @@ struct cache {
 
 struct gc_stat {
size_t  nodes;
+   size_t  nodes_pre;
size_t  key_bytes;
 
size_t  nkeys;
@@ -568,6 +569,10 @@ struct cache_set {
 */
atomic_trescale;
/*
+* used for GC, identify if any front side I/Os is inflight
+*/
+   atomic_tio_inflight;
+   /*
 * When we invalidate buckets, we use both the priority and the amount
 * of good data to determine which buckets to reuse first - to weight
 * those together consistently we keep track of the smallest nonzero
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 658c54b..48a4d18 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -90,6 +90,9 @@
 
 #define MAX_NEED_GC64
 #define MAX_SAVE_PRIO  72
+#define MIN_GC_NODES   100
+#define GC_SLEEP_TIME  100
+
 
 #define PTR_DIRTY_BIT  (((uint64_t) 1 << 36))
 
@@ -1573,6 +1576,13 @@ static int btree_gc_recurse(struct btree *b, struct 
btree_op *op,
memmove(r + 1, r, sizeof(r[0]) * (GC_MERGE_NODES - 1));
r->b = NULL;
 
+   if (atomic_read(>c->io_inflight) &&
+   gc->nodes >= gc->nodes_pre + MIN_GC_NODES) {
+   gc->nodes_pre =  gc->nodes;
+   ret = -EAGAIN;
+   break;
+   }
+
if (need_resched()) {
ret = -EAGAIN;
break;
@@ -1742,7 +1752,10 @@ static void bch_btree_gc(struct cache_set *c)
closure_sync();
cond_resched();
 
-   if (ret && ret != -EAGAIN)
+   if (ret == -EAGAIN)
+   schedule_timeout_interruptible(msecs_to_jiffies
+  (GC_SLEEP_TIME));
+   else if (ret)
pr_warn("gc failed!");
} while (ret);
 
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 3475d66..a8b1226 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -628,7 +628,9 @@ static void do_bio_hook(struct search *s, struct bio 
*orig_bio)
 static void search_free(struct closure *cl)
 {
struct search *s = container_of(cl, struct search, cl);
+
bio_complete(s);
+   atomic_dec(>d->c->io_inflight);
 
if (s->iop.bio)
bio_put(s->iop.bio);
@@ -646,6 +648,7 @@ static inline struct search *search_alloc(struct bio *bio,
 
closure_init(>cl, NULL);
do_bio_hook(s, bio);
+   atomic_inc(>c->io_inflight);
 
s->orig_bio = bio;
s->cache_miss   = NULL;
-- 
1.8.3.1



Re: [LSF/MM TOPIC] Springfield: smart and automated storage

2018-01-29 Thread Jan Tulak
On Sat, Jan 27, 2018 at 5:32 PM, James Bottomley
 wrote:
> On Sat, 2018-01-27 at 09:37 +0100, Jan Tulak wrote:
>> Springfield is a collection of projects unifying multiple levels of
>> the storage stack and providing a general API for automation, health
>> and status monitoring, as well as sane and easy configuration across
>> multiple levels of the storage stack. It is a scalable solution,
>> working from a single node to large deployments with a mix of base
>> metal, containers and VMs. For the first time, it was presented on
>> Vault 2017 in a birds of a feather session.
>>
>> Springfield builds upon and enhances the existing work of those
>> projects:
>> * udisks
>> * libblockdev
>> * blivet
>> * libstoragemgmt
>>
>> It is a coordinated effort to overcome many of the shortcomings of
>> the current situation and provide a unified approach to storage
>> management, so there is nothing like a binary or a library with the
>> name “Springfield.”
>>
>> An example of things we are trying to tackle is a useful storage
>> reporting, like notifications about filesystem being full, especially
>> with thin provisioning and multiple levels of the storage stack,
>> where
>> the user-visible status is not a true representation of what is
>> happening underneath.
>>
>> We want to see how the community sees this effort. What shortcomings
>> of the current situation in storage layering can Springfield address
>> (is there something we don’t see, but a project like this could
>> help)? How would you use it, or what would make it more useful to
>> you?
>
> Do you have a link to the project page and the source tree(s)?
>

Sure, for Springfield the website is: https://springfield-project.github.io/

Each of the listed projects has it's own repositories:
https://github.com/storaged-project/udisks
https://github.com/libstorage/libstoragemgmt
https://github.com/storaged-project/libblockdev
https://github.com/storaged-project/blivet

Cheers,
Jan