Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread NeilBrown
On Wed, Nov 22 2017, Mikulas Patocka wrote:

> On Wed, 22 Nov 2017, NeilBrown wrote:
>
>> On Tue, Nov 21 2017, Mikulas Patocka wrote:
>> 
>> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
>> >
>> >> On Tue, Nov 21 2017 at  4:23pm -0500,
>> >> Mikulas Patocka  wrote:
>> >> 
>> >> > This is not correct:
>> >> > 
>> >> >2206 static void dm_wq_work(struct work_struct *work)
>> >> >2207 {
>> >> >2208 struct mapped_device *md = container_of(work, struct 
>> >> > mapped_device, work);
>> >> >2209 struct bio *bio;
>> >> >2210 int srcu_idx;
>> >> >2211 struct dm_table *map;
>> >> >2212
>> >> >2213 if (!bio_list_empty(&md->rescued)) {
>> >> >2214 struct bio_list list;
>> >> >2215 spin_lock_irq(&md->deferred_lock);
>> >> >2216 list = md->rescued;
>> >> >2217 bio_list_init(&md->rescued);
>> >> >2218 spin_unlock_irq(&md->deferred_lock);
>> >> >2219 while ((bio = bio_list_pop(&list)))
>> >> >2220 generic_make_request(bio);
>> >> >2221 }
>> >> >
>> >> >2223 map = dm_get_live_table(md, &srcu_idx);
>> >> >2224
>> >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) 
>> >> > {
>> >> >2226 spin_lock_irq(&md->deferred_lock);
>> >> >2227 bio = bio_list_pop(&md->deferred);
>> >> >2228 spin_unlock_irq(&md->deferred_lock);
>> >> >2229
>> >> >2230 if (!bio)
>> >> >2231 break;
>> >> >2232
>> >> >2233 if (dm_request_based(md))
>> >> >2234 generic_make_request(bio);
>> >> >2235 else
>> >> >2236 __split_and_process_bio(md, map, bio);
>> >> >2237 }
>> >> >2238
>> >> >2239 dm_put_live_table(md, srcu_idx);
>> >> >2240 }
>> >> > 
>> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
>> >> > will not process md->rescued list.
>> >> 
>> >> Can you elaborate further?  We cannot be "in dm_wq_work in
>> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
>> >> of scheduling away from __split_and_process_bio?
>> >> 
>> >> The more detail you can share the better.
>> >
>> > Suppose this scenario:
>> >
>> > * dm_wq_work calls __split_and_process_bio
>> > * __split_and_process_bio eventually reaches the function snapshot_map
>> > * snapshot_map attempts to take the snapshot lock
>> >
>> > * the snapshot lock could be released only if some bios submitted by the 
>> > snapshot driver to the underlying device complete
>> > * the bios submitted to the underlying device were already offloaded by 
>> > some other task and they are waiting on the list md->rescued
>> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
>> > blocked in snapshot_map (called from __split_and_process_bio)
>> 
>> Yes, I think you are right. 
>> 
>> I think the solution is to get rid of the dm_offload() infrastructure
>> and make it not necessary.
>> i.e. discard my patches
>> dm: prepare to discontinue use of BIOSET_NEED_RESCUER
>> and
>> dm: revise 'rescue' strategy for bio-based bioset allocations
>> 
>> And build on "dm: ensure bio submission follows a depth-first tree walk"
>> which was written after those and already makes dm_offload() less
>> important.
>> 
>> Since that "depth-first" patch, every request to the dm device, after
>> the initial splitting, allocates just one dm_target_io structure, and
>> makes just one __map_bio() call, and so will behave exactly the way
>> generic_make_request() expects and copes with - thus avoiding awkward
>> dependencies and deadlocks.  Except
>> 
>> a/ If any target defines ->num_write_bios() to return >1,
>>__clone_and_map_data_bio() will make multiple calls to alloc_tio()
>>and __map_bio(), which might need rescuing.
>>But no target defines num_write_bios, and none have since it was
>>removed from dm-cache 4.5 years ago.
>>Can we discard num_write_bios??
>> 
>> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>>to a value > 1, then __send_duplicate_bios() will also make multiple
>>calls to alloc_tio() and __map_bio().
>>Some do.
>>  dm-cache-target:  flush=2
>>  dm-snap: flush=2
>>  dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
>> 
>> These will only be a problem if the second (or subsequent) alloc_tio()
>> blocks waiting for an earlier allocation to complete.  This will only
>> be a problem if multiple threads are each trying to allocate multiple
>> dm_target_io from the same bioset at the same time.
>> This is rare and should be easier to address than the current
>> dm_offload() approach. 
>> One possibility would be to copy the approach taken

[PATCH V2 2/2] block: drain blkcg part of request_queue in blk_cleanup_queue()

2017-11-22 Thread Ming Lei
Now once blk_freeze_queue() returns, all requests(in-queue and pending)
can be drained, but we still need to drain blkcg part of request_queue
for both blk-mq and legacy, so this patch calls blkcg_drain_queue()
explicitely in blk_cleanup_queue() to do that.

Then the __blk_drain_queue() in blk_cleanup_queue() can be covered by both
blk_freeze_queue() and blkcg_drain_queue(), and tasks blocked in get_request()
are waken up in blk_set_queue_dying() too, so remove it from 
blk_cleanup_queue().

Cc: Wen Xiong 
Cc: Mauricio Faria de Oliveira 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1038706edd87..f3f6f11a5b31 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -689,8 +689,7 @@ void blk_cleanup_queue(struct request_queue *q)
 */
blk_freeze_queue(q);
spin_lock_irq(lock);
-   if (!q->mq_ops)
-   __blk_drain_queue(q, true);
+   blkcg_drain_queue(q);
queue_flag_set(QUEUE_FLAG_DEAD, q);
spin_unlock_irq(lock);
 
-- 
2.9.5



[PATCH V2 0/2] block: fix queue freeze and cleanup

2017-11-22 Thread Ming Lei
Hi Jens,

The 1st patch runs queue in blk_freeze_queue_start() for fixing one
regression by 055f6e18e08f("block: Make q_usage_counter also track legacy
requests").

The 2nd patch drians blkcg part of request_queue for both blk-mq and
legacy, which can be a fix on blk-mq's queue cleanup.

V2:
- follow Bart's suggestion to use run queue instead of drain queue
- drians blkcg part of request_queue for blk-mq

thanks,
Ming

Ming Lei (2):
  block: run queue before waiting for q_usage_counter becoming zero
  block: drain blkcg part of request_queue after queue is frozen

 block/blk-core.c | 3 +--
 block/blk-mq.c   | 2 ++
 2 files changed, 3 insertions(+), 2 deletions(-)

-- 
2.9.5



[PATCH V2 1/2] block: run queue before waiting for q_usage_counter becoming zero

2017-11-22 Thread Ming Lei
Now we track legacy requests with .q_usage_counter in commit 055f6e18e08f
("block: Make q_usage_counter also track legacy requests"), but that
commit never runs legacy queue before waiting for this counter becoming zero,
then IO hang is caused in the test of pulling disk during IO.

This patch fixes the issue by running queue in blk_freeze_queue_start() like
blk-mq before waiting for q_usage_counter becoming zero.

Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
Cc: Wen Xiong 
Cc: Mauricio Faria de Oliveira 
Suggested-by: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 11097477eeab..e2b6a57b004d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -128,6 +128,8 @@ void blk_freeze_queue_start(struct request_queue *q)
percpu_ref_kill(&q->q_usage_counter);
if (q->mq_ops)
blk_mq_run_hw_queues(q, false);
+   else
+   blk_run_queue(q);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5



Re: Re: [bug report] bcache stucked when writting jounrnal

2017-11-22 Thread tang . junhui
From: Tang Junhui 

Hi, RuiHua:

> I have met the similar problem once.
> It looks like a deadlock between the cache device register thread and
> bcache_allocator thread.
> 
> The trace info tell us the journal is full, probablely the allocator
> thread waits on bch_prio_write()->prio_io()->bch_journal_meta(), but
> there is no RESERVE_BTREE buckets to use for journal replay at this
> time, so register thread waits on
> bch_journal_replay()->bch_btree_insert()
> 
> The path which your register command possibly blocked:
> run_cache_set()
>   -> bch_journal_replay()
>   -> bch_btree_insert()
>   -> btree_insert_fn()
>   -> bch_btree_insert_node()
>   -> btree_split()
>   -> btree_check_reserve() here we find
> RESERVE_BTREE buckets is empty, and then schedule out...
> 
> bch_allocator_thread()
>   ->bch_prio_write()
>  ->bch_journal_meta()
> 
> 
> You can apply this patch to your code and try to register again. This
> is for your reference only. Because this patch was not verified in my
> environment, because my env was damaged last time before I dig into
> code and write this patch, I hopefully it can resolve your problem:-)
Thanks very much, your advice and pathch gave me a lot of inspiration,
I will do an further analysis, and apply this patch then have a try again.

Thanks,
Tang




Re: [PATCH] block: drain queue before waiting for q_usage_counter becoming zero

2017-11-22 Thread Ming Lei
On Wed, Nov 22, 2017 at 04:47:48PM +, Bart Van Assche wrote:
> On Wed, 2017-11-22 at 13:11 +0800, Ming Lei wrote:
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 11097477eeab..3d3797327491 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -161,6 +161,8 @@ void blk_freeze_queue(struct request_queue *q)
> >  * exported to drivers as the only user for unfreeze is blk_mq.
> >  */
> > blk_freeze_queue_start(q);
> > +   if (!q->mq_ops)
> > +   blk_drain_queue(q);
> > blk_mq_freeze_queue_wait(q);
> >  }
> 
> Since q_usage_counter now tracks legacy requests, is there any reason why we
> still need __blk_drain_queue()? Have you considered to eliminate
> __blk_drain_queue() and to call blk_run_queue() from inside blk_freeze_queue()
> instead of calling blk_drain_queue()? I'm asking this because

Yeah, that looks better, I am thinking of that too, will do this way
in V2.

-- 
Ming


Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mike Snitzer
On Tue, Nov 21 2017 at 11:28pm -0500,
Mike Snitzer  wrote:

> 
> I'll work through your patches tomorrow.

Please see the top 3 patches on this branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16

This rebased dm-4.16 branch seems to be working well so far.

Mike


Re: new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mike Snitzer
On Wed, Nov 22 2017 at  1:24pm -0500,
Mikulas Patocka  wrote:
 
> Another problem is this:
> 
> struct bio *b = bio_clone_bioset(bio, GFP_NOIO, md->queue->bio_split);
> bio_advance(b, (bio_sectors(b) - ci.sector_count) << 9);
> bio_chain(b, bio);
> 
> What if it blocks because the bioset is exhausted?
> 
> The code basically builds a chain of bios of unlimited length (suppose for 
> example a case when we are splitting on every sector boundary, so there 
> will be one bio for every sector in the original bio), it could exhaust 
> the bioset easily.
> 
> It would be better to use mechanism from md-raid that chains all the 
> sub-bios to the same master bio and doesn't create long chains of bios:
> 
> if (max_sectors < bio_sectors(bio)) {
> struct bio *split = bio_split(bio, max_sectors,
>   gfp, conf->bio_split);
> bio_chain(split, bio);
> generic_make_request(bio);
> bio = split;
> r1_bio->master_bio = bio;
> r1_bio->sectors = max_sectors;
> }

I'd be happy to take an incremental patch that improves on this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16&id=b46d6a08f1ae7bf53e4cde28e0ccdf91567d432e

But short of that I'll have to come back to this.

Thanks,
Mike


Re: [PATCH] block: remove useless assignment in bio_split

2017-11-22 Thread Jens Axboe
On 11/22/2017 11:18 AM, Mikulas Patocka wrote:
> Remove useless assignment to the variable "split" because the variable is
> unconditionally assigned later.

Thanks, applied.

-- 
Jens Axboe



Re: [dm-devel] new patchset to eliminate DM's use of BIOSET_NEED_RESCUER

2017-11-22 Thread Mikulas Patocka


On Wed, 22 Nov 2017, NeilBrown wrote:

> On Tue, Nov 21 2017, Mikulas Patocka wrote:
> 
> > On Tue, 21 Nov 2017, Mike Snitzer wrote:
> >
> >> On Tue, Nov 21 2017 at  4:23pm -0500,
> >> Mikulas Patocka  wrote:
> >> 
> >> > This is not correct:
> >> > 
> >> >2206 static void dm_wq_work(struct work_struct *work)
> >> >2207 {
> >> >2208 struct mapped_device *md = container_of(work, struct 
> >> > mapped_device, work);
> >> >2209 struct bio *bio;
> >> >2210 int srcu_idx;
> >> >2211 struct dm_table *map;
> >> >2212
> >> >2213 if (!bio_list_empty(&md->rescued)) {
> >> >2214 struct bio_list list;
> >> >2215 spin_lock_irq(&md->deferred_lock);
> >> >2216 list = md->rescued;
> >> >2217 bio_list_init(&md->rescued);
> >> >2218 spin_unlock_irq(&md->deferred_lock);
> >> >2219 while ((bio = bio_list_pop(&list)))
> >> >2220 generic_make_request(bio);
> >> >2221 }
> >> >
> >> >2223 map = dm_get_live_table(md, &srcu_idx);
> >> >2224
> >> >2225 while (!test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) {
> >> >2226 spin_lock_irq(&md->deferred_lock);
> >> >2227 bio = bio_list_pop(&md->deferred);
> >> >2228 spin_unlock_irq(&md->deferred_lock);
> >> >2229
> >> >2230 if (!bio)
> >> >2231 break;
> >> >2232
> >> >2233 if (dm_request_based(md))
> >> >2234 generic_make_request(bio);
> >> >2235 else
> >> >2236 __split_and_process_bio(md, map, bio);
> >> >2237 }
> >> >2238
> >> >2239 dm_put_live_table(md, srcu_idx);
> >> >2240 }
> >> > 
> >> > You can see that if we are in dm_wq_work in __split_and_process_bio, we 
> >> > will not process md->rescued list.
> >> 
> >> Can you elaborate further?  We cannot be "in dm_wq_work in
> >> __split_and_process_bio" simultaneously.  Do you mean as a side-effect
> >> of scheduling away from __split_and_process_bio?
> >> 
> >> The more detail you can share the better.
> >
> > Suppose this scenario:
> >
> > * dm_wq_work calls __split_and_process_bio
> > * __split_and_process_bio eventually reaches the function snapshot_map
> > * snapshot_map attempts to take the snapshot lock
> >
> > * the snapshot lock could be released only if some bios submitted by the 
> > snapshot driver to the underlying device complete
> > * the bios submitted to the underlying device were already offloaded by 
> > some other task and they are waiting on the list md->rescued
> > * the bios waiting on md->rescued are not processed, because dm_wq_work is 
> > blocked in snapshot_map (called from __split_and_process_bio)
> 
> Yes, I think you are right. 
> 
> I think the solution is to get rid of the dm_offload() infrastructure
> and make it not necessary.
> i.e. discard my patches
> dm: prepare to discontinue use of BIOSET_NEED_RESCUER
> and
> dm: revise 'rescue' strategy for bio-based bioset allocations
> 
> And build on "dm: ensure bio submission follows a depth-first tree walk"
> which was written after those and already makes dm_offload() less
> important.
> 
> Since that "depth-first" patch, every request to the dm device, after
> the initial splitting, allocates just one dm_target_io structure, and
> makes just one __map_bio() call, and so will behave exactly the way
> generic_make_request() expects and copes with - thus avoiding awkward
> dependencies and deadlocks.  Except
> 
> a/ If any target defines ->num_write_bios() to return >1,
>__clone_and_map_data_bio() will make multiple calls to alloc_tio()
>and __map_bio(), which might need rescuing.
>But no target defines num_write_bios, and none have since it was
>removed from dm-cache 4.5 years ago.
>Can we discard num_write_bios??
> 
> b/ If any target sets any of num_{flush,discard,write_same,write_zeroes}_bios
>to a value > 1, then __send_duplicate_bios() will also make multiple
>calls to alloc_tio() and __map_bio().
>Some do.
>  dm-cache-target:  flush=2
>  dm-snap: flush=2
>  dm-stripe: discard, write_same, write_zeroes all set to 'stripes'.
> 
> These will only be a problem if the second (or subsequent) alloc_tio()
> blocks waiting for an earlier allocation to complete.  This will only
> be a problem if multiple threads are each trying to allocate multiple
> dm_target_io from the same bioset at the same time.
> This is rare and should be easier to address than the current
> dm_offload() approach. 
> One possibility would be to copy the approach taken by
> crypt_alloc_buffer() which needs to allocate multiple entries from a
> mempool.
> It first tries the with GFP_NOWAIT.  If that fails it take a mutex and
> tri

[PATCH] block: remove useless assignment in bio_split

2017-11-22 Thread Mikulas Patocka
Remove useless assignment to the variable "split" because the variable is
unconditionally assigned later.

Signed-off-by: Mikulas Patocka 

---
 block/bio.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/block/bio.c
===
--- linux-2.6.orig/block/bio.c
+++ linux-2.6/block/bio.c
@@ -1873,7 +1873,7 @@ EXPORT_SYMBOL(bio_endio);
 struct bio *bio_split(struct bio *bio, int sectors,
  gfp_t gfp, struct bio_set *bs)
 {
-   struct bio *split = NULL;
+   struct bio *split;
 
BUG_ON(sectors <= 0);
BUG_ON(sectors >= bio_sectors(bio));


Re: [PATCH] bcache: set task state correctly in allocator_wait()

2017-11-22 Thread Michael Lyle
Hey everyone--

On 11/22/2017 06:10 AM, Hannes Reinecke wrote:
> _Actually_ there is a push to remove all kthreads in the kernel, as they
> don't play nice together with RT.
> So while you're at it, do you think it'd be possible to convert it to a
> workqueue? Sebastian will be happy to help you here, right, Sebastian?

I don't see a reason why moving this away from a thread would be
advantageous.  There's neither devastating complexity in the current
form nor a need for additional concurrency (at least right now).

As it turns out, we're broken for RT for other reasons, but that has to
do with our use of closures and releasing locks in different threads
from where they're allocated.  The natural answer is completions, but
that's not easily done with the current structure of the code.

This current code is neither a problem for RT nor for
performance/legibility on mainline.

Mike


Re: [PATCH] block: drain queue before waiting for q_usage_counter becoming zero

2017-11-22 Thread Bart Van Assche
On Wed, 2017-11-22 at 13:11 +0800, Ming Lei wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 11097477eeab..3d3797327491 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -161,6 +161,8 @@ void blk_freeze_queue(struct request_queue *q)
>* exported to drivers as the only user for unfreeze is blk_mq.
>*/
>   blk_freeze_queue_start(q);
> + if (!q->mq_ops)
> + blk_drain_queue(q);
>   blk_mq_freeze_queue_wait(q);
>  }

Since q_usage_counter now tracks legacy requests, is there any reason why we
still need __blk_drain_queue()? Have you considered to eliminate
__blk_drain_queue() and to call blk_run_queue() from inside blk_freeze_queue()
instead of calling blk_drain_queue()? I'm asking this because
blk_mq_freeze_queue_wait() uses a more efficient mechanism (wait_event())
than __blk_drain_queue() (while (...) msleep(10)).

Bart.

Re: [PATCH] bcache: set task state correctly in allocator_wait()

2017-11-22 Thread Coly Li
On 22/11/2017 10:55 PM, Sebastian Andrzej Siewior wrote:
> On 2017-11-22 15:10:51 [+0100], Hannes Reinecke wrote:
>> On 11/22/2017 01:33 PM, Coly Li wrote:
>>> Kthread function bch_allocator_thread() references allocator_wait(ca, cond)
>>> and when kthread_should_stop() is true, this kthread exits.
>>>
>>> The problem is, if kthread_should_stop() is true, macro allocator_wait()
>>> calls "return 0" with current task state TASK_INTERRUPTIBLE. After function
>>> bch_allocator_thread() returns to do_exit(), there are some blocking
>>> operations are called, then a kenrel warning is popped up by __might_sleep
>>> from kernel/sched/core.c,
>>>   "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at 
>>> []"
>>>
>>> If the task is interrupted and preempted out, since its status is
>>> TASK_INTERRUPTIBLE, it means scheduler won't pick it back to run forever,
>>> and the allocator thread may hang in do_exit().
>>>
>>> This patch sets allocator kthread state back to TASK_RUNNING before it
>>> returns to do_exit(), which avoids a potential deadlock.
>>>
>>> Signed-off-by: Coly Li 
>>> Cc: sta...@vger.kernel.org
>>> ---
>>>  drivers/md/bcache/alloc.c | 5 -
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
>>> index a27d85232ce1..996ebbabd819 100644
>>> --- a/drivers/md/bcache/alloc.c
>>> +++ b/drivers/md/bcache/alloc.c
>>> @@ -286,9 +286,12 @@ do {   
>>> \
>>> if (cond)   \
>>> break;  \
>>> \
>>> +   \
>>> mutex_unlock(&(ca)->set->bucket_lock);  \
>>> -   if (kthread_should_stop())  \
>>> +   if (kthread_should_stop()) {\
>>> +   __set_current_state(TASK_RUNNING);  \
>>> return 0;   \
>>> +   }   \
>>> \
>>> schedule(); \
>>> mutex_lock(&(ca)->set->bucket_lock);\
>>>
>> _Actually_ there is a push to remove all kthreads in the kernel, as they
>> don't play nice together with RT.
> 
> with RT? If RT as in PREEMPT-RT then this is news to me. The reason why
> I removed the per-CPU kthreads in the scsi driver(s) was because it was
> nonsense in regards to CPU-hotplug and workqueue infrastructure is way
> nicer for that. Not to mention that it made the code simpler.
> 
>> So while you're at it, do you think it'd be possible to convert it to a
>> workqueue? Sebastian will be happy to help you here, right, Sebastian?
> If commit 4b9bc86d5a99 ("fcoe: convert to kworker") does not explain I
> can try to assist.

Hi Hannes and Sebastian,

Thanks for the informative input. I see the point why convert from
kthread to per-cpu kworker. Bucket allocation is not a very hot code
path to deserve per-cpu work queue, a per-cached device work queue is
enough, as other places where kworker is used in bcache code. Bcache
used to have circular dependency issue on kworker queue, unless I attach
the kworker to a new and separate workqueue, there might be potential
possibility to introduce a new circular locking on global workqueues.

It is better to make less modification for now, and later after Michael
finishes all locking clean up, we can come to see the kthread to kworker
conversion.

Thanks.

Coly Li


Re: [PATCH] bcache: set task state correctly in allocator_wait()

2017-11-22 Thread Sebastian Andrzej Siewior
On 2017-11-22 15:10:51 [+0100], Hannes Reinecke wrote:
> On 11/22/2017 01:33 PM, Coly Li wrote:
> > Kthread function bch_allocator_thread() references allocator_wait(ca, cond)
> > and when kthread_should_stop() is true, this kthread exits.
> > 
> > The problem is, if kthread_should_stop() is true, macro allocator_wait()
> > calls "return 0" with current task state TASK_INTERRUPTIBLE. After function
> > bch_allocator_thread() returns to do_exit(), there are some blocking
> > operations are called, then a kenrel warning is popped up by __might_sleep
> > from kernel/sched/core.c,
> >   "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at 
> > []"
> > 
> > If the task is interrupted and preempted out, since its status is
> > TASK_INTERRUPTIBLE, it means scheduler won't pick it back to run forever,
> > and the allocator thread may hang in do_exit().
> > 
> > This patch sets allocator kthread state back to TASK_RUNNING before it
> > returns to do_exit(), which avoids a potential deadlock.
> > 
> > Signed-off-by: Coly Li 
> > Cc: sta...@vger.kernel.org
> > ---
> >  drivers/md/bcache/alloc.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> > index a27d85232ce1..996ebbabd819 100644
> > --- a/drivers/md/bcache/alloc.c
> > +++ b/drivers/md/bcache/alloc.c
> > @@ -286,9 +286,12 @@ do {   
> > \
> > if (cond)   \
> > break;  \
> > \
> > +   \
> > mutex_unlock(&(ca)->set->bucket_lock);  \
> > -   if (kthread_should_stop())  \
> > +   if (kthread_should_stop()) {\
> > +   __set_current_state(TASK_RUNNING);  \
> > return 0;   \
> > +   }   \
> > \
> > schedule(); \
> > mutex_lock(&(ca)->set->bucket_lock);\
> > 
> _Actually_ there is a push to remove all kthreads in the kernel, as they
> don't play nice together with RT.

with RT? If RT as in PREEMPT-RT then this is news to me. The reason why
I removed the per-CPU kthreads in the scsi driver(s) was because it was
nonsense in regards to CPU-hotplug and workqueue infrastructure is way
nicer for that. Not to mention that it made the code simpler.

> So while you're at it, do you think it'd be possible to convert it to a
> workqueue? Sebastian will be happy to help you here, right, Sebastian?
If commit 4b9bc86d5a99 ("fcoe: convert to kworker") does not explain I
can try to assist.

> Cheers,
> 
> Hannes

Sebastian


Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)

2017-11-22 Thread Jens Axboe
On 11/22/2017 12:28 AM, Christoph Hellwig wrote:
> Jens, please don't just revert the commit in your for-linus tree.
> 
> On its own this will totally mess up the interrupt assignments.  Give
> me a bit of time to sort this out properly.

I wasn't going to push it until I heard otherwise. I'll just pop it
off, for-linus isn't a stable branch.

-- 
Jens Axboe



Re: [PATCH V14 07/24] mmc: block: Use data timeout in card_busy_detect()

2017-11-22 Thread Ulf Hansson
On 22 November 2017 at 08:40, Adrian Hunter  wrote:
> On 21/11/17 17:39, Ulf Hansson wrote:
>> On 21 November 2017 at 14:42, Adrian Hunter  wrote:
>>> card_busy_detect() has a 10 minute timeout. However the correct timeout is
>>> the data timeout. Change card_busy_detect() to use the data timeout.
>>
>> Unfortunate I don't think there is "correct" timeout for this case.
>>
>> The data->timeout_ns is to indicate for the host to how long the
>> maximum time it's allowed to take between blocks that are written to
>> the data lines.
>>
>> I haven't found a definition of the busy timeout, after the data write
>> has completed. The spec only mentions that the device moves to
>> programming state and pulls DAT0 to indicate busy.
>
> To me it reads more like the timeout is for each block, including the last
> i.e. the same timeout for "busy".  Note the card is also busy between blocks.

I don't think that is the same timeout. Or maybe it is.

In the eMMC 5.1 spec, there are mentions about "buffer busy signal"
and "programming busy signal", see section 6.15.3 (Timings - Data
Write).

Anyway, whether any of them is specified, is to me unclear.

>
> Equally it is the timeout we give the host controller.  So either the host
> controller does not have a timeout for "busy" - which begs the question why
> it has a timeout at all - or it invents its own "busy" timeout - which begs
> the question why it isn't in the spec.

Well, there are some vague hints in section 6.8.2 (Time-out
conditions), but I don't find these timeouts values being referred to
in 6.15 (Timings). And that puzzles me.

Moreover, the below is quoted from section 6.6.8.1 (Block write):
--
Some Devices may require long and unpredictable times to write a block
of data. After receiving a block of data and completing the CRC check,
the Device will begin writing and hold the DAT0 line low. The host may
poll the status of the Device with a SEND_STATUS command (CMD13) at
any time, and the Device will respond with its status (except in Sleep
state). The status bit READY_FOR_DATA indicates whether the Device can
accept new data or not. The host may deselect the Device by issuing
CMD7 that will then displace the Device into the Disconnect State and
release the DAT0 line without interrupting the write operation. When
reselecting the Device, it will reactivate busy indication by pulling
DAT0 to low. See 6.15 for details of busy indication.
--

>
>>
>> Sure, 10 min seems crazy, perhaps something along the lines of 10-20 s
>> is more reasonable. What do you think?
>
> We give SD cards a generous 3 seconds for writes.  SDHCI has long had a 10
> second software timer for the whole request, which strongly suggests that
> requests have always completed within 10 seconds.  So that puts the range of
> an arbitrary timeout 3-10 s.

>From the reasoning above, I guess we could try out 10 s. That is at
least a lot better than 10 minutes.

I also see that we have at three different places (switch, erase,
block data transfers) implementing busy signal detection. It would be
nice to try to align those pieces of code, because they are quite
similar. Of course, this deserves it's own separate task to try to fix
up.

BTW, perhaps we should move this to a separate change on top of the
series? Or is there as specific need for this in regards to blkmq and
CQE?

Kind regards
Uffe


Re: [PATCH] bcache: set task state correctly in allocator_wait()

2017-11-22 Thread Hannes Reinecke
On 11/22/2017 01:33 PM, Coly Li wrote:
> Kthread function bch_allocator_thread() references allocator_wait(ca, cond)
> and when kthread_should_stop() is true, this kthread exits.
> 
> The problem is, if kthread_should_stop() is true, macro allocator_wait()
> calls "return 0" with current task state TASK_INTERRUPTIBLE. After function
> bch_allocator_thread() returns to do_exit(), there are some blocking
> operations are called, then a kenrel warning is popped up by __might_sleep
> from kernel/sched/core.c,
>   "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at 
> []"
> 
> If the task is interrupted and preempted out, since its status is
> TASK_INTERRUPTIBLE, it means scheduler won't pick it back to run forever,
> and the allocator thread may hang in do_exit().
> 
> This patch sets allocator kthread state back to TASK_RUNNING before it
> returns to do_exit(), which avoids a potential deadlock.
> 
> Signed-off-by: Coly Li 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/md/bcache/alloc.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> index a27d85232ce1..996ebbabd819 100644
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -286,9 +286,12 @@ do { 
> \
>   if (cond)   \
>   break;  \
>   \
> + \
>   mutex_unlock(&(ca)->set->bucket_lock);  \
> - if (kthread_should_stop())  \
> + if (kthread_should_stop()) {\
> + __set_current_state(TASK_RUNNING);  \
>   return 0;   \
> + }   \
>   \
>   schedule(); \
>   mutex_lock(&(ca)->set->bucket_lock);\
> 
_Actually_ there is a push to remove all kthreads in the kernel, as they
don't play nice together with RT.
So while you're at it, do you think it'd be possible to convert it to a
workqueue? Sebastian will be happy to help you here, right, Sebastian?

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


[PATCH] bcache: set task state correctly in allocator_wait()

2017-11-22 Thread Coly Li
Kthread function bch_allocator_thread() references allocator_wait(ca, cond)
and when kthread_should_stop() is true, this kthread exits.

The problem is, if kthread_should_stop() is true, macro allocator_wait()
calls "return 0" with current task state TASK_INTERRUPTIBLE. After function
bch_allocator_thread() returns to do_exit(), there are some blocking
operations are called, then a kenrel warning is popped up by __might_sleep
from kernel/sched/core.c,
  "WARNING: do not call blocking ops when !TASK_RUNNING; state=1 set at []"

If the task is interrupted and preempted out, since its status is
TASK_INTERRUPTIBLE, it means scheduler won't pick it back to run forever,
and the allocator thread may hang in do_exit().

This patch sets allocator kthread state back to TASK_RUNNING before it
returns to do_exit(), which avoids a potential deadlock.

Signed-off-by: Coly Li 
Cc: sta...@vger.kernel.org
---
 drivers/md/bcache/alloc.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
index a27d85232ce1..996ebbabd819 100644
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -286,9 +286,12 @@ do {   
\
if (cond)   \
break;  \
\
+   \
mutex_unlock(&(ca)->set->bucket_lock);  \
-   if (kthread_should_stop())  \
+   if (kthread_should_stop()) {\
+   __set_current_state(TASK_RUNNING);  \
return 0;   \
+   }   \
\
schedule(); \
mutex_lock(&(ca)->set->bucket_lock);\
-- 
2.13.6



[PATCH v5] Return bytes transferred for partial direct I/O

2017-11-22 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

In case direct I/O encounters an error midway, it returns the error.
Instead it should be returning the number of bytes transferred so far.

Test case for filesystems (with ENOSPC):
1. Create an almost full filesystem
2. Create a file, say /mnt/lastfile, until the filesystem is full.
3. Direct write() with count > sizeof /mnt/lastfile.

Result: write() returns -ENOSPC. However, file content has data written
in step 3.

Changes since v1:
 - incorporated iomap and block devices

Changes since v2:
 - realized that file size was not increasing when performing a (partial)
   direct I/O because end_io function was receiving the error instead of
   size. Fixed.

Changes since v3:
 - [hch] initialize transferred with dio->size and use transferred instead
   of dio->size.

Changes since v4:
 - Refreshed to v4.14

Signed-off-by: Goldwyn Rodrigues 
Reviewed-by: Christoph Hellwig 
---
 fs/block_dev.c |  2 +-
 fs/direct-io.c |  4 +---
 fs/iomap.c | 20 ++--
 3 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 789f55e851ae..4d3e4603f687 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -421,7 +421,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
 
if (!ret)
ret = blk_status_to_errno(dio->bio.bi_status);
-   if (likely(!ret))
+   if (likely(dio->size))
ret = dio->size;
 
bio_put(&dio->bio);
diff --git a/fs/direct-io.c b/fs/direct-io.c
index b53e66d9abd7..a8d2710f4ee9 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -262,8 +262,6 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
unsigned int flags)
ret = dio->page_errors;
if (ret == 0)
ret = dio->io_error;
-   if (ret == 0)
-   ret = transferred;
 
if (dio->end_io) {
// XXX: ki_pos??
@@ -310,7 +308,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
unsigned int flags)
}
 
kmem_cache_free(dio_cache, dio);
-   return ret;
+   return transferred ? transferred : ret;
 }
 
 static void dio_aio_complete_work(struct work_struct *work)
diff --git a/fs/iomap.c b/fs/iomap.c
index d4801f8dd4fd..6e37acba578d 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -715,23 +715,23 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
struct kiocb *iocb = dio->iocb;
struct inode *inode = file_inode(iocb->ki_filp);
loff_t offset = iocb->ki_pos;
-   ssize_t ret;
+   ssize_t err;
+   ssize_t transferred = dio->size;
 
if (dio->end_io) {
-   ret = dio->end_io(iocb,
-   dio->error ? dio->error : dio->size,
+   err = dio->end_io(iocb,
+   transferred ? transferred : dio->error,
dio->flags);
} else {
-   ret = dio->error;
+   err = dio->error;
}
 
-   if (likely(!ret)) {
-   ret = dio->size;
+   if (likely(transferred)) {
/* check for short read */
-   if (offset + ret > dio->i_size &&
+   if (offset + transferred > dio->i_size &&
!(dio->flags & IOMAP_DIO_WRITE))
-   ret = dio->i_size - offset;
-   iocb->ki_pos += ret;
+   transferred = dio->i_size - offset;
+   iocb->ki_pos += transferred;
}
 
/*
@@ -758,7 +758,7 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
inode_dio_end(file_inode(iocb->ki_filp));
kfree(dio);
 
-   return ret;
+   return transferred ? transferred : err;
 }
 
 static void iomap_dio_complete_work(struct work_struct *work)
-- 
2.14.2



Re: [bug report] bcache stucked when writting jounrnal

2017-11-22 Thread Rui Hua
Hi, Junhui,

I have met the similar problem once.
It looks like a deadlock between the cache device register thread and
bcache_allocator thread.

The trace info tell us the journal is full, probablely the allocator
thread waits on bch_prio_write()->prio_io()->bch_journal_meta(), but
there is no RESERVE_BTREE buckets to use for journal replay at this
time, so register thread waits on
bch_journal_replay()->bch_btree_insert()

The path which your register command possibly blocked:
run_cache_set()
  -> bch_journal_replay()
  -> bch_btree_insert()
  -> btree_insert_fn()
  -> bch_btree_insert_node()
  -> btree_split()
  -> btree_check_reserve() here we find
RESERVE_BTREE buckets is empty, and then schedule out...

bch_allocator_thread()
  ->bch_prio_write()
 ->bch_journal_meta()


You can apply this patch to your code and try to register again. This
is for your reference only. Because this patch was not verified in my
environment, because my env was damaged last time before I dig into
code and write this patch, I hopefully it can resolve your problem:-)


Signed-off-by: Hua Rui 
---
 drivers/md/bcache/btree.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 11c5503..211be35 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1868,14 +1868,16 @@ void bch_initial_gc_finish(struct cache_set *c)
 */
for_each_cache(ca, c, i) {
for_each_bucket(b, ca) {
-   if (fifo_full(&ca->free[RESERVE_PRIO]))
+   if (fifo_full(&ca->free[RESERVE_PRIO]) &&
+   fifo_full(&ca->free[RESERVE_BTREE]))
break;

if (bch_can_invalidate_bucket(ca, b) &&
!GC_MARK(b)) {
__bch_invalidate_one_bucket(ca, b);
-   fifo_push(&ca->free[RESERVE_PRIO],
- b - ca->buckets);
+   if
(!fifo_push(&ca->free[RESERVE_PRIO], b - ca->buckets))
+   fifo_push(&ca->free[RESERVE_BTREE],
+   b - ca->buckets);
}
}
}
-- 
1.8.3.1

2017-11-22 16:49 GMT+08:00  :
> From: Tang Junhui 
>
> Hi, everyone:
>
> bcache stucked when reboot system after high load.
>
> root  1704  3.7  0.0   4164   360 ?D14:07   0:09 
> /usr/lib/udev/bcache-register /dev/sdc
> [] closure_sync+0x25/0x90 [bcache]
> [] bch_btree_set_root+0x1f1/0x250 [bcache]
> [] btree_split+0x632/0x760 [bcache]
> [] bch_btree_insert_recurse+0x3db/0x500 [bcache]
> [] bch_btree_insert+0x167/0x360 [bcache]
> [] bch_journal_replay+0x1aa/0x2e0 [bcache]
> [] run_cache_set+0x813/0x83e [bcache]
> [] register_bcache+0xea3/0x1410 [bcache]
> [] kobj_attr_store+0xf/0x20
> [] sysfs_write_file+0xc6/0x140
> [] vfs_write+0xbd/0x1e0
> [] SyS_write+0x58/0xb0
> [] system_call_fastpath+0x16/0x1b
>
> root  2097  0.0  0.0  0 0 ?D14:08   0:00 
> [bcache_allocato]
> [] closure_sync+0x25/0x90 [bcache]
> [] bch_prio_write+0x23e/0x340 [bcache]
> [] bch_allocator_thread+0x340/0x350 [bcache]
> [] kthread+0xcf/0xe0
> [] ret_from_fork+0x58/0x90
>
> I try to add some debug info to the code, it seems that it always run in
> journal_write_unlocked()
> else if (journal_full(&c->journal)) {
> journal_reclaim(c);
> spin_unlock(&c->journal.lock);
>
> btree_flush_write(c);
> continue_at(cl, journal_write, system_wq);
> return;
> }
> the condition of journal_full() always returns true, so the journal
> can not finish all the time.
>
> My code has a little difference with the upstream branch.
> Could anyone give me some suggestions?
>
> Thanks,
> Tang
>
>


[bug report] bcache stucked when writting jounrnal

2017-11-22 Thread tang . junhui
From: Tang Junhui 

Hi, everyone:

bcache stucked when reboot system after high load.

root  1704  3.7  0.0   4164   360 ?D14:07   0:09 
/usr/lib/udev/bcache-register /dev/sdc
[] closure_sync+0x25/0x90 [bcache]
[] bch_btree_set_root+0x1f1/0x250 [bcache]
[] btree_split+0x632/0x760 [bcache]
[] bch_btree_insert_recurse+0x3db/0x500 [bcache]
[] bch_btree_insert+0x167/0x360 [bcache]
[] bch_journal_replay+0x1aa/0x2e0 [bcache]
[] run_cache_set+0x813/0x83e [bcache]
[] register_bcache+0xea3/0x1410 [bcache]
[] kobj_attr_store+0xf/0x20
[] sysfs_write_file+0xc6/0x140
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x58/0xb0
[] system_call_fastpath+0x16/0x1b

root  2097  0.0  0.0  0 0 ?D14:08   0:00 
[bcache_allocato]
[] closure_sync+0x25/0x90 [bcache]
[] bch_prio_write+0x23e/0x340 [bcache]
[] bch_allocator_thread+0x340/0x350 [bcache]
[] kthread+0xcf/0xe0
[] ret_from_fork+0x58/0x90

I try to add some debug info to the code, it seems that it always run in
journal_write_unlocked()
else if (journal_full(&c->journal)) {
journal_reclaim(c);
spin_unlock(&c->journal.lock);

btree_flush_write(c);
continue_at(cl, journal_write, system_wq);
return;
}
the condition of journal_full() always returns true, so the journal 
can not finish all the time.

My code has a little difference with the upstream branch.
Could anyone give me some suggestions?

Thanks,
Tang




Re: [RFC] md: make queue limits depending on limits of RAID members

2017-11-22 Thread Mariusz Dabrowski

On 11/17/2017 07:40 PM, Shaohua Li wrote:

On Wed, Nov 15, 2017 at 11:25:12AM +0100, Mariusz Dabrowski wrote:

Hi all,

In order to be compliant with a pass-throug drive behavior, RAID queue
limits should be calculated in a way that minimal io, optimal io and
discard granularity size will be met from a single drive perspective.
Currently MD driver is ignoring queue limits reported by members and all
calculations are based on chunk (stripe) size.


We did use blk_stack_limits, which will care about these.


I am working on patch which
changes this. Since kernel 4.13 drives which support streams (write hints)
might report discard_alignment or discard_granularity bigger than array
stripe (for more details view NVMe 1.3 specification, chapter 9.3 Streams)
so result of current calculation is not optimal for those drives. For
example for 4 disk RAID 0 with chunk size smaller than discard_granularity
of member drives, discard_granularity of RAID array should be equal to
4*member_discard_granularity.

The problem is that for some drives there is a risk that result of this
calculation exceeds unsigned int range. Current implementation of function
nvme_config_discard in NVMe driver can also suffer the same problem:

if (ctrl->nr_streams && ns->sws && ns->sgs) {
unsigned int sz = logical_block_size * ns->sws * ns->sgs;

ns->queue->limits.discard_alignment = sz;
ns->queue->limits.discard_granularity = sz;
}

One potential solution for that is to change type of some queue limits
(at least discard_granularity and discard_alignment, maybe more, like
max_discard_sectors?) from 32 bits to 64 bits.

What are your thoughts about this? Do you expect any troubles with
changing this in block layer? Would you be willing to do such change?

Signed-off-by: Mariusz Dabrowski 
---

 drivers/md/md.c | 24 
 drivers/md/md.h |  1 +
 drivers/md/raid0.c  | 23 ---
 drivers/md/raid10.c | 30 +-
 drivers/md/raid5.c  | 25 +++--
 5 files changed, 89 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0ff1bbf6c90e..e0dc114cab19 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include "md.h"
@@ -679,6 +680,29 @@ struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, 
int nr)
 }
 EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu);

+void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits *limits)
+{
+   struct md_rdev *rdev;
+
+   memset(limits, 0, sizeof(struct queue_limits));
+
+   rdev_for_each(rdev, mddev) {
+   struct queue_limits *rdev_limits;
+
+   rdev_limits = &rdev->bdev->bd_queue->limits;
+   limits->io_min = max(limits->io_min, rdev_limits->io_min);
+   limits->io_opt = lcm_not_zero(limits->io_opt,
+ rdev_limits->io_opt);
+   limits->discard_granularity =
+   max(limits->discard_granularity,
+   rdev_limits->discard_granularity);
+   limits->discard_alignment =
+   max(limits->discard_alignment,
+   rdev_limits->discard_alignment);
+   }
+}


isn't this exactly what blk_stack_limits does?


Yes, but limits of member devices have to be known before calling 
blk_stack_limits to calculate values for array.



+EXPORT_SYMBOL_GPL(md_rdev_get_queue_limits);
+
 static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev)
 {
struct md_rdev *rdev;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d8287d3cd1bf..5b514b9bb535 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -702,6 +702,7 @@ extern void md_reload_sb(struct mddev *mddev, int 
raid_disk);
 extern void md_update_sb(struct mddev *mddev, int force);
 extern void md_kick_rdev_from_array(struct md_rdev * rdev);
 struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
+void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits 
*limits);

 static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
 {
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 5a00fc118470..f1dcf7fd3eb1 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -382,15 +382,31 @@ static int raid0_run(struct mddev *mddev)
if (mddev->queue) {
struct md_rdev *rdev;
bool discard_supported = false;
+   struct queue_limits limits;
+   unsigned int io_min, io_opt;
+   unsigned int granularity, alignment;
+   unsigned int chunk_size = mddev->chunk_sectors << 9;

+   md_rdev_get_queue_limits(mddev, &limits);
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 
mddev->chunk_secto

Re: [PATCH] block: drain queue before waiting for q_usage_counter becoming zero

2017-11-22 Thread Ming Lei
On Wed, Nov 22, 2017 at 08:04:13AM +0100, Hannes Reinecke wrote:
> On 11/22/2017 06:11 AM, Ming Lei wrote:
> > Now we track legacy requests with .q_usage_counter in commit 055f6e18e08f
> > ("block: Make q_usage_counter also track legacy requests"), but that
> > commit never runs and drains legacy queue before waiting for this counter
> > becoming zero, then IO hang is caused in the test of pulling disk during IO.
> > 
> > This patch fixes the issue by draining requests before waiting for
> > q_usage_counter becoming zero.
> > 
> > Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy 
> > requests")
> > Cc: Wen Xiong 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-core.c | 9 +++--
> >  block/blk-mq.c   | 2 ++
> >  block/blk.h  | 2 ++
> >  3 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 1038706edd87..5ae5ea0dca4c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -562,6 +562,13 @@ static void __blk_drain_queue(struct request_queue *q, 
> > bool drain_all)
> > }
> >  }
> >  
> > +void blk_drain_queue(struct request_queue *q)
> > +{
> > +   spin_lock_irq(q->queue_lock);
> > +   __blk_drain_queue(q, true);
> > +   spin_unlock_irq(q->queue_lock);
> > +}
> > +
> >  /**
> >   * blk_queue_bypass_start - enter queue bypass mode
> >   * @q: queue of interest
> > @@ -689,8 +696,6 @@ void blk_cleanup_queue(struct request_queue *q)
> >  */
> > blk_freeze_queue(q);
> > spin_lock_irq(lock);
> > -   if (!q->mq_ops)
> > -   __blk_drain_queue(q, true);
> > queue_flag_set(QUEUE_FLAG_DEAD, q);
> > spin_unlock_irq(lock);
> >  
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 11097477eeab..3d3797327491 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -161,6 +161,8 @@ void blk_freeze_queue(struct request_queue *q)
> >  * exported to drivers as the only user for unfreeze is blk_mq.
> >  */
> > blk_freeze_queue_start(q);
> > +   if (!q->mq_ops)
> > +   blk_drain_queue(q);
> > blk_mq_freeze_queue_wait(q);
> >  }
> >  
> > diff --git a/block/blk.h b/block/blk.h
> > index 3f1446937aec..442098aa9463 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -330,4 +330,6 @@ static inline void blk_queue_bounce(struct 
> > request_queue *q, struct bio **bio)
> >  }
> >  #endif /* CONFIG_BOUNCE */
> >  
> > +extern void blk_drain_queue(struct request_queue *q);
> > +
> >  #endif /* BLK_INTERNAL_H */
> > 
> I seem to have missed something.
> Is blk_drain_queue() ever used?

It is always called in blk_cleanup_queue().

Before 055f6e18e08f("block: Make q_usage_counter also track legacy requests",
the q_usage_counter just works for sync IO(such as dax) because we only
hold the refcount when calling q->make_request_fn(), so blk_drain_queue()
does do the job always.

Thanks,
Ming