from:"Bart Van Assche"

Re: [for-4.16 PATCH v3 2/3] block: allow gendisk's request_queue registration to be deferred

2018-01-11 Thread Bart Van Assche

On Thu, 2018-01-11 at 14:20 -0500, Mike Snitzer wrote:
> Yes, I noticed the use of sysfs_lock also.  I've fixed it up earlier in
> my v4 patchset and build on it.  I'll add your Reported-by too.

Hello Mike,

There are many more inconsistencies with regard to queue flag manipulation
in the block layer core and in block drivers. I'm working on a series to
address all inconsistencies in the block layer and in the sd driver.

Bart.

Re: [for-4.16 PATCH v3 2/3] block: allow gendisk's request_queue registration to be deferred

2018-01-11 Thread Bart Van Assche

On Thu, 2018-01-11 at 12:29 -0500, Mike Snitzer wrote:
> On Thu, Jan 11 2018 at 12:18pm -0500,
> Bart Van Assche <bart.vanass...@wdc.com> wrote:
> 
> > On Thu, 2018-01-11 at 12:04 -0500, Mike Snitzer wrote:
> > > So my v4, that I'll send out shortly, won't be using test_and_clear_bit()
> > 
> > Please use queue_flag_set(), queue_flag_clear(), 
> > queue_flag_test_and_clear() and/or
> > queue_flag_test_and_set() to manipulate queue flags.
> 
> Can you please expand on this?  My patch is only using test_bit().

Hello Mike,

I was referring to the following code, which apparenly is existing code:

mutex_lock(>sysfs_lock);
queue_flag_clear_unlocked(QUEUE_FLAG_REGISTERED, q);
mutex_unlock(>sysfs_lock);

The above code is wrong. Other code that changes the queue flags protects
these changes with the the queue lock. The above code should be changed into
the following:

spin_lock_irq(q->queue_lock);
queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
spin_unlock_irq(q->queue_lock);

The only functions from which it is safe to call 
queue_flag_(set|clear)_unlocked()
without holding the queue lock are blk_alloc_queue_node() and
__blk_release_queue() because for these functions it is guaranteed that no queue
flag changes can happen from another context, e.g. through a 
blk_set_queue_dying()
call.

Bart.

Re: [PATCH v2 2/2] blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

2018-01-11 Thread Bart Van Assche

On Thu, 2018-01-11 at 08:39 +0100, Hannes Reinecke wrote:
> I'm actually not that convinced with this change; originally we had been
> checking if it's on the wait list, and only _then_ call bt_wait_ptr().
> Now we call bt_wait_ptr() always, meaning we run a chance of increasing
> the bitmap_tags wait pointer without actually using it.
> Looking at the code I'm not sure this is the correct way of using it ...

Hello Hannes,

Are you perhaps referring to sbq_index_atomic_inc()? I think it's fine to
call bt_wait_ptr() even if the corresponding waitqueue won't be used. My
understanding is that the purpose of having multiple waitqueues and of
using these in a round-robin fashion is to avoid that if less than or equal
to 8 (SBQ_WAIT_QUEUES) threads are waiting for a tag that these won't use
the same wait queue. So in the unlikely event of an early return the worst
that can happen is that there is more contention on one of the eight wait
queues. But since that is an unlikely scenario I don't think we have to
worry about this.

Bart.

Re: [for-4.16 PATCH v3 2/3] block: allow gendisk's request_queue registration to be deferred

2018-01-11 Thread Bart Van Assche

On Thu, 2018-01-11 at 12:04 -0500, Mike Snitzer wrote:
> So my v4, that I'll send out shortly, won't be using test_and_clear_bit()

Please use queue_flag_set(), queue_flag_clear(), queue_flag_test_and_clear() 
and/or
queue_flag_test_and_set() to manipulate queue flags.

Thanks,

Bart.

Re: scsi: memory leak in sg_start_req

2018-01-11 Thread Bart Van Assche

On Thu, 2018-01-11 at 01:04 -0500, Douglas Gilbert wrote:
> Monitoring that program with 'free' from another terminal I see
> about 2.5 GBytes of ram "swallowed" almost immediately when the test
> program runs. When the program exits (about 50 seconds later) as far
> as I can see all that ram is given back.

Hello Doug,

There is probably something that's leaking memory in one of the functions in
the reported call stack. kmemleak reports the following in my tests (I haven't
had the time yet to analyze this further):

unreferenced object 0x880363c09088 (size 192): 
 comm "multipath", pid 20088, jiffies 4295034706 (age 6544.980s) 
 hex dump (first 32 bytes): 
   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00   
   00 00 00 00 20 00 00 00 12 01 00 00 00 00 00 00   ... 
 backtrace: 
   [] kmemleak_alloc+0x4a/0xa0 
   [] __kmalloc+0x16e/0x320 
   [] bio_alloc_bioset+0x198/0x1f0 
   [] bio_map_user_iov+0x129/0x3f0 
   [] blk_rq_map_user_iov+0x123/0x210 
   [] blk_rq_map_user+0x52/0x60 
   [] sg_io+0x385/0x3c0 
   [] scsi_cmd_ioctl+0x2e3/0x450 
   [] scsi_cmd_blk_ioctl+0x42/0x50 
   [] sd_ioctl+0x85/0x110 
   [] blkdev_ioctl+0x4db/0x970 
   [] block_ioctl+0x3d/0x50 
   [] do_vfs_ioctl+0x94/0x670 
   [] SyS_ioctl+0x41/0x70 
   [] do_syscall_64+0x5c/0x110 
   [] return_from_SYSCALL_64+0x0/0x7a

Bart.

[PATCH v3 0/2] Rework blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

Hello Jens,

This patch series reworks the blk_mq_mark_tag_wait() implementation and also
fixes a race condition in that function. Please consider these two patches for
kernel v4.16.

Thanks,

Bart.

Changes compared to v3:
- Reworked patch 1/2 such that it uses if (...) ...; ... instead of
  if (...) ... else ... as proposed by Jens.

Changes compared to v1:
- Split a single patch into two patches to make reviewing easier.
- The race fix does no longer use prepare_to_wait() / finish_wait().

Bart Van Assche (2):
  blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()
  blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

 block/blk-mq.c | 72 +-
 1 file changed, 36 insertions(+), 36 deletions(-)

-- 
2.15.1

[PATCH v3 1/2] blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

This patch does not change any functionality but makes the
blk_mq_mark_tag_wait() code slightly easier to read.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Omar Sandoval <osan...@fb.com>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 block/blk-mq.c | 69 +-
 1 file changed, 35 insertions(+), 34 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1d705e25852e..f27bcb6f6751 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1181,58 +1181,59 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 struct request *rq)
 {
struct blk_mq_hw_ctx *this_hctx = *hctx;
-   bool shared_tags = (this_hctx->flags & BLK_MQ_F_TAG_SHARED) != 0;
struct sbq_wait_state *ws;
wait_queue_entry_t *wait;
bool ret;
 
-   if (!shared_tags) {
+   if (!(this_hctx->flags & BLK_MQ_F_TAG_SHARED)) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, _hctx->state))
set_bit(BLK_MQ_S_SCHED_RESTART, _hctx->state);
-   } else {
-   wait = _hctx->dispatch_wait;
-   if (!list_empty_careful(>entry))
-   return false;
 
-   spin_lock(_hctx->lock);
-   if (!list_empty(>entry)) {
-   spin_unlock(_hctx->lock);
-   return false;
-   }
+   /*
+* It's possible that a tag was freed in the window between the
+* allocation failure and adding the hardware queue to the wait
+* queue.
+*
+* Don't clear RESTART here, someone else could have set it.
+* At most this will cost an extra queue run.
+*/
+   return blk_mq_get_driver_tag(rq, hctx, false);
+   }
+
+   wait = _hctx->dispatch_wait;
+   if (!list_empty_careful(>entry))
+   return false;
 
-   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
-   add_wait_queue(>wait, wait);
+   spin_lock(_hctx->lock);
+   if (!list_empty(>entry)) {
+   spin_unlock(_hctx->lock);
+   return false;
}
 
+   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
+   add_wait_queue(>wait, wait);
+
/*
 * It's possible that a tag was freed in the window between the
 * allocation failure and adding the hardware queue to the wait
 * queue.
 */
ret = blk_mq_get_driver_tag(rq, hctx, false);
-
-   if (!shared_tags) {
-   /*
-* Don't clear RESTART here, someone else could have set it.
-* At most this will cost an extra queue run.
-*/
-   return ret;
-   } else {
-   if (!ret) {
-   spin_unlock(_hctx->lock);
-   return false;
-   }
-
-   /*
-* We got a tag, remove ourselves from the wait queue to ensure
-* someone else gets the wakeup.
-*/
-   spin_lock_irq(>wait.lock);
-   list_del_init(>entry);
-   spin_unlock_irq(>wait.lock);
+   if (!ret) {
spin_unlock(_hctx->lock);
-   return true;
+   return false;
}
+
+   /*
+* We got a tag, remove ourselves from the wait queue to ensure
+* someone else gets the wakeup.
+*/
+   spin_lock_irq(>wait.lock);
+   list_del_init(>entry);
+   spin_unlock_irq(>wait.lock);
+   spin_unlock(_hctx->lock);
+
+   return true;
 }
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
-- 
2.15.1

[PATCH v3 2/2] blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

Both add_wait_queue() and blk_mq_dispatch_wake() protect wait queue
manipulations with the wait queue lock. Hence also protect the
!list_empty(>entry) test with the wait queue lock.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Omar Sandoval <osan...@fb.com>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 block/blk-mq.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f27bcb6f6751..bbadbd5c8003 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1183,7 +1183,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
struct blk_mq_hw_ctx *this_hctx = *hctx;
struct sbq_wait_state *ws;
wait_queue_entry_t *wait;
-   bool ret;
+   bool on_wait_list, ret;
 
if (!(this_hctx->flags & BLK_MQ_F_TAG_SHARED)) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, _hctx->state))
@@ -1204,13 +1204,15 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
if (!list_empty_careful(>entry))
return false;
 
-   spin_lock(_hctx->lock);
-   if (!list_empty(>entry)) {
-   spin_unlock(_hctx->lock);
+   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
+
+   spin_lock_irq(>wait.lock);
+   on_wait_list = !list_empty(>entry);
+   spin_unlock_irq(>wait.lock);
+
+   if (on_wait_list)
return false;
-   }
 
-   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
add_wait_queue(>wait, wait);
 
/*
@@ -1219,10 +1221,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 * queue.
 */
ret = blk_mq_get_driver_tag(rq, hctx, false);
-   if (!ret) {
-   spin_unlock(_hctx->lock);
+   if (!ret)
return false;
-   }
 
/*
 * We got a tag, remove ourselves from the wait queue to ensure
@@ -1231,7 +1231,6 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
spin_lock_irq(>wait.lock);
list_del_init(>entry);
spin_unlock_irq(>wait.lock);
-   spin_unlock(_hctx->lock);
 
return true;
 }
-- 
2.15.1

Re: [PATCH v2 1/2] blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

On Wed, 2018-01-10 at 13:30 -0700, Jens Axboe wrote:
> On 1/10/18 12:39 PM, Bart Van Assche wrote:
> > This patch does not change any functionality but makes the
> > blk_mq_mark_tag_wait() code slightly easier to read.
> 
> I agree it could do with a cleanup, but how about something like the
> below? I think that's easier to read.
> 
> [ ... ]

Hello Jens,

That sounds like a good idea to me. I will update the patch series, retest
it and repost.

Bart.

[PATCH v2 0/2] Rework blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

Hello Jens,

This patch series reworks the blk_mq_mark_tag_wait() implementation and also
fixes a race condition in that function. Please consider these two patches for
kernel v4.16.

Thanks,

Bart.

Changes compared to v1:
- Split a single patch into two patches to make reviewing easier.
- The race fix does no longer use prepare_to_wait() / finish_wait().

Bart Van Assche (2):
  blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()
  blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

 block/blk-mq.c | 49 -
 1 file changed, 24 insertions(+), 25 deletions(-)

-- 
2.15.1

[PATCH v2 1/2] blk-mq: Reduce the number of if-statements in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

This patch does not change any functionality but makes the
blk_mq_mark_tag_wait() code slightly easier to read.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Omar Sandoval <osan...@fb.com>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 block/blk-mq.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1d705e25852e..e770e8814f60 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1189,6 +1189,16 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
if (!shared_tags) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, _hctx->state))
set_bit(BLK_MQ_S_SCHED_RESTART, _hctx->state);
+   /*
+* It's possible that a tag was freed in the window between the
+* allocation failure and adding the hardware queue to the wait
+* queue.
+*/
+   ret = blk_mq_get_driver_tag(rq, hctx, false);
+   /*
+* Don't clear RESTART here, someone else could have set it.
+* At most this will cost an extra queue run.
+*/
} else {
wait = _hctx->dispatch_wait;
if (!list_empty_careful(>entry))
@@ -1202,22 +1212,12 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 
ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
add_wait_queue(>wait, wait);
-   }
-
-   /*
-* It's possible that a tag was freed in the window between the
-* allocation failure and adding the hardware queue to the wait
-* queue.
-*/
-   ret = blk_mq_get_driver_tag(rq, hctx, false);
-
-   if (!shared_tags) {
/*
-* Don't clear RESTART here, someone else could have set it.
-* At most this will cost an extra queue run.
+* It's possible that a tag was freed in the window between the
+* allocation failure and adding the hardware queue to the wait
+* queue.
 */
-   return ret;
-   } else {
+   ret = blk_mq_get_driver_tag(rq, hctx, false);
if (!ret) {
spin_unlock(_hctx->lock);
return false;
@@ -1231,8 +1231,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
list_del_init(>entry);
spin_unlock_irq(>wait.lock);
spin_unlock(_hctx->lock);
-   return true;
}
+   return ret;
 }
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
-- 
2.15.1

[PATCH v2 2/2] blk-mq: Fix a race condition in blk_mq_mark_tag_wait()

2018-01-10 Thread Bart Van Assche

Both add_wait_queue() and blk_mq_dispatch_wake() protect wait queue
manipulations with the wait queue lock. Hence also protect the
!list_empty(>entry) test with the wait queue lock instead of
the hctx lock.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Omar Sandoval <osan...@fb.com>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 block/blk-mq.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e770e8814f60..d5313ce60836 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1184,7 +1184,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
bool shared_tags = (this_hctx->flags & BLK_MQ_F_TAG_SHARED) != 0;
struct sbq_wait_state *ws;
wait_queue_entry_t *wait;
-   bool ret;
+   bool on_wait_list, ret;
 
if (!shared_tags) {
if (!test_bit(BLK_MQ_S_SCHED_RESTART, _hctx->state))
@@ -1204,13 +1204,15 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
if (!list_empty_careful(>entry))
return false;
 
-   spin_lock(_hctx->lock);
-   if (!list_empty(>entry)) {
-   spin_unlock(_hctx->lock);
+   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
+
+   spin_lock_irq(>wait.lock);
+   on_wait_list = !list_empty(>entry);
+   spin_unlock_irq(>wait.lock);
+
+   if (on_wait_list)
return false;
-   }
 
-   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
add_wait_queue(>wait, wait);
/*
 * It's possible that a tag was freed in the window between the
@@ -1218,10 +1220,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
 * queue.
 */
ret = blk_mq_get_driver_tag(rq, hctx, false);
-   if (!ret) {
-   spin_unlock(_hctx->lock);
+   if (!ret)
return false;
-   }
 
/*
 * We got a tag, remove ourselves from the wait queue to ensure
@@ -1230,7 +1230,6 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
spin_lock_irq(>wait.lock);
list_del_init(>entry);
spin_unlock_irq(>wait.lock);
-   spin_unlock(_hctx->lock);
}
return ret;
 }
-- 
2.15.1

[PATCH] blk-mq: Add locking annotations to hctx_lock() and hctx_unlock()

2018-01-10 Thread Bart Van Assche

This patch avoids that sparse reports the following:

block/blk-mq.c:637:33: warning: context imbalance in 'hctx_unlock' - unexpected 
unlock
block/blk-mq.c:642:9: warning: context imbalance in 'hctx_lock' - wrong count 
at exit

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Tejun Heo <t...@kernel.org>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8139d88e78f6..1d705e25852e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,6 +630,7 @@ static void __blk_mq_complete_request(struct request *rq)
 }
 
 static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
+   __releases(hctx->srcu)
 {
if (!(hctx->flags & BLK_MQ_F_BLOCKING))
rcu_read_unlock();
@@ -638,6 +639,7 @@ static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int 
srcu_idx)
 }
 
 static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
+   __acquires(hctx->srcu)
 {
if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
/* shut up gcc false positive */
-- 
2.15.1

Re: [PATCH 3/4] block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

2018-01-10 Thread Bart Van Assche

On Wed, 2018-01-10 at 11:35 -0700, Jens Axboe wrote:
> On 1/10/18 11:33 AM, Bart Van Assche wrote:
> > On Wed, 2018-01-10 at 11:32 -0700, Jens Axboe wrote:
> > > On 1/10/18 11:29 AM, Bart Van Assche wrote:
> > > > On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> > > > > @@ -313,8 +307,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, 
> > > > > struct request *rq)
> > > > >   seq_puts(m, ", .rq_flags=");
> > > > >   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
> > > > >  ARRAY_SIZE(rqf_name));
> > > > > - seq_puts(m, ", .atomic_flags=");
> > > > > - blk_flags_show(m, rq->atomic_flags, rqaf_name, 
> > > > > ARRAY_SIZE(rqaf_name));
> > > > >   seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
> > > > >  rq->internal_tag);
> > > > >   if (mq_ops->show_rq)
> > > > 
> > > > Whether or not a request has been marked complete is very useful to 
> > > > know. Have you
> > > > considered to show the return value of blk_rq_is_complete() in the 
> > > > debugfs output?
> > > 
> > > Yeah, that's a good point. Let me add that in lieu of the atomic flags 
> > > that
> > > are being killed. Are you fine with the patch then?
> > 
> > The rest of the patch looks fine to me. This is the only comment I had 
> > about this patch.
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=blk-kill-atomic-flags=3d34009082f5e72137d6bb38cbc2ff6047f03fd1
> 
> Added a complete= entry for it.

For that patch: Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>

Thanks,

Bart.

Re: [PATCH 4/4] block: rearrange a few request fields for better cache layout

2018-01-10 Thread Bart Van Assche

On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> Move completion related items (like the call single data) near the
> end of the struct, instead of mixing them in with the initial
> queueing related fields.
> 
> Move queuelist below the bio structures. Then we have all
> queueing related bits in the first cache line.
> 
> This yields a 1.5-2% increase in IOPS for a null_blk test, both for
> sync and for high thread count access. Sync test goes form 975K to
> 992K, 32-thread case from 20.8M to 21.2M IOPS.

That's a nice result!

Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>

Re: [PATCH 3/4] block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

2018-01-10 Thread Bart Van Assche

On Wed, 2018-01-10 at 11:32 -0700, Jens Axboe wrote:
> On 1/10/18 11:29 AM, Bart Van Assche wrote:
> > On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> > > @@ -313,8 +307,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, 
> > > struct request *rq)
> > >   seq_puts(m, ", .rq_flags=");
> > >   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
> > >  ARRAY_SIZE(rqf_name));
> > > - seq_puts(m, ", .atomic_flags=");
> > > - blk_flags_show(m, rq->atomic_flags, rqaf_name, ARRAY_SIZE(rqaf_name));
> > >   seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
> > >  rq->internal_tag);
> > >   if (mq_ops->show_rq)
> > 
> > Whether or not a request has been marked complete is very useful to know. 
> > Have you
> > considered to show the return value of blk_rq_is_complete() in the debugfs 
> > output?
> 
> Yeah, that's a good point. Let me add that in lieu of the atomic flags that
> are being killed. Are you fine with the patch then?

The rest of the patch looks fine to me. This is the only comment I had about 
this patch.

Bart.

Re: [PATCH 3/4] block: convert REQ_ATOM_COMPLETE to stealing rq->__deadline bit

2018-01-10 Thread Bart Van Assche

On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> @@ -313,8 +307,6 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, struct 
> request *rq)
>   seq_puts(m, ", .rq_flags=");
>   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
>  ARRAY_SIZE(rqf_name));
> - seq_puts(m, ", .atomic_flags=");
> - blk_flags_show(m, rq->atomic_flags, rqaf_name, ARRAY_SIZE(rqaf_name));
>   seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
>  rq->internal_tag);
>   if (mq_ops->show_rq)

Whether or not a request has been marked complete is very useful to know. Have 
you
considered to show the return value of blk_rq_is_complete() in the debugfs 
output?

Thanks,

Bart.

Re: [PATCH 2/4] block: add accessors for setting/querying request deadline

2018-01-10 Thread Bart Van Assche

On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> We reduce the resolution of request expiry, but since we're already
> using jiffies for this where resolution depends on the kernel
> configuration and since the timeout resolution is coarse anyway,
> that should be fine.

Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>

Re: [PATCH 1/4] block: remove REQ_ATOM_POLL_SLEPT

2018-01-10 Thread Bart Van Assche

On Tue, 2018-01-09 at 17:29 -0700, Jens Axboe wrote:
> We don't need this to be an atomic flag, it can be a regular
> flag. We either end up on the same CPU for the polling, in which
> case the state is sane, or we did the sleep which would imply
> the needed barrier to ensure we see the right state.

Please consider updating rqf_name[] in blk-mq-debugfs.c. Anyway:

Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>

[PATCH v2 2/4] block: Introduce blk_start_wait_if_quiesced() and blk_finish_wait_if_quiesced()

2018-01-10 Thread Bart Van Assche

Introduce functions that allow block drivers to wait while a request
queue is in the quiesced state (blk-mq) or in the stopped state (legacy
block layer). The next patch will add calls to these functions in the
SCSI core.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: Ming Lei <ming@redhat.com>
---
 block/blk-core.c   |  1 +
 block/blk-mq.c | 64 ++
 include/linux/blk-mq.h |  2 ++
 3 files changed, 67 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index c10b4ce95248..06eaea15bae9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -287,6 +287,7 @@ void blk_start_queue(struct request_queue *q)
WARN_ON_ONCE(q->mq_ops);
 
queue_flag_clear(QUEUE_FLAG_STOPPED, q);
+   wake_up_all(>mq_wq);
__blk_run_queue(q);
 }
 EXPORT_SYMBOL(blk_start_queue);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a05ea7e9b415..87455977ad34 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -247,11 +247,75 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
spin_unlock_irqrestore(q->queue_lock, flags);
 
+   wake_up_all(>mq_wq);
+
/* dispatch requests which are inserted during quiescing */
blk_mq_run_hw_queues(q, true);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
 
+/**
+ * blk_start_wait_if_quiesced() - wait if a queue is quiesced (blk-mq) or 
stopped (legacy block layer)
+ * @q: Request queue pointer.
+ *
+ * Some block drivers, e.g. the SCSI core, can bypass the block layer core
+ * request submission mechanism. Surround such code with
+ * blk_start_wait_if_quiesced() / blk_finish_wait_if_quiesced() to avoid that
+ * request submission can happen while a queue is quiesced or stopped.
+ *
+ * Returns with the RCU read lock held (blk-mq) or with q->queue_lock held
+ * (legacy block layer).
+ *
+ * Notes:
+ * - Every call of this function must be followed by a call of
+ *   blk_finish_wait_if_quiesced().
+ * - This function does not support block drivers whose .queue_rq()
+ *   implementation can sleep (BLK_MQ_F_BLOCKING).
+ */
+int blk_start_wait_if_quiesced(struct request_queue *q)
+{
+   struct blk_mq_hw_ctx *hctx;
+   unsigned int i;
+
+   might_sleep();
+
+   if (q->mq_ops) {
+   queue_for_each_hw_ctx(q, hctx, i)
+   WARN_ON(hctx->flags & BLK_MQ_F_BLOCKING);
+
+   rcu_read_lock();
+   while (!blk_queue_dying(q) && blk_queue_quiesced(q)) {
+   rcu_read_unlock();
+   wait_event(q->mq_wq, blk_queue_dying(q) ||
+  !blk_queue_quiesced(q));
+   rcu_read_lock();
+   }
+   } else {
+   spin_lock_irq(q->queue_lock);
+   wait_event_lock_irq(q->mq_wq,
+   blk_queue_dying(q) || !blk_queue_stopped(q),
+   *q->queue_lock);
+   q->request_fn_active++;
+   }
+   return blk_queue_dying(q) ? -ENODEV : 0;
+}
+EXPORT_SYMBOL(blk_start_wait_if_quiesced);
+
+/**
+ * blk_finish_wait_if_quiesced() - counterpart of blk_start_wait_if_quiesced()
+ * @q: Request queue pointer.
+ */
+void blk_finish_wait_if_quiesced(struct request_queue *q)
+{
+   if (q->mq_ops) {
+   rcu_read_unlock();
+   } else {
+   q->request_fn_active--;
+   spin_unlock_irq(q->queue_lock);
+   }
+}
+EXPORT_SYMBOL(blk_finish_wait_if_quiesced);
+
 void blk_mq_wake_waiters(struct request_queue *q)
 {
struct blk_mq_hw_ctx *hctx;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49796a3..15912cd348b5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -267,6 +267,8 @@ void blk_mq_start_stopped_hw_queue(struct blk_mq_hw_ctx 
*hctx, bool async);
 void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async);
 void blk_mq_quiesce_queue(struct request_queue *q);
 void blk_mq_unquiesce_queue(struct request_queue *q);
+int blk_start_wait_if_quiesced(struct request_queue *q);
+void blk_finish_wait_if_quiesced(struct request_queue *q);
 void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long 
msecs);
 bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
 void blk_mq_run_hw_queues(struct request_queue *q, bool async);
-- 
2.15.1

[PATCH v2 3/4] scsi: Avoid that .queuecommand() gets called for a quiesced SCSI device

2018-01-10 Thread Bart Van Assche

Several SCSI transport and LLD drivers surround code that does not
tolerate concurrent calls of .queuecommand() with scsi_target_block() /
scsi_target_unblock(). These last two functions use
blk_mq_quiesce_queue() / blk_mq_unquiesce_queue() for scsi-mq request
queues to prevent concurrent .queuecommand() calls. However, that is
not sufficient to prevent .queuecommand() calls from scsi_send_eh_cmnd().
Hence surround the .queuecommand() call from the SCSI error handler with
blk_start_wait_if_quiesced() / blk_finish_wait_if_quiesced().

Note: converting the .queuecommand() call in scsi_send_eh_cmnd() into
code that calls blk_get_request(), e.g. scsi_execute_req(), is not an
option since scsi_send_eh_cmnd() can be called if all requests are
allocated and if no requests will make progress without aborting any
of these requests.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Hannes Reinecke <h...@suse.de>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: Ming Lei <ming@redhat.com>
---
 drivers/scsi/scsi_error.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 62b56de38ae8..f7154ea86715 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1016,6 +1016,7 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, 
unsigned char *cmnd,
 {
struct scsi_device *sdev = scmd->device;
struct Scsi_Host *shost = sdev->host;
+   struct request_queue *q = sdev->request_queue;
DECLARE_COMPLETION_ONSTACK(done);
unsigned long timeleft = timeout;
struct scsi_eh_save ses;
@@ -1028,7 +1029,9 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, 
unsigned char *cmnd,
 
scsi_log_send(scmd);
scmd->scsi_done = scsi_eh_done;
+   blk_start_wait_if_quiesced(q);
rtn = shost->hostt->queuecommand(shost, scmd);
+   blk_finish_wait_if_quiesced(q);
if (rtn) {
if (timeleft > stall_for) {
scsi_eh_restore_cmnd(scmd, );
-- 
2.15.1

[PATCH v2 0/4] Make SCSI transport recovery more robust

2018-01-10 Thread Bart Van Assche

Hello Jens,

A longstanding issue with the SCSI core is that several SCSI transport drivers
use scsi_target_block() and scsi_target_unblock() to avoid concurrent
.queuecommand() calls during e.g. transport recovery but that this is not
sufficient to protect from such calls. Hence this patch series. An additional
benefit of this patch series is that it allows to remove an ugly hack from
the SRP initiator driver. Please consider this patch series for kernel v4.16.

Thanks,

Bart.

Changes compared to v1:
- Renamed blk_wait_if_quiesced() into blk_start_wait_if_quiesced().
- Mentioned in the comment above blk_start_wait_if_quiesced() that every call
  of this function has to be followed by a call of
  blk_finish_wait_if_quiesced().

Bart Van Assche (4):
  blk-mq: Rename request_queue.mq_freeze_wq into mq_wq
  block: Introduce blk_start_wait_if_quiesced() and
blk_finish_wait_if_quiesced()
  scsi: Avoid that .queuecommand() gets called for a quiesced SCSI
device
  IB/srp: Fix a sleep-in-invalid-context bug

 block/blk-core.c| 11 +++---
 block/blk-mq.c  | 74 ++---
 drivers/infiniband/ulp/srp/ib_srp.c | 21 +--
 drivers/scsi/scsi_error.c   |  3 ++
 include/linux/blk-mq.h  |  2 +
 include/linux/blkdev.h  |  2 +-
 6 files changed, 83 insertions(+), 30 deletions(-)

-- 
2.15.1

[PATCH v2 4/4] IB/srp: Fix a sleep-in-invalid-context bug

2018-01-10 Thread Bart Van Assche

The previous two patches guarantee that srp_queuecommand() does not get
invoked while reconnecting occurs. Hence remove the code from
srp_queuecommand() that prevents command queueing while reconnecting.
This patch avoids that the following can appear in the kernel log:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 5600, name: scsi_eh_9
1 lock held by scsi_eh_9/5600:
 #0:  (rcu_read_lock){}, at: [<cbb798c7>] 
__blk_mq_run_hw_queue+0xf1/0x1e0
Preemption disabled at:
[<139badf2>] __blk_mq_delay_run_hw_queue+0x78/0xf0
CPU: 9 PID: 5600 Comm: scsi_eh_9 Tainted: GW4.15.0-rc4-dbg+ #1
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 2.5.4 01/22/2016
Call Trace:
 dump_stack+0x67/0x99
 ___might_sleep+0x16a/0x250 [ib_srp]
 __mutex_lock+0x46/0x9d0
 srp_queuecommand+0x356/0x420 [ib_srp]
 scsi_dispatch_cmd+0xf6/0x3f0
 scsi_queue_rq+0x4a8/0x5f0
 blk_mq_dispatch_rq_list+0x73/0x440
 blk_mq_sched_dispatch_requests+0x109/0x1a0
 __blk_mq_run_hw_queue+0x131/0x1e0
 __blk_mq_delay_run_hw_queue+0x9a/0xf0
 blk_mq_run_hw_queue+0xc0/0x1e0
 blk_mq_start_hw_queues+0x2c/0x40
 scsi_run_queue+0x18e/0x2d0
 scsi_run_host_queues+0x22/0x40
 scsi_error_handler+0x18d/0x5f0
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Cc: Jason Gunthorpe <j...@mellanox.com>
Cc: Doug Ledford <dledf...@redhat.com>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 972d4b3c5223..670f187ecb91 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -2149,7 +2149,6 @@ static void srp_handle_qp_err(struct ib_cq *cq, struct 
ib_wc *wc,
 static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
 {
struct srp_target_port *target = host_to_target(shost);
-   struct srp_rport *rport = target->rport;
struct srp_rdma_ch *ch;
struct srp_request *req;
struct srp_iu *iu;
@@ -2159,16 +2158,6 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
u32 tag;
u16 idx;
int len, ret;
-   const bool in_scsi_eh = !in_interrupt() && current == shost->ehandler;
-
-   /*
-* The SCSI EH thread is the only context from which srp_queuecommand()
-* can get invoked for blocked devices (SDEV_BLOCK /
-* SDEV_CREATED_BLOCK). Avoid racing with srp_reconnect_rport() by
-* locking the rport mutex if invoked from inside the SCSI EH.
-*/
-   if (in_scsi_eh)
-   mutex_lock(>mutex);
 
scmnd->result = srp_chkready(target->rport);
if (unlikely(scmnd->result))
@@ -2230,13 +2219,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
goto err_unmap;
}
 
-   ret = 0;
-
-unlock_rport:
-   if (in_scsi_eh)
-   mutex_unlock(>mutex);
-
-   return ret;
+   return 0;
 
 err_unmap:
srp_unmap_data(scmnd, ch, req);
@@ -2258,7 +2241,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
ret = SCSI_MLQUEUE_HOST_BUSY;
}
 
-   goto unlock_rport;
+   return ret;
 }
 
 /*
-- 
2.15.1

[PATCH v2 1/4] blk-mq: Rename request_queue.mq_freeze_wq into mq_wq

2018-01-10 Thread Bart Van Assche

Rename a waitqueue in struct request_queue since the next patch will
add code that uses this waitqueue outside the request queue freezing
implementation.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Hannes Reinecke <h...@suse.de>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: Ming Lei <ming@redhat.com>
---
 block/blk-core.c   | 10 +-
 block/blk-mq.c | 10 +-
 include/linux/blkdev.h |  2 +-
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fa1cb95f7f6a..c10b4ce95248 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -377,7 +377,7 @@ void blk_clear_preempt_only(struct request_queue *q)
 
spin_lock_irqsave(q->queue_lock, flags);
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>mq_wq);
spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
@@ -648,7 +648,7 @@ void blk_set_queue_dying(struct request_queue *q)
}
 
/* Make blk_queue_enter() reexamine the DYING flag. */
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>mq_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -851,7 +851,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 */
smp_rmb();
 
-   ret = wait_event_interruptible(q->mq_freeze_wq,
+   ret = wait_event_interruptible(q->mq_wq,
(atomic_read(>mq_freeze_depth) == 0 &&
 (preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
@@ -872,7 +872,7 @@ static void blk_queue_usage_counter_release(struct 
percpu_ref *ref)
struct request_queue *q =
container_of(ref, struct request_queue, q_usage_counter);
 
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>mq_wq);
 }
 
 static void blk_rq_timed_out_timer(struct timer_list *t)
@@ -948,7 +948,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
q->bypass_depth = 1;
__set_bit(QUEUE_FLAG_BYPASS, >queue_flags);
 
-   init_waitqueue_head(>mq_freeze_wq);
+   init_waitqueue_head(>mq_wq);
 
/*
 * Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 29f140b4dbf7..a05ea7e9b415 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -137,16 +137,16 @@ EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
 void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-   wait_event(q->mq_freeze_wq, percpu_ref_is_zero(>q_usage_counter));
+   wait_event(q->mq_wq, percpu_ref_is_zero(>q_usage_counter));
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout)
 {
-   return wait_event_timeout(q->mq_freeze_wq,
-   percpu_ref_is_zero(>q_usage_counter),
-   timeout);
+   return wait_event_timeout(q->mq_wq,
+ percpu_ref_is_zero(>q_usage_counter),
+ timeout);
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
 
@@ -185,7 +185,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
percpu_ref_reinit(>q_usage_counter);
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>mq_wq);
}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 50fb1c18ec54..2c74c03a9d5f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -638,7 +638,7 @@ struct request_queue {
struct throtl_data *td;
 #endif
struct rcu_head rcu_head;
-   wait_queue_head_t   mq_freeze_wq;
+   wait_queue_head_t   mq_wq;
struct percpu_ref   q_usage_counter;
struct list_headall_q_node;
 
-- 
2.15.1

[PATCH] blk-mq: Explain when 'active_queues' is decremented

2018-01-10 Thread Bart Van Assche

It is nontrivial to derive from the blk-mq source code when
blk_mq_tags.active_queues is decremented. Hence add a comment that
explains this.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 block/blk-mq.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9aa24c9508f9..266fc4f6b046 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -954,6 +954,12 @@ static void blk_mq_timeout_work(struct work_struct *work)
data.next = blk_rq_timeout(round_jiffies_up(data.next));
mod_timer(>timeout, data.next);
} else {
+   /*
+* Request timeouts are handled as a forward rolling timer. If
+* we end up here it means that no requests are pending and
+* also that no request has been pending for a while. Mark
+* each hctx as idle.
+*/
queue_for_each_hw_ctx(q, hctx, i) {
/* the hctx may be unmapped, so check it here */
if (blk_mq_hw_queue_mapped(hctx))
-- 
2.15.1

Re: [PATCH] bdi: show error log when fail to create bdi debugfs entry

2018-01-10 Thread Bart Van Assche

On Wed, 2018-01-10 at 23:18 +0800, weiping zhang wrote:
> bdi debugfs dir/file may create fail, add error log here.
> 
> Signed-off-by: weiping zhang 
> ---
>  mm/backing-dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index b5f940c..9117c21 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -885,7 +885,8 @@ int bdi_register_va(struct backing_dev_info *bdi, const 
> char *fmt, va_list args)
>   cgwb_bdi_register(bdi);
>   bdi->dev = dev;
>  
> - bdi_debug_register(bdi, dev_name(dev));
> +  if (bdi_debug_register(bdi, dev_name(dev)))
> +  pr_warn("blkdev %s create bdi debugfs failed\n", 
> dev_name(dev));
>   set_bit(WB_registered, >wb.state);
>  
>   spin_lock_bh(_lock);

The indentation of the if-statement is inconsistent. Have you verified this
patch with checkpatch before submitting it?

Additionally, the error message is hard to understand. Please make it more 
clear,
e.g. as follows: "%s: creation of bdi debugfs entries failed.\n".

Thanks,

Bart.

Re: [PATCH 2/4] block: Introduce blk_wait_if_quiesced() and blk_finish_wait_if_quiesced()

2018-01-09 Thread Bart Van Assche

On Tue, 2018-01-09 at 07:41 +0100, Hannes Reinecke wrote:
> I'm always a bit cautious when having rcu_read_lock() and
> rcu_read_unlock() in two separate functions.
> Can we make this dependency more explicit by renaming the first function
> to blk_start_wait_if_quiesced() and updating the comment to the effect
> that both functions must be used in tandem?

Hello Hannes,

That sounds like a good idea to me. I will make these changes and repost this
patch series.

Bart.

Re: [for-4.16 PATCH 1/2] block: cope with gendisk's 'queue' being added later

2018-01-09 Thread Bart Van Assche

On Tue, 2018-01-09 at 17:10 -0500, Mike Snitzer wrote:
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 870484eaed1f..0b0dda8e2420 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -919,8 +919,20 @@ int blk_register_queue(struct gendisk *disk)
>   ret = 0;
>  unlock:
>   mutex_unlock(>sysfs_lock);
> +
> + /*
> +  * Take an extra ref on queue which will be put on disk_release()
> +  * so that it sticks around as long as @disk is there.
> +  */
> + WARN_ON_ONCE(!blk_get_queue(q));
> +
> + WARN_ON(sysfs_create_link(>kobj,
> +   >backing_dev_info->dev->kobj,
> +   "bdi"));
> +
>   return ret;
>  }
> +EXPORT_SYMBOL_GPL(blk_register_queue);

Hello Mike,

So the sysfs_create_link() call is moved from register_disk() into
blk_register_queue() but the sysfs_remove_link() call stays in del_gendisk()?
Are you sure that you want this asymmetry?

Thanks,

Bart.

[PATCH] block: Fix kernel-doc warnings reported when building with W=1

2018-01-09 Thread Bart Van Assche

Commit 3a025e1d1c2e ("Add optional check for bad kernel-doc comments")
causes W=1 the kernel-doc script to be run and thereby causes several
new warnings to appear when building the kernel with W=1. Fix the
block layer kernel-doc headers such that the block layer again builds
cleanly with W=1.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 block/bsg-lib.c| 3 ++-
 block/scsi_ioctl.c | 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index 15d25ccd51a5..1474153f73e3 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -30,7 +30,7 @@
 
 /**
  * bsg_teardown_job - routine to teardown a bsg job
- * @job: bsg_job that is to be torn down
+ * @kref: kref inside bsg_job that is to be torn down
  */
 static void bsg_teardown_job(struct kref *kref)
 {
@@ -251,6 +251,7 @@ static void bsg_exit_rq(struct request_queue *q, struct 
request *req)
  * @name: device to give bsg device
  * @job_fn: bsg job handler
  * @dd_job_size: size of LLD data needed for each job
+ * @release: @dev release function
  */
 struct request_queue *bsg_setup_queue(struct device *dev, const char *name,
bsg_job_fn *job_fn, int dd_job_size,
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index edcfff974527..5cddff44a2f8 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -384,9 +384,10 @@ static int sg_io(struct request_queue *q, struct gendisk 
*bd_disk,
 
 /**
  * sg_scsi_ioctl  --  handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl
- * @file:  file this ioctl operates on (optional)
  * @q: request queue to send scsi commands down
  * @disk:  gendisk to operate on (option)
+ * @mode:  mode used to open the file through which the ioctl has been
+ * submitted
  * @sic:   userspace structure describing the command to perform
  *
  * Send down the scsi command described by @sic to the device below
@@ -415,10 +416,10 @@ static int sg_io(struct request_queue *q, struct gendisk 
*bd_disk,
  *  Positive numbers returned are the compacted SCSI error codes (4
  *  bytes in one int) where the lowest byte is the SCSI status.
  */
-#define OMAX_SB_LEN 16  /* For backward compatibility */
 int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
struct scsi_ioctl_command __user *sic)
 {
+   enum { OMAX_SB_LEN = 16 };  /* For backward compatibility */
struct request *rq;
struct scsi_request *req;
int err;
-- 
2.15.1

[PATCH, resend] blk-mq: Fix spelling in a source code comment

2018-01-09 Thread Bart Van Assche

Change "nedeing" into "needing" and "caes" into "cases".

Fixes: commit f906a6a0f426 ("blk-mq: improve tag waiting setup for non-shared 
tags")
Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Omar Sandoval <osan...@fb.com>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 block/blk-mq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 34d14ab3e68e..fbb3dcf8342e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1104,8 +1104,8 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, 
unsigned mode,
 
 /*
  * Mark us waiting for a tag. For shared tags, this involves hooking us into
- * the tag wakeups. For non-shared tags, we can simply mark us nedeing a
- * restart. For both caes, take care to check the condition again after
+ * the tag wakeups. For non-shared tags, we can simply mark us needing a
+ * restart. For both cases, take care to check the condition again after
  * marking us as waiting.
  */
 static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx,
-- 
2.15.1

Re: [PATCH 2/8] blk-mq: protect completion path with RCU

2018-01-09 Thread Bart Van Assche

On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> Currently, blk-mq protects only the issue path with RCU.  This patch
> puts the completion path under the same RCU protection.  This will be
> used to synchronize issue/completion against timeout by later patches,
> which will also add the comments.
> 
> Signed-off-by: Tejun Heo 
> ---
>  block/blk-mq.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ddc9261..6741c3e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
> *srcu_idx)
>  void blk_mq_complete_request(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
> + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> + int srcu_idx;
>  
>   if (unlikely(blk_should_fake_timeout(q)))
>   return;
> +
> + hctx_lock(hctx, _idx);
>   if (!blk_mark_rq_complete(rq))
>   __blk_mq_complete_request(rq);
> + hctx_unlock(hctx, srcu_idx);
>  }
>  EXPORT_SYMBOL(blk_mq_complete_request);

Hello Tejun,

I'm concerned about the additional CPU cycles needed for the new 
blk_mq_map_queue()
call, although I know this call is cheap. Would the timeout code really get that
much more complicated if the hctx_lock() and hctx_unlock() calls would be 
changed
into rcu_read_lock() and rcu_read_unlock() calls? Would it be sufficient if
"if (has_rcu) synchronize_rcu();" would be changed into "synchronize_rcu();" in
blk_mq_timeout_work()?

Thanks,

Bart.

Re: [linux-next][qla2xxx][85caa95]kernel BUG at lib/list_debug.c:31!

2018-01-09 Thread Bart Van Assche

On Tue, 2018-01-09 at 14:44 +0530, Abdul Haleem wrote:
> Greeting's, 
> 
> Linux next kernel panics on powerpc when module qla2xxx is load/unload.
> 
> Machine Type: Power 8 PowerVM LPAR
> Kernel : 4.15.0-rc2-next-20171211
> gcc : version 4.8.5
> Test type: module load/unload few times
> 
> Trace messages:
> ---
> qla2xxx [:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 10.00.00.03-k.
> qla2xxx [0106:a0:00.0]-001a: : MSI-X vector count: 32.
> qla2xxx [0106:a0:00.0]-001d: : Found an ISP2532 irq 505 iobase 
> 0xaeb324e6.
> qla2xxx [0106:a0:00.0]-00c6:1: MSI-X: Failed to enable support with 32 
> vectors, using 16 vectors.
> qla2xxx [0106:a0:00.0]-00fb:1: QLogic QLE2562 - PCIe 2-port 8Gb FC Adapter.
> qla2xxx [0106:a0:00.0]-00fc:1: ISP2532: PCIe (5.0GT/s x8) @ 0106:a0:00.0 
> hdma- host#=1 fw=8.06.00 (90d5).
> qla2xxx [0106:a0:00.1]-001a: : MSI-X vector count: 32.
> qla2xxx [0106:a0:00.1]-001d: : Found an ISP2532 irq 506 iobase 
> 0xa46f1774.
> qla2xxx [0106:a0:00.1]-00c6:2: MSI-X: Failed to enable support with 32 
> vectors, using 16 vectors.
> 2xxx
> qla2xxx [0106:a0:00.1]-00fb:2: QLogic QLE2562 - PCIe 2-port 8Gb FC Adapter.
> qla2xxx [0106:a0:00.1]-00fc:2: ISP2532: PCIe (5.0GT/s x8) @ 0106:a0:00.1 
> hdma- host#=2 fw=8.06.00 (90d5).
> 0:00.0]-500a:1: LOOP UP detected (8 Gbps). 
> qla2xxx [0106:a0:00.1]-500a:2: LOOP UP detected (8 Gbps).
> list_add double add: new=8d33e594, prev=8d33e594, 
> next=adef1df4.
> [ cut here ]
> kernel BUG at lib/list_debug.c:31! 
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries 
> Dumping ftrace buffer: 
>(ftrace buffer empty)
> Modules linked in: qla2xxx(E) tg3(E) ibmveth(E) xt_CHECKSUM(E)
> iptable_mangle(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E)
> iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E)
> nf_defrag_ipv4(E) xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E)
> nf_reject_ipv4(E) tun(E) bridge(E) stp(E) llc(E) kvm_pr(E) kvm(E)
> sctp_diag(E) sctp(E) libcrc32c(E) tcp_diag(E) udp_diag(E)
> ebtable_filter(E) ebtables(E) dccp_diag(E) ip6table_filter(E) dccp(E)
> ip6_tables(E) iptable_filter(E) inet_diag(E) unix_diag(E)
> af_packet_diag(E) netlink_diag(E) xts(E) sg(E) vmx_crypto(E)
> pseries_rng(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E)
> sunrpc(E) binfmt_misc(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E)
> fscrypto(E) sd_mod(E) ibmvscsi(E) scsi_transport_srp(E) nvme_fc(E)
> nvme_fabrics(E) nvme_core(E) scsi_transport_fc(E)
>  ptp(E) pps_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
> [last unloaded: qla2xxx]
> CPU: 7 PID: 22230 Comm: qla2xxx_1_dpc Tainted: GE
> 4.15.0-rc2-next-20171211-autotest-autotest #1
> NIP:  c0511040 LR: c051103c CTR: 00655170
> REGS: 9b7356fa TRAP: 0700   Tainted: GE 
> (4.15.0-rc2-next-20171211-autotest-autotest)
> MSR:  80010282b033   CR: 2222  
> XER: 0009  
> CFAR: c0170594 SOFTE: 0 
> GPR00: c051103c c000fc293ac0 c10f1d00 0058 
> GPR04: c0028fcccdd0 c0028fce3798 8000374060b8  
> GPR08:  c0d435ec 00028ef9 2717 
> GPR12:  ce734980 c01215d8 c002886996c0 
> GPR16:  0020 c002813d83f8 0001 
> GPR20: 2000 2000 0002 c002813dc808 
> GPR24: 0003 0001 c0027f5a5c20 c002813dced0 
> GPR28: c0027f5a5d90 c0027f5a5d90 c0027f5a5c00 c002813dc7f8 
> NIP [c0511040] __list_add_valid+0x70/0xb0
> LR [c051103c] __list_add_valid+0x6c/0xb0
> Call Trace:
> [c000fc293ac0] [c051103c] __list_add_valid+0x6c/0xb0 (unreliable)
> [c000fc293b20] [d51f1a08] qla24xx_async_gnl+0x108/0x420 [qla2xxx]
> [c000fc293bc0] [d51e762c] qla2x00_do_work+0x18c/0x8c0 [qla2xxx]
> [c000fc293ce0] [d51e8180] qla2x00_relogin+0x420/0xff0 [qla2xxx]
> [c000fc293dc0] [c012172c] kthread+0x15c/0x1a0
> [c000fc293e30] [c000b4e8] ret_from_kernel_thread+0x5c/0x74
> Instruction dump:
> 41de0018 38210060 3861 e8010010 7c0803a6 4e800020 3c62ffae 7d445378 
> 38631748 7d254b78 4bc5f51d 6000 <0fe0> 3c62ffae 7cc43378 386316f8 
> ---[ end trace a41bc8bd434657f1 ]---
> 
> Kernel panic - not syncing: Fatal exception
> Dumping ftrace buffer: 
>(ftrace buffer empty)
> Rebooting in 10 seconds..
> 
> This trace back to the below code path:
> 
> # gdb -batch vmlinux -ex 'list *(0xc0511040)'
> 0xc0511040 is in __list_add_valid (lib/list_debug.c:29).
> 24"list_add corruption. next->prev should be prev 
> (%p), but was %p. (next=%p).\n",
> 25prev, next->prev, next) ||
> 26

Re: [PATCH 8/8] blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu

2018-01-08 Thread Bart Van Assche

On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> The RCU protection has been expanded to cover both queueing and
> completion paths making ->queue_rq_srcu a misnomer.  Rename it to
> ->srcu as suggested by Bart.

Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>

Re: [PATCH 7/8] blk-mq: remove REQ_ATOM_STARTED

2018-01-08 Thread Bart Van Assche

On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> After the recent updates to use generation number and state based
> synchronization, we can easily replace REQ_ATOM_STARTED usages by
> adding an extra state to distinguish completed but not yet freed
> state.
> 
> Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
> blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
> left and is removed.

Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>

Re: [PATCH 6/8] blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq

2018-01-08 Thread Bart Van Assche

On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> After the recent updates to use generation number and state based
> synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
> to avoid firing the same timeout multiple times.
> 
> Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
> RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
> times.  This removes atomic bitops from hot paths too.
> 
> v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
> 
> v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.

I think it's unfortunate that this patch spreads the request state over two
struct request members, namely the lower two bits of gstate and the
RQF_MQ_TIMEOUT_EXPIRED flag in req->rq_flags. Please consider to drop the
RQF_MQ_TIMEOUT_EXPIRED flag and to represent the "timeout has been processed"
state as a MQ_RQ_* state in gstate. That will avoid that every state update
has to be reviewed whether or not perhaps an update of the
RQF_MQ_TIMEOUT_EXPIRED flag is missing.

Thanks,

Bart.

Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2018-01-08 Thread Bart Van Assche

On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> @@ -230,6 +232,27 @@ struct request {
>  
>   unsigned short write_hint;
>  
> + /*
> +  * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state
> +  * value and the upper bits the generation number which is
> +  * monotonically incremented and used to distinguish the reuse
> +  * instances.
> +  *
> +  * ->gstate_seq allows updates to ->gstate and other fields
> +  * (currently ->deadline) during request start to be read
> +  * atomically from the timeout path, so that it can operate on a
> +  * coherent set of information.
> +  */
> + seqcount_t gstate_seq;
> + u64 gstate;
> +
> + /*
> +  * ->aborted_gstate is used by the timeout to claim a specific
> +  * recycle instance of this request.  See blk_mq_timeout_work().
> +  */
> + struct u64_stats_sync aborted_gstate_sync;
> + u64 aborted_gstate;
> +
>   unsigned long deadline;
>   struct list_head timeout_list;

Does "gstate" perhaps stand for "generation number and state"? If so, please
mention this in one of the above comments.

Thanks,

Bart.

Re: [PATCH 5/8] blk-mq: make blk_abort_request() trigger timeout path

2018-01-08 Thread Bart Van Assche

On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> @@ -156,12 +156,12 @@ void blk_timeout_work(struct work_struct *work)
>   */
>  void blk_abort_request(struct request *req)
>  {
> - if (blk_mark_rq_complete(req))
> - return;
> -
>   if (req->q->mq_ops) {
> - blk_mq_rq_timed_out(req, false);
> + req->deadline = jiffies;
> + mod_timer(>q->timeout, 0);
>   } else {
> + if (blk_mark_rq_complete(req))
> + return;
>   blk_delete_timer(req);
>   blk_rq_timed_out(req);
>   }

Other req->deadline writes are protected by preempt_disable(),
write_seqcount_begin(>gstate_seq), write_seqcount_end(>gstate_seq)
and preempt_enable(). I think it's fine that the above req->deadline store
does not have that protection but I also think that that deserves a comment.

Thanks,

Bart.

Re: [PATCH 4/8] blk-mq: use blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE

2018-01-08 Thread Bart Van Assche

On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> blk_mq_check_inflight() and blk_mq_poll_hybrid_sleep() test
> REQ_ATOM_COMPLETE to determine the request state.  Both uses are
> speculative and we can test REQ_ATOM_STARTED and blk_mq_rq_state() for
> equivalent results.  Replace the tests.  This will allow removing
> REQ_ATOM_COMPLETE usages from blk-mq.

Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>

Re: [PATCH 3/8] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2018-01-08 Thread Bart Van Assche

On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> +static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + u64_stats_update_begin(>aborted_gstate_sync);
> + rq->aborted_gstate = gstate;
> + u64_stats_update_end(>aborted_gstate_sync);
> + local_irq_restore(flags);
> +}

Please add a comment that explains the purpose of local_irq_save() and
local_irq_restore(). Please also explain why you chose to disable interrupts
instead of disabling preemption. I think that disabling preemption should be
sufficient since this is the only code that updates rq->aborted_gstate and
since this function is always called from thread context.

> @@ -801,6 +840,12 @@ void blk_mq_rq_timed_out(struct request *req, bool 
> reserved)
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
> + /*
> +  * As nothing prevents from completion happening while
> +  * ->aborted_gstate is set, this may lead to ignored
> +  * completions and further spurious timeouts.
> +  */
> + blk_mq_rq_update_aborted_gstate(req, 0);
>   blk_add_timer(req);
>   blk_clear_rq_complete(req);
>   break;

Is the race that the comment refers to addressed by one of the later patches?

Thanks,

Bart.

Re: [PATCH 1/8] blk-mq: move hctx lock/unlock into a helper

2018-01-08 Thread Bart Van Assche

On Mon, 2018-01-08 at 11:15 -0800, Tejun Heo wrote:
> +static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
> +{
> + if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> + rcu_read_unlock();
> + else
> + srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
> +}
> +
> +static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
> +{
> + if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> + rcu_read_lock();
> + else
> + *srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> +}

A minor comment: please consider to reorder these two functions such that the
lock function appears first and the unlock function second. Anyway:

Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>

[PATCH 4/4] IB/srp: Fix a sleep-in-invalid-context bug

2018-01-08 Thread Bart Van Assche

The previous two patches guarantee that srp_queuecommand() does not get
invoked while reconnecting occurs. Hence remove the code from
srp_queuecommand() that prevents command queueing while reconnecting.
This patch avoids that the following can appear in the kernel log:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
in_atomic(): 1, irqs_disabled(): 0, pid: 5600, name: scsi_eh_9
1 lock held by scsi_eh_9/5600:
 #0:  (rcu_read_lock){}, at: [<cbb798c7>] 
__blk_mq_run_hw_queue+0xf1/0x1e0
Preemption disabled at:
[<139badf2>] __blk_mq_delay_run_hw_queue+0x78/0xf0
CPU: 9 PID: 5600 Comm: scsi_eh_9 Tainted: GW4.15.0-rc4-dbg+ #1
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 2.5.4 01/22/2016
Call Trace:
 dump_stack+0x67/0x99
 ___might_sleep+0x16a/0x250 [ib_srp]
 __mutex_lock+0x46/0x9d0
 srp_queuecommand+0x356/0x420 [ib_srp]
 scsi_dispatch_cmd+0xf6/0x3f0
 scsi_queue_rq+0x4a8/0x5f0
 blk_mq_dispatch_rq_list+0x73/0x440
 blk_mq_sched_dispatch_requests+0x109/0x1a0
 __blk_mq_run_hw_queue+0x131/0x1e0
 __blk_mq_delay_run_hw_queue+0x9a/0xf0
 blk_mq_run_hw_queue+0xc0/0x1e0
 blk_mq_start_hw_queues+0x2c/0x40
 scsi_run_queue+0x18e/0x2d0
 scsi_run_host_queues+0x22/0x40
 scsi_error_handler+0x18d/0x5f0
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Jason Gunthorpe <j...@mellanox.com>
Cc: Doug Ledford <dledf...@redhat.com>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 972d4b3c5223..670f187ecb91 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -2149,7 +2149,6 @@ static void srp_handle_qp_err(struct ib_cq *cq, struct 
ib_wc *wc,
 static int srp_queuecommand(struct Scsi_Host *shost, struct scsi_cmnd *scmnd)
 {
struct srp_target_port *target = host_to_target(shost);
-   struct srp_rport *rport = target->rport;
struct srp_rdma_ch *ch;
struct srp_request *req;
struct srp_iu *iu;
@@ -2159,16 +2158,6 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
u32 tag;
u16 idx;
int len, ret;
-   const bool in_scsi_eh = !in_interrupt() && current == shost->ehandler;
-
-   /*
-* The SCSI EH thread is the only context from which srp_queuecommand()
-* can get invoked for blocked devices (SDEV_BLOCK /
-* SDEV_CREATED_BLOCK). Avoid racing with srp_reconnect_rport() by
-* locking the rport mutex if invoked from inside the SCSI EH.
-*/
-   if (in_scsi_eh)
-   mutex_lock(>mutex);
 
scmnd->result = srp_chkready(target->rport);
if (unlikely(scmnd->result))
@@ -2230,13 +2219,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
goto err_unmap;
}
 
-   ret = 0;
-
-unlock_rport:
-   if (in_scsi_eh)
-   mutex_unlock(>mutex);
-
-   return ret;
+   return 0;
 
 err_unmap:
srp_unmap_data(scmnd, ch, req);
@@ -2258,7 +2241,7 @@ static int srp_queuecommand(struct Scsi_Host *shost, 
struct scsi_cmnd *scmnd)
ret = SCSI_MLQUEUE_HOST_BUSY;
}
 
-   goto unlock_rport;
+   return ret;
 }
 
 /*
-- 
2.15.1

[PATCH 1/4] blk-mq: Rename request_queue.mq_freeze_wq into mq_wq

2018-01-08 Thread Bart Van Assche

Rename a waitqueue in struct request_queue since the next patch will
add code that uses this waitqueue outside the request queue freezing
implementation.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: Ming Lei <ming@redhat.com>
---
 block/blk-core.c   | 10 +-
 block/blk-mq.c | 10 +-
 include/linux/blkdev.h |  2 +-
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 8f27ab14abce..605599a2ab3b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -375,7 +375,7 @@ void blk_clear_preempt_only(struct request_queue *q)
 
spin_lock_irqsave(q->queue_lock, flags);
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>mq_wq);
spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
@@ -651,7 +651,7 @@ void blk_set_queue_dying(struct request_queue *q)
}
 
/* Make blk_queue_enter() reexamine the DYING flag. */
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>mq_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -854,7 +854,7 @@ int blk_queue_enter(struct request_queue *q, 
blk_mq_req_flags_t flags)
 */
smp_rmb();
 
-   ret = wait_event_interruptible(q->mq_freeze_wq,
+   ret = wait_event_interruptible(q->mq_wq,
(atomic_read(>mq_freeze_depth) == 0 &&
 (preempt || !blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
@@ -875,7 +875,7 @@ static void blk_queue_usage_counter_release(struct 
percpu_ref *ref)
struct request_queue *q =
container_of(ref, struct request_queue, q_usage_counter);
 
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>mq_wq);
 }
 
 static void blk_rq_timed_out_timer(struct timer_list *t)
@@ -951,7 +951,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
q->bypass_depth = 1;
__set_bit(QUEUE_FLAG_BYPASS, >queue_flags);
 
-   init_waitqueue_head(>mq_freeze_wq);
+   init_waitqueue_head(>mq_wq);
 
/*
 * Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 21c21ca13072..8118890fb66f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -138,16 +138,16 @@ EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
 void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-   wait_event(q->mq_freeze_wq, percpu_ref_is_zero(>q_usage_counter));
+   wait_event(q->mq_wq, percpu_ref_is_zero(>q_usage_counter));
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 unsigned long timeout)
 {
-   return wait_event_timeout(q->mq_freeze_wq,
-   percpu_ref_is_zero(>q_usage_counter),
-   timeout);
+   return wait_event_timeout(q->mq_wq,
+ percpu_ref_is_zero(>q_usage_counter),
+ timeout);
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
 
@@ -186,7 +186,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
WARN_ON_ONCE(freeze_depth < 0);
if (!freeze_depth) {
percpu_ref_reinit(>q_usage_counter);
-   wake_up_all(>mq_freeze_wq);
+   wake_up_all(>mq_wq);
}
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 742c0d9aed6b..26c4d4343edb 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -613,7 +613,7 @@ struct request_queue {
struct throtl_data *td;
 #endif
struct rcu_head rcu_head;
-   wait_queue_head_t   mq_freeze_wq;
+   wait_queue_head_t   mq_wq;
struct percpu_ref   q_usage_counter;
struct list_headall_q_node;
 
-- 
2.15.1

[PATCH 3/4] scsi: Avoid that .queuecommand() gets called for a quiesced SCSI device

2018-01-08 Thread Bart Van Assche

Several SCSI transport and LLD drivers surround code that does not
tolerate concurrent calls of .queuecommand() with scsi_target_block() /
scsi_target_unblock(). These last two functions use
blk_mq_quiesce_queue() / blk_mq_unquiesce_queue() for scsi-mq request
queues to prevent concurrent .queuecommand() calls. However, that is
not sufficient to prevent .queuecommand() calls from scsi_send_eh_cmnd().
Hence surround the .queuecommand() call from the SCSI error handler with
blk_wait_if_quiesced() / blk_finish_wait_if_quiesced().

Note: converting the .queuecommand() call in scsi_send_eh_cmnd() into
code that calls blk_get_request(), e.g. scsi_execute_req(), is not an
option since scsi_send_eh_cmnd() can be called if all requests are
allocated and if no requests will make progress without aborting any
of these requests.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: Ming Lei <ming@redhat.com>
---
 drivers/scsi/scsi_error.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 62b56de38ae8..5af3de1731a5 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1016,6 +1016,7 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, 
unsigned char *cmnd,
 {
struct scsi_device *sdev = scmd->device;
struct Scsi_Host *shost = sdev->host;
+   struct request_queue *q = sdev->request_queue;
DECLARE_COMPLETION_ONSTACK(done);
unsigned long timeleft = timeout;
struct scsi_eh_save ses;
@@ -1028,7 +1029,9 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, 
unsigned char *cmnd,
 
scsi_log_send(scmd);
scmd->scsi_done = scsi_eh_done;
+   blk_wait_if_quiesced(q);
rtn = shost->hostt->queuecommand(shost, scmd);
+   blk_finish_wait_if_quiesced(q);
if (rtn) {
if (timeleft > stall_for) {
scsi_eh_restore_cmnd(scmd, );
-- 
2.15.1

[PATCH 0/4] Make SCSI transport recovery more robust

2018-01-08 Thread Bart Van Assche

Hello Jens,

A longstanding issue with the SCSI core is that several SCSI transport drivers
use scsi_target_block() and scsi_target_unblock() to avoid concurrent
.queuecommand() calls during e.g. transport recovery but that this is not
sufficient to protect from such calls. Hence this patch series. Please
consider this patch series for kernel v4.16.

Thanks,

Bart.

Bart Van Assche (4):
  blk-mq: Rename request_queue.mq_freeze_wq into mq_wq
  block: Introduce blk_wait_if_quiesced() and
blk_finish_wait_if_quiesced()
  scsi: Avoid that .queuecommand() gets called for a quiesced SCSI
device
  IB/srp: Fix a sleep-in-invalid-context bug

 block/blk-core.c| 11 +++---
 block/blk-mq.c  | 71 ++---
 drivers/infiniband/ulp/srp/ib_srp.c | 21 ++-
 drivers/scsi/scsi_error.c   |  3 ++
 include/linux/blk-mq.h  |  2 ++
 include/linux/blkdev.h  |  2 +-
 6 files changed, 80 insertions(+), 30 deletions(-)

-- 
2.15.1

[PATCH] block: Fix kernel-doc warnings reported when building with W=1

2018-01-08 Thread Bart Van Assche

Commit 3a025e1d1c2e ("Add optional check for bad kernel-doc comments")
causes W=1 the kernel-doc script to be run and thereby causes several
new warnings to appear when building the kernel with W=1. Fix the
block layer kernel-doc headers such that the block layer again builds
cleanly with W=1.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 block/bsg-lib.c| 3 ++-
 block/scsi_ioctl.c | 5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index 15d25ccd51a5..1474153f73e3 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -30,7 +30,7 @@
 
 /**
  * bsg_teardown_job - routine to teardown a bsg job
- * @job: bsg_job that is to be torn down
+ * @kref: kref inside bsg_job that is to be torn down
  */
 static void bsg_teardown_job(struct kref *kref)
 {
@@ -251,6 +251,7 @@ static void bsg_exit_rq(struct request_queue *q, struct 
request *req)
  * @name: device to give bsg device
  * @job_fn: bsg job handler
  * @dd_job_size: size of LLD data needed for each job
+ * @release: @dev release function
  */
 struct request_queue *bsg_setup_queue(struct device *dev, const char *name,
bsg_job_fn *job_fn, int dd_job_size,
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index edcfff974527..5cddff44a2f8 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -384,9 +384,10 @@ static int sg_io(struct request_queue *q, struct gendisk 
*bd_disk,
 
 /**
  * sg_scsi_ioctl  --  handle deprecated SCSI_IOCTL_SEND_COMMAND ioctl
- * @file:  file this ioctl operates on (optional)
  * @q: request queue to send scsi commands down
  * @disk:  gendisk to operate on (option)
+ * @mode:  mode used to open the file through which the ioctl has been
+ * submitted
  * @sic:   userspace structure describing the command to perform
  *
  * Send down the scsi command described by @sic to the device below
@@ -415,10 +416,10 @@ static int sg_io(struct request_queue *q, struct gendisk 
*bd_disk,
  *  Positive numbers returned are the compacted SCSI error codes (4
  *  bytes in one int) where the lowest byte is the SCSI status.
  */
-#define OMAX_SB_LEN 16  /* For backward compatibility */
 int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
struct scsi_ioctl_command __user *sic)
 {
+   enum { OMAX_SB_LEN = 16 };  /* For backward compatibility */
struct request *rq;
struct scsi_request *req;
int err;
-- 
2.15.1

[PATCH v4 2/5] crypto: scompress - use sgl_alloc() and sgl_free()

2018-01-05 Thread Bart Van Assche

Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
Cc: Herbert Xu <herb...@gondor.apana.org.au>
---
 crypto/Kconfig |  1 +
 crypto/scompress.c | 51 ++-
 2 files changed, 3 insertions(+), 49 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index f7911963bb79..20360e040425 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -106,6 +106,7 @@ config CRYPTO_KPP
 config CRYPTO_ACOMP2
tristate
select CRYPTO_ALGAPI2
+   select SGL_ALLOC
 
 config CRYPTO_ACOMP
tristate
diff --git a/crypto/scompress.c b/crypto/scompress.c
index 2075e2c4e7df..968bbcf65c94 100644
--- a/crypto/scompress.c
+++ b/crypto/scompress.c
@@ -140,53 +140,6 @@ static int crypto_scomp_init_tfm(struct crypto_tfm *tfm)
return ret;
 }
 
-static void crypto_scomp_sg_free(struct scatterlist *sgl)
-{
-   int i, n;
-   struct page *page;
-
-   if (!sgl)
-   return;
-
-   n = sg_nents(sgl);
-   for_each_sg(sgl, sgl, n, i) {
-   page = sg_page(sgl);
-   if (page)
-   __free_page(page);
-   }
-
-   kfree(sgl);
-}
-
-static struct scatterlist *crypto_scomp_sg_alloc(size_t size, gfp_t gfp)
-{
-   struct scatterlist *sgl;
-   struct page *page;
-   int i, n;
-
-   n = ((size - 1) >> PAGE_SHIFT) + 1;
-
-   sgl = kmalloc_array(n, sizeof(struct scatterlist), gfp);
-   if (!sgl)
-   return NULL;
-
-   sg_init_table(sgl, n);
-
-   for (i = 0; i < n; i++) {
-   page = alloc_page(gfp);
-   if (!page)
-   goto err;
-   sg_set_page(sgl + i, page, PAGE_SIZE, 0);
-   }
-
-   return sgl;
-
-err:
-   sg_mark_end(sgl + i);
-   crypto_scomp_sg_free(sgl);
-   return NULL;
-}
-
 static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir)
 {
struct crypto_acomp *tfm = crypto_acomp_reqtfm(req);
@@ -220,7 +173,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, 
int dir)
  scratch_dst, >dlen, *ctx);
if (!ret) {
if (!req->dst) {
-   req->dst = crypto_scomp_sg_alloc(req->dlen, GFP_ATOMIC);
+   req->dst = sgl_alloc(req->dlen, GFP_ATOMIC, NULL);
if (!req->dst)
goto out;
}
@@ -274,7 +227,7 @@ int crypto_init_scomp_ops_async(struct crypto_tfm *tfm)
 
crt->compress = scomp_acomp_compress;
crt->decompress = scomp_acomp_decompress;
-   crt->dst_free = crypto_scomp_sg_free;
+   crt->dst_free = sgl_free;
crt->reqsize = sizeof(void *);
 
return 0;
-- 
2.15.1

[PATCH v4 3/5] nvmet/fc: Use sgl_alloc() and sgl_free()

2018-01-05 Thread Bart Van Assche

Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumsh...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Cc: Keith Busch <keith.bu...@intel.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: James Smart <james.sm...@broadcom.com>
Cc: Sagi Grimberg <s...@grimberg.me>
---
 drivers/nvme/target/Kconfig |  1 +
 drivers/nvme/target/fc.c| 36 ++--
 2 files changed, 3 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 03e4ab65fe77..4d9715630e21 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -39,6 +39,7 @@ config NVME_TARGET_FC
tristate "NVMe over Fabrics FC target driver"
depends on NVME_TARGET
depends on HAS_DMA
+   select SGL_ALLOC
help
  This enables the NVMe FC target support, which allows exporting NVMe
  devices over FC.
diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
index 5fd86039e353..840d1a39de33 100644
--- a/drivers/nvme/target/fc.c
+++ b/drivers/nvme/target/fc.c
@@ -1697,31 +1697,12 @@ static int
 nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 {
struct scatterlist *sg;
-   struct page *page;
unsigned int nent;
-   u32 page_len, length;
-   int i = 0;
 
-   length = fod->req.transfer_len;
-   nent = DIV_ROUND_UP(length, PAGE_SIZE);
-   sg = kmalloc_array(nent, sizeof(struct scatterlist), GFP_KERNEL);
+   sg = sgl_alloc(fod->req.transfer_len, GFP_KERNEL, );
if (!sg)
goto out;
 
-   sg_init_table(sg, nent);
-
-   while (length) {
-   page_len = min_t(u32, length, PAGE_SIZE);
-
-   page = alloc_page(GFP_KERNEL);
-   if (!page)
-   goto out_free_pages;
-
-   sg_set_page([i], page, page_len, 0);
-   length -= page_len;
-   i++;
-   }
-
fod->data_sg = sg;
fod->data_sg_cnt = nent;
fod->data_sg_cnt = fc_dma_map_sg(fod->tgtport->dev, sg, nent,
@@ -1731,14 +1712,6 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 
return 0;
 
-out_free_pages:
-   while (i > 0) {
-   i--;
-   __free_page(sg_page([i]));
-   }
-   kfree(sg);
-   fod->data_sg = NULL;
-   fod->data_sg_cnt = 0;
 out:
return NVME_SC_INTERNAL;
 }
@@ -1746,18 +1719,13 @@ nvmet_fc_alloc_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 static void
 nvmet_fc_free_tgt_pgs(struct nvmet_fc_fcp_iod *fod)
 {
-   struct scatterlist *sg;
-   int count;
-
if (!fod->data_sg || !fod->data_sg_cnt)
return;
 
fc_dma_unmap_sg(fod->tgtport->dev, fod->data_sg, fod->data_sg_cnt,
((fod->io_dir == NVMET_FCP_WRITE) ?
DMA_FROM_DEVICE : DMA_TO_DEVICE));
-   for_each_sg(fod->data_sg, sg, fod->data_sg_cnt, count)
-   __free_page(sg_page(sg));
-   kfree(fod->data_sg);
+   sgl_free(fod->data_sg);
fod->data_sg = NULL;
fod->data_sg_cnt = 0;
 }
-- 
2.15.1

[PATCH v4 5/5] target: Use sgl_alloc_order() and sgl_free()

2018-01-05 Thread Bart Van Assche

Use the sgl_alloc_order() and sgl_free() functions instead of open
coding these functions.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Acked-by: Nicholas A. Bellinger <n...@linux-iscsi.org>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Sagi Grimberg <s...@grimberg.me>
---
 drivers/target/Kconfig |  1 +
 drivers/target/target_core_transport.c | 46 +++---
 2 files changed, 5 insertions(+), 42 deletions(-)

diff --git a/drivers/target/Kconfig b/drivers/target/Kconfig
index e2bc99980f75..4c44d7bed01a 100644
--- a/drivers/target/Kconfig
+++ b/drivers/target/Kconfig
@@ -5,6 +5,7 @@ menuconfig TARGET_CORE
select CONFIGFS_FS
select CRC_T10DIF
select BLK_SCSI_REQUEST # only for scsi_command_size_tbl..
+   select SGL_ALLOC
default n
help
Say Y or M here to enable the TCM Storage Engine and ConfigFS enabled
diff --git a/drivers/target/target_core_transport.c 
b/drivers/target/target_core_transport.c
index 58caacd54a3b..a001ba711cca 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -2300,13 +2300,7 @@ static void target_complete_ok_work(struct work_struct 
*work)
 
 void target_free_sgl(struct scatterlist *sgl, int nents)
 {
-   struct scatterlist *sg;
-   int count;
-
-   for_each_sg(sgl, sg, nents, count)
-   __free_page(sg_page(sg));
-
-   kfree(sgl);
+   sgl_free(sgl);
 }
 EXPORT_SYMBOL(target_free_sgl);
 
@@ -2414,42 +2408,10 @@ int
 target_alloc_sgl(struct scatterlist **sgl, unsigned int *nents, u32 length,
 bool zero_page, bool chainable)
 {
-   struct scatterlist *sg;
-   struct page *page;
-   gfp_t zero_flag = (zero_page) ? __GFP_ZERO : 0;
-   unsigned int nalloc, nent;
-   int i = 0;
-
-   nalloc = nent = DIV_ROUND_UP(length, PAGE_SIZE);
-   if (chainable)
-   nalloc++;
-   sg = kmalloc_array(nalloc, sizeof(struct scatterlist), GFP_KERNEL);
-   if (!sg)
-   return -ENOMEM;
+   gfp_t gfp = GFP_KERNEL | (zero_page ? __GFP_ZERO : 0);
 
-   sg_init_table(sg, nalloc);
-
-   while (length) {
-   u32 page_len = min_t(u32, length, PAGE_SIZE);
-   page = alloc_page(GFP_KERNEL | zero_flag);
-   if (!page)
-   goto out;
-
-   sg_set_page([i], page, page_len, 0);
-   length -= page_len;
-   i++;
-   }
-   *sgl = sg;
-   *nents = nent;
-   return 0;
-
-out:
-   while (i > 0) {
-   i--;
-   __free_page(sg_page([i]));
-   }
-   kfree(sg);
-   return -ENOMEM;
+   *sgl = sgl_alloc_order(length, 0, chainable, gfp, nents);
+   return *sgl ? 0 : -ENOMEM;
 }
 EXPORT_SYMBOL(target_alloc_sgl);
 
-- 
2.15.1

[PATCH v4 1/5] lib/scatterlist: Introduce sgl_alloc() and sgl_free()

2018-01-05 Thread Bart Van Assche

Many kernel drivers contain code that allocates and frees both a
scatterlist and the pages that populate that scatterlist.
Introduce functions in lib/scatterlist.c that perform these tasks
instead of duplicating this functionality in multiple drivers.
Only include these functions in the build if CONFIG_SGL_ALLOC=y
to avoid that the kernel size increases if this functionality is
not used.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Reviewed-by: Johannes Thumshirn <jthumsh...@suse.de>
---
 include/linux/scatterlist.h |  10 +
 lib/Kconfig |   4 ++
 lib/scatterlist.c   | 105 
 3 files changed, 119 insertions(+)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index b7c83254c566..b8a7c1d1dbe3 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -276,6 +276,16 @@ int sg_alloc_table_from_pages(struct sg_table *sgt, struct 
page **pages,
  unsigned int n_pages, unsigned int offset,
  unsigned long size, gfp_t gfp_mask);
 
+#ifdef CONFIG_SGL_ALLOC
+struct scatterlist *sgl_alloc_order(unsigned long long length,
+   unsigned int order, bool chainable,
+   gfp_t gfp, unsigned int *nent_p);
+struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
+ unsigned int *nent_p);
+void sgl_free_order(struct scatterlist *sgl, int order);
+void sgl_free(struct scatterlist *sgl);
+#endif /* CONFIG_SGL_ALLOC */
+
 size_t sg_copy_buffer(struct scatterlist *sgl, unsigned int nents, void *buf,
  size_t buflen, off_t skip, bool to_buffer);
 
diff --git a/lib/Kconfig b/lib/Kconfig
index c5e84fbcb30b..4dd5c11366f9 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -409,6 +409,10 @@ config HAS_DMA
depends on !NO_DMA
default y
 
+config SGL_ALLOC
+   bool
+   default n
+
 config DMA_NOOP_OPS
bool
depends on HAS_DMA && (!64BIT || ARCH_DMA_ADDR_T_64BIT)
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 7c1c55f7daaa..9afc9b432083 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -474,6 +474,111 @@ int sg_alloc_table_from_pages(struct sg_table *sgt, 
struct page **pages,
 }
 EXPORT_SYMBOL(sg_alloc_table_from_pages);
 
+#ifdef CONFIG_SGL_ALLOC
+
+/**
+ * sgl_alloc_order - allocate a scatterlist and its pages
+ * @length: Length in bytes of the scatterlist. Must be at least one
+ * @order: Second argument for alloc_pages()
+ * @chainable: Whether or not to allocate an extra element in the scatterlist
+ * for scatterlist chaining purposes
+ * @gfp: Memory allocation flags
+ * @nent_p: [out] Number of entries in the scatterlist that have pages
+ *
+ * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
+ */
+struct scatterlist *sgl_alloc_order(unsigned long long length,
+   unsigned int order, bool chainable,
+   gfp_t gfp, unsigned int *nent_p)
+{
+   struct scatterlist *sgl, *sg;
+   struct page *page;
+   unsigned int nent, nalloc;
+   u32 elem_len;
+
+   nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
+   /* Check for integer overflow */
+   if (length > (nent << (PAGE_SHIFT + order)))
+   return NULL;
+   nalloc = nent;
+   if (chainable) {
+   /* Check for integer overflow */
+   if (nalloc + 1 < nalloc)
+   return NULL;
+   nalloc++;
+   }
+   sgl = kmalloc_array(nalloc, sizeof(struct scatterlist),
+   (gfp & ~GFP_DMA) | __GFP_ZERO);
+   if (!sgl)
+   return NULL;
+
+   sg_init_table(sgl, nent);
+   sg = sgl;
+   while (length) {
+   elem_len = min_t(u64, length, PAGE_SIZE << order);
+   page = alloc_pages(gfp, order);
+   if (!page) {
+   sgl_free(sgl);
+   return NULL;
+   }
+
+   sg_set_page(sg, page, elem_len, 0);
+   length -= elem_len;
+   sg = sg_next(sg);
+   }
+   WARN_ON_ONCE(sg);
+   if (nent_p)
+   *nent_p = nent;
+   return sgl;
+}
+EXPORT_SYMBOL(sgl_alloc_order);
+
+/**
+ * sgl_alloc - allocate a scatterlist and its pages
+ * @length: Length in bytes of the scatterlist
+ * @gfp: Memory allocation flags
+ * @nent_p: [out] Number of entries in the scatterlist
+ *
+ * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
+ */
+struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
+ unsigned int *nent_p)
+{
+   return sgl_alloc_order(length, 0, false, gfp, nent_p);
+}
+EXPORT_SYMBOL(sgl_al

[PATCH v4 4/5] nvmet/rdma: Use sgl_alloc() and sgl_free()

2018-01-05 Thread Bart Van Assche

Use the sgl_alloc() and sgl_free() functions instead of open coding
these functions.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumsh...@suse.de>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Cc: Keith Busch <keith.bu...@intel.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Sagi Grimberg <s...@grimberg.me>
---
 drivers/nvme/target/Kconfig |  1 +
 drivers/nvme/target/rdma.c  | 63 +++--
 2 files changed, 5 insertions(+), 59 deletions(-)

diff --git a/drivers/nvme/target/Kconfig b/drivers/nvme/target/Kconfig
index 4d9715630e21..5f4f8b16685f 100644
--- a/drivers/nvme/target/Kconfig
+++ b/drivers/nvme/target/Kconfig
@@ -29,6 +29,7 @@ config NVME_TARGET_RDMA
tristate "NVMe over Fabrics RDMA target support"
depends on INFINIBAND
depends on NVME_TARGET
+   select SGL_ALLOC
help
  This enables the NVMe RDMA target support, which allows exporting NVMe
  devices over RDMA.
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 49912909c298..0e4c15754c58 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -185,59 +185,6 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp)
spin_unlock_irqrestore(>queue->rsps_lock, flags);
 }
 
-static void nvmet_rdma_free_sgl(struct scatterlist *sgl, unsigned int nents)
-{
-   struct scatterlist *sg;
-   int count;
-
-   if (!sgl || !nents)
-   return;
-
-   for_each_sg(sgl, sg, nents, count)
-   __free_page(sg_page(sg));
-   kfree(sgl);
-}
-
-static int nvmet_rdma_alloc_sgl(struct scatterlist **sgl, unsigned int *nents,
-   u32 length)
-{
-   struct scatterlist *sg;
-   struct page *page;
-   unsigned int nent;
-   int i = 0;
-
-   nent = DIV_ROUND_UP(length, PAGE_SIZE);
-   sg = kmalloc_array(nent, sizeof(struct scatterlist), GFP_KERNEL);
-   if (!sg)
-   goto out;
-
-   sg_init_table(sg, nent);
-
-   while (length) {
-   u32 page_len = min_t(u32, length, PAGE_SIZE);
-
-   page = alloc_page(GFP_KERNEL);
-   if (!page)
-   goto out_free_pages;
-
-   sg_set_page([i], page, page_len, 0);
-   length -= page_len;
-   i++;
-   }
-   *sgl = sg;
-   *nents = nent;
-   return 0;
-
-out_free_pages:
-   while (i > 0) {
-   i--;
-   __free_page(sg_page([i]));
-   }
-   kfree(sg);
-out:
-   return NVME_SC_INTERNAL;
-}
-
 static int nvmet_rdma_alloc_cmd(struct nvmet_rdma_device *ndev,
struct nvmet_rdma_cmd *c, bool admin)
 {
@@ -484,7 +431,7 @@ static void nvmet_rdma_release_rsp(struct nvmet_rdma_rsp 
*rsp)
}
 
if (rsp->req.sg != >cmd->inline_sg)
-   nvmet_rdma_free_sgl(rsp->req.sg, rsp->req.sg_cnt);
+   sgl_free(rsp->req.sg);
 
if (unlikely(!list_empty_careful(>rsp_wr_wait_list)))
nvmet_rdma_process_wr_wait_list(queue);
@@ -621,16 +568,14 @@ static u16 nvmet_rdma_map_sgl_keyed(struct nvmet_rdma_rsp 
*rsp,
u32 len = get_unaligned_le24(sgl->length);
u32 key = get_unaligned_le32(sgl->key);
int ret;
-   u16 status;
 
/* no data command? */
if (!len)
return 0;
 
-   status = nvmet_rdma_alloc_sgl(>req.sg, >req.sg_cnt,
-   len);
-   if (status)
-   return status;
+   rsp->req.sg = sgl_alloc(len, GFP_KERNEL, >req.sg_cnt);
+   if (!rsp->req.sg)
+   return NVME_SC_INTERNAL;
 
ret = rdma_rw_ctx_init(>rw, cm_id->qp, cm_id->port_num,
rsp->req.sg, rsp->req.sg_cnt, 0, addr, key,
-- 
2.15.1

[PATCH v4 0/5] Introduce sgl_alloc() and sgl_free()

2018-01-05 Thread Bart Van Assche

Hello Jens,

As you know there are multiple drivers that both allocate a scatter/gather
list and populate that list with pages. This patch series moves the code for
allocating and freeing such scatterlists from several drivers into
lib/scatterlist.c. Please consider this patch series for kernel v4.16.

Changes between v3 and v4:
- Left out patches 6..8 as requested by Martin Petersen. These patches will be
  sent to Martin after this series has been merged.

Changes between v2 and v3:
- Added Reviewed-by tags that had been posted as replies on v2.
- Cc'ed the crypto maintainers for the entire patch series.

Changes between v1 and v2:
- Moved the sgl_alloc*() and sgl_free() functions from a new source file into
  lib/scatterlist.c.
- Changed the argument order for the sgl_alloc*() functions such that the
  (pointer to) the output argument comes last.

Bart Van Assche (5):
  lib/scatterlist: Introduce sgl_alloc() and sgl_free()
  crypto: scompress - use sgl_alloc() and sgl_free()
  nvmet/fc: Use sgl_alloc() and sgl_free()
  nvmet/rdma: Use sgl_alloc() and sgl_free()
  target: Use sgl_alloc_order() and sgl_free()

 crypto/Kconfig |   1 +
 crypto/scompress.c |  51 +---
 drivers/nvme/target/Kconfig|   2 +
 drivers/nvme/target/fc.c   |  36 +--
 drivers/nvme/target/rdma.c |  63 ++--
 drivers/target/Kconfig |   1 +
 drivers/target/target_core_transport.c |  46 ++-
 include/linux/scatterlist.h|  10 
 lib/Kconfig|   4 ++
 lib/scatterlist.c  | 105 +
 10 files changed, 135 insertions(+), 184 deletions(-)

-- 
2.15.1

Re: 4.15-rcX 32-bit SCSI suspend2ram warning

2018-01-03 Thread Bart Van Assche

On Wed, 2018-01-03 at 12:16 -0500, Woody Suwalski wrote:
> Starting with a 32-bit 4.15-rcX I have observed a warning in dmesg,
> happening when the system goes to suspend2ram. If it is just a warning
> - could we have it quiet'et down?

(+linux-scsi)

Thank you for having reported this. The comment above the warning that has
been triggered is as follows:

/*
 * It is allowed to call scsi_device_quiesce() multiple times from
 * the same context but concurrent scsi_device_quiesce() calls are
 * not allowed.
 */
WARN_ON_ONCE(sdev->quiesced_by && sdev->quiesced_by != current);

That means that suppressing the warning would be wrong - it would hide the
race condition but not fix it. Anyway, would it be possible to test the
(entirely untested) patch below? Sorry but I currently don't have access to
a test setup that triggers the SPI DV code.

Thanks,

Bart.


diff --git a/drivers/scsi/scsi_transport_spi.c 
b/drivers/scsi/scsi_transport_spi.c
index 10ebb213ddb3..abd48cffb282 100644
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "scsi_priv.h"
 #include 
@@ -1009,11 +1010,20 @@ spi_dv_device(struct scsi_device *sdev)
u8 *buffer;
const int len = SPI_MAX_ECHO_BUFFER_SIZE*2;
 
+   /*
+* Because this function and the power management code both call
+* scsi_device_quiesce(), it is not safe to perform domain validation
+* while suspend or resume is in progress. Hence the
+* lock/unlock_system_sleep() calls.
+*/
+   lock_system_sleep();
+
if (unlikely(spi_dv_in_progress(starget)))
-   return;
+   goto unlock;
 
if (unlikely(scsi_device_get(sdev)))
-   return;
+   goto unlock;
+
spi_dv_in_progress(starget) = 1;
 
buffer = kzalloc(len, GFP_KERNEL);
@@ -1044,11 +1054,13 @@ spi_dv_device(struct scsi_device *sdev)
 
spi_initial_dv(starget) = 1;
 
- out_free:
+out_free:
kfree(buffer);
- out_put:
+out_put:
spi_dv_in_progress(starget) = 0;
scsi_device_put(sdev);
+unlock:
+   unlock_system_sleep();
 }
 EXPORT_SYMBOL(spi_dv_device);
 
-- 
2.15.1

[PATCH 2/2] pktcdvd: Fix a recently introduced NULL pointer dereference

2018-01-02 Thread Bart Van Assche

Call bdev_get_queue(bdev) after bdev->bd_disk has been initialized
instead of just before that pointer has been initialized. This patch
avoids that the following command

pktsetup 1 /dev/sr0

triggers the following kernel crash:

BUG: unable to handle kernel NULL pointer dereference at 0548
IP: pkt_setup_dev+0x2db/0x670 [pktcdvd]
CPU: 2 PID: 724 Comm: pktsetup Not tainted 4.15.0-rc4-dbg+ #1
Call Trace:
 pkt_ctl_ioctl+0xce/0x1c0 [pktcdvd]
 do_vfs_ioctl+0x8e/0x670
 SyS_ioctl+0x3c/0x70
 entry_SYSCALL_64_fastpath+0x23/0x9a

Reported-by: Maciej S. Szmigiero <m...@maciej.szmigiero.name>
Fixes: commit ca18d6f769d2 ("block: Make most scsi_req_init() calls implicit")
Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Tested-by: Maciej S. Szmigiero <m...@maciej.szmigiero.name>
Cc: Maciej S. Szmigiero <m...@maciej.szmigiero.name>
Cc: <sta...@vger.kernel.org> # v4.13
---
 drivers/block/pktcdvd.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 2659b2534073..531a0915066b 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2579,14 +2579,14 @@ static int pkt_new_dev(struct pktcdvd_device *pd, dev_t 
dev)
bdev = bdget(dev);
if (!bdev)
return -ENOMEM;
+   ret = blkdev_get(bdev, FMODE_READ | FMODE_NDELAY, NULL);
+   if (ret)
+   return ret;
if (!blk_queue_scsi_passthrough(bdev_get_queue(bdev))) {
WARN_ONCE(true, "Attempt to register a non-SCSI queue\n");
-   bdput(bdev);
+   blkdev_put(bdev, FMODE_READ | FMODE_NDELAY);
return -EINVAL;
}
-   ret = blkdev_get(bdev, FMODE_READ | FMODE_NDELAY, NULL);
-   if (ret)
-   return ret;
 
/* This is safe, since we have a reference from open(). */
__module_get(THIS_MODULE);
-- 
2.15.1

[PATCH 1/2] pktcdvd: Fix pkt_setup_dev() error path

2018-01-02 Thread Bart Van Assche

Commit 523e1d399ce0 ("block: make gendisk hold a reference to its queue")
modified add_disk() and disk_release() but did not update any of the
error paths that trigger a put_disk() call after disk->queue has been
assigned. That introduced the following behavior in the pktcdvd driver
if pkt_new_dev() fails:

Kernel BUG at e98fd882 [verbose debug info unavailable]

Since disk_release() calls blk_put_queue() anyway if disk->queue != NULL,
fix this by removing the blk_cleanup_queue() call from the pkt_setup_dev()
error path.

Fixes: commit 523e1d399ce0 ("block: make gendisk hold a reference to its queue")
Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Tejun Heo <t...@kernel.org>
Cc: Maciej S. Szmigiero <m...@maciej.szmigiero.name>
Cc: <sta...@vger.kernel.org> # v3.2
---
 drivers/block/pktcdvd.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 67974796c350..2659b2534073 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2745,7 +2745,7 @@ static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev)
pd->pkt_dev = MKDEV(pktdev_major, idx);
ret = pkt_new_dev(pd, dev);
if (ret)
-   goto out_new_dev;
+   goto out_mem2;
 
/* inherit events of the host device */
disk->events = pd->bdev->bd_disk->events;
@@ -2763,8 +2763,6 @@ static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev)
mutex_unlock(_mutex);
return 0;
 
-out_new_dev:
-   blk_cleanup_queue(disk->queue);
 out_mem2:
put_disk(disk);
 out_mem:
-- 
2.15.1

[PATCH 0/2] Two bug fixes for the pktcdvd driver

2018-01-02 Thread Bart Van Assche

Hello Jens,

This patch series fixes two bugs in the pktcdvd driver. The second patch fixes
a recently introduced regression while the first patch fixes a regression that
was introduced a long time ago. Please consider these patches for the upstream
kernel.

Thanks,

Bart.

Bart Van Assche (2):
  pktcdvd: Fix pkt_setup_dev() error path
  pktcdvd: Fix a recently introduced NULL pointer dereference

 drivers/block/pktcdvd.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

-- 
2.15.1

Re: [v2,4/6] pktcdvd: Check queue type before attaching to a queue

2017-12-30 Thread Bart Van Assche

On Sat, 2017-12-30 at 22:41 +0100, Maciej S. Szmigiero wrote:
> This commit causes a NULL pointer dereference when adding a pktcdvd
> mapping.
> 
> Reproducing it is simple:
> # pktsetup 1 /dev/cdrom 
> 
> Specifically, the NULL dereference happens inside bdev_get_queue(bdev),
> which is supposed to return bdev->bd_disk->queue, but in this case
> bdev->bd_disk is NULL.

Would it be possible to test the two attached patches?

Thanks,

Bart.From 8ef0308718a3f3f60c0c6983d3ff606ac8d3db8d Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bart.vanass...@wdc.com>
Date: Sat, 30 Dec 2017 15:28:25 -0800
Subject: [PATCH 1/2] pktcdvd: Fix a recently introduced NULL pointer
 dereference

Reported-by: Maciej S. Szmigiero <m...@maciej.szmigiero.name>
Fixes: commit ca18d6f769d2 ("block: Make most scsi_req_init() calls implicit")
Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: <sta...@vger.kernel.org> # v4.13
---
 drivers/block/pktcdvd.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 67974796c350..fc8a80ec90e5 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2579,14 +2579,14 @@ static int pkt_new_dev(struct pktcdvd_device *pd, dev_t dev)
 	bdev = bdget(dev);
 	if (!bdev)
 		return -ENOMEM;
+	ret = blkdev_get(bdev, FMODE_READ | FMODE_NDELAY, NULL);
+	if (ret)
+		return ret;
 	if (!blk_queue_scsi_passthrough(bdev_get_queue(bdev))) {
 		WARN_ONCE(true, "Attempt to register a non-SCSI queue\n");
-		bdput(bdev);
+		blkdev_put(bdev, FMODE_READ | FMODE_NDELAY);
 		return -EINVAL;
 	}
-	ret = blkdev_get(bdev, FMODE_READ | FMODE_NDELAY, NULL);
-	if (ret)
-		return ret;
 
 	/* This is safe, since we have a reference from open(). */
 	__module_get(THIS_MODULE);
-- 
2.15.1

From 3192cc5f62b3ba9f866bcb245d21231a39745d8d Mon Sep 17 00:00:00 2001
From: Bart Van Assche <bart.vanass...@wdc.com>
Date: Sat, 30 Dec 2017 16:44:35 -0800
Subject: [PATCH 2/2] pktcdvd: Fix pkt_setup_dev() error path

Since disk_release(disk) calls blk_put_queue() if disk->queue != NULL,
clear disk->queue before calling put_disk().

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: <sta...@vger.kernel.org>
---
 drivers/block/pktcdvd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index fc8a80ec90e5..c5e930d23a63 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2765,6 +2765,7 @@ static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev)
 
 out_new_dev:
 	blk_cleanup_queue(disk->queue);
+	disk->queue = NULL;
 out_mem2:
 	put_disk(disk);
 out_mem:
-- 
2.15.1

Re: regression 4.15-rc: kernel oops in dm-multipath

2017-12-22 Thread Bart Van Assche

On Fri, 2017-12-22 at 10:53 +0100, Christian Borntraeger wrote:
> any quick ideas?

Hello Christian,

Was commit afc567a4977b ("dm table: fix regression from improper 
dm_dev_internal.count
refcount_t conversion") included in your test kernel?

Bart.

Re: [PATCHSET v2] blk-mq: reimplement timeout handling

2017-12-20 Thread Bart Van Assche

On Wed, 2017-12-20 at 16:08 -0800, t...@kernel.org wrote:
> On Wed, Dec 20, 2017 at 11:41:02PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > Currently, blk-mq timeout path synchronizes against the usual
> > > issue/completion path using a complex scheme involving atomic
> > > bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> > > rules.  Unfortunatley, it contains quite a few holes.
> > 
> > Hello Tejun,
> > 
> > An attempt to run SCSI I/O with this patch series applied resulted in
> > the following:
> 
> Can you please try the v3?  There were a couple bugs that I missed
> while testing earlier versions.
> 
>  http://lkml.kernel.org/r/20171216120726.517153-1...@kernel.org

Will do. But please Cc: linux-block in case you would post a v4 of this patch
series. I searched the linux-block folder of my mailbox for the latest version
of this patch series and that is how I ended up testing v2 instead of v3 ...

Bart.

Re: [PATCHSET v2] blk-mq: reimplement timeout handling

2017-12-20 Thread Bart Van Assche

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.

Hello Tejun,

An attempt to run SCSI I/O with this patch series applied resulted in
the following:

BUG: unable to handle kernel NULL pointer dereference at   (null)
IP: scsi_times_out+0x1c/0x2d0
PGD 0 P4D 0
Oops:  [#1] PREEMPT SMP
CPU: 1 PID: 437 Comm: kworker/1:1H Tainted: GW4.15.0-rc4-dbg+ #1
Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 2.5.4 01/22/2016
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:scsi_times_out+0x1c/0x2d0
RSP: 0018:c90007ef3d58 EFLAGS: 00010246
RAX:  RBX: 880878eab000 RCX: 
RDX:  RSI:  RDI: 880878eab000
RBP: 880878eab1a0 R08:  R09: 0001
R10:  R11:  R12: 0004
R13:  R14: 88085e4a5ce8 R15: 880878e9f848
FS:  () GS:88093f60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 01c0f002 CR4: 000606e0
Call Trace:
 blk_mq_terminate_expired+0x36/0x70
 bt_iter+0x43/0x50
 blk_mq_queue_tag_busy_iter+0xee/0x200
 blk_mq_timeout_work+0x186/0x2e0
 process_one_work+0x221/0x6e0
 worker_thread+0x3a/0x390
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30
RIP: scsi_times_out+0x1c/0x2d0 RSP: c90007ef3d58
CR2: 

(gdb) list *(scsi_times_out+0x1c)
0x8147adbc is in scsi_times_out (drivers/scsi/scsi_error.c:285).
280  */
281 enum blk_eh_timer_return scsi_times_out(struct request *req)
282 {
283 struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284 enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285 struct Scsi_Host *host = scmd->device->host;
286
287 trace_scsi_dispatch_cmd_timeout(scmd);
288 scsi_log_completion(scmd, TIMEOUT_ERROR);
289

(gdb) disas /s scsi_times_out
[ ... ]
283 struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
284 enum blk_eh_timer_return rtn = BLK_EH_NOT_HANDLED;
285 struct Scsi_Host *host = scmd->device->host;
   0x8147adb2 <+18>:mov0x1d8(%rdi),%rax
   0x8147adb9 <+25>:mov%rdi,%rbx
   0x8147adbc <+28>:mov(%rax),%r13
   0x8147adbf <+31>:nopl   0x0(%rax,%rax,1)

Bart.

Re: [PATCH V2 0/2] block: fix queue freeze and cleanup

2017-12-15 Thread Bart Van Assche

On Fri, 2017-12-15 at 15:58 +0800, chenxiang (M) wrote:
> I tested v2 of this series based on Martin's for-4.16 SCSI tree which 
> branch error handler issue "scsi: core: Ensure that the
> SCSI error handler gets woken up" is merged. And system is still hung 
> after repeat my testcase.

Hello chenxiang,

Thanks for having run this test and for having shared the results. I will see
whether I can reproduce this on my test setup.

Bart.

Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Bart Van Assche

On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > + write_seqcount_begin(>gstate_seq);
> > > + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > > + blk_add_timer(rq);
> > > + write_seqcount_end(>gstate_seq);
> > 
> > My understanding is that both write_seqcount_begin() and 
> > write_seqcount_end()
> > trigger a write memory barrier. Is a seqcount really faster than a spinlock?
> 
> Yes lots, no atomic operations and no waiting.
> 
> The only constraint for write_seqlock is that there must not be any
> concurrency.
> 
> But now that I look at this again, TJ, why can't the below happen?
> 
>   write_seqlock_begin();
>   blk_mq_rq_update_state(rq, IN_FLIGHT);
>   blk_add_timer(rq);
>   
>   read_seqcount_begin()
>   while (seq & 1)
>   cpurelax();
>   // life-lock
>   
>   write_seqlock_end();

Hello Peter,

Some time ago the block layer was changed to handle timeouts in thread context
instead of interrupt context. See also commit 287922eb0b18 ("block: defer
timeouts to a workqueue").

Bart.

Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Bart Van Assche

On Thu, 2017-12-14 at 11:19 -0800, t...@kernel.org wrote:
> On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct 
> > > request *rq)
> > >   rq->start_time = jiffies;
> > >   set_start_time_ns(rq);
> > >   rq->part = NULL;
> > > + seqcount_init(>gstate_seq);
> > > + u64_stats_init(>aborted_gstate_sync);
> > >  }
> > >  EXPORT_SYMBOL(blk_rq_init);
> > 
> > Sorry but the above change looks ugly to me. My understanding is that 
> > blk_rq_init() is only used inside the block layer to initialize legacy block
> > layer requests while gstate_seq and aborted_gstate_sync are only relevant
> > for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
> > called for blk-mq requests such that the above change can be left out? The
> > only callers outside the block layer core of blk_rq_init() I know of are
> > ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
> > code if you want.
> 
> This is also used by flush path.  We probably should clean that up,
> but let's worry about that later cuz flush handling has enough of its
> own complications.

We may have a different opinion about this but I think it is more than a
detail. This patch needs gstate_seq and aborted_gstate_sync to be preserved
across request state transitions from MQ_RQ_IN_FLIGHT to MR_RQ_IDLE.
blk_mq_init_request() is called at request allocation time so it's the right
context to initialize gstate_seq and aborted_gstate_sync. blk_rq_init()
however is called before a every use of a request. Sorry but I'm not
enthusiast about the code in blk_rq_init() that reinitializes state
information that should survive request reuse.

Bart.

Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-14 Thread Bart Van Assche

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> rules.  Unfortunatley, it contains quite a few holes.
  ^
  Unfortunately?

> While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
^^^
synchronization?

> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request 
> *rq)
>   rq->start_time = jiffies;
>   set_start_time_ns(rq);
>   rq->part = NULL;
> + seqcount_init(>gstate_seq);
> + u64_stats_init(>aborted_gstate_sync);
>  }
>  EXPORT_SYMBOL(blk_rq_init);

Sorry but the above change looks ugly to me. My understanding is that 
blk_rq_init() is only used inside the block layer to initialize legacy block
layer requests while gstate_seq and aborted_gstate_sync are only relevant
for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
called for blk-mq requests such that the above change can be left out? The
only callers outside the block layer core of blk_rq_init() I know of are
ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
code if you want.

> + write_seqcount_begin(>gstate_seq);
> + blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> + blk_add_timer(rq);
> + write_seqcount_end(>gstate_seq);

My understanding is that both write_seqcount_begin() and write_seqcount_end()
trigger a write memory barrier. Is a seqcount really faster than a spinlock?

> 
> @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool 
> reserved)
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
> + /*
> +  * As nothing prevents from completion happening while
> +  * ->aborted_gstate is set, this may lead to ignored
> +  * completions and further spurious timeouts.
> +  */
> + u64_stats_update_begin(>aborted_gstate_sync);
> + req->aborted_gstate = 0;
> + u64_stats_update_end(>aborted_gstate_sync);

If a blk-mq request is resubmitted 2**62 times, can that result in the above
code setting aborted_gstate to the same value as gstate? Isn't that a bug?
If so, how about setting aborted_gstate in the above code to e.g. gstate ^ 
(2**63)?

> @@ -228,6 +230,27 @@ struct request {
>  
>   unsigned short write_hint;
>  
> + /*
> +  * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state
> +  * value and the upper bits the generation number which is
> +  * monotonically incremented and used to distinguish the reuse
> +  * instances.
> +  *
> +  * ->gstate_seq allows updates to ->gstate and other fields
> +  * (currently ->deadline) during request start to be read
> +  * atomically from the timeout path, so that it can operate on a
> +  * coherent set of information.
> +  */
> + seqcount_t gstate_seq;
> + u64 gstate;
> +
> + /*
> +  * ->aborted_gstate is used by the timeout to claim a specific
> +  * recycle instance of this request.  See blk_mq_timeout_work().
> +  */
> + struct u64_stats_sync aborted_gstate_sync;
> + u64 aborted_gstate;
> +
>   unsigned long deadline;
>   struct list_head timeout_list;

Why are gstate and aborted_gstate 64-bit variables? What makes you think that
32 bits would not be enough?

Thanks,

Bart.

Re: [PATCH 1/6] blk-mq: protect completion path with RCU

2017-12-14 Thread Bart Van Assche

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> + } else {
> + srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> + if (!blk_mark_rq_complete(rq))
> + __blk_mq_complete_request(rq);
> + srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);

Hello Tejun,

The name queue_rq_srcu was chosen to reflect the original use of that structure,
namely to protect .queue_rq() calls. Your patch series broadens the use of that
srcu structure so I would appreciate it if it would be renamed, e.g. into 
"srcu".
See also commit 6a83e74d214a ("blk-mq: Introduce blk_mq_quiesce_queue()").

Thanks,

Bart.

Re: [PATCH V2 0/2] block: fix queue freeze and cleanup

2017-12-13 Thread Bart Van Assche

On Wed, 2017-11-29 at 10:57 +0800, chenxiang (M) wrote:
> I applied this v2 patchset to kernel 4.15-rc1, running fio on a SATA 
> disk, then disable the disk with sysfs interface
> (echo 0 > /sys/class/sas_phy/phy-1:0:1/enable), and find system is hung. 
> But with v1 patch, it doesn't has this issue.

Hello chenxiang,

Would it be possible to repeat your test with v2 of this series and Martin's
for-4.16 SCSI tree merged into your kernel test tree? Martin's tree is available
at 
https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/log/?h=4.16/scsi-queue.
The following patch in that tree fixes a race condition that sometimes caused
the SCSI error handler not to be woken up:
https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/commit/?h=4.16/scsi-queue=3bd6f43f5cb3714f70c591514f344389df593501

Thanks,

Bart.

Re: [PATCH V2 0/2] block: fix queue freeze and cleanup

2017-12-13 Thread Bart Van Assche

On Fri, 2017-12-01 at 16:49 -0200, Mauricio Faria de Oliveira wrote:
> LR [c057c7fc] __blk_run_queue+0x6c/0xb0
> Call Trace:
> [c001fb083970] [c001fb0839e0] 0xc001fb0839e0 (unreliable)
> [c001fb0839a0] [c057ce0c] blk_run_queue+0x4c/0x80
> [c001fb0839d0] [c0591f54] blk_freeze_queue_start+0xa4/0xb0
> [c001fb083a00] [c057d5cc] blk_set_queue_dying+0x6c/0x190
> [c001fb083a30] [c08a3fbc] __dm_destroy+0xac/0x300
> [c001fb083ad0] [c08af6a4] dev_remove+0x154/0x1d0
> [c001fb083b20] [c08affd0] ctl_ioctl+0x360/0x4f0
> [c001fb083d10] [c08b0198] dm_ctl_ioctl+0x38/0x50
> [c001fb083d50] [c03863b8] do_vfs_ioctl+0xd8/0x8c0
> [c001fb083df0] [c0386c08] SyS_ioctl+0x68/0x100
> [c001fb083e30] [c000b760] system_call+0x58/0x6c
> Instruction dump:
>        
>        
> ---[ end trace e1710ec836e5526f ]---
> 
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x000b

Hello Mauricio,

Would it be possible to repeat your test with the patch below applied on your
kernel tree? This patch has just been posted on the dm-devel mailing list.

Thanks,

Bart.


From: Bart Van Assche <bart.vanass...@wdc.com>
Date: Wed, 13 Dec 2017 13:07:14 -0800
Subject: [PATCH] dm: Fix a recently introduced reference counting bug

This patch avoids that the following message occurs sporadically
in the system log (revealing that pgpath->path.dev->name became
a dangling pointer):

device-mapper: table: 254:2: device 
kkk?x0?a?E??E??F?2?pF??PF?9[F??]F???#???#??'f?
 not in table devices list

This patch also fixes the following kernel crash:

general protection fault:  [#1] PREEMPT SMP
RIP: 0010:multipath_busy+0x77/0xd0 [dm_multipath]
Call Trace:
 dm_mq_queue_rq+0x44/0x110 [dm_mod]
 blk_mq_dispatch_rq_list+0x73/0x440
 blk_mq_do_dispatch_sched+0x60/0xe0
 blk_mq_sched_dispatch_requests+0x11a/0x1a0
 __blk_mq_run_hw_queue+0x11f/0x1c0
 __blk_mq_delay_run_hw_queue+0x95/0xe0
 blk_mq_run_hw_queue+0x25/0x80
 blk_mq_flush_plug_list+0x197/0x420
 blk_flush_plug_list+0xe4/0x270
 blk_finish_plug+0x27/0x40
 __do_page_cache_readahead+0x2b4/0x370
 force_page_cache_readahead+0xb4/0x110
 generic_file_read_iter+0x755/0x970
 __vfs_read+0xd2/0x140
 vfs_read+0x9b/0x140
 SyS_read+0x45/0xa0
 do_syscall_64+0x56/0x1a0
 entry_SYSCALL64_slow_path+0x25/0x25

From the disassembly of multipath_busy (0x77 = 119):

./include/linux/blkdev.h:
992 return bdev->bd_disk->queue;/* this is never NULL */
   0x06b4 <+116>:   mov(%rax),%rax
   0x06b7 <+119>:   mov0xe0(%rax),%rax

Fixes: commit 2a0b4682e09d ("dm: convert dm_dev_internal.count from atomic_t to 
refcount_t")
Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Elena Reshetova <elena.reshet...@intel.com>
Cc: Kees Cook <keesc...@chromium.org>
Cc: David Windsor <dwind...@gmail.com>
Cc: Hans Liljestrand <ishkam...@gmail.com>
Cc: Hannes Reinecke <h...@suse.com>
Cc: sta...@vger.kernel.org # v4.15
---
 drivers/md/dm-table.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 88130b5d95f9..ee5c389e7256 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -459,6 +459,8 @@ int dm_get_device(struct dm_target *ti, const char *path, 
fmode_t mode,
if (r)
return r;
refcount_inc(>count);
+   } else {
+   refcount_inc(>count);
}
 
*result = dd->dm_dev;
-- 
2.15.1

Re: About the try to remove cross-release feature entirely by Ingo

2017-12-13 Thread Bart Van Assche

On Wed, 2017-12-13 at 16:13 +0900, Byungchul Park wrote:
> In addition, I want to say that the current level of
> classification is much less than 100% but, since we
> have annotated well to suppress wrong reports by
> rough classifications, finally it does not come into
> view by original lockdep for now.

The Linux kernel is not a vehicle for experiments. The majority of false
positives should have been fixed before the crossrelease patches were sent
to Linus.

Bart.

Re: [PATCH 6/6] blk-mq: remove REQ_ATOM_STARTED

2017-12-12 Thread Bart Van Assche

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> @@ -409,7 +407,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
> *data, bool reserved)
> const struct show_busy_params *params = data;
>  
> if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> -   test_bit(REQ_ATOM_STARTED, >atomic_flags))
> +   blk_mq_rq_state(rq) != MQ_RQ_IDLE)
> __blk_mq_debugfs_rq_show(params->m,
>  list_entry_rq(>queuelist));

The above code should show all requests owned by the block driver. Patch
"blk-mq-debugfs: Also show requests that have not yet been started" (not yet
in Jens' tree) changes the REQ_ATOM_STARTED test into 
list_empty(>queuelist).
Can that change be supported with the existing MQ_RQ_* states or will a new
state have to be introduced to support this? See also
https://marc.info/?l=linux-block=151252188411991.

Thanks,

Bart.

Re: [PATCH 2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

2017-12-12 Thread Bart Van Assche

On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> +/*
> + * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
> + * and the upper bits the generation number.
> + */
> +enum mq_rq_state {
> + MQ_RQ_IDLE  = 0,
> + MQ_RQ_IN_FLIGHT = 1,
> +
> + MQ_RQ_STATE_BITS= 2,
> + MQ_RQ_STATE_MASK= (1 << MQ_RQ_STATE_BITS) - 1,
> + MQ_RQ_GEN_INC   = 1 << MQ_RQ_STATE_BITS,
> +};
> +
> @@ -85,6 +98,38 @@ extern void blk_mq_rq_timed_out(struct request *req, bool 
> reserved);
> +/**
> + * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
> + * @rq: target request.
> + */
> +static inline int blk_mq_rq_state(struct request *rq)
> +{
> + return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
> +}
> +
> +/**
> + * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
> + * @rq: target request.
> + * @state: new state to set.
> + *
> + * Set @rq's state to @state.  The caller is responsible for ensuring that
> + * there are no other updaters.  A request can transition into IN_FLIGHT
> + * only from IDLE and doing so increments the generation number.
> + */
> +static inline void blk_mq_rq_update_state(struct request *rq,
> +   enum mq_rq_state state)
> +{
> + u64 new_val = (READ_ONCE(rq->gstate) & ~MQ_RQ_STATE_MASK) | state;
> +
> + if (state == MQ_RQ_IN_FLIGHT) {
> + WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
> + new_val += MQ_RQ_GEN_INC;
> + }
> +
> + /* avoid exposing interim values */
> + WRITE_ONCE(rq->gstate, new_val);
> +}

Hello Tejun,

Have you considered the following instead of introducing MQ_RQ_IDLE and
MQ_RQ_IN_FLIGHT? I think this could help to limit the number of new atomic
operations introduced in the hot path by this patch series.

static inline bool blk_mq_rq_in_flight(struct request *rq)
{
return list_empty(>queuelist);
}

Thanks,

Bart.

Re: [PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle

2017-12-07 Thread Bart Van Assche

On Thu, 2017-12-07 at 09:31 +0800, Ming Lei wrote:
> But if you always call blk_mq_sched_mark_restart_hctx() before a new
> dispatch, that may affect performance on NVMe which may never trigger
> BLK_STS_RESOURCE.

Hmm ... only the SCSI core implements .get_budget() and .put_budget() and
I proposed to insert a blk_mq_sched_mark_restart_hctx() call under "if
(q->mq_ops->get_budget)". In other words, I proposed to insert a
blk_mq_sched_mark_restart_hctx() call in a code path that is never triggered
by the NVMe driver. So I don't see how the change I proposed could affect
the performance of the NVMe driver?

Bart.

Re: [PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle

2017-12-07 Thread Bart Van Assche

On Wed, 2017-12-06 at 00:28 +0800, Ming Lei wrote:
> On Tue, Dec 05, 2017 at 04:08:20PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-05 at 15:52 +0800, Ming Lei wrote:
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > index db9556662e27..1816dd8259b3 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx 
> > > *hctx)
> > >  out_put_device:
> > >   put_device(>sdev_gendev);
> > >  out:
> > > + if (atomic_read(>device_busy) == 0 && !scsi_device_blocked(sdev))
> > > + blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
> > >   return false;
> > >  }
> > 
> > This cannot work since multiple threads can call scsi_mq_get_budget()
> 
> That is exactly the way we are handling these cases before 0df21c86bdbf(scsi:
> implement .get_budget and .put_budget for blk-mq), so if it can't work,
> that is not fault of commit 0df21c86bdbf.
> 
> > concurrently and hence it can happen that none of them sees
> > atomic_read(>device_busy) == 0. BTW, the above patch is something I
> 
> If there is concurrent .get_budget(), one of them will see the counter
> becoming zero finally because each sdev->device_busy is inc/dec
> atomically. Or scsi_dev_queue_ready() return true.
> 
> Anyway, we need this patch to avoid possible regression. If you think
> there are bugs in blk-mq RESTART, just root cause and and fix it.

Hello Ming,

When I looked at the patch at the start of this thread for the first time I
got frustrated because I didn't see how this patch could fix the queue stall
I ran into myself. Today I started realizing that what Holger reported is
probably another issue than what I ran into myself. Since this patch by
itself looks now useful to me:

Reviewed-by: Bart Van Assche <bart.vanass...@wdc.com>

BTW, do you think the patch at the start of this thread also fixes the issue
that resulted in commit 826a70a08b12 ("SCSI: don't get target/host busy_count
in scsi_mq_get_budget()")? Do you think we still need that patch?

Bart.

Re: [PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle

2017-12-06 Thread Bart Van Assche

On Wed, 2017-12-06 at 09:52 +0800, Ming Lei wrote:
> On Wed, Dec 06, 2017 at 12:28:25AM +0800, Ming Lei wrote:
> > On Tue, Dec 05, 2017 at 04:08:20PM +0000, Bart Van Assche wrote:
> > > The patch below is not a full solution but resulted in a significant
> > > improvement in my tests:
> > > 
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > index 69e3226e66ca..9d86876ec503 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -226,6 +226,7 @@ void blk_mq_sched_dispatch_requests(struct 
> > > blk_mq_hw_ctx *hctx)
> > >* TODO: get more budgets, and dequeue more requests in
> > >* one time.
> > >*/
> > > + blk_mq_sched_mark_restart_hctx(hctx);
> > >   blk_mq_do_dispatch_ctx(hctx);
> > >   } else {
> > >   blk_mq_flush_busy_ctxs(hctx, _list);
> 
> BTW, this kind of change can't cover scsi_set_blocked() which is
> triggered by timeout, scsi dispatch failure. You will see that
> easily if you run the SCSI test script I provided in the commit log.

Hello Ming,

I am aware that the above change does not cover all cases. That's why I wrote
in my previous e-mail that that patch is not a full solution. The reason I
posted that change anyway is because I prefer a solution that is not based on
delayed queue runs over a solution that is based on delayed queue runs
(blk_mq_delay_run_hw_queue()). My concern is that performance of a solution
based on delayed queue runs will be suboptimal.

Bart.

[PATCH v2 1/3] scsi: Fix a scsi_show_rq() NULL pointer dereference

2017-12-05 Thread Bart Van Assche

Avoid that scsi_show_rq() triggers a NULL pointer dereference if
called after sd_uninit_command(). Swap the NULL pointer assignment
and the mempool_free() call in sd_uninit_command() to make it less
likely that scsi_show_rq() triggers a use-after-free. Note: even
with these changes scsi_show_rq() can trigger a use-after-free but
that's a lesser evil than e.g. suppressing debug information for
T10-PI commands completely. This patch fixes the following oops:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: scsi_format_opcode_name+0x1a/0x1c0
CPU: 1 PID: 1881 Comm: cat Not tainted 4.14.0-rc2.blk_mq_io_hang+ #516
Call Trace:
 __scsi_format_command+0x27/0xc0
 scsi_show_rq+0x5c/0xc0
 __blk_mq_debugfs_rq_show+0x116/0x130
 blk_mq_debugfs_rq_show+0xe/0x10
 seq_read+0xfe/0x3b0
 full_proxy_read+0x54/0x90
 __vfs_read+0x37/0x160
 vfs_read+0x96/0x130
 SyS_read+0x55/0xc0
 entry_SYSCALL_64_fastpath+0x1a/0xa5

Fixes: 0eebd005dd07 ("scsi: Implement blk_mq_ops.show_rq()")
Reported-by: Ming Lei <ming@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: James E.J. Bottomley <j...@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: Ming Lei <ming@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.com>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: sta...@vger.kernel.org
---
 drivers/scsi/scsi_debugfs.c | 6 --
 drivers/scsi/sd.c   | 4 +++-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
index 01f08c03f2c1..c3765d29fd3f 100644
--- a/drivers/scsi/scsi_debugfs.c
+++ b/drivers/scsi/scsi_debugfs.c
@@ -8,9 +8,11 @@ void scsi_show_rq(struct seq_file *m, struct request *rq)
 {
struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
int msecs = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
-   char buf[80];
+   const u8 *const cdb = READ_ONCE(cmd->cmnd);
+   char buf[80] = "(?)";
 
-   __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len);
+   if (cdb)
+   __scsi_format_command(buf, sizeof(buf), cdb, cmd->cmd_len);
seq_printf(m, ", .cmd=%s, .retries=%d, allocated %d.%03d s ago", buf,
   cmd->retries, msecs / 1000, msecs % 1000);
 }
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index d175c5c5ccf8..d841743b2107 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1284,6 +1284,7 @@ static int sd_init_command(struct scsi_cmnd *cmd)
 static void sd_uninit_command(struct scsi_cmnd *SCpnt)
 {
struct request *rq = SCpnt->request;
+   u8 *cmnd;
 
if (SCpnt->flags & SCMD_ZONE_WRITE_LOCK)
sd_zbc_write_unlock_zone(SCpnt);
@@ -1292,9 +1293,10 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
__free_page(rq->special_vec.bv_page);
 
if (SCpnt->cmnd != scsi_req(rq)->cmd) {
-   mempool_free(SCpnt->cmnd, sd_cdb_pool);
+   cmnd = SCpnt->cmnd;
SCpnt->cmnd = NULL;
SCpnt->cmd_len = 0;
+   mempool_free(cmnd, sd_cdb_pool);
}
 }
 
-- 
2.15.0

[PATCH v2 3/3] scsi-mq-debugfs: Show more information

2017-12-05 Thread Bart Van Assche

Show the request result, request timeout and SCSI command flags.
This information is very helpful when trying to figure out why a
queue got stuck. An example of the information that is exported
through debugfs:

$ (cd /sys/kernel/debug/block && find -type f -print0 | xargs -0 grep ago)
./sda/hctx0/busy:8804a4523300 {.op=READ, 
.cmd_flags=FAILFAST_DEV|FAILFAST_TRANSPORT|FAILFAST_DRIVER|RAHEAD, 
.rq_flags=MQ_INFLIGHT|DONTPREP|IO_STAT|STATS, .atomic_flags=STARTED, .tag=24, 
.internal_tag=-1, .cmd=Read(10) 28 00 06 80 1c c8 00 00 08 00, .retries=0, 
.result = 0x0, .flags=TAGGED|INITIALIZED, .timeout=90.000, allocated 0.010 s 
ago}

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: James E.J. Bottomley <j...@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: Ming Lei <ming@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.com>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 drivers/scsi/scsi_debugfs.c | 41 ++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
index c3765d29fd3f..37ed6bb8e6ec 100644
--- a/drivers/scsi/scsi_debugfs.c
+++ b/drivers/scsi/scsi_debugfs.c
@@ -4,15 +4,50 @@
 #include 
 #include "scsi_debugfs.h"
 
+#define SCSI_CMD_FLAG_NAME(name) [ilog2(SCMD_##name)] = #name
+static const char *const scsi_cmd_flags[] = {
+   SCSI_CMD_FLAG_NAME(TAGGED),
+   SCSI_CMD_FLAG_NAME(UNCHECKED_ISA_DMA),
+   SCSI_CMD_FLAG_NAME(ZONE_WRITE_LOCK),
+   SCSI_CMD_FLAG_NAME(INITIALIZED),
+};
+#undef SCSI_CMD_FLAG_NAME
+
+static int scsi_flags_show(struct seq_file *m, const unsigned long flags,
+  const char *const *flag_name, int flag_name_count)
+{
+   bool sep = false;
+   int i;
+
+   for (i = 0; i < sizeof(flags) * BITS_PER_BYTE; i++) {
+   if (!(flags & BIT(i)))
+   continue;
+   if (sep)
+   seq_puts(m, "|");
+   sep = true;
+   if (i < flag_name_count && flag_name[i])
+   seq_puts(m, flag_name[i]);
+   else
+   seq_printf(m, "%d", i);
+   }
+   return 0;
+}
+
 void scsi_show_rq(struct seq_file *m, struct request *rq)
 {
struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
-   int msecs = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
+   int alloc_ms = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
+   int timeout_ms = jiffies_to_msecs(rq->timeout);
const u8 *const cdb = READ_ONCE(cmd->cmnd);
char buf[80] = "(?)";
 
if (cdb)
__scsi_format_command(buf, sizeof(buf), cdb, cmd->cmd_len);
-   seq_printf(m, ", .cmd=%s, .retries=%d, allocated %d.%03d s ago", buf,
-  cmd->retries, msecs / 1000, msecs % 1000);
+   seq_printf(m, ", .cmd=%s, .retries=%d, .result = %#x, .flags=", buf,
+  cmd->retries, cmd->result);
+   scsi_flags_show(m, cmd->flags, scsi_cmd_flags,
+   ARRAY_SIZE(scsi_cmd_flags));
+   seq_printf(m, ", .timeout=%d.%03d, allocated %d.%03d s ago",
+  timeout_ms / 1000, timeout_ms % 1000,
+  alloc_ms / 1000, alloc_ms % 1000);
 }
-- 
2.15.0

[PATCH v2 2/3] blk-mq-debugfs: Also show requests that have not yet been started

2017-12-05 Thread Bart Van Assche

When debugging e.g. the SCSI timeout handler it is important that
requests that have not yet been started or that already have
completed are also reported through debugfs.

Fixes: commit 2720bab50258 ("blk-mq-debugfs: Show busy requests")
Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Ming Lei <ming@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.com>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: sta...@vger.kernel.org
---
 block/blk-mq-debugfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f7db73f1698e..886b37163f17 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -409,7 +409,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
*data, bool reserved)
const struct show_busy_params *params = data;
 
if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-   test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   list_empty(>queuelist))
__blk_mq_debugfs_rq_show(params->m,
 list_entry_rq(>queuelist));
 }
-- 
2.15.0

[PATCH v2 0/3] Show all commands in debugfs

2017-12-05 Thread Bart Van Assche

Hello Jens,

While debugging an issue with the SCSI error handler I noticed that commands
that got stuck in that error handler are not shown in debugfs. That is very
annoying for anyone who relies on the information in debugfs for root-causing
such an issue. Hence this patch series that makes sure that commands that got
stuck in a block driver timeout handler are also shown in debugfs. Please
consider these patches for kernel v4.16.

Thanks,

Bart.

Changes compared to v1:
- Split the SCSI patch into two patches.
- Added the tag "Cc: stable" to two of the three patches.
- Included a change for the SCSI sd driver.

Bart Van Assche (3):
  scsi: Fix a scsi_show_rq() NULL pointer dereference
  blk-mq-debugfs: Also show requests that have not yet been started
  scsi-mq-debugfs: Show more information

 block/blk-mq-debugfs.c  |  2 +-
 drivers/scsi/scsi_debugfs.c | 47 -
 drivers/scsi/sd.c   |  4 +++-
 3 files changed, 46 insertions(+), 7 deletions(-)

-- 
2.15.0

Re: [PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-05 Thread Bart Van Assche

On Wed, 2017-12-06 at 00:38 +0800, Ming Lei wrote:
> On Tue, Dec 05, 2017 at 04:22:33PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-05 at 13:00 +0800, Ming Lei wrote:
> > > No, do not mix two different things in one patch, especially the fix part
> > > need to be backported to stable.
> > > 
> > > The fix part should aim at V4.15, and the other part can be a V4.16
> > > stuff.
> > 
> > Does this mean that you do not plan to post a v5 of your patch and that you
> > want me to rework this patch series? I can do that.
> 
> I believe V4 has been OK for merge, actually the only concern from James
> is that 'set the cmnd to NULL *before* calling free so we narrow the race
> window.', but that isn't required as I explained, even though you don't do
> that in this patch too.

The next version of this patch series will include the sd driver change 
requested
by James.

Bart.

Re: [PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle

2017-12-05 Thread Bart Van Assche

On Wed, 2017-12-06 at 00:28 +0800, Ming Lei wrote:
> This is still a workaround for RESTART, see my comment before:
> 
>   https://marc.info/?l=linux-block=151217500929341=2

A quote from that e-mail: "The theory about using BLK_MQ_S_SCHED_RESTART in
current way is that we mark it after requests are added to hctx->dispatch".
Reading that makes me wonder whether you understand the purpose of the
BLK_MQ_S_SCHED_RESTART flag? That flag is not set after requests are added
to the dispatch list but after requests have been *removed*. The purpose of
that flag is to detect whether another thread has run the queue after
requests were removed from the dispatch list and before these were readded.
If so, the queue needs to be rerun.

Bart.

Re: [PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-05 Thread Bart Van Assche

On Tue, 2017-12-05 at 13:00 +0800, Ming Lei wrote:
> No, do not mix two different things in one patch, especially the fix part
> need to be backported to stable.
> 
> The fix part should aim at V4.15, and the other part can be a V4.16
> stuff.

Does this mean that you do not plan to post a v5 of your patch and that you
want me to rework this patch series? I can do that.

Bart.

Re: [PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle

2017-12-05 Thread Bart Van Assche

On Tue, 2017-12-05 at 15:29 +0100, Johannes Thumshirn wrote:
> 1) Testing without the patch applied hangs the test forever as it
>doesn't get killed after a specific timeout (I think this should be
>solved in a common function).

Hello Johannes,

If a request queue got stuck then the processes that submitted the requests
on that queue are unkillable. The only approach I know of to stop these
processes is to send a kill signal and next to trigger a queue run from user
space. One possible approach is to run the following command:

for d in /sys/kernel/debug/block/*; do echo kick >$d/state; done

Bart.

Re: [PATCH] SCSI: run queue if SCSI device queue isn't ready and queue is idle

2017-12-05 Thread Bart Van Assche

On Tue, 2017-12-05 at 15:52 +0800, Ming Lei wrote:
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index db9556662e27..1816dd8259b3 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1967,6 +1967,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx 
> *hctx)
>  out_put_device:
>   put_device(>sdev_gendev);
>  out:
> + if (atomic_read(>device_busy) == 0 && !scsi_device_blocked(sdev))
> + blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);
>   return false;
>  }

This cannot work since multiple threads can call scsi_mq_get_budget()
concurrently and hence it can happen that none of them sees
atomic_read(>device_busy) == 0. BTW, the above patch is something I
consider as a workaround. Have you considered to fix the root cause and to
add blk_mq_sched_mark_restart_hctx() where these are missing, e.g. in 
blk_mq_sched_dispatch_requests()? The patch below is not a full solution
but resulted in a significant improvement in my tests:

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 69e3226e66ca..9d86876ec503 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -226,6 +226,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * TODO: get more budgets, and dequeue more requests in
 * one time.
 */
+   blk_mq_sched_mark_restart_hctx(hctx);
blk_mq_do_dispatch_ctx(hctx);
} else {
blk_mq_flush_busy_ctxs(hctx, _list);

Bart.

Re: [PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-04 Thread Bart Van Assche

On Tue, 2017-12-05 at 09:15 +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 04:38:08PM -0800, Bart Van Assche wrote:
> > Since the next patch will make it possible that scsi_show_rq() gets
> > called before the CDB pointer is changed into a non-NULL value,
> > only show the CDB if the CDB pointer is not NULL. Additionally,
> > show the request timeout and SCSI command flags. This patch also
> > fixes a bug that was reported by Ming Lei. See also Ming Lei,
> > scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November
> > 2017 (https://marc.info/?l=linux-block=151006655317188).
> 
> Please cook a patch for fixing the crash issue only, since we need
> to backport the fix to stable kernel.

The code that is touched by this patch is only used for kernel debugging.
I will do this if others agree with your opinion.

Bart.

Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche

On Tue, 2017-12-05 at 09:04 +0800, Ming Lei wrote:
> Then no reason to revert commit(0df21c86bdbf scsi: implement .get_budget an
> .put_budget for blk-mq) for one issue which may never happen in reality since
> this reproducer need out-of-tree patch.

Sorry but I disagree completely. You seem to overlook that there may be other
circumstances that trigger the same lockup, e.g. a SCSI queue full condition.

Bart.

[PATCH 0/2] Show commands stuck in a timeout handler in debugfs

2017-12-04 Thread Bart Van Assche

Hello Jens,

While debugging an issue with the SCSI error handler I noticed that commands
that got stuck in that error handler are not shown in debugfs. That is very
annoying for anyone who relies on the information in debugfs for root-causing
such an issue. Hence this patch series that makes sure that commands that got
stuck in a block driver timeout handler are also shown in debugfs. Please
consider these patches for kernel v4.16.

Thanks,

Bart.

Bart Van Assche (2):
  scsi-mq: Only show the CDB if available
  blk-mq-debugfs: Also show requests that have not yet been started

 block/blk-mq-debugfs.c  |  2 +-
 drivers/scsi/scsi_debugfs.c | 47 -
 2 files changed, 43 insertions(+), 6 deletions(-)

-- 
2.15.0

[PATCH 1/2] scsi-mq: Only show the CDB if available

2017-12-04 Thread Bart Van Assche

Since the next patch will make it possible that scsi_show_rq() gets
called before the CDB pointer is changed into a non-NULL value,
only show the CDB if the CDB pointer is not NULL. Additionally,
show the request timeout and SCSI command flags. This patch also
fixes a bug that was reported by Ming Lei. See also Ming Lei,
scsi_debugfs: fix crash in scsi_show_rq(), linux-scsi, 7 November
2017 (https://marc.info/?l=linux-block=151006655317188).

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: James E.J. Bottomley <j...@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: Ming Lei <ming@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.com>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 drivers/scsi/scsi_debugfs.c | 47 -
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_debugfs.c b/drivers/scsi/scsi_debugfs.c
index 01f08c03f2c1..37ed6bb8e6ec 100644
--- a/drivers/scsi/scsi_debugfs.c
+++ b/drivers/scsi/scsi_debugfs.c
@@ -4,13 +4,50 @@
 #include 
 #include "scsi_debugfs.h"
 
+#define SCSI_CMD_FLAG_NAME(name) [ilog2(SCMD_##name)] = #name
+static const char *const scsi_cmd_flags[] = {
+   SCSI_CMD_FLAG_NAME(TAGGED),
+   SCSI_CMD_FLAG_NAME(UNCHECKED_ISA_DMA),
+   SCSI_CMD_FLAG_NAME(ZONE_WRITE_LOCK),
+   SCSI_CMD_FLAG_NAME(INITIALIZED),
+};
+#undef SCSI_CMD_FLAG_NAME
+
+static int scsi_flags_show(struct seq_file *m, const unsigned long flags,
+  const char *const *flag_name, int flag_name_count)
+{
+   bool sep = false;
+   int i;
+
+   for (i = 0; i < sizeof(flags) * BITS_PER_BYTE; i++) {
+   if (!(flags & BIT(i)))
+   continue;
+   if (sep)
+   seq_puts(m, "|");
+   sep = true;
+   if (i < flag_name_count && flag_name[i])
+   seq_puts(m, flag_name[i]);
+   else
+   seq_printf(m, "%d", i);
+   }
+   return 0;
+}
+
 void scsi_show_rq(struct seq_file *m, struct request *rq)
 {
struct scsi_cmnd *cmd = container_of(scsi_req(rq), typeof(*cmd), req);
-   int msecs = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
-   char buf[80];
+   int alloc_ms = jiffies_to_msecs(jiffies - cmd->jiffies_at_alloc);
+   int timeout_ms = jiffies_to_msecs(rq->timeout);
+   const u8 *const cdb = READ_ONCE(cmd->cmnd);
+   char buf[80] = "(?)";
 
-   __scsi_format_command(buf, sizeof(buf), cmd->cmnd, cmd->cmd_len);
-   seq_printf(m, ", .cmd=%s, .retries=%d, allocated %d.%03d s ago", buf,
-  cmd->retries, msecs / 1000, msecs % 1000);
+   if (cdb)
+   __scsi_format_command(buf, sizeof(buf), cdb, cmd->cmd_len);
+   seq_printf(m, ", .cmd=%s, .retries=%d, .result = %#x, .flags=", buf,
+  cmd->retries, cmd->result);
+   scsi_flags_show(m, cmd->flags, scsi_cmd_flags,
+   ARRAY_SIZE(scsi_cmd_flags));
+   seq_printf(m, ", .timeout=%d.%03d, allocated %d.%03d s ago",
+  timeout_ms / 1000, timeout_ms % 1000,
+  alloc_ms / 1000, alloc_ms % 1000);
 }
-- 
2.15.0

[PATCH 2/2] blk-mq-debugfs: Also show requests that have not yet been started

2017-12-04 Thread Bart Van Assche

When debugging e.g. the SCSI timeout handler it is important that
requests that have not yet been started or that already have
completed are also reported through debugfs.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Ming Lei <ming@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.com>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
---
 block/blk-mq-debugfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f7db73f1698e..886b37163f17 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -409,7 +409,7 @@ static void hctx_show_busy_rq(struct request *rq, void 
*data, bool reserved)
const struct show_busy_params *params = data;
 
if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
-   test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   list_empty(>queuelist))
__blk_mq_debugfs_rq_show(params->m,
 list_entry_rq(>queuelist));
 }
-- 
2.15.0

Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche

On Tue, 2017-12-05 at 08:20 +0800, Ming Lei wrote:
> Also it is a bit odd to see request in hctx->dispatch now, and it can only
> happen now when scsi_target_queue_ready() returns false, so I guess you apply
> some change on target->can_queue(such as setting it as 1 in srp/ib code
> manually)?

Yes, but that had already been mentioned. From the e-mail at the start of
this e-mail thread: "Change the SRP initiator such that SCSI target queue
depth is limited to 1." The changes I made in the SRP initiator are the same
as those described in the following message from about one month ago:
https://www.spinics.net/lists/linux-scsi/msg114720.html.

Bart.

Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche

On Tue, 2017-12-05 at 07:01 +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 10:48:18PM +0000, Bart Van Assche wrote:
> > On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote:
> > > On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote:
> > > > * A systematic lockup for SCSI queues with queue depth 1. The
> > > >   following test reproduces that bug systematically:
> > > >   - Change the SRP initiator such that SCSI target queue depth is
> > > > limited to 1.
> > > >   - Run the following command:
> > > >   srp-test/run_tests -f xfs -d -e none -r 60 -t 01
> > > >   See also "[PATCH 4/7] blk-mq: Avoid that request processing
> > > >   stalls when sharing tags"
> > > >   (https://marc.info/?l=linux-block=151208695316857). Note:
> > > >   reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
> > > >   queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
> > > >   before all blk_mq_dispatch_rq_list() calls only fixes the
> > > >   systematic lockup for queue depth 1.
> > > 
> > > You are the only reproducer [ ... ]
> > 
> > That's not correct. I'm pretty sure if you try to reproduce this that
> > you will see the same hang I ran into. Does this mean that you have not
> > yet tried to reproduce the hang I reported?
> 
> Do you mean every kernel developer has to own one SRP/IB hardware?

When I have the time I will make it possible to run this test on any system
equipped with at least one Ethernet port. But the fact that the test I
mentioned requires IB hardware should not prevent you from running this test
since you have run this test software before.

> I don't have your hardware to reproduce that,

That's not true. Your employer definitely owns IB hardware. E.g. the
following message shows that you have run the srp-test yourself on IB hardware
only four weeks ago:

https://www.spinics.net/lists/linux-block/msg19511.html

> Otherwise, there should have be such similar reports from others, not from
> only you.

That's not correct either. How long was it ago that kernel v4.15-rc1 was
released? One week? How many SRP users do you think have tried to trigger a
queue full condition with that kernel version?

> More importantly I don't understand why you can't share the kernel
> log/debugfs log when IO hang happens?
> 
> Without any kernel log, how can we confirm that it is a valid report?

It's really unfortunate that you are focussing on denying that the bug I
reported exists instead of trying to fix the bugs introduced by commit
b347689ffbca. BTW, I have more than enough experience to decide myself what
a valid report is and what not. I can easily send you several MB of kernel
logs. The reason I have not yet done this is because I'm 99.9% sure that
these won't help to root cause the reported issue. But I can tell you what
I learned from analyzing the information under /sys/kernel/debug/block:
every time a hang occurred I noticed that no requests were "busy", that two
requests occurred in rq_lists and one request occurred in a hctx dispatch
list. This is enough to conclude that a queue run was missing. And I think
that the patch at the start of this e-mail thread not only shows that the
root cause is in the block layer but also that this bug was introduced by
commit b347689ffbca.

> > > You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched:
> > > improve dispatching from sw queue")', but you don't mention any issue
> > > about that commit.
> > 
> > That's not correct either. From the commit message "A systematic lockup
> > for SCSI queues with queue depth 1."
> 
> I mean you mentioned your patch can fix 'commit b347689ffbca
> ("blk-mq-sched: improve dispatching from sw queue")', but you never
> point where the commit b347689ffbca is wrong, how your patch fixes
> the mistake of that commit.

You should know that it is not required to perform a root cause analysis
before posting a revert. Having performed a bisect is sufficient.

BTW, it seems like you forgot that last Friday I explained to you that there
is an obvious bug in commit b347689ffbca, namely that a 
blk_mq_sched_mark_restart_hctx()
call is missing in blk_mq_sched_dispatch_requests() before the 
blk_mq_do_dispatch_ctx()
call. See also https://marc.info/?l=linux-block=151215794224401.

Bart.

Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Bart Van Assche

On Tue, 2017-12-05 at 06:45 +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 03:09:20PM +0000, Bart Van Assche wrote:
> > On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
> > > Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
> > > blk-mq")
> > 
> > It might be safer to revert commit 0df21c86bdbf instead of trying to fix all
> > issues introduced by that commit for kernel version v4.15 ...
> 
> What are all issues in v4.15-rc? Up to now, it is the only issue reported,
> and can be fixed by this simple patch, which one can be thought as cleanup
> too.

The three issues I described in the commit message of the patch that is 
available at:
https://marc.info/?l=linux-block=151240866010572.

Bart.

Re: [PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche

On Tue, 2017-12-05 at 06:42 +0800, Ming Lei wrote:
> On Mon, Dec 04, 2017 at 09:30:32AM -0800, Bart Van Assche wrote:
> > * A systematic lockup for SCSI queues with queue depth 1. The
> >   following test reproduces that bug systematically:
> >   - Change the SRP initiator such that SCSI target queue depth is
> > limited to 1.
> >   - Run the following command:
> >   srp-test/run_tests -f xfs -d -e none -r 60 -t 01
> >   See also "[PATCH 4/7] blk-mq: Avoid that request processing
> >   stalls when sharing tags"
> >   (https://marc.info/?l=linux-block=151208695316857). Note:
> >   reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
> >   queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
> >   before all blk_mq_dispatch_rq_list() calls only fixes the
> >   systematic lockup for queue depth 1.
> 
> You are the only reproducer [ ... ]

That's not correct. I'm pretty sure if you try to reproduce this that
you will see the same hang I ran into. Does this mean that you have not
yet tried to reproduce the hang I reported?

> You said that your patch fixes 'commit b347689ffbca ("blk-mq-sched:
> improve dispatching from sw queue")', but you don't mention any issue
> about that commit.

That's not correct either. From the commit message "A systematic lockup
for SCSI queues with queue depth 1."

> > I think the above means that it is too risky to try to fix all bugs
> > introduced by commit 0df21c86bdbf before kernel v4.15 is released.
> > Hence revert that commit.
> 
> What is the risk?

That more bugs were introduced by commit 0df21c86bdbf than the ones that
have been discovered so far.

Bart.

[PATCH] blk-mq: Fix several SCSI request queue lockups

2017-12-04 Thread Bart Van Assche

Commit 0df21c86bdbf introduced several bugs:
* A SCSI queue stall for queue depths > 1, addressed by commit
  88022d7201e9 ("blk-mq: don't handle failure in .get_budget")
* A systematic lockup for SCSI queues with queue depth 1. The
  following test reproduces that bug systematically:
  - Change the SRP initiator such that SCSI target queue depth is
limited to 1.
  - Run the following command:
  srp-test/run_tests -f xfs -d -e none -r 60 -t 01
  See also "[PATCH 4/7] blk-mq: Avoid that request processing
  stalls when sharing tags"
  (https://marc.info/?l=linux-block=151208695316857). Note:
  reverting commit 0df21c86bdbf also fixes a sporadic SCSI request
  queue lockup while inserting a blk_mq_sched_mark_restart_hctx()
  before all blk_mq_dispatch_rq_list() calls only fixes the
  systematic lockup for queue depth 1.
* A scsi_debug lockup - see also "[PATCH] SCSI: delay run queue if
  device is blocked in scsi_dev_queue_ready()"
  (https://marc.info/?l=linux-block=151223233407154).

I think the above means that it is too risky to try to fix all bugs
introduced by commit 0df21c86bdbf before kernel v4.15 is released.
Hence revert that commit.

Fixes: commit 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for 
blk-mq")
Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Ming Lei <ming@redhat.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.com>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: James E.J. Bottomley <j...@linux.vnet.ibm.com>
Cc: Martin K. Petersen <martin.peter...@oracle.com>
Cc: linux-s...@vger.kernel.org
---
 drivers/scsi/scsi_lib.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 84bd2b16d216..a7e7966f1477 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1976,9 +1976,11 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
struct scsi_device *sdev = q->queuedata;
struct Scsi_Host *shost = sdev->host;
struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
-   blk_status_t ret;
+   blk_status_t ret = BLK_STS_RESOURCE;
int reason;
 
+   if (!scsi_mq_get_budget(hctx))
+   goto out;
ret = prep_to_mq(scsi_prep_state_check(sdev, req));
if (ret != BLK_STS_OK)
goto out_put_budget;
@@ -2022,6 +2024,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
atomic_dec(_target(sdev)->target_busy);
 out_put_budget:
scsi_mq_put_budget(hctx);
+out:
switch (ret) {
case BLK_STS_OK:
break;
@@ -2225,8 +2228,6 @@ struct request_queue *scsi_old_alloc_queue(struct 
scsi_device *sdev)
 }
 
 static const struct blk_mq_ops scsi_mq_ops = {
-   .get_budget = scsi_mq_get_budget,
-   .put_budget = scsi_mq_put_budget,
.queue_rq   = scsi_queue_rq,
.complete   = scsi_softirq_done,
.timeout= scsi_timeout,
-- 
2.15.0

Re: [PATCH] SCSI: delay run queue if device is blocked in scsi_dev_queue_ready()

2017-12-04 Thread Bart Van Assche

On Sun, 2017-12-03 at 00:31 +0800, Ming Lei wrote:
> Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq")

It might be safer to revert commit 0df21c86bdbf instead of trying to fix all
issues introduced by that commit for kernel version v4.15 ...

Bart.

Re: [PATCH 4/7] blk-mq: Avoid that request processing stalls when sharing tags

2017-12-01 Thread Bart Van Assche

On Sat, 2017-12-02 at 09:00 +0800, Ming Lei wrote:
> On Sat, Dec 02, 2017 at 12:48:51AM +0000, Bart Van Assche wrote:
> > Further tests have shown that the lockup I referred to does not occur 
> > before commit
> > b347689ffbca but that it occurs with b347689ffbca.
> 
> Then you need to root cause it, or

It's not my job to root cause bugs introduced by your patches.

> Provide debugfs log and reproduction steps, please.

How I reproduce this bug is something I have already mentioned many times in
previous e-mails.

Bart.

Re: [PATCH 4/7] blk-mq: Avoid that request processing stalls when sharing tags

2017-12-01 Thread Bart Van Assche

On Sat, 2017-12-02 at 08:36 +0800, Ming Lei wrote:
> On Fri, Dec 01, 2017 at 07:52:14PM +0000, Bart Van Assche wrote:
> > On Fri, 2017-12-01 at 10:58 +0800, Ming Lei wrote:
> > > On Thu, Nov 30, 2017 at 04:08:45PM -0800, Bart Van Assche wrote:
> > > > blk_mq_dispatch_rq_list() is called. Make sure that
> > > > BLK_MQ_S_SCHED_RESTART is set before any blk_mq_dispatch_rq_list()
> > > > call occurs.
> > > > 
> > > > Fixes: commit b347689ffbca ("blk-mq-sched: improve dispatching from sw 
> > > > queue")
> > > 
> > > We always mark RESTART state bit just before dispatching from 
> > > ->dispatch_list,
> > > this way has been there before b347689ffbca, which doesn't change this
> > > RESTART mechanism, so please explain a bit why it is a fix on commit
> > > b347689ffbca.
> > 
> > I'm not completely sure which patch introduced the lockup fixed by this 
> > patch
> > but I will have another look whether this was really introduced by commit
> > b347689ffbca.
> 
> Please make sure 'Fixes' tag correct.

Further tests have shown that the lockup I referred to does not occur before 
commit
b347689ffbca but that it occurs with b347689ffbca. I think that shows clearly 
that
commit b347689ffbca introduced this lockup.

Bart.

Re: [PATCH 4/7] blk-mq: Avoid that request processing stalls when sharing tags

2017-12-01 Thread Bart Van Assche

On Fri, 2017-12-01 at 10:58 +0800, Ming Lei wrote:
> On Thu, Nov 30, 2017 at 04:08:45PM -0800, Bart Van Assche wrote:
> > blk_mq_sched_mark_restart_hctx() must be called before
> 
> Could you please describe the theory on commit log? Like, why is it
> a must? and what is the issue to be fixed? 

The BLK_MQ_S_SCHED_RESTART test at the end of blk_mq_dispatch_rq_list() can
only work if BLK_MQ_S_SCHED_RESTART is set before blk_mq_dispatch_rq_list()
is called. BTW, without this patch every iteration of my test triggers a
queue stall. With this patch a queue stall only occurs sporadically so I
think we really need something like this patch.

> > blk_mq_dispatch_rq_list() is called. Make sure that
> > BLK_MQ_S_SCHED_RESTART is set before any blk_mq_dispatch_rq_list()
> > call occurs.
> > 
> > Fixes: commit b347689ffbca ("blk-mq-sched: improve dispatching from sw 
> > queue")
> 
> We always mark RESTART state bit just before dispatching from ->dispatch_list,
> this way has been there before b347689ffbca, which doesn't change this
> RESTART mechanism, so please explain a bit why it is a fix on commit
> b347689ffbca.

I'm not completely sure which patch introduced the lockup fixed by this patch
but I will have another look whether this was really introduced by commit
b347689ffbca.

Bart.

Re: [PATCH] blk-mq: improve tag waiting setup for non-shared tags

2017-12-01 Thread Bart Van Assche

On Thu, 2017-11-09 at 16:00 -0700, Jens Axboe wrote:
> + spin_lock(_hctx->lock);
> + if (!list_empty(>entry)) {
> + spin_unlock(_hctx->lock);
> + return false;
> + }
>  
> - ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
> - add_wait_queue(>wait, wait);
> + ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
> + add_wait_queue(>wait, wait);
> + }

Hello Jens,

My understanding is that all code that adds a dispatch_wait entry to a wait
queue or removes it from a wait queue is protected by ws->wait.lock. Can you
explain why the above list_empty() check is protected by this_hctx->lock?

Thanks,

Bart.

Re: [PATCH 2/7] block: Document more locking requirements

2017-12-01 Thread Bart Van Assche

On Thu, 2017-11-30 at 20:03 -0700, Jens Axboe wrote:
> On 11/30/2017 05:08 PM, Bart Van Assche wrote:
> > This patch does not change any functionality.
> 
> Unless these actually found real bugs, I think it's pointless.
> Add a comment.
 
Hello Jens,

As you know lockdep_assert_held() statements are verified at runtime with
LOCKDEP enabled but comments not. Hence my preference for lockdep_assert_held()
over source code comments.

> And this:
>
> > @@ -1003,9 +1007,14 @@ bool blk_mq_get_driver_tag(struct request *rq, 
> > struct blk_mq_hw_ctx **hctx,
> >  static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
> > int flags, void *key)
> >  {
> > -   struct blk_mq_hw_ctx *hctx;
> > +   struct blk_mq_hw_ctx *hctx =
> > +   container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
> > +
> > +#ifdef CONFIG_LOCKDEP
> > +   struct sbq_wait_state *ws = bt_wait_ptr(>tags->bitmap_tags, hctx);
> >  
> > -   hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
> > +   lockdep_assert_held(>wait.lock);
> > +#endif
> 
> we're definitely not doing.

Can you explain me why you think the above is wrong? My understanding is
that the call chain for the above function is as follows:

blk_mq_tag_wakeup_all()
-> sbitmap_queue_wake_all()
  -> wake_up()
-> __wake_up()
  -> __wake_up_common_lock()
-> spin_lock_irqsave(_head->lock, flags);
-> __wake_up_common()
  -> list_for_each_entry_safe_from(curr, next, _head->head, entry)
  ->   ret = curr->func(curr, mode, wake_flags, key)
-> spin_unlock_irqrestore(_head->lock, flags);

In other words, blk_mq_dispatch_wake() is called with the wait queue head
lock held.

Bart.

Re: [PATCH V2 0/2] block: fix queue freeze and cleanup

2017-12-01 Thread Bart Van Assche

On Fri, 2017-12-01 at 13:36 -0200, Mauricio Faria de Oliveira wrote:
> On 11/29/2017 12:57 AM, chenxiang (M) wrote:
> > I applied this v2 patchset to kernel 4.15-rc1, running fio on a SATA 
> > disk, then disable the disk with sysfs interface
> > (echo 0 > /sys/class/sas_phy/phy-1:0:1/enable), and find system is hung. 
> > But with v1 patch, it doesn't
> > has this issue. Please have a check.
> 
> Indeed, with this particular test-case (thanks, chenxiang) the problem
> can be recreated with PATCH v2 but _not_ with v1.
> 
> For reference, I'm including the tests with v2 in this e-mail.
> The same tests have too been performed with v1, without blocked tasks.
> 
> Interestingly, physical disk pulls did not hit the problem either
> on v1 or v2 (but it does not matter anymore) -- so v1 is the one.

The test chenxiang ran does not prove that there is anything wrong with v2.
Maybe chenxiang hit the issue described in https://lkml.org/lkml/2017/9/5/381?

Bart.

[PATCH 7/7] blk-mq: Fix another queue stall

2017-11-30 Thread Bart Van Assche

The following code at the end of blk_mq_dispatch_rq_list() detects
whether or not wake_up(>dispatch_wait) has been called
concurrently with pushing back requests onto the dispatch list:

list_empty_careful(>dispatch_wait.entry)

Since blk_mq_dispatch_wake() is protected by another lock than the
dispatch list and since blk_mq_run_hw_queue() does not acquire any
lock if it notices that no requests are pending,
blk_mq_dispatch_wake() is not ordered against the code that pushes
back requests onto the dispatch list. Avoid that the dispatch_wait
empty check fails due to load/store reordering by serializing it
against the dispatch_wait queue wakeup. This patch fixes a queue
stall I ran into while testing a SCSI initiator driver with the
maximum target depth set to one.

Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Ming Lei <ming@redhat.com>
Cc: Omar Sandoval <osan...@fb.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
---
 block/blk-mq.c | 16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b4225f606737..a11767a4d95c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1074,6 +1074,20 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
return ret;
 }
 
+static bool blk_mq_dispatch_list_empty(struct blk_mq_hw_ctx *hctx)
+{
+   struct sbq_wait_state *ws = bt_wait_ptr(>tags->bitmap_tags, hctx);
+   struct wait_queue_head *wq_head = >wait;
+   unsigned long flags;
+   bool result;
+
+   spin_lock_irqsave(_head->lock, flags);
+   result = list_empty(>dispatch_wait.entry);
+   spin_unlock_irqrestore(_head->lock, flags);
+
+   return result;
+}
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 bool got_budget)
 {
@@ -1197,7 +1211,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 */
if (restart ||
!blk_mq_sched_needs_restart(hctx) ||
-   (no_tag && list_empty_careful(>dispatch_wait.entry)))
+   (no_tag && blk_mq_dispatch_list_empty(hctx)))
blk_mq_run_hw_queue(hctx, true);
}
 
-- 
2.15.0

[PATCH 6/7] blk-mq: Rerun hardware queues after having called .put_budget()

2017-11-30 Thread Bart Van Assche

During request dispatch, after a scheduler or per-CPU queue has
been examined, .put_budget() is called if the examined queue is
empty. Since a new request may be queued concurrently with the
.put_budget() call, a request queue needs to be rerun after each
.put_budget() call.

Fixes: commit 1f460b63d4b3 ("blk-mq: don't restart queue when .get_budget 
returns BLK_STS_RESOURCE")
Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
Cc: Ming Lei <ming@redhat.com>
Cc: Omar Sandoval <osan...@fb.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Hannes Reinecke <h...@suse.de>
Cc: Johannes Thumshirn <jthumsh...@suse.de>
Cc: <sta...@vger.kernel.org>
---
 block/blk-mq-sched.c | 39 ---
 block/blk-mq-sched.h |  2 +-
 block/blk-mq.c   | 17 -
 3 files changed, 33 insertions(+), 25 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 398545d94521..3a935081a2d3 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -82,12 +82,8 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
*hctx)
return blk_mq_run_hw_queue(hctx, true);
 }
 
-/*
- * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
- * its queue by itself in its completion handler, so we don't need to
- * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
- */
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
+/* returns true if hctx needs to be run again */
+static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
@@ -106,7 +102,7 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx 
*hctx)
rq = e->type->ops.mq.dispatch_request(hctx);
if (!rq) {
blk_mq_put_dispatch_budget(hctx);
-   break;
+   return true;
}
 
/*
@@ -116,6 +112,8 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx 
*hctx)
 */
list_add(>queuelist, _list);
} while (blk_mq_dispatch_rq_list(q, _list, true));
+
+   return false;
 }
 
 static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
@@ -129,16 +127,13 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct 
blk_mq_hw_ctx *hctx,
return hctx->ctxs[idx];
 }
 
-/*
- * Only SCSI implements .get_budget and .put_budget, and SCSI restarts
- * its queue by itself in its completion handler, so we don't need to
- * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE.
- */
-static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
+/* returns true if hctx needs to be run again */
+static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
LIST_HEAD(rq_list);
struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+   bool ret = false;
 
do {
struct request *rq;
@@ -152,6 +147,7 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx 
*hctx)
rq = blk_mq_dequeue_from_ctx(hctx, ctx);
if (!rq) {
blk_mq_put_dispatch_budget(hctx);
+   ret = true;
break;
}
 
@@ -168,19 +164,22 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx 
*hctx)
} while (blk_mq_dispatch_rq_list(q, _list, true));
 
WRITE_ONCE(hctx->dispatch_from, ctx);
+
+   return ret;
 }
 
 /* return true if hw queue need to be run again */
-void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
+bool blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
LIST_HEAD(rq_list);
+   bool run_queue = false;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
-   return;
+   return false;
 
hctx->run++;
 
@@ -212,12 +211,12 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
if (!list_empty(_list)) {
if (blk_mq_dispatch_rq_list(q, _list, false)) {
if (has_sched_dispatch)
-   blk_mq_do_dispatch_sched(hctx);
+   run_queue = blk_mq_do_dispatch_sched(hctx);
else
-   blk_mq_do_dispatch_ctx(hctx);
+   run_queue = blk_mq_do_dispatch_ctx(hctx);
}
} else if (has_sched_dispatch) {
-   blk_mq_do_dispatch_sched(hctx)

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 2095 matches

Mail list logo