Re: [PATCH V6 9/9] nvme: hold request queue's refcount in ns's whole lifetime

2019-04-17 Thread Keith Busch
On Tue, Apr 16, 2019 at 08:44:10PM -0700, Ming Lei wrote:
> Hennes reported the following kernel oops:
> 
> There is a race condition between namespace rescanning and
> controller reset; during controller reset all namespaces are
> quiesed vie nams_stop_ctrl(), and after reset all namespaces
> are unquiesced again.
> When namespace scanning was active by the time controller reset
> was triggered the rescan code will call nvme_ns_remove(), which
> then will cause a kernel crash in nvme_start_ctrl() as it'll trip
> over uninitialized namespaces.
> 
> Patch "blk-mq: free hw queue's resource in hctx's release handler"
> should make this issue quite difficult to trigger. However it can't
> kill the issue completely becasue pre-condition of that patch is to
> hold request queue's refcount before calling block layer API, and
> there is still a small window between blk_cleanup_queue() and removing
> the ns from the controller namspace list in nvme_ns_remove().
> 
> Hold request queue's refcount until the ns is freed, then the above race
> can be avoided completely. Given the 'namespaces_rwsem' is always held
> to retrieve ns for starting/stopping request queue, this lock can prevent
> namespaces from being freed.

This looks good to me.

Reviewed-by: Keith Busch 


Re: Question on handling managed IRQs when hotplugging CPUs

2019-02-05 Thread Keith Busch
On Tue, Feb 05, 2019 at 04:10:47PM +0100, Hannes Reinecke wrote:
> On 2/5/19 3:52 PM, Keith Busch wrote:
> > Whichever layer dispatched the IO to a CPU specific context should
> > be the one to wait for its completion. That should be blk-mq for most
> > block drivers.
> > 
> Indeed.
> But we don't provide any mechanisms for that ATM, right?
> 
> Maybe this would be a topic fit for LSF/MM?

Right, there's nothing handling this now, and sounds like it'd be a good
discussion to bring to the storage track.


Re: Question on handling managed IRQs when hotplugging CPUs

2019-02-05 Thread Keith Busch
On Tue, Feb 05, 2019 at 03:09:28PM +, John Garry wrote:
> On 05/02/2019 14:52, Keith Busch wrote:
> > On Tue, Feb 05, 2019 at 05:24:11AM -0800, John Garry wrote:
> > > On 04/02/2019 07:12, Hannes Reinecke wrote:
> > > 
> > > Hi Hannes,
> > > 
> > > > 
> > > > So, as the user then has to wait for the system to declars 'ready for
> > > > CPU remove', why can't we just disable the SQ and wait for all I/O to
> > > > complete?
> > > > We can make it more fine-grained by just waiting on all outstanding I/O
> > > > on that SQ to complete, but waiting for all I/O should be good as an
> > > > initial try.
> > > > With that we wouldn't need to fiddle with driver internals, and could
> > > > make it pretty generic.
> > > 
> > > I don't fully understand this idea - specifically, at which layer would
> > > we be waiting for all the IO to complete?
> > 
> > Whichever layer dispatched the IO to a CPU specific context should
> > be the one to wait for its completion. That should be blk-mq for most
> > block drivers.
> 
> For SCSI devices, unfortunately not all IO sent to the HW originates from
> blk-mq or any other single entity.

Then they'll need to register their own CPU notifiers and handle the
ones they dispatched.


Re: Question on handling managed IRQs when hotplugging CPUs

2019-02-05 Thread Keith Busch
On Tue, Feb 05, 2019 at 05:24:11AM -0800, John Garry wrote:
> On 04/02/2019 07:12, Hannes Reinecke wrote:
> 
> Hi Hannes,
> 
> >
> > So, as the user then has to wait for the system to declars 'ready for
> > CPU remove', why can't we just disable the SQ and wait for all I/O to
> > complete?
> > We can make it more fine-grained by just waiting on all outstanding I/O
> > on that SQ to complete, but waiting for all I/O should be good as an
> > initial try.
> > With that we wouldn't need to fiddle with driver internals, and could
> > make it pretty generic.
> 
> I don't fully understand this idea - specifically, at which layer would 
> we be waiting for all the IO to complete?

Whichever layer dispatched the IO to a CPU specific context should
be the one to wait for its completion. That should be blk-mq for most
block drivers.


Re: [PATCH v2 7/7] Remove an atomic instruction from the hot path

2019-01-15 Thread Keith Busch
On Tue, Jan 15, 2019 at 04:50:03PM -0800, Bart Van Assche wrote:
> From scsi_init_command(), a function called by scsi_mq_prep_fn():
> 
>   /* zero out the cmd, except for the embedded scsi_request */
>   memset((char *)cmd + sizeof(cmd->req), 0,
>   sizeof(*cmd) - sizeof(cmd->req) + dev->host->hostt->cmd_size);
> 
> In other words, scsi_mq_prep_fn() clears scsi_cmnd.flags. Hence move
> the clear_bit() call into the else branch, the only branch in which
> this code is necessary.

Right, good call. You can even safely use "cmd->state = 0" in this
path instead of clear_bit(), but last time there were objections to
it looking inconsistent with the completion+timeout paths that use it
atomicly. Anyway,

Reviewed-by: Keith Busch 


Re: [PATCHv4 0/3] scsi timeout handling updates

2018-11-29 Thread Keith Busch
On Thu, Nov 29, 2018 at 06:11:59PM +0100, Christoph Hellwig wrote:
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index a82830f39933..d0ef540711c7 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
> >  
> >  int blk_mq_request_started(struct request *rq)
> >  {
> > -   return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
> > +   return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT;
> >  }
> >  EXPORT_SYMBOL_GPL(blk_mq_request_started);
> 
> Independ of this series this change looks like the right thing to do.
> But this whole area is a mine field, so separate testing would be
> very helpful.
>
> I also wonder why we even bother with the above helper, a direct
> state comparism seems a lot more obvious to the reader.

I think it's just because blk_mq_rq_state() is a private interface. The
enum mq_rq_state is defined under include/linux/, so it looks okay to
make getting the state public too.
 
> Last but not least the blk_mq_request_started check in nbd
> should probably be lifted into blk_mq_tag_to_rq while we're at it..
> 
> As for the nvme issue - it seems to me like we need to decouple the
> nvme loop frontend request from the target backend request.  In case
> of a timeout/reset we'll complete struct request like all other nvme
> transport drivers, but we leave the backend target state around, which
> will be freed when it completes (or leaks when the completion is lost).

I don't think nvme's loop target should do anything to help a command
complete. It shouldn't even implement a timeout for the same reason
no stacking block driver implements these. If a request is stuck, the
lowest level is the only driver that should have the responsibility to
make it unstuck.


Re: [PATCH] scsi: Fix a harmless double shift bug

2018-11-29 Thread Keith Busch
On Thu, Nov 29, 2018 at 01:37:10PM +0300, Dan Carpenter wrote:
> Smatch generates a warning:
> 
> drivers/scsi/scsi_lib.c:1656 scsi_mq_done() warn: test_bit() takes a bit 
> number
> 
> The problem is that SCMD_STATE_COMPLETE is supposed to be bit number 0
> and not a mask like "(1 << 0)".  It is used like this:
> 
>   if (test_and_set_bit(SCMD_STATE_COMPLETE, &scmd->state))
> 
> The test_and_set_bit() has a shift built in so it's a double left shift
> and uses bit number 1 instead of number 0.  This bug is harmless because
> it's done consistently and it doesn't clash with any other flags.
> 
> Fixes: f1342709d18a ("scsi: Do not rely on blk-mq for double completions")
> Signed-off-by: Dan Carpenter 

Nice catch, thanks for the fix.

Reviewed-by: Keith Busch 


Re: [PATCHv4 0/3] scsi timeout handling updates

2018-11-28 Thread Keith Busch
On Wed, Nov 28, 2018 at 03:31:46PM -0700, Keith Busch wrote:
> Waiting for a freeze isn't really the criteria we need anyway: we don't
> care if there are entered requests in REQ_MQ_IDLE. We just want to wait
> for dispatched ones to return, and we currently don't have a good way
> to sync with that condition.

One thing making this weird is that blk_mq_request_started() returns true
for COMPLETED requests, and that makes no sense to me. Completed is
the opposite of started, so I'm not sure why we would return true for
such states. Is anyone actually depending on that?

If we can return true only for started commands, the following
implements the desired wait criteria:

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a82830f39933..d0ef540711c7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -647,7 +647,7 @@ EXPORT_SYMBOL(blk_mq_complete_request);
 
 int blk_mq_request_started(struct request *rq)
 {
-   return blk_mq_rq_state(rq) != MQ_RQ_IDLE;
+   return blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT;
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9908082b32c4..ae50b6ed95fb 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -14,6 +14,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -425,12 +426,37 @@ static int nvme_loop_configure_admin_queue(struct 
nvme_loop_ctrl *ctrl)
return error;
 }
 
+bool nvme_count_active(struct request *req, void *data, bool reserved)
+{
+   unsigned int *active = data;
+
+   (*active)++;
+   return true;
+}
+
+/*
+ * It is the backing device driver's responsibility to ensure all dispatched
+ * requests are eventually completed.
+ */
+static void nvme_wait_for_stopped(struct nvme_loop_ctrl *ctrl)
+{
+   unsigned int active;
+
+   do {
+   active = 0;
+   blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_count_active,
+   &active);
+   if (!active)
+   return;
+   msleep(100);
+   } while (true);
+}
+
 static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
 {
if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(&ctrl->ctrl);
-   blk_mq_tagset_busy_iter(&ctrl->tag_set,
-   nvme_cancel_request, &ctrl->ctrl);
+   nvme_wait_for_stopped(ctrl);
nvme_loop_destroy_io_queues(ctrl);
}
 
--


Re: [PATCHv4 0/3] scsi timeout handling updates

2018-11-28 Thread Keith Busch
On Wed, Nov 28, 2018 at 09:26:55AM -0700, Keith Busch wrote:
> ---
> diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> index 9908082b32c4..116398b240e5 100644
> --- a/drivers/nvme/target/loop.c
> +++ b/drivers/nvme/target/loop.c
> @@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct 
> nvme_loop_ctrl *ctrl)
>  static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
>  {
>   if (ctrl->ctrl.queue_count > 1) {
> - nvme_stop_queues(&ctrl->ctrl);
> - blk_mq_tagset_busy_iter(&ctrl->tag_set,
> - nvme_cancel_request, &ctrl->ctrl);
> + /*
> +  * The back end device driver is responsible for completing all
> +  * entered requests
> +  */
> + nvme_start_freeze(&ctrl->ctrl);
> + nvme_wait_freeze(&ctrl->ctrl);
>   nvme_loop_destroy_io_queues(ctrl);
> + nvme_unfreeze(&ctrl->ctrl);
>   }
>  
>   if (ctrl->ctrl.state == NVME_CTRL_LIVE)
> ---

Grr, the above doesn't work so well when not using NVMe mpath because
nvmf_check_ready() will requeue it, leaving entered requests that won't
be started.

Waiting for a freeze isn't really the criteria we need anyway: we don't
care if there are entered requests in REQ_MQ_IDLE. We just want to wait
for dispatched ones to return, and we currently don't have a good way
to sync with that condition.


Re: [PATCHv4 0/3] scsi timeout handling updates

2018-11-28 Thread Keith Busch
On Wed, Nov 28, 2018 at 09:26:55AM -0700, Keith Busch wrote:
> ---
> diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> index 9908082b32c4..116398b240e5 100644
> --- a/drivers/nvme/target/loop.c
> +++ b/drivers/nvme/target/loop.c
> @@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct 
> nvme_loop_ctrl *ctrl)
>  static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
>  {
>   if (ctrl->ctrl.queue_count > 1) {
> - nvme_stop_queues(&ctrl->ctrl);
> - blk_mq_tagset_busy_iter(&ctrl->tag_set,
> - nvme_cancel_request, &ctrl->ctrl);
> + /*
> +  * The back end device driver is responsible for completing all
> +  * entered requests
> +  */
> + nvme_start_freeze(&ctrl->ctrl);
> + nvme_wait_freeze(&ctrl->ctrl);
>   nvme_loop_destroy_io_queues(ctrl);
> + nvme_unfreeze(&ctrl->ctrl);
>   }
>  
>   if (ctrl->ctrl.state == NVME_CTRL_LIVE)
> ---

The above tests fine with io and nvme resets on a target nvme loop
backed by null_blk, but I also couldn't recreate the original report
without the patch.

Ming,

Could you tell me a little more how you set it up? I'm just configuring
null_blk with queue_mode=2 irqmode=2. Anything else to recreate?

Thanks,
Keith 


Re: [PATCHv4 0/3] scsi timeout handling updates

2018-11-28 Thread Keith Busch
On Wed, Nov 28, 2018 at 08:58:00AM -0700, Jens Axboe wrote:
> On 11/28/18 8:49 AM, Keith Busch wrote:
> > On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote:
> >> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
> >>>> Is this the nvme target on top of null_blk?
> >>>
> >>> Yes.
> >>
> >> And it goes away if you revert just the last patch?
> > 
> > It looks like a problem existed before that last patch. Reverting it
> > helps only if the request happened to have not been reallocated. If it
> > had been reallocated, the NULL_IRQ_TIMER would have completed the wrong
> > request out-of-order. If this were a real device, that'd probably result
> > in data corruption.
> 
> null_blk just needs updating for this.

Isn't this the nvme target's problem? It shouldn't complete requests
dispatched to its backing device, so I'm thinking something like the
following is what should happen. Untested at the moment, but will try
it out shortly.

---
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9908082b32c4..116398b240e5 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -428,10 +428,14 @@ static int nvme_loop_configure_admin_queue(struct 
nvme_loop_ctrl *ctrl)
 static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
 {
if (ctrl->ctrl.queue_count > 1) {
-   nvme_stop_queues(&ctrl->ctrl);
-   blk_mq_tagset_busy_iter(&ctrl->tag_set,
-   nvme_cancel_request, &ctrl->ctrl);
+   /*
+* The back end device driver is responsible for completing all
+* entered requests
+*/
+   nvme_start_freeze(&ctrl->ctrl);
+   nvme_wait_freeze(&ctrl->ctrl);
nvme_loop_destroy_io_queues(ctrl);
+   nvme_unfreeze(&ctrl->ctrl);
}
 
if (ctrl->ctrl.state == NVME_CTRL_LIVE)
---


Re: [PATCHv4 0/3] scsi timeout handling updates

2018-11-28 Thread Keith Busch
On Wed, Nov 28, 2018 at 11:08:48AM +0100, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 06:07:01PM +0800, Ming Lei wrote:
> > > Is this the nvme target on top of null_blk?
> > 
> > Yes.
> 
> And it goes away if you revert just the last patch?

It looks like a problem existed before that last patch. Reverting it
helps only if the request happened to have not been reallocated. If it
had been reallocated, the NULL_IRQ_TIMER would have completed the wrong
request out-of-order. If this were a real device, that'd probably result
in data corruption.


[PATCHv4 3/3] blk-mq: Simplify request completion state

2018-11-26 Thread Keith Busch
There are no more users relying on blk-mq request states to prevent
double completions, so replace the relatively expensive cmpxchg operation
with WRITE_ONCE.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c |  4 +---
 include/linux/blk-mq.h | 14 --
 2 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7c8cfa0cd420..cda698804422 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -568,9 +568,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   if (!blk_mq_mark_complete(rq))
-   return;
-
+   WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
/*
 * Most of single queue controllers, there is only one irq vector
 * for handling IO completion, and the only irq's affinity is set
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 6e3da356a8eb..b8de11e0603b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -329,20 +329,6 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
 unsigned int blk_mq_rq_cpu(struct request *rq);
 
-/**
- * blk_mq_mark_complete() - Set request state to complete
- * @rq: request to set to complete state
- *
- * Returns true if request state was successfully set to complete. If
- * successful, the caller is responsibile for seeing this request is ended, as
- * blk_mq_complete_request will not work again.
- */
-static inline bool blk_mq_mark_complete(struct request *rq)
-{
-   return cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) ==
-   MQ_RQ_IN_FLIGHT;
-}
-
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
-- 
2.14.4



[PATCHv4 1/3] blk-mq: Return true if request was completed

2018-11-26 Thread Keith Busch
A driver may have internal state to cleanup if we're pretending a request
didn't complete. Return 'false' if the command wasn't actually completed
due to the timeout error injection, and true otherwise.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c | 5 +++--
 include/linux/blk-mq.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 37674c1766a7..7c8cfa0cd420 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -638,11 +638,12 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
  * Ends all I/O on a request. It does not handle partial completions.
  * The actual completion happens out-of-order, through a IPI handler.
  **/
-void blk_mq_complete_request(struct request *rq)
+bool blk_mq_complete_request(struct request *rq)
 {
if (unlikely(blk_should_fake_timeout(rq->q)))
-   return;
+   return false;
__blk_mq_complete_request(rq);
+   return true;
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ca0520ca6437..6e3da356a8eb 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -298,7 +298,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool 
at_head,
bool kick_requeue_list);
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long 
msecs);
-void blk_mq_complete_request(struct request *rq);
+bool blk_mq_complete_request(struct request *rq);
 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
   struct bio *bio);
 bool blk_mq_queue_stopped(struct request_queue *q);
-- 
2.14.4



[PATCHv4 0/3] scsi timeout handling updates

2018-11-26 Thread Keith Busch
The iterative update to the previous version taking into account review
comments.

Background:

The main objective is to remove the generic block layer's lock prefix
currently required to transition a request to its completed state by
shifting that expense to lower level drivers that actually need it,
and removing the software layering violation that was required to use
that mechnaism.

Changes since v3:

  The complete state is moved to its own field separate from the
  non-atomic scsi_cmnd "flags" field.

  Code comments added to describe the more obscure handling for fake
  timeout injection.

Keith Busch (3):
  blk-mq: Return true if request was completed
  scsi: Do not rely on blk-mq for double completions
  blk-mq: Simplify request completion state

 block/blk-mq.c|  9 -
 drivers/scsi/scsi_error.c | 22 +++---
 drivers/scsi/scsi_lib.c   | 13 -
 include/linux/blk-mq.h| 16 +---
 include/scsi/scsi_cmnd.h  |  4 
 5 files changed, 32 insertions(+), 32 deletions(-)

-- 
2.14.4



[PATCHv4 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-26 Thread Keith Busch
The scsi timeout error handling had been directly updating the block
layer's request state to prevent a error handling and a natural completion
from completing the same request twice. Fix this layering violation
by having scsi control the fate of its commands with scsi owned flags
rather than use blk-mq's.

Signed-off-by: Keith Busch 
---
 drivers/scsi/scsi_error.c | 22 +++---
 drivers/scsi/scsi_lib.c   | 13 -
 include/scsi/scsi_cmnd.h  |  4 
 3 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index dd338a8cd275..16eef068e9e9 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -297,19 +297,19 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
 
if (rtn == BLK_EH_DONE) {
/*
-* For blk-mq, we must set the request state to complete now
-* before sending the request to the scsi error handler. This
-* will prevent a use-after-free in the event the LLD manages
-* to complete the request before the error handler finishes
-* processing this timed out request.
+* Set the command to complete first in order to prevent a real
+* completion from releasing the command while error handling
+* is using it. If the command was already completed, then the
+* lower level driver beat the timeout handler, and it is safe
+* to return without escalating error recovery.
 *
-* If the request was already completed, then the LLD beat the
-* time out handler from transferring the request to the scsi
-* error handler. In that case we can return immediately as no
-* further action is required.
+* If timeout handling lost the race to a real completion, the
+* block layer may ignore that due to a fake timeout injection,
+* so return RESET_TIMER to allow error handling another shot
+* at this command.
 */
-   if (!blk_mq_mark_complete(req))
-   return rtn;
+   if (test_and_set_bit(SCMD_STATE_COMPLETE, &scmd->state))
+   return BLK_EH_RESET_TIMER;
if (scsi_abort_command(scmd) != SUCCESS) {
set_host_byte(scmd, DID_TIME_OUT);
scsi_eh_scmd_add(scmd);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0df15cb738d2..0dbf25512778 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1642,8 +1642,18 @@ static blk_status_t scsi_mq_prep_fn(struct request *req)
 
 static void scsi_mq_done(struct scsi_cmnd *cmd)
 {
+   if (unlikely(test_and_set_bit(SCMD_STATE_COMPLETE, &cmd->state)))
+   return;
trace_scsi_dispatch_cmd_done(cmd);
-   blk_mq_complete_request(cmd->request);
+
+   /*
+* If the block layer didn't complete the request due to a timeout
+* injection, scsi must clear its internal completed state so that the
+* timeout handler will see it needs to escalate its own error
+* recovery.
+*/
+   if (unlikely(!blk_mq_complete_request(cmd->request)))
+   clear_bit(SCMD_STATE_COMPLETE, &cmd->state);
 }
 
 static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
@@ -1702,6 +1712,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
if (!scsi_host_queue_ready(q, shost, sdev))
goto out_dec_target_busy;
 
+   clear_bit(SCMD_STATE_COMPLETE, &cmd->state);
if (!(req->rq_flags & RQF_DONTPREP)) {
ret = scsi_mq_prep_fn(req);
if (ret != BLK_STS_OK)
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index d6fd2aba0380..3de905e205ce 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -61,6 +61,9 @@ struct scsi_pointer {
 /* flags preserved across unprep / reprep */
 #define SCMD_PRESERVED_FLAGS   (SCMD_UNCHECKED_ISA_DMA | SCMD_INITIALIZED)
 
+/* for scmd->state */
+#define SCMD_STATE_COMPLETE(1 << 0)
+
 struct scsi_cmnd {
struct scsi_request req;
struct scsi_device *device;
@@ -145,6 +148,7 @@ struct scsi_cmnd {
 
int result; /* Status code from lower level driver */
int flags;  /* Command flags */
+   unsigned long state;/* Command completion state */
 
unsigned char tag;  /* SCSI-II queued command tag */
 };
-- 
2.14.4



Re: [PATCHv3 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-26 Thread Keith Busch
On Mon, Nov 26, 2018 at 08:32:45AM -0700, Jens Axboe wrote:
> On 11/21/18 6:12 AM, Christoph Hellwig wrote:
> > On Mon, Nov 19, 2018 at 08:19:00AM -0700, Keith Busch wrote:
> >> On Mon, Nov 19, 2018 at 12:58:15AM -0800, Christoph Hellwig wrote:
> >>>> index 5d83a162d03b..c1d5e4e36125 100644
> >>>> --- a/drivers/scsi/scsi_lib.c
> >>>> +++ b/drivers/scsi/scsi_lib.c
> >>>> @@ -1635,8 +1635,11 @@ static blk_status_t scsi_mq_prep_fn(struct 
> >>>> request *req)
> >>>>  
> >>>>  static void scsi_mq_done(struct scsi_cmnd *cmd)
> >>>>  {
> >>>> +if (unlikely(test_and_set_bit(__SCMD_COMPLETE, &cmd->flags)))
> >>>> +return;
> >>>>  trace_scsi_dispatch_cmd_done(cmd);
> >>>> -blk_mq_complete_request(cmd->request);
> >>>> +if (unlikely(!blk_mq_complete_request(cmd->request)))
> >>>> +clear_bit(__SCMD_COMPLETE, &cmd->flags);
> >>>>  }
> >>>
> >>> This looks a little odd to me.  If we didn't complete the command
> >>> someone else did.  Why would we clear the bit in this case?
> >>
> >> It's only to go along with the fake timeout. If we don't clear the bit,
> >> then then scsi timeout handler will believe it has nothing to do because
> >> scsi did its required part. The block layer just pretends the LLD didn't
> >> do its part, so scsi has to play along too.
> > 
> > This just looks way to magic to me.  In other word - it needs a big fat
> > comment explaining the situation.
> > 
> >>>> +#define __SCMD_COMPLETE 3
> >>>> +#define SCMD_COMPLETE   (1 << __SCMD_COMPLETE)
> >>>
> >>> This mixing of atomic and non-atomic bitops looks rather dangerous
> >>> to me.  Can you add a new atomic_flags just for the completed flag,
> >>> and always use the bitops on it for now? I think we can eventually
> >>> kill most of the existing flags except for SCMD_TAGGED over the
> >>> next merge window or two and then move that over as well.
> >>
> >> The only concurrent access is completion + timeout, otherwise access is
> >> single-threaded. I'm using the atomic operations only where it is
> >> needed.
> >>
> >> We implicitly clear the SCMD_COMPLETED flag along with SCMD_TAGGED in
> >> scsi_init_command() too, and I didn't want to add new overhead with
> >> new atomics.
> > 
> > In general mixing access types on a single field (nevermind bit)
> > is going to cause us problems further down the road sooner or later.
> > 
> > I'd be much happier with a separate field.
> 
> Keith, will you please respin with the separate field? Would be nice
> to get this merged for 4.21.

I'll send out a new version today.


Re: [PATCHv3 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-19 Thread Keith Busch
On Mon, Nov 19, 2018 at 12:58:15AM -0800, Christoph Hellwig wrote:
> > index 5d83a162d03b..c1d5e4e36125 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1635,8 +1635,11 @@ static blk_status_t scsi_mq_prep_fn(struct request 
> > *req)
> >  
> >  static void scsi_mq_done(struct scsi_cmnd *cmd)
> >  {
> > +   if (unlikely(test_and_set_bit(__SCMD_COMPLETE, &cmd->flags)))
> > +   return;
> > trace_scsi_dispatch_cmd_done(cmd);
> > -   blk_mq_complete_request(cmd->request);
> > +   if (unlikely(!blk_mq_complete_request(cmd->request)))
> > +   clear_bit(__SCMD_COMPLETE, &cmd->flags);
> >  }
> 
> This looks a little odd to me.  If we didn't complete the command
> someone else did.  Why would we clear the bit in this case?

It's only to go along with the fake timeout. If we don't clear the bit,
then then scsi timeout handler will believe it has nothing to do because
scsi did its required part. The block layer just pretends the LLD didn't
do its part, so scsi has to play along too.

> >  static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
> > @@ -1701,6 +1704,7 @@ static blk_status_t scsi_queue_rq(struct 
> > blk_mq_hw_ctx *hctx,
> > goto out_dec_host_busy;
> > req->rq_flags |= RQF_DONTPREP;
> > } else {
> > +   cmd->flags &= ~SCMD_COMPLETE;
> > blk_mq_start_request(req);
> > }
> >  
> > diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
> > index d6fd2aba0380..ded7c7194a28 100644
> > --- a/include/scsi/scsi_cmnd.h
> > +++ b/include/scsi/scsi_cmnd.h
> > @@ -58,6 +58,9 @@ struct scsi_pointer {
> >  #define SCMD_TAGGED(1 << 0)
> >  #define SCMD_UNCHECKED_ISA_DMA (1 << 1)
> >  #define SCMD_INITIALIZED   (1 << 2)
> > +
> > +#define __SCMD_COMPLETE3
> > +#define SCMD_COMPLETE  (1 << __SCMD_COMPLETE)
> 
> This mixing of atomic and non-atomic bitops looks rather dangerous
> to me.  Can you add a new atomic_flags just for the completed flag,
> and always use the bitops on it for now? I think we can eventually
> kill most of the existing flags except for SCMD_TAGGED over the
> next merge window or two and then move that over as well.

The only concurrent access is completion + timeout, otherwise access is
single-threaded. I'm using the atomic operations only where it is
needed.

We implicitly clear the SCMD_COMPLETED flag along with SCMD_TAGGED in
scsi_init_command() too, and I didn't want to add new overhead with
new atomics.

> Otherwise the concept looks fine to me.


[PATCHv3 1/3] blk-mq: Return true if request was completed

2018-11-15 Thread Keith Busch
A driver may have internal state to cleanup if we're pretending a request
didn't complete. Return 'false' if the command wasn't actually completed
due to the timeout error injection, and true otherwise.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c | 5 +++--
 include/linux/blk-mq.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 352051ea30f7..1fb0f2050d31 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -633,11 +633,12 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
  * Ends all I/O on a request. It does not handle partial completions.
  * The actual completion happens out-of-order, through a IPI handler.
  **/
-void blk_mq_complete_request(struct request *rq)
+bool blk_mq_complete_request(struct request *rq)
 {
if (unlikely(blk_should_fake_timeout(rq->q)))
-   return;
+   return false;
__blk_mq_complete_request(rq);
+   return true;
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 929e8abc5535..f0dc26fc86a9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -298,7 +298,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool 
at_head,
bool kick_requeue_list);
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long 
msecs);
-void blk_mq_complete_request(struct request *rq);
+bool blk_mq_complete_request(struct request *rq);
 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
   struct bio *bio);
 bool blk_mq_queue_stopped(struct request_queue *q);
-- 
2.14.4



[PATCHv3 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-15 Thread Keith Busch
The scsi timeout error handling had been directly updating the block
layer's request state to prevent a error handling and a natural completion
from completing the same request twice. Fix this layering violation
by having scsi control the fate of its commands with scsi owned flags
rather than use blk-mq's.

Signed-off-by: Keith Busch 
---
 drivers/scsi/scsi_error.c | 22 +++---
 drivers/scsi/scsi_lib.c   |  6 +-
 include/scsi/scsi_cmnd.h  |  5 -
 3 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index dd338a8cd275..e92e088f636f 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -297,19 +297,19 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
 
if (rtn == BLK_EH_DONE) {
/*
-* For blk-mq, we must set the request state to complete now
-* before sending the request to the scsi error handler. This
-* will prevent a use-after-free in the event the LLD manages
-* to complete the request before the error handler finishes
-* processing this timed out request.
+* Set the command to complete first in order to prevent a real
+* completion from releasing the command while error handling
+* is using it. If the command was already completed, then the
+* lower level driver beat the timeout handler, and it is safe
+* to return without escalating error recovery.
 *
-* If the request was already completed, then the LLD beat the
-* time out handler from transferring the request to the scsi
-* error handler. In that case we can return immediately as no
-* further action is required.
+* If timeout handling lost the race to a real completion, the
+* block layer may ignore that due to a fake timeout injection,
+* so return RESET_TIMER to allow error handling another shot
+* at this command.
 */
-   if (!blk_mq_mark_complete(req))
-   return rtn;
+   if (test_and_set_bit(__SCMD_COMPLETE, &scmd->flags))
+   return BLK_EH_RESET_TIMER;
if (scsi_abort_command(scmd) != SUCCESS) {
set_host_byte(scmd, DID_TIME_OUT);
scsi_eh_scmd_add(scmd);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5d83a162d03b..c1d5e4e36125 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1635,8 +1635,11 @@ static blk_status_t scsi_mq_prep_fn(struct request *req)
 
 static void scsi_mq_done(struct scsi_cmnd *cmd)
 {
+   if (unlikely(test_and_set_bit(__SCMD_COMPLETE, &cmd->flags)))
+   return;
trace_scsi_dispatch_cmd_done(cmd);
-   blk_mq_complete_request(cmd->request);
+   if (unlikely(!blk_mq_complete_request(cmd->request)))
+   clear_bit(__SCMD_COMPLETE, &cmd->flags);
 }
 
 static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
@@ -1701,6 +1704,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
goto out_dec_host_busy;
req->rq_flags |= RQF_DONTPREP;
} else {
+   cmd->flags &= ~SCMD_COMPLETE;
blk_mq_start_request(req);
}
 
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index d6fd2aba0380..ded7c7194a28 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -58,6 +58,9 @@ struct scsi_pointer {
 #define SCMD_TAGGED(1 << 0)
 #define SCMD_UNCHECKED_ISA_DMA (1 << 1)
 #define SCMD_INITIALIZED   (1 << 2)
+
+#define __SCMD_COMPLETE3
+#define SCMD_COMPLETE  (1 << __SCMD_COMPLETE)
 /* flags preserved across unprep / reprep */
 #define SCMD_PRESERVED_FLAGS   (SCMD_UNCHECKED_ISA_DMA | SCMD_INITIALIZED)
 
@@ -144,7 +147,7 @@ struct scsi_cmnd {
 * to be at an address < 16Mb). */
 
int result; /* Status code from lower level driver */
-   int flags;  /* Command flags */
+   unsigned long flags;/* Command flags */
 
unsigned char tag;  /* SCSI-II queued command tag */
 };
-- 
2.14.4



[PATCHv3 0/3] scsi timeout handling updates

2018-11-15 Thread Keith Busch
The main objective is to remove the generic block layer's lock prefix
currently required to transition a request to its completed state by
shifting that expense to lower level drivers that need it, and removing
the software layering violation that was required to use that mechnaism.

Changes since v2:

  This one really plugs any gaps with fake timeout injection and
  additional coode comments included.

Keith Busch (3):
  blk-mq: Return true if request was completed
  scsi: Do not rely on blk-mq for double completions
  blk-mq: Simplify request completion state

 block/blk-mq.c|  9 -
 drivers/scsi/scsi_error.c | 22 +++---
 drivers/scsi/scsi_lib.c   |  6 +-
 include/linux/blk-mq.h| 16 +---
 include/scsi/scsi_cmnd.h  |  5 -
 5 files changed, 25 insertions(+), 33 deletions(-)

-- 
2.14.4



[PATCHv3 3/3] blk-mq: Simplify request completion state

2018-11-15 Thread Keith Busch
There are no more users relying on blk-mq request states to prevent
double completions, so replace the relatively expensive cmpxchg operation
with WRITE_ONCE.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c |  4 +---
 include/linux/blk-mq.h | 14 --
 2 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fb0f2050d31..ce769132cfbe 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -568,9 +568,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   if (!blk_mq_mark_complete(rq))
-   return;
-
+   WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
/*
 * Most of single queue controllers, there is only one irq vector
 * for handling IO completion, and the only irq's affinity is set
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f0dc26fc86a9..9b5528fcdd45 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -329,20 +329,6 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
 unsigned int blk_mq_rq_cpu(struct request *rq);
 
-/**
- * blk_mq_mark_complete() - Set request state to complete
- * @rq: request to set to complete state
- *
- * Returns true if request state was successfully set to complete. If
- * successful, the caller is responsibile for seeing this request is ended, as
- * blk_mq_complete_request will not work again.
- */
-static inline bool blk_mq_mark_complete(struct request *rq)
-{
-   return cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) ==
-   MQ_RQ_IN_FLIGHT;
-}
-
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
-- 
2.14.4



Re: [PATCH 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-15 Thread Keith Busch
Sorry everyone, this was the previous verision. Please ignore, I'm
sending out the updated one now.


[PATCH 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-15 Thread Keith Busch
The scsi timeout error handling had been directly updating the request
state to prevent a natural completion and error handling from completing
the same request twice. Fix this layering violation by having scsi
control the fate of its commands with scsi owned flags rather than
use blk-mq's.

Signed-off-by: Keith Busch 
---
 drivers/scsi/scsi_error.c | 17 +++--
 drivers/scsi/scsi_lib.c   |  6 +-
 include/scsi/scsi_cmnd.h  |  5 -
 3 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c736d61b1648..f89e829a1c51 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -199,6 +199,9 @@ scsi_abort_command(struct scsi_cmnd *scmd)
return FAILED;
}
 
+   if (test_and_set_bit(__SCMD_COMPLETE, &scmd->flags))
+   return SUCCESS;
+
spin_lock_irqsave(shost->host_lock, flags);
if (shost->eh_deadline != -1 && !shost->last_reset)
shost->last_reset = jiffies;
@@ -296,20 +299,6 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
rtn = host->hostt->eh_timed_out(scmd);
 
if (rtn == BLK_EH_DONE) {
-   /*
-* For blk-mq, we must set the request state to complete now
-* before sending the request to the scsi error handler. This
-* will prevent a use-after-free in the event the LLD manages
-* to complete the request before the error handler finishes
-* processing this timed out request.
-*
-* If the request was already completed, then the LLD beat the
-* time out handler from transferring the request to the scsi
-* error handler. In that case we can return immediately as no
-* further action is required.
-*/
-   if (req->q->mq_ops && !blk_mq_mark_complete(req))
-   return rtn;
if (scsi_abort_command(scmd) != SUCCESS) {
set_host_byte(scmd, DID_TIME_OUT);
scsi_eh_scmd_add(scmd);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c7fccbb8f554..1e74137f1073 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2044,8 +2044,11 @@ static int scsi_mq_prep_fn(struct request *req)
 
 static void scsi_mq_done(struct scsi_cmnd *cmd)
 {
+   if (test_and_set_bit(__SCMD_COMPLETE, &cmd->flags))
+   return;
trace_scsi_dispatch_cmd_done(cmd);
-   blk_mq_complete_request(cmd->request);
+   if (unlikely(!blk_mq_complete_request(cmd->request)))
+   clear_bit(__SCMD_COMPLETE, &cmd->flags);
 }
 
 static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
@@ -2104,6 +2107,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
goto out_dec_host_busy;
req->rq_flags |= RQF_DONTPREP;
} else {
+   cmd->flags &= ~SCMD_COMPLETE;
blk_mq_start_request(req);
}
 
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index c891ada3c5c2..acef13c628d3 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -58,6 +58,9 @@ struct scsi_pointer {
 #define SCMD_TAGGED(1 << 0)
 #define SCMD_UNCHECKED_ISA_DMA (1 << 1)
 #define SCMD_INITIALIZED   (1 << 2)
+
+#define __SCMD_COMPLETE3
+#define SCMD_COMPLETE  (1 << __SCMD_COMPLETE)
 /* flags preserved across unprep / reprep */
 #define SCMD_PRESERVED_FLAGS   (SCMD_UNCHECKED_ISA_DMA | SCMD_INITIALIZED)
 
@@ -144,7 +147,7 @@ struct scsi_cmnd {
 * to be at an address < 16Mb). */
 
int result; /* Status code from lower level driver */
-   int flags;  /* Command flags */
+   unsigned long flags;/* Command flags */
 
unsigned char tag;  /* SCSI-II queued command tag */
 };
-- 
2.14.4



[PATCH 1/3] blk-mq: Return true if request was completed

2018-11-15 Thread Keith Busch
A driver may have internal state to cleanup if we're pretending a request
timeout occured. Return 'false' if the command wasn't actually completed
due to the error injection, and true otherwise.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c | 5 +++--
 include/linux/blk-mq.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e5b17a..f91951800a64 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,11 +630,12 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
  * Ends all I/O on a request. It does not handle partial completions.
  * The actual completion happens out-of-order, through a IPI handler.
  **/
-void blk_mq_complete_request(struct request *rq)
+bool blk_mq_complete_request(struct request *rq)
 {
if (unlikely(blk_should_fake_timeout(rq->q)))
-   return;
+   return false;
__blk_mq_complete_request(rq);
+   return true;
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2286dc12c6bc..dec6ef385492 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -264,7 +264,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool 
at_head,
bool kick_requeue_list);
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long 
msecs);
-void blk_mq_complete_request(struct request *rq);
+bool blk_mq_complete_request(struct request *rq);
 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
   struct bio *bio);
 bool blk_mq_queue_stopped(struct request_queue *q);
-- 
2.14.4



[PATCH 3/3] blk-mq: Simplify request completion state

2018-11-15 Thread Keith Busch
There are no more users relying on blk-mq request states to prevent
double completions, so replace the relatively expensive cmpxchg operation
with WRITE_ONCE.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c |  4 +---
 include/linux/blk-mq.h | 14 --
 2 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f91951800a64..9d569e74cbe3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -565,9 +565,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   if (!blk_mq_mark_complete(rq))
-   return;
-
+   WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
/*
 * Most of single queue controllers, there is only one irq vector
 * for handling IO completion, and the only irq's affinity is set
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index dec6ef385492..ec0a2688a373 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -293,20 +293,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
-/**
- * blk_mq_mark_complete() - Set request state to complete
- * @rq: request to set to complete state
- *
- * Returns true if request state was successfully set to complete. If
- * successful, the caller is responsibile for seeing this request is ended, as
- * blk_mq_complete_request will not work again.
- */
-static inline bool blk_mq_mark_complete(struct request *rq)
-{
-   return cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) ==
-   MQ_RQ_IN_FLIGHT;
-}
-
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
-- 
2.14.4



Re: [PATCH 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-14 Thread Keith Busch
On Wed, Nov 14, 2018 at 11:00:18AM -0700, Keith Busch wrote:
> On Wed, Nov 14, 2018 at 09:51:36AM -0800, Bart Van Assche wrote:
> > Regarding this patch: I think this patch introduces a subtle but severe bug
> > in the SCSI core, namely that if an abort is processed concurrently with
> > request completion with "fake timeout" enabled that the abort is ignored.
> 
> That requires the following occur concurrently:
> 
>   1. A real completion
>   2. A real timeout
>   3. A fake timeout
> 
> That can't happen on a production kernel, and highly improbable
> on the fake one. We can still fix it by having scsi timeout return
> BLK_EH_RESET_TIMER in this case. I didn't like adding code just to work
> around error injection, but there isn't a good alternative at the moment.

So do this instead:

--8<---
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index ff372b335ced..d343024e732a 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -199,9 +199,6 @@ scsi_abort_command(struct scsi_cmnd *scmd)
return FAILED;
}
 
-   if (test_and_set_bit(__SCMD_COMPLETE, &scmd->flags))
-   return SUCCESS;
-
spin_lock_irqsave(shost->host_lock, flags);
if (shost->eh_deadline != -1 && !shost->last_reset)
shost->last_reset = jiffies;
@@ -299,6 +296,8 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
rtn = host->hostt->eh_timed_out(scmd);
 
if (rtn == BLK_EH_DONE) {
+   if (test_and_set_bit(__SCMD_COMPLETE, &scmd->flags))
+   return BLK_EH_RESET_TIMER;
if (scsi_abort_command(scmd) != SUCCESS) {
set_host_byte(scmd, DID_TIME_OUT);
scsi_eh_scmd_add(scmd);
-->8---


Re: [PATCH 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-14 Thread Keith Busch
On Wed, Nov 14, 2018 at 09:51:36AM -0800, Bart Van Assche wrote:
> Regarding this patch: I think this patch introduces a subtle but severe bug
> in the SCSI core, namely that if an abort is processed concurrently with
> request completion with "fake timeout" enabled that the abort is ignored.

That requires the following occur concurrently:

  1. A real completion
  2. A real timeout
  3. A fake timeout

That can't happen on a production kernel, and highly improbable
on the fake one. We can still fix it by having scsi timeout return
BLK_EH_RESET_TIMER in this case. I didn't like adding code just to work
around error injection, but there isn't a good alternative at the moment.


[PATCH 3/3] blk-mq: Simplify request completion state

2018-11-14 Thread Keith Busch
There are no more users relying on blk-mq request states to prevent
double completions, so replace the relatively expensive cmpxchg operation
with WRITE_ONCE.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c |  4 +---
 include/linux/blk-mq.h | 14 --
 2 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f91951800a64..9d569e74cbe3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -565,9 +565,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   if (!blk_mq_mark_complete(rq))
-   return;
-
+   WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
/*
 * Most of single queue controllers, there is only one irq vector
 * for handling IO completion, and the only irq's affinity is set
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index dec6ef385492..ec0a2688a373 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -293,20 +293,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
-/**
- * blk_mq_mark_complete() - Set request state to complete
- * @rq: request to set to complete state
- *
- * Returns true if request state was successfully set to complete. If
- * successful, the caller is responsibile for seeing this request is ended, as
- * blk_mq_complete_request will not work again.
- */
-static inline bool blk_mq_mark_complete(struct request *rq)
-{
-   return cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) ==
-   MQ_RQ_IN_FLIGHT;
-}
-
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
-- 
2.14.4



[PATCH 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-14 Thread Keith Busch
The scsi timeout error handling had been directly updating the request
state to prevent a natural completion and error handling from completing
the same request twice. Fix this layering violation by having scsi
control the fate of its commands with scsi owned flags rather than
use blk-mq's.

Signed-off-by: Keith Busch 
---
 drivers/scsi/scsi_error.c | 17 +++--
 drivers/scsi/scsi_lib.c   |  6 +-
 include/scsi/scsi_cmnd.h  |  5 -
 3 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index c736d61b1648..f89e829a1c51 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -199,6 +199,9 @@ scsi_abort_command(struct scsi_cmnd *scmd)
return FAILED;
}
 
+   if (test_and_set_bit(__SCMD_COMPLETE, &scmd->flags))
+   return SUCCESS;
+
spin_lock_irqsave(shost->host_lock, flags);
if (shost->eh_deadline != -1 && !shost->last_reset)
shost->last_reset = jiffies;
@@ -296,20 +299,6 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
rtn = host->hostt->eh_timed_out(scmd);
 
if (rtn == BLK_EH_DONE) {
-   /*
-* For blk-mq, we must set the request state to complete now
-* before sending the request to the scsi error handler. This
-* will prevent a use-after-free in the event the LLD manages
-* to complete the request before the error handler finishes
-* processing this timed out request.
-*
-* If the request was already completed, then the LLD beat the
-* time out handler from transferring the request to the scsi
-* error handler. In that case we can return immediately as no
-* further action is required.
-*/
-   if (req->q->mq_ops && !blk_mq_mark_complete(req))
-   return rtn;
if (scsi_abort_command(scmd) != SUCCESS) {
set_host_byte(scmd, DID_TIME_OUT);
scsi_eh_scmd_add(scmd);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c7fccbb8f554..1e74137f1073 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2044,8 +2044,11 @@ static int scsi_mq_prep_fn(struct request *req)
 
 static void scsi_mq_done(struct scsi_cmnd *cmd)
 {
+   if (test_and_set_bit(__SCMD_COMPLETE, &cmd->flags))
+   return;
trace_scsi_dispatch_cmd_done(cmd);
-   blk_mq_complete_request(cmd->request);
+   if (unlikely(!blk_mq_complete_request(cmd->request)))
+   clear_bit(__SCMD_COMPLETE, &cmd->flags);
 }
 
 static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
@@ -2104,6 +2107,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
goto out_dec_host_busy;
req->rq_flags |= RQF_DONTPREP;
} else {
+   cmd->flags &= ~SCMD_COMPLETE;
blk_mq_start_request(req);
}
 
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index c891ada3c5c2..acef13c628d3 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -58,6 +58,9 @@ struct scsi_pointer {
 #define SCMD_TAGGED(1 << 0)
 #define SCMD_UNCHECKED_ISA_DMA (1 << 1)
 #define SCMD_INITIALIZED   (1 << 2)
+
+#define __SCMD_COMPLETE3
+#define SCMD_COMPLETE  (1 << __SCMD_COMPLETE)
 /* flags preserved across unprep / reprep */
 #define SCMD_PRESERVED_FLAGS   (SCMD_UNCHECKED_ISA_DMA | SCMD_INITIALIZED)
 
@@ -144,7 +147,7 @@ struct scsi_cmnd {
 * to be at an address < 16Mb). */
 
int result; /* Status code from lower level driver */
-   int flags;  /* Command flags */
+   unsigned long flags;/* Command flags */
 
unsigned char tag;  /* SCSI-II queued command tag */
 };
-- 
2.14.4



[PATCH 1/3] blk-mq: Return true if request was completed

2018-11-14 Thread Keith Busch
A driver may have internal state to cleanup if we're pretending a request
timeout occured. Return 'false' if the command wasn't actually completed
due to the error injection, and true otherwise.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c | 5 +++--
 include/linux/blk-mq.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e5b17a..f91951800a64 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,11 +630,12 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
  * Ends all I/O on a request. It does not handle partial completions.
  * The actual completion happens out-of-order, through a IPI handler.
  **/
-void blk_mq_complete_request(struct request *rq)
+bool blk_mq_complete_request(struct request *rq)
 {
if (unlikely(blk_should_fake_timeout(rq->q)))
-   return;
+   return false;
__blk_mq_complete_request(rq);
+   return true;
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2286dc12c6bc..dec6ef385492 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -264,7 +264,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool 
at_head,
bool kick_requeue_list);
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long 
msecs);
-void blk_mq_complete_request(struct request *rq);
+bool blk_mq_complete_request(struct request *rq);
 bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
   struct bio *bio);
 bool blk_mq_queue_stopped(struct request_queue *q);
-- 
2.14.4



Re: [PATCH 1/2] scsi: Do not rely on blk-mq for double completions

2018-11-13 Thread Keith Busch
On Tue, Nov 13, 2018 at 12:20:46PM -0700, Jens Axboe wrote:
> On 11/13/18 11:57 AM, Keith Busch wrote:
> >  static void scsi_mq_done(struct scsi_cmnd *cmd)
> >  {
> > +   if (test_and_set_bit(__SCMD_COMPLETE, &cmd->flags))
> > +   return;
> > trace_scsi_dispatch_cmd_done(cmd);
> > blk_mq_complete_request(cmd->request);
> > +
> > +#ifdef CONFIG_FAIL_IO_TIMEOUT
> > +   /*
> > +* Clearing complete here serves only to allow the desired recovery to
> > +* escalate on blk_rq_should_fake_timeout()'s error injection.
> > +*/
> > +   clear_bit(__SCMD_COMPLETE, &cmd->flags);
> > +#endif
> >  }
> 
> We could have this be:
> 
> static void scsi_mq_done(struct scsi_cmnd *cmd)
> {
>   if (test_and_set_bit(__SCMD_COMPLETE, &cmd->flags))
>   return;
>   trace_scsi_dispatch_cmd_done(cmd);
> 
>   if (blk_mq_complete_request(cmd->request)) {
>   /*
>* Clearing complete here serves only to allow the
>* desired recovery to escalate on
>* blk_rq_should_fake_timeout()'s error injection.
>*/
>   clear_bit(__SCMD_COMPLETE, &cmd->flags);
>   }
> }
> 
> with
> 
> bool blk_mq_complete_request(struct request *rq)
> {
>   if (unlikely(blk_should_fake_timeout(rq->q)))
>   return true;
>   __blk_mq_complete_request(rq);
>   return false;
> }
> 
> and not have this CONFIG_FAIL_IO_TIMEOUT dependency, but that'd be a bit
> more expensive.

I was trying to avoid every cost no matter how negligable (those are the
only types of costs left as far as I can see), but I think your proposal
might actually be necessary: if a timeout wasn't faked, clearing the
completion flag unconditionally might have a problem with a real timeout
racing with the real completion. :(


[PATCH 1/2] scsi: Do not rely on blk-mq for double completions

2018-11-13 Thread Keith Busch
The scsi timeout error handling had been directly updating the request
state to prevent a natural completion and error handling from completing
the same request twice. Fix this layering violation by having scsi
control the fate of its commands with scsi owned flags rather than
use blk-mq's.

Signed-off-by: Keith Busch 
---
 drivers/scsi/scsi_error.c | 17 +++--
 drivers/scsi/scsi_lib.c   | 11 +++
 include/scsi/scsi_cmnd.h  |  5 -
 3 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index dd338a8cd275..6156f45c2c80 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -204,6 +204,9 @@ scsi_abort_command(struct scsi_cmnd *scmd)
shost->last_reset = jiffies;
spin_unlock_irqrestore(shost->host_lock, flags);
 
+   if (test_and_set_bit(__SCMD_COMPLETE, &scmd->flags))
+   return SUCCESS;
+
scmd->eh_eflags |= SCSI_EH_ABORT_SCHEDULED;
SCSI_LOG_ERROR_RECOVERY(3,
scmd_printk(KERN_INFO, scmd, "abort scheduled\n"));
@@ -296,20 +299,6 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
rtn = host->hostt->eh_timed_out(scmd);
 
if (rtn == BLK_EH_DONE) {
-   /*
-* For blk-mq, we must set the request state to complete now
-* before sending the request to the scsi error handler. This
-* will prevent a use-after-free in the event the LLD manages
-* to complete the request before the error handler finishes
-* processing this timed out request.
-*
-* If the request was already completed, then the LLD beat the
-* time out handler from transferring the request to the scsi
-* error handler. In that case we can return immediately as no
-* further action is required.
-*/
-   if (!blk_mq_mark_complete(req))
-   return rtn;
if (scsi_abort_command(scmd) != SUCCESS) {
set_host_byte(scmd, DID_TIME_OUT);
scsi_eh_scmd_add(scmd);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 61babcb269ab..c680171ca201 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1635,8 +1635,18 @@ static blk_status_t scsi_mq_prep_fn(struct request *req)
 
 static void scsi_mq_done(struct scsi_cmnd *cmd)
 {
+   if (test_and_set_bit(__SCMD_COMPLETE, &cmd->flags))
+   return;
trace_scsi_dispatch_cmd_done(cmd);
blk_mq_complete_request(cmd->request);
+
+#ifdef CONFIG_FAIL_IO_TIMEOUT
+   /*
+* Clearing complete here serves only to allow the desired recovery to
+* escalate on blk_rq_should_fake_timeout()'s error injection.
+*/
+   clear_bit(__SCMD_COMPLETE, &cmd->flags);
+#endif
 }
 
 static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
@@ -1701,6 +1711,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx 
*hctx,
goto out_dec_host_busy;
req->rq_flags |= RQF_DONTPREP;
} else {
+   cmd->flags &= ~SCMD_COMPLETE;
blk_mq_start_request(req);
}
 
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index d6fd2aba0380..ded7c7194a28 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -58,6 +58,9 @@ struct scsi_pointer {
 #define SCMD_TAGGED(1 << 0)
 #define SCMD_UNCHECKED_ISA_DMA (1 << 1)
 #define SCMD_INITIALIZED   (1 << 2)
+
+#define __SCMD_COMPLETE3
+#define SCMD_COMPLETE  (1 << __SCMD_COMPLETE)
 /* flags preserved across unprep / reprep */
 #define SCMD_PRESERVED_FLAGS   (SCMD_UNCHECKED_ISA_DMA | SCMD_INITIALIZED)
 
@@ -144,7 +147,7 @@ struct scsi_cmnd {
 * to be at an address < 16Mb). */
 
int result; /* Status code from lower level driver */
-   int flags;  /* Command flags */
+   unsigned long flags;/* Command flags */
 
unsigned char tag;  /* SCSI-II queued command tag */
 };
-- 
2.14.4



[PATCH 2/2] blk-mq: Simplify request completion state

2018-11-13 Thread Keith Busch
There are no more users relying on blk-mq request states to prevent
double completions, so replace the relatively expensive cmpxchg operation
with WRITE_ONCE.

Signed-off-by: Keith Busch 
---
 block/blk-mq.c |  4 +---
 include/linux/blk-mq.h | 14 --
 2 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 03b1af0151ca..6c546a021803 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -568,9 +568,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   if (!blk_mq_mark_complete(rq))
-   return;
-
+   WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
/*
 * Most of single queue controllers, there is only one irq vector
 * for handling IO completion, and the only irq's affinity is set
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e32e9293e5a0..1a857c6e17fa 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -237,20 +237,6 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
 unsigned int blk_mq_rq_cpu(struct request *rq);
 
-/**
- * blk_mq_mark_complete() - Set request state to complete
- * @rq: request to set to complete state
- *
- * Returns true if request state was successfully set to complete. If
- * successful, the caller is responsibile for seeing this request is ended, as
- * blk_mq_complete_request will not work again.
- */
-static inline bool blk_mq_mark_complete(struct request *rq)
-{
-   return cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) ==
-   MQ_RQ_IN_FLIGHT;
-}
-
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
-- 
2.14.4



Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs

2018-10-30 Thread Keith Busch
On Tue, Oct 30, 2018 at 11:33:51AM -0600, Jens Axboe wrote:
> On 10/30/18 11:22 AM, Keith Busch wrote:
> > On Tue, Oct 30, 2018 at 11:09:04AM -0600, Jens Axboe wrote:
> >> Pretty trivial, below. This also keeps the queue mapping calculations
> >> more clean, as we don't have to do one after we're done allocating
> >> IRQs.
> > 
> > Yep, this addresses my concern. It less efficient than PCI since PCI
> > can usually jump straight to a valid vector count in a single iteration
> > where this only subtracts by 1. I really can't be bothered to care for
> > optimizing that, so this works for me! :) 
> 
> It definitely is less efficient than just getting the count that we
> can support, but it's at probe time so I could not really be bothered
> either.
> 
> Can I add your reviewed-by?

Yes, please.

Reviewed-by: Keith Busch 

> -- 
> Jens Axboe


Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs

2018-10-30 Thread Keith Busch
On Tue, Oct 30, 2018 at 11:09:04AM -0600, Jens Axboe wrote:
> Pretty trivial, below. This also keeps the queue mapping calculations
> more clean, as we don't have to do one after we're done allocating
> IRQs.

Yep, this addresses my concern. It less efficient than PCI since PCI
can usually jump straight to a valid vector count in a single iteration
where this only subtracts by 1. I really can't be bothered to care for
optimizing that, so this works for me! :) 


Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs

2018-10-30 Thread Keith Busch
On Tue, Oct 30, 2018 at 09:18:05AM -0600, Jens Axboe wrote:
> On 10/30/18 9:08 AM, Keith Busch wrote:
> > On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote:
> >> The sum of the set can't exceed the nvecs passed in, the nvecs passed in
> >> should be the less than or equal to nvecs. Granted this isn't enforced,
> >> and perhaps that should be the case.
> > 
> > That should at least initially be true for a proper functioning
> > driver. It's not enforced as you mentioned, but that's only related to
> > the issue I'm referring to.
> > 
> > The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs
> > and max_vecs, but a range of allowable vector allocations doesn't make
> > sense when using sets.
> 
> I feel like we're going in circles here, not sure what you feel the
> issue is now? The range is fine, whoever uses sets will need to adjust
> their sets based on what pci_alloc_irq_vectors_affinity() returns,
> if it didn't return the passed in desired max.

Sorry, let me to try again.

pci_alloc_irq_vectors_affinity() starts at the provided max_vecs. If
that doesn't work, it will iterate down to min_vecs without returning to
the caller. The caller doesn't have a chance to adjust its sets between
iterations when you provide a range.

The 'masks' overrun problem happens if the caller provides min_vecs
as a smaller value than the sum of the set (plus any reserved).

If it's up to the caller to ensure that doesn't happen, then min and
max must both be the same value, and that value must also be the same as
the set sum + reserved vectors. The range just becomes redundant since
it is already bounded by the set.

Using the nvme example, it would need something like this to prevent the
'masks' overrun:

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a8747b956e43..625eff570eaa 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2120,7 +2120,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 * setting up the full range we need.
 */
pci_free_irq_vectors(pdev);
-   result = pci_alloc_irq_vectors_affinity(pdev, 1, nr_io_queues,
+   result = pci_alloc_irq_vectors_affinity(pdev, nr_io_queues, 
nr_io_queues,
PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, &affd);
if (result <= 0)
return -EIO;
--


Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs

2018-10-30 Thread Keith Busch
On Tue, Oct 30, 2018 at 08:53:37AM -0600, Jens Axboe wrote:
> The sum of the set can't exceed the nvecs passed in, the nvecs passed in
> should be the less than or equal to nvecs. Granted this isn't enforced,
> and perhaps that should be the case.

That should at least initially be true for a proper functioning
driver. It's not enforced as you mentioned, but that's only related to
the issue I'm referring to.

The problem is pci_alloc_irq_vectors_affinity() takes a range, min_vecs
and max_vecs, but a range of allowable vector allocations doesn't make
sense when using sets.


Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs

2018-10-30 Thread Keith Busch
On Tue, Oct 30, 2018 at 08:36:35AM -0600, Jens Axboe wrote:
> On 10/30/18 8:26 AM, Keith Busch wrote:
> > On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote:
> >> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> >> index f4f29b9d90ee..2046a0f0f0f1 100644
> >> --- a/kernel/irq/affinity.c
> >> +++ b/kernel/irq/affinity.c
> >> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> >> irq_affinity *affd)
> >>int curvec, usedvecs;
> >>cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
> >>struct cpumask *masks = NULL;
> >> +  int i, nr_sets;
> >>  
> >>/*
> >> * If there aren't any vectors left after applying the pre/post
> >> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct 
> >> irq_affinity *affd)
> >>get_online_cpus();
> >>build_node_to_cpumask(node_to_cpumask);
> >>  
> >> -  /* Spread on present CPUs starting from affd->pre_vectors */
> >> -  usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
> >> -  node_to_cpumask, cpu_present_mask,
> >> -  nmsk, masks);
> >> +  /*
> >> +   * Spread on present CPUs starting from affd->pre_vectors. If we
> >> +   * have multiple sets, build each sets affinity mask separately.
> >> +   */
> >> +  nr_sets = affd->nr_sets;
> >> +  if (!nr_sets)
> >> +  nr_sets = 1;
> >> +
> >> +  for (i = 0, usedvecs = 0; i < nr_sets; i++) {
> >> +  int this_vecs = affd->sets ? affd->sets[i] : affvecs;
> >> +  int nr;
> >> +
> >> +  nr = irq_build_affinity_masks(affd, curvec, this_vecs,
> >> +node_to_cpumask, cpu_present_mask,
> >> +nmsk, masks + usedvecs);
> >> +  usedvecs += nr;
> >> +  }
> > 
> > 
> > While the code below returns the appropriate number of possible vectors
> > when a set requested too many, the above code is still using the value
> > from the set, which may exceed 'nvecs' used to kcalloc 'masks', so
> > 'masks + usedvecs' may go out of bounds.
> 
> How so? nvecs must the max number of vecs, the sum of the sets can't
> exceed that value.

'nvecs' is what irq_calc_affinity_vectors() returns, which is the min
of either the requested max or the sum of the set, and the sum of the set
isn't guaranteed to be the smaller value.


Re: [PATCH 11/14] irq: add support for allocating (and affinitizing) sets of IRQs

2018-10-30 Thread Keith Busch
On Mon, Oct 29, 2018 at 10:37:35AM -0600, Jens Axboe wrote:
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index f4f29b9d90ee..2046a0f0f0f1 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -180,6 +180,7 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   int curvec, usedvecs;
>   cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
>   struct cpumask *masks = NULL;
> + int i, nr_sets;
>  
>   /*
>* If there aren't any vectors left after applying the pre/post
> @@ -210,10 +211,23 @@ irq_create_affinity_masks(int nvecs, const struct 
> irq_affinity *affd)
>   get_online_cpus();
>   build_node_to_cpumask(node_to_cpumask);
>  
> - /* Spread on present CPUs starting from affd->pre_vectors */
> - usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
> - node_to_cpumask, cpu_present_mask,
> - nmsk, masks);
> + /*
> +  * Spread on present CPUs starting from affd->pre_vectors. If we
> +  * have multiple sets, build each sets affinity mask separately.
> +  */
> + nr_sets = affd->nr_sets;
> + if (!nr_sets)
> + nr_sets = 1;
> +
> + for (i = 0, usedvecs = 0; i < nr_sets; i++) {
> + int this_vecs = affd->sets ? affd->sets[i] : affvecs;
> + int nr;
> +
> + nr = irq_build_affinity_masks(affd, curvec, this_vecs,
> +   node_to_cpumask, cpu_present_mask,
> +   nmsk, masks + usedvecs);
> + usedvecs += nr;
> + }


While the code below returns the appropriate number of possible vectors
when a set requested too many, the above code is still using the value
from the set, which may exceed 'nvecs' used to kcalloc 'masks', so
'masks + usedvecs' may go out of bounds.

  
>   /*
>* Spread on non present CPUs starting from the next vector to be
> @@ -258,13 +272,21 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, 
> const struct irq_affinity
>  {
>   int resv = affd->pre_vectors + affd->post_vectors;
>   int vecs = maxvec - resv;
> - int ret;
> + int set_vecs;
>  
>   if (resv > minvec)
>   return 0;
>  
> - get_online_cpus();
> - ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
> - put_online_cpus();
> - return ret;
> + if (affd->nr_sets) {
> + int i;
> +
> + for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
> + set_vecs += affd->sets[i];
> + } else {
> + get_online_cpus();
> + set_vecs = cpumask_weight(cpu_possible_mask);
> + put_online_cpus();
> + }
> +
> + return resv + min(set_vecs, vecs);
>  }




Re: [Patch v1 0/7] mpt3sas: Hot-Plug Surprise removal support on IOC.

2018-09-05 Thread Keith Busch
On Wed, Sep 05, 2018 at 09:38:16AM +0200, Lukas Wunner wrote:
> On Wed, Sep 05, 2018 at 11:45:45AM +0530, Sreekanth Reddy wrote:
> > On Tue, Sep 4, 2018 at 3:12 PM, Lukas Wunner  wrote:
> > > Many scsi drivers call pci_channel_offline() to detect inaccessibility
> > > of the device due to a PCI error:
> > > https://elixir.bootlin.com/linux/v4.19-rc2/ident/pci_channel_offline
> > >
> > > A patch is pending such that surprise removal can also be queried
> > > with that same function:
> > > https://www.spinics.net/lists/linux-pci/msg75722.html
> > 
> > Lukas, thanks for pointing to this pci_dev_is_disconnected() API. So
> > we can use this API directly instead of reading the vendor Id and
> > checking for all one's once this patch get accepted?
> 
> Yes, except pci_dev_is_disconnected() is private to the PCI core,
> but dev->error_state and pci_channel_offline() is public.

The exported function to call is pci_device_is_present().


Re: [PATCHv2 2/2] scsi: set timed out out mq requests to complete

2018-07-25 Thread Keith Busch
On Wed, Jul 25, 2018 at 03:52:17PM +, Bart Van Assche wrote:
> On Mon, 2018-07-23 at 08:37 -0600, Keith Busch wrote:
> > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> > index 8932ae81a15a..2715cdaa669c 100644
> > --- a/drivers/scsi/scsi_error.c
> > +++ b/drivers/scsi/scsi_error.c
> > @@ -296,6 +296,20 @@ enum blk_eh_timer_return scsi_times_out(struct request 
> > *req)
> > rtn = host->hostt->eh_timed_out(scmd);
> >  
> > if (rtn == BLK_EH_DONE) {
> > +   /*
> > +* For blk-mq, we must set the request state to complete now
> > +* before sending the request to the scsi error handler. This
> > +* will prevent a use-after-free in the event the LLD manages
> > +* to complete the request before the error handler finishes
> > +* processing this timed out request.
> > +*
> > +* If the request was already completed, then the LLD beat the
> > +* time out handler from transferring the request to the scsi
> > +* error handler. In that case we can return immediately as no
> > +* further action is required.
> > +*/
> > +   if (req->q->mq_ops && !blk_mq_mark_complete(req))
> > +   return rtn;
> > if (scsi_abort_command(scmd) != SUCCESS) {
> > set_host_byte(scmd, DID_TIME_OUT);
> > scsi_eh_scmd_add(scmd);
> 
> Hello Keith,
> 
> What will happen if a completion occurs after scsi_times_out() has started and
> before or during the host->hostt->eh_timed_out()? Can that cause a 
> use-after-free
> in .eh_timed_out()? Can that cause .eh_timed_out() to return 
> BLK_EH_RESET_TIMER
> when it should return BLK_EH_DONE? Can that cause blk_mq_rq_timed_out() to 
> call
> blk_add_timer() when that function shouldn't be called?

That's what the request's refcount protects. The whole point was that
driver returning RESET_TIMER doesn't lose the completion. In the worst
case scenario, the blk-mq timeout work spends some CPU cycles re-arming
a timer that it didn't need to.


Re: [PATCH 13/14] blk-mq: Remove generation seqeunce

2018-05-23 Thread Keith Busch
On Wed, May 23, 2018 at 02:19:40PM +0200, Christoph Hellwig wrote:
> -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
> - __releases(hctx->srcu)
> -{
> - if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> - rcu_read_unlock();
> - else
> - srcu_read_unlock(hctx->srcu, srcu_idx);
> -}
> -
> -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
> - __acquires(hctx->srcu)
> -{
> - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> - /* shut up gcc false positive */
> - *srcu_idx = 0;
> - rcu_read_lock();
> - } else
> - *srcu_idx = srcu_read_lock(hctx->srcu);
> -}

I may have too ambitious on removing hctx_lock. That's actually nothing
to do with the completion issues, but it is necessary for quiesce to
work reliably. Otherwise, I still think there is some promise to the
reset of the approach.


[PATCH 1/3] blk-mq: Allow PCI vector offset for mapping queues

2018-03-27 Thread Keith Busch
The PCI interrupt vectors intended to be associated with a queue may
not start at 0; a driver may allocate pre_vectors for special use. This
patch adds an offset parameter so blk-mq may find the intended affinity
mask and updates all drivers using this API accordingly.

Cc: Don Brace 
Cc: 
Cc: 
Signed-off-by: Keith Busch 
---
v1 -> v2:

  Update blk-mq API directly instead of chaining a default parameter to
  a new API, and update all drivers accordingly.

 block/blk-mq-pci.c| 6 --
 drivers/nvme/host/pci.c   | 2 +-
 drivers/scsi/qla2xxx/qla_os.c | 2 +-
 drivers/scsi/smartpqi/smartpqi_init.c | 2 +-
 include/linux/blk-mq-pci.h| 3 ++-
 5 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index 76944e3271bf..e233996bb76f 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -21,6 +21,7 @@
  * blk_mq_pci_map_queues - provide a default queue mapping for PCI device
  * @set:   tagset to provide the mapping for
  * @pdev:  PCI device associated with @set.
+ * @offset:Offset to use for the pci irq vector
  *
  * This function assumes the PCI device @pdev has at least as many available
  * interrupt vectors as @set has queues.  It will then query the vector
@@ -28,13 +29,14 @@
  * that maps a queue to the CPUs that have irq affinity for the corresponding
  * vector.
  */
-int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev)
+int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev,
+   int offset)
 {
const struct cpumask *mask;
unsigned int queue, cpu;
 
for (queue = 0; queue < set->nr_hw_queues; queue++) {
-   mask = pci_irq_get_affinity(pdev, queue);
+   mask = pci_irq_get_affinity(pdev, queue + offset);
if (!mask)
goto fallback;
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index cef5ce851a92..e3b9efca0571 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -414,7 +414,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 {
struct nvme_dev *dev = set->driver_data;
 
-   return blk_mq_pci_map_queues(set, to_pci_dev(dev->dev));
+   return blk_mq_pci_map_queues(set, to_pci_dev(dev->dev), 0);
 }
 
 /**
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 12ee6e02d146..2c705f3dd265 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -6805,7 +6805,7 @@ static int qla2xxx_map_queues(struct Scsi_Host *shost)
if (USER_CTRL_IRQ(vha->hw))
rc = blk_mq_map_queues(&shost->tag_set);
else
-   rc = blk_mq_pci_map_queues(&shost->tag_set, vha->hw->pdev);
+   rc = blk_mq_pci_map_queues(&shost->tag_set, vha->hw->pdev, 0);
return rc;
 }
 
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c 
b/drivers/scsi/smartpqi/smartpqi_init.c
index b2880c7709e6..10c94011c8a8 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -5348,7 +5348,7 @@ static int pqi_map_queues(struct Scsi_Host *shost)
 {
struct pqi_ctrl_info *ctrl_info = shost_to_hba(shost);
 
-   return blk_mq_pci_map_queues(&shost->tag_set, ctrl_info->pci_dev);
+   return blk_mq_pci_map_queues(&shost->tag_set, ctrl_info->pci_dev, 0);
 }
 
 static int pqi_getpciinfo_ioctl(struct pqi_ctrl_info *ctrl_info,
diff --git a/include/linux/blk-mq-pci.h b/include/linux/blk-mq-pci.h
index 6338551e0fb9..9f4c17f0d2d8 100644
--- a/include/linux/blk-mq-pci.h
+++ b/include/linux/blk-mq-pci.h
@@ -5,6 +5,7 @@
 struct blk_mq_tag_set;
 struct pci_dev;
 
-int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev);
+int blk_mq_pci_map_queues(struct blk_mq_tag_set *set, struct pci_dev *pdev,
+ int offset);
 
 #endif /* _LINUX_BLK_MQ_PCI_H */
-- 
2.14.3



[LSF/MM TOPIC] blk-mq priority based hctx selection

2018-01-15 Thread Keith Busch
For the storage track, I would like to propose a topic for differentiated
blk-mq hardware contexts. Today, blk-mq considers all hardware contexts
equal, and are selected based on the software's CPU context. There are
use cases that benefit from having hardware context selection criteria
beyond which CPU a process happens to be running on.

One example is exlusive polling for the latency sensitive use cases.
Mixing polled and non-polled requests into the same context loses part of
the benefit when interrupts unnecessarilly occur, and coalescing tricks
to mitigate this have undesirable side effects during times when
HIPRI commands are not issued.

Another example is for hardware priority queues, where not all command
queues the hardware provides may be equal to another. Many newer storage
controllers provide such queues with different QoS guarantees, and Linux
currently does not make use of this feature.

The talk would like to discuss potential new blk-mq APIs needed for
a block driver to register its different priority queues, changes to
blk-mq hwctx selection, and implications for low level drivers that
utilize IRQ affinity to set up current mappings.

Thanks,
Keith


Re: [PATCH 13/14] megaraid_sas: NVME passthru command support

2018-01-11 Thread Keith Busch
On Wed, Jan 10, 2018 at 03:14:40PM -0700, Sathya Prakash Veerichetty wrote:
> In the case of RAID controllers, all of those drives and RAID volumes
> are exposed to the OS as generic SCSI devices

So even when used as a RAID member, there will be a device handle at
/dev/sdX for each NVMe device the megaraid controller manages?


Re: [PATCH 13/14] megaraid_sas: NVME passthru command support

2018-01-09 Thread Keith Busch
On Tue, Jan 09, 2018 at 03:50:44PM -0500, Douglas Gilbert wrote:
> Have you tried to do any serious work with  and
> say compared it with FreeBSD and Microsoft's approach? No prize for
> guessing which one is worst (and least extensible). Looks like the
> Linux pass-through was at the end of a ToDo list and was "designed"
> at 5 a.m in the morning.

What the heck are you talking about? FreeBSD's NVMe passthrough is near
identical to Linux, and Linux's existed years prior.
 
You're not even touching the nvme subsystem, so why are you copying the
linux-nvme mailing list to help you with a non-NVMe device? Please take
your ignorant and dubious claims elsewhere.


Re: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-08 Thread Keith Busch
On Tue, Aug 08, 2017 at 12:33:40PM +0530, Sreekanth Reddy wrote:
> On Tue, Aug 8, 2017 at 9:34 AM, Keith Busch  wrote:
> >
> > It looks like they can make existing nvme tooling work with little
> > effort if they have the driver implement NVME_IOCTL_ADMIN_COMMAND, and
> 
> Precisely, I was thinking on the same line and you clarified. I will
> spend sometime on this and get back to you.

Sounds good. Just note that while the majority of NVMe management should
be reachable with just that IOCTL, some tooling features may not work
correctly since they look for /dev/nvmeX, and I assume this driver will
create /dev/sdX instead.


Re: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-07 Thread Keith Busch
On Mon, Aug 07, 2017 at 08:45:25AM -0700, James Bottomley wrote:
> On Mon, 2017-08-07 at 20:01 +0530, Kashyap Desai wrote:
> > 
> > We have to attempt this use case and see how it behaves. I have not
> > tried this, so not sure if things are really bad or just some tuning
> > may be helpful. I will revert back to you on this.
> > 
> > I understood request as -  We need some udev rules to be working well
> > for *same* NVME drives if it is behind  or native .
> > Example - If user has OS installed on NVME drive which is behind
> >  driver as SCSI disk should be able to boot if he/she hooked
> > same NVME drive which is detected by native  driver (and vice
> > versa.)
> 
> It's not just the udev rules, it's the tools as well; possibly things
> like that nvme-cli toolkit Intel is doing.

It looks like they can make existing nvme tooling work with little
effort if they have the driver implement NVME_IOCTL_ADMIN_COMMAND, and
then have their driver build the MPI NVMe Encapsulated Request from that.


Re: [PATCH 02/13] mpt3sas: SGL to PRP Translation for I/Os to NVMe devices

2017-07-11 Thread Keith Busch
On Tue, Jul 11, 2017 at 01:55:02AM -0700, Suganath Prabu S wrote:
> +/**
> + * _base_check_pcie_native_sgl - This function is called for PCIe end 
> devices to
> + * determine if the driver needs to build a native SGL.  If so, that native
> + * SGL is built in the special contiguous buffers allocated especially for
> + * PCIe SGL creation.  If the driver will not build a native SGL, return
> + * TRUE and a normal IEEE SGL will be built.  Currently this routine
> + * supports NVMe.
> + * @ioc: per adapter object
> + * @mpi_request: mf request pointer
> + * @smid: system request message index
> + * @scmd: scsi command
> + * @pcie_device: points to the PCIe device's info
> + *
> + * Returns 0 if native SGL was built, 1 if no SGL was built
> + */
> +static int
> +_base_check_pcie_native_sgl(struct MPT3SAS_ADAPTER *ioc,
> + Mpi25SCSIIORequest_t *mpi_request, u16 smid, struct scsi_cmnd *scmd,
> + struct _pcie_device *pcie_device)
> +{



> + /* Return 0, indicating we built a native SGL. */
> + return 1;
> +}

This function doesn't return 0 ever. Not sure why it's here.

Curious about your device, though, if a nvme native SGL can *not* be
built, does the HBA firmware then buffer it in its local memory before
sending/receiving to/from the host?

And if a native SGL can be built, does the NVMe target DMA directly
to/from host memory, giving a performance boost?


Re: [PATCH 02/13] mpt3sas: SGL to PRP Translation for I/Os to NVMe devices

2017-07-11 Thread Keith Busch
On Tue, Jul 11, 2017 at 01:55:02AM -0700, Suganath Prabu S wrote:
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h 
> b/drivers/scsi/mpt3sas/mpt3sas_base.h
> index 60fa7b6..cebdd8e 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_base.h
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
> @@ -54,6 +54,7 @@
>  #include "mpi/mpi2_raid.h"
>  #include "mpi/mpi2_tool.h"
>  #include "mpi/mpi2_sas.h"
> +#include "mpi/mpi2_pci.h"

Could you ajust your patch order for this series so each can compile? Here
in patch 2 you're including a header that's not defined until patch 12.


Re: [PATCH 4/5] nvme: move the retries count to struct nvme_request

2017-04-05 Thread Keith Busch
On Wed, Apr 05, 2017 at 04:18:55PM +0200, Christoph Hellwig wrote:
> The way NVMe uses this field is entirely different from the older
> SCSI/BLOCK_PC usage, so move it into struct nvme_request.
> 
> Also reduce the size of the file to a unsigned char so that we leave space
> for additional smaller fields that will appear soon.

What new fields can we expect? Why temporarily pad the middle of the
struct instead of appending to the end? The "result" packed nicely
on 64-bit, at least.

>  struct nvme_request {
>   struct nvme_command *cmd;
> + u8  retries;
>   union nvme_result   result;
>  };


Re: hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]

2017-02-16 Thread Keith Busch
On Thu, Feb 16, 2017 at 01:21:29PM -0500, Mike Snitzer wrote:
> Then undeprecate them.  Decisions like marking a path checker deprecated
> were _not_ made with NVMe in mind.  They must predate NVMe.
> 
> multipath-tools has tables that specify all the defaults for a given
> target backend.  NVMe will just be yet another.  Yes some user _could_
> shoot themselves in the foot by overriding the proper configuration but
> since when are we motivated by _not_ giving users the power to hang
> themselves?
> 
> As for configurability (chosing between N valid configs/settings): At
> some point the user will want one behaviour vs another.  Thinking
> otherwise is just naive.  Think error timeouts, etc.  Any multipath
> kernel implementation (which dm-multipath is BTW) will eventually find
> itself at a crossroads where the underlying fabric could be tweaked in
> different ways.  Thinking you can just hardcode these attributes and
> settings is foolish.

Roger that, and I absolutely want to see this work with the existing
framework.

I just think it'd be easier for everyone if multipath were more like
the generic block layer, in that devices are surfaced with configurable
policies without userspace telling it which to use. The kernel knowing
safe defaults for a particular device is probably the more common case,
and userspace can still tune them as needed. Of course, I accept you're
in a better position to know if this is folly.


Re: hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]

2017-02-16 Thread Keith Busch
On Thu, Feb 16, 2017 at 05:37:41PM +, Bart Van Assche wrote:
> On Thu, 2017-02-16 at 12:38 -0500, Keith Busch wrote:
> > Maybe I'm not seeing the bigger picture. Is there some part to multipath
> > that the kernel is not in a better position to handle?
> 
> Does this mean that the code to parse /etc/multipath.conf will be moved into
> the kernel? Or will we lose the ability to configure the policies that
> /etc/multipath.conf allows to configure?

No, I'm just considering the settings for a device that won't work
at all if multipath.conf is wrong. For example, the uuid attributes,
path priority, or path checker. These can't be considered configurable
policies if all but one of them are invalid for a specific device type.

It shouldn't even be an option to let a user select TUR path checker
for NVMe, and the only checkers multipath-tools provide that even make
sense for NVMe are deprecated.


Re: hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]

2017-02-16 Thread Keith Busch
On Thu, Feb 16, 2017 at 10:13:37AM -0500, Mike Snitzer wrote:
> On Thu, Feb 16 2017 at  9:26am -0500,
> Christoph Hellwig  wrote:
>
> > just a little new code in the block layer, and a move of the path
> > selectors from dm to the block layer.  I would not call this
> > fragmentation.
> 
> I'm fine with the path selectors getting moved out; maybe it'll
> encourage new path selectors to be developed.
> 
> But there will need to be some userspace interface stood up to support
> your native NVMe multipathing (you may not think it needed but think in
> time there will be a need to configure _something_).  That is the
> fragmentation I'm referring to.

I'm not sure what Christoph's proposal looks like, but I have to agree
that multipath support directly in the kernel without requiring user
space to setup the mpath block device is easier for everyone. The only
NVMe specific part, though, just needs to be how it reports unique
identifiers to the multipath layer.

Maybe I'm not seeing the bigger picture. Is there some part to multipath
that the kernel is not in a better position to handle?


Re: [PATCHv2] scsi: use 'scsi_device_from_queue()' for scsi_dh

2017-02-16 Thread Keith Busch
On Thu, Feb 16, 2017 at 12:05:19PM -0500, Keith Busch wrote:
> On Thu, Feb 16, 2017 at 04:12:23PM +0100, Hannes Reinecke wrote:
> > The device handler needs to check if a given queue belongs to
> > a scsi device; only then does it make sense to attach a device
> > handler.
> > 
> > Signed-off-by: Hannes Reinecke 
> 
> The thing I don't like is that this still has dm-mpath directly calling
> into scsi. I don't think dm-mpath has any business calling protocol
> specific APIs, and doesn't help other protocols with similar needs.
> 
> Can we solve this with an indirection layer instead? 
> 
> (untested; just checking if this direction is preferable)

Eh osrry, I accidently included unrelated stuff in the previous. Here's
what I meant to send:

---
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 3570bcb..28310a2 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -847,7 +847,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, 
struct path_selector *ps
 
if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) {
 retain:
-   attached_handler_name = scsi_dh_attached_handler_name(q, 
GFP_KERNEL);
+   attached_handler_name = blk_dh_attached_handler_name(q, 
GFP_KERNEL);
if (attached_handler_name) {
/*
 * Clear any hw_handler_params associated with a
@@ -870,7 +870,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, 
struct path_selector *ps
}
 
if (m->hw_handler_name) {
-   r = scsi_dh_attach(q, m->hw_handler_name);
+   r = blk_dh_attach(q, m->hw_handler_name);
if (r == -EBUSY) {
char b[BDEVNAME_SIZE];
 
@@ -885,7 +885,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, 
struct path_selector *ps
}
 
if (m->hw_handler_params) {
-   r = scsi_dh_set_params(q, m->hw_handler_params);
+   r = blk_dh_set_params(q, m->hw_handler_params);
if (r < 0) {
ti->error = "unable to set hardware "
"handler parameters";
@@ -1526,7 +1526,7 @@ static void activate_path(struct work_struct *work)
struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev);
 
if (pgpath->is_active && !blk_queue_dying(q))
-   scsi_dh_activate(q, pg_init_done, pgpath);
+   blk_dh_activate(q, pg_init_done, pgpath);
else
pg_init_done(pgpath, SCSI_DH_DEV_OFFLINED);
 }
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e9e1e14..e23c7d5 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2028,6 +2028,13 @@ static u64 scsi_calculate_bounce_limit(struct Scsi_Host 
*shost)
return bounce_limit;
 }
 
+static struct dh_ops scsi_dh_ops = {
+   .dh_activate = scsi_dh_activate,
+   .dh_attach = scsi_dh_attach,
+   .dh_attached_handler_name = scsi_dh_attached_handler_name,
+   .dh_set_params = scsi_dh_set_params,
+};
+
 static void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 {
struct device *dev = shost->dma_dev;
@@ -2062,6 +2069,8 @@ static void __scsi_init_queue(struct Scsi_Host *shost, 
struct request_queue *q)
 * blk_queue_update_dma_alignment() later.
 */
blk_queue_dma_alignment(q, 0x03);
+
+   q->dh_ops = &scsi_dh_ops;
 }
 
 struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ca8e8f..5899768 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -378,6 +378,14 @@ static inline int blkdev_reset_zones_ioctl(struct 
block_device *bdev,
 
 #endif /* CONFIG_BLK_DEV_ZONED */
 
+typedef void (*activate_complete)(void *, int);
+struct dh_ops {
+   int (*dh_activate)(struct request_queue *, activate_complete, void *);
+   int (*dh_attach)(struct request_queue *, const char *);
+   const char *(*dh_attached_handler_name)(struct request_queue *, gfp_t);
+   int (*dh_set_params)(struct request_queue *, const char *);
+};
+
 struct request_queue {
/*
 * Together with queue_head for cacheline sharing
@@ -408,6 +416,7 @@ struct request_queue {
lld_busy_fn *lld_busy_fn;
 
struct blk_mq_ops   *mq_ops;
+   struct dh_ops   *dh_ops;
 
unsigned int*mq_map;
 
@@ -572,6 +581,35 @@ struct request_queue {
boolmq_sysfs_init_done;
 };
 
+static inline int blk_dh_activate(struct request_queue *q, activate_complete 
fn, void *data)
+{
+   if (q->dh_ops && q->dh_ops->dh_activate)
+   return q

Re: [PATCHv2] scsi: use 'scsi_device_from_queue()' for scsi_dh

2017-02-16 Thread Keith Busch
On Thu, Feb 16, 2017 at 04:12:23PM +0100, Hannes Reinecke wrote:
> The device handler needs to check if a given queue belongs to
> a scsi device; only then does it make sense to attach a device
> handler.
> 
> Signed-off-by: Hannes Reinecke 

The thing I don't like is that this still has dm-mpath directly calling
into scsi. I don't think dm-mpath has any business calling protocol
specific APIs, and doesn't help other protocols with similar needs.

Can we solve this with an indirection layer instead? 

(untested; just checking if this direction is preferable)
---
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 3570bcb..28310a2 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -847,7 +847,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, 
struct path_selector *ps
 
if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) {
 retain:
-   attached_handler_name = scsi_dh_attached_handler_name(q, 
GFP_KERNEL);
+   attached_handler_name = blk_dh_attached_handler_name(q, 
GFP_KERNEL);
if (attached_handler_name) {
/*
 * Clear any hw_handler_params associated with a
@@ -870,7 +870,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, 
struct path_selector *ps
}
 
if (m->hw_handler_name) {
-   r = scsi_dh_attach(q, m->hw_handler_name);
+   r = blk_dh_attach(q, m->hw_handler_name);
if (r == -EBUSY) {
char b[BDEVNAME_SIZE];
 
@@ -885,7 +885,7 @@ static struct pgpath *parse_path(struct dm_arg_set *as, 
struct path_selector *ps
}
 
if (m->hw_handler_params) {
-   r = scsi_dh_set_params(q, m->hw_handler_params);
+   r = blk_dh_set_params(q, m->hw_handler_params);
if (r < 0) {
ti->error = "unable to set hardware "
"handler parameters";
@@ -1526,7 +1526,7 @@ static void activate_path(struct work_struct *work)
struct request_queue *q = bdev_get_queue(pgpath->path.dev->bdev);
 
if (pgpath->is_active && !blk_queue_dying(q))
-   scsi_dh_activate(q, pg_init_done, pgpath);
+   blk_dh_activate(q, pg_init_done, pgpath);
else
pg_init_done(pgpath, SCSI_DH_DEV_OFFLINED);
 }
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 06d2ed4..49a061a 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -107,7 +107,7 @@ static int dpc_probe(struct pcie_device *dev)
int status;
u16 ctl, cap;
 
-   dpc = kzalloc(&dev->device, sizeof(*dpc), GFP_KERNEL);
+   dpc = kzalloc(sizeof(*dpc), GFP_KERNEL);
if (!dpc)
return -ENOMEM;
 
@@ -116,7 +116,7 @@ static int dpc_probe(struct pcie_device *dev)
INIT_WORK(&dpc->work, interrupt_event_handler);
set_service_data(dev, dpc);
 
-   status = request_irq(&dev->device, dev->irq, dpc_irq, IRQF_SHARED,
+   status = request_irq(dev->irq, dpc_irq, IRQF_SHARED,
  "pcie-dpc", dpc);
if (status) {
dev_warn(&dev->device, "request IRQ%d failed: %d\n", dev->irq,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e9e1e14..e23c7d5 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2028,6 +2028,13 @@ static u64 scsi_calculate_bounce_limit(struct Scsi_Host 
*shost)
return bounce_limit;
 }
 
+static struct dh_ops scsi_dh_ops = {
+   .dh_activate = scsi_dh_activate,
+   .dh_attach = scsi_dh_attach,
+   .dh_attached_handler_name = scsi_dh_attached_handler_name,
+   .dh_set_params = scsi_dh_set_params,
+};
+
 static void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 {
struct device *dev = shost->dma_dev;
@@ -2062,6 +2069,8 @@ static void __scsi_init_queue(struct Scsi_Host *shost, 
struct request_queue *q)
 * blk_queue_update_dma_alignment() later.
 */
blk_queue_dma_alignment(q, 0x03);
+
+   q->dh_ops = &scsi_dh_ops;
 }
 
 struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ca8e8f..5899768 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -378,6 +378,14 @@ static inline int blkdev_reset_zones_ioctl(struct 
block_device *bdev,
 
 #endif /* CONFIG_BLK_DEV_ZONED */
 
+typedef void (*activate_complete)(void *, int);
+struct dh_ops {
+   int (*dh_activate)(struct request_queue *, activate_complete, void *);
+   int (*dh_attach)(struct request_queue *, const char *);
+   const char *(*dh_attached_handler_name)(struct request_queue *, gfp_t);
+   int (*dh_set_params)(struct request_queue *, const char *);
+};
+
 struct request_queue {
/*
 * Togethe

Re: [PATCH 1/2] nvme: untangle 0 and BLK_MQ_RQ_QUEUE_OK

2016-11-15 Thread Keith Busch
On Tue, Nov 15, 2016 at 11:11:58AM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
> specific values. No functional change.
>
> Signed-off-by: Omar Sandoval 

Yeah, we've been depending on the values of BLK_MQ_RQ_QUEUE_[ERROR|BUSY]
not being zero without this. Looks good.

Reviewed-by: Keith Busch 

> ---
>  drivers/nvme/host/core.c   | 4 ++--
>  drivers/nvme/host/pci.c| 8 
>  drivers/nvme/host/rdma.c   | 2 +-
>  drivers/nvme/target/loop.c | 6 +++---
>  4 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 53584d2..e54bb10 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -269,7 +269,7 @@ static inline int nvme_setup_discard(struct nvme_ns *ns, 
> struct request *req,
>*/
>   req->__data_len = nr_bytes;
>  
> - return 0;
> + return BLK_MQ_RQ_QUEUE_OK;
>  }
>  
>  static inline void nvme_setup_rw(struct nvme_ns *ns, struct request *req,
> @@ -317,7 +317,7 @@ static inline void nvme_setup_rw(struct nvme_ns *ns, 
> struct request *req,
>  int nvme_setup_cmd(struct nvme_ns *ns, struct request *req,
>   struct nvme_command *cmd)
>  {
> - int ret = 0;
> + int ret = BLK_MQ_RQ_QUEUE_OK;
>  
>   if (req->cmd_type == REQ_TYPE_DRV_PRIV)
>   memcpy(cmd, nvme_req(req)->cmd, sizeof(*cmd));
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 51d13d5..d58f8e4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -328,7 +328,7 @@ static int nvme_init_iod(struct request *rq, unsigned 
> size,
>   rq->retries = 0;
>   rq->rq_flags |= RQF_DONTPREP;
>   }
> - return 0;
> + return BLK_MQ_RQ_QUEUE_OK;
>  }
>  
>  static void nvme_free_iod(struct nvme_dev *dev, struct request *req)
> @@ -598,17 +598,17 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  
>   map_len = nvme_map_len(req);
>   ret = nvme_init_iod(req, map_len, dev);
> - if (ret)
> + if (ret != BLK_MQ_RQ_QUEUE_OK)
>   return ret;
>  
>   ret = nvme_setup_cmd(ns, req, &cmnd);
> - if (ret)
> + if (ret != BLK_MQ_RQ_QUEUE_OK)
>   goto out;
>  
>   if (req->nr_phys_segments)
>   ret = nvme_map_data(dev, req, map_len, &cmnd);
>  
> - if (ret)
> + if (ret != BLK_MQ_RQ_QUEUE_OK)
>   goto out;
>  
>   cmnd.common.command_id = req->tag;
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index c4700ef..ff1b34060 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1395,7 +1395,7 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   sizeof(struct nvme_command), DMA_TO_DEVICE);
>  
>   ret = nvme_setup_cmd(ns, rq, c);
> - if (ret)
> + if (ret != BLK_MQ_RQ_QUEUE_OK)
>   return ret;
>  
>   c->common.command_id = rq->tag;
> diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
> index 26aa3a5..be56d05 100644
> --- a/drivers/nvme/target/loop.c
> +++ b/drivers/nvme/target/loop.c
> @@ -169,7 +169,7 @@ static int nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
>   int ret;
>  
>   ret = nvme_setup_cmd(ns, req, &iod->cmd);
> - if (ret)
> + if (ret != BLK_MQ_RQ_QUEUE_OK)
>   return ret;
>  
>   iod->cmd.common.flags |= NVME_CMD_SGL_METABUF;
> @@ -179,7 +179,7 @@ static int nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
>   nvme_cleanup_cmd(req);
>   blk_mq_start_request(req);
>   nvme_loop_queue_response(&iod->req);
> - return 0;
> + return BLK_MQ_RQ_QUEUE_OK;
>   }
>  
>   if (blk_rq_bytes(req)) {
> @@ -198,7 +198,7 @@ static int nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
>   blk_mq_start_request(req);
>  
>   schedule_work(&iod->work);
> - return 0;
> + return BLK_MQ_RQ_QUEUE_OK;
>  }
>  
>  static void nvme_loop_submit_async_event(struct nvme_ctrl *arg, int aer_idx)
> -- 
> 2.10.2
> 
> 
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-28 Thread Keith Busch
On Fri, Oct 28, 2016 at 11:51:35AM -0700, Bart Van Assche wrote:
> I think it is wrong that kicking the requeue list starts stopped queues
> because this makes it impossible to stop request processing without setting
> an additional flag next to BLK_MQ_S_STOPPED. Can you have a look at the
> attached two patches? These patches survive my dm-multipath and SCSI tests.

Hi Bart,

These look good to me, and succesful on my NVMe tests.

Thanks,
Keith


> From e93799f726485a398837c992c5c0068d7180 Mon Sep 17 00:00:00 2001
> From: Bart Van Assche 
> Date: Fri, 28 Oct 2016 10:48:58 -0700
> Subject: [PATCH 1/2] block: Avoid that requeueing starts stopped queues
> 
> Since blk_mq_requeue_work() starts stopped queues and since
> execution of this function can be scheduled after a queue has
> been stopped it is not possible to stop queues without using
> an additional state variable to track whether or not the queue
> has been stopped. Hence modify blk_mq_requeue_work() such that it
> does not start stopped queues. My conclusion after a review of
> the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
> callers is as follows:
> * In the dm driver starting and stopping queues should only happen
>   if __dm_suspend() or __dm_resume() is called and not if the
>   requeue list is processed.
> * In the SCSI core queue stopping and starting should only be
>   performed by the scsi_internal_device_block() and
>   scsi_internal_device_unblock() functions but not by any other
>   function.
> * In the NVMe core only the functions that call
>   blk_mq_start_stopped_hw_queues() explicitly should start stopped
>   queues.
> * A blk_mq_start_stopped_hwqueues() call must be added in the
>   xen-blkfront driver in its blkif_recover() function.
> ---
>  block/blk-mq.c   | 6 +-
>  drivers/block/xen-blkfront.c | 1 +
>  drivers/md/dm-rq.c   | 7 +--
>  drivers/scsi/scsi_lib.c  | 1 -
>  4 files changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a49b8af..24dfd0d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -528,11 +528,7 @@ static void blk_mq_requeue_work(struct work_struct *work)
>   blk_mq_insert_request(rq, false, false, false);
>   }
>  
> - /*
> -  * Use the start variant of queue running here, so that running
> -  * the requeue work will kick stopped queues.
> -  */
> - blk_mq_start_hw_queues(q);
> + blk_mq_run_hw_queues(q, false);
>  }
>  
>  void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 1ca702d..a3e1727 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -2045,6 +2045,7 @@ static int blkif_recover(struct blkfront_info *info)
>   BUG_ON(req->nr_phys_segments > segs);
>   blk_mq_requeue_request(req, false);
>   }
> + blk_mq_start_stopped_hwqueues(info->rq);
>   blk_mq_kick_requeue_list(info->rq);
>  
>   while ((bio = bio_list_pop(&info->bio_list)) != NULL) {
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 107ed19..b951ae83 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -326,12 +326,7 @@ static void dm_old_requeue_request(struct request *rq)
>  
>  static void __dm_mq_kick_requeue_list(struct request_queue *q, unsigned long 
> msecs)
>  {
> - unsigned long flags;
> -
> - spin_lock_irqsave(q->queue_lock, flags);
> - if (!blk_mq_queue_stopped(q))
> - blk_mq_delay_kick_requeue_list(q, msecs);
> - spin_unlock_irqrestore(q->queue_lock, flags);
> + blk_mq_delay_kick_requeue_list(q, msecs);
>  }
>  
>  void dm_mq_kick_requeue_list(struct mapped_device *md)
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 4cddaff..94f54ac 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1939,7 +1939,6 @@ static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>  out:
>   switch (ret) {
>   case BLK_MQ_RQ_QUEUE_BUSY:
> - blk_mq_stop_hw_queue(hctx);
>   if (atomic_read(&sdev->device_busy) == 0 &&
>   !scsi_device_blocked(sdev))
>   blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
> -- 
> 2.10.1
> 

> From 47eec3bdcf4b673e3ab606543cb3acdf7f4de593 Mon Sep 17 00:00:00 2001
> From: Bart Van Assche 
> Date: Fri, 28 Oct 2016 10:50:04 -0700
> Subject: [PATCH 2/2] blk-mq: Remove blk_mq_cancel_requeue_work()
> 
> Since blk_mq_requeue_work() no longer restarts stopped queues
> canceling requeue work is no longer needed to prevent that a
> stopped queue would be restarted. Hence remove this function.
> ---
>  block/blk-mq.c   | 6 --
>  drivers/md/dm-rq.c   | 2 --
>  drivers/nvme/host/core.c | 1 -
>  include/linux/blk-mq.h   | 1 -
>  4 files changed, 10 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 24dfd0d..1aa79e5 100644
> --- a/block/

Re: [PATCH 11/12] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-28 Thread Keith Busch
On Wed, Oct 26, 2016 at 03:56:04PM -0700, Bart Van Assche wrote:
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7bb73ba..b662416 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -205,7 +205,7 @@ void nvme_requeue_req(struct request *req)
>  
>   blk_mq_requeue_request(req, false);
>   spin_lock_irqsave(req->q->queue_lock, flags);
> - if (!blk_queue_stopped(req->q))
> + if (!blk_mq_queue_stopped(req->q))
>   blk_mq_kick_requeue_list(req->q);
>   spin_unlock_irqrestore(req->q->queue_lock, flags);
>  }
> @@ -2079,10 +2079,6 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
>  
>   mutex_lock(&ctrl->namespaces_mutex);
>   list_for_each_entry(ns, &ctrl->namespaces, list) {
> - spin_lock_irq(ns->queue->queue_lock);
> - queue_flag_set(QUEUE_FLAG_STOPPED, ns->queue);
> - spin_unlock_irq(ns->queue->queue_lock);
> -
>   blk_mq_cancel_requeue_work(ns->queue);
>   blk_mq_stop_hw_queues(ns->queue);

There's actually a reason the queue stoppage is using a different flag
than blk_mq_queue_stopped: kicking the queue starts stopped queues.
The driver has to ensure the requeue work can't be kicked prior to
cancelling the current requeue work. Once we know requeue work isn't
running and can't restart again, then we're safe to stop the hw queues.

It's a pretty obscure condition, requiring the controller post an
error completion at the same time the driver decides to reset the
controller. Here's the sequence with the wrong outcome:

 CPU A   CPU B
 -   -
nvme_stop_queues nvme_requeue_req
 blk_mq_stop_hw_queuesif (blk_mq_queue_stopped) <- returns false
  blk_mq_stop_hw_queue blk_mq_kick_requeue_list
 blk_mq_requeue_work
  blk_mq_start_hw_queues
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues

2016-10-20 Thread Keith Busch
On Wed, Oct 19, 2016 at 04:51:18PM -0700, Bart Van Assche wrote:
> 
> I assume that line 498 in blk-mq.c corresponds to BUG_ON(blk_queued_rq(rq))?
> Anyway, it seems to me like this is a bug in the NVMe code and also that
> this bug is completely unrelated to my patch series. In nvme_complete_rq() I
> see that blk_mq_requeue_request() is called. I don't think this is allowed
> from the context of nvme_cancel_request() because blk_mq_requeue_request()
> assumes that a request has already been removed from the request list.
> However, neither blk_mq_tagset_busy_iter() nor nvme_cancel_request() remove
> a request from the request list before nvme_complete_rq() is called. I think
> this is what triggers the BUG_ON() statement in blk_mq_requeue_request().
> Have you noticed that e.g. the scsi-mq code only calls
> blk_mq_requeue_request() after __blk_mq_end_request() has finished? Have you
> considered to follow the same approach in nvme_cancel_request()?

Both nvme and scsi requeue through their mp_ops 'complete' callback, so
nvme is similarly waiting for __blk_mq_end_request before requesting to
requeue. The problem, I think, is nvme's IO cancelling path is observing
active requests that it's requeuing from the queue_rq path.

Patch [11/11] kicks the requeue list unconditionally. This restarts queues
the driver had just quiesced a moment before, restarting those requests,
but the driver isn't ready to handle them. When the driver ultimately
unbinds from the device, it requeues those requests a second time.

Either the requeuing can't kick the requeue work when queisced, or the
shutdown needs to quiesce even when it hasn't restarted the queues.
Either patch below appears to fix the issue.

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ccd9cc5..078530c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -201,7 +201,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct gendisk 
*disk)
 
 void nvme_requeue_req(struct request *req)
 {
-   blk_mq_requeue_request(req, true);
+   blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q));
 }
 EXPORT_SYMBOL_GPL(nvme_requeue_req);
--

--- 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4b30fa2..a05da98 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1681,10 +1681,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
del_timer_sync(&dev->watchdog_timer);
 
mutex_lock(&dev->shutdown_lock);
-   if (pci_is_enabled(to_pci_dev(dev->dev))) {
-   nvme_stop_queues(&dev->ctrl);
+   nvme_stop_queues(&dev->ctrl);
+   if (pci_is_enabled(to_pci_dev(dev->dev)))
csts = readl(dev->bar + NVME_REG_CSTS);
-   }
 
queues = dev->online_queues - 1;
for (i = dev->queue_count - 1; i > 0; i--)
--
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/11] Fix race conditions related to stopping block layer queues

2016-10-19 Thread Keith Busch
Hi Bart,

I'm running linux 4.9-rc1 + linux-block/for-linus, and alternating tests
with and without this series.

Without this, I'm not seeing any problems in a link-down test while
running fio after ~30 runs.

With this series, I only see the test pass infrequently. Most of the
time I observe one of several failures. In all cases, it looks like the
rq->queuelist is in an unexpected state.

I think I've almost got this tracked down, but I have to leave for the
day soon. Rather than having a more useful suggestion, I've put the two
failures below.


First failure:

[  214.782075] [ cut here ]
[  214.782098] kernel BUG at block/blk-mq.c:498!
[  214.782117] invalid opcode:  [#1] SMP
[  214.782133] Modules linked in: nvme nvme_core nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack 
ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_raw 
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle 
ip6table_security iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security ebtable_filter 
ebtables ip6table_filter ip6_tables vfat fat
[  214.782356] CPU: 6 PID: 160 Comm: kworker/u16:6 Not tainted 4.9.0-rc1+ #28
[  214.782383] Hardware name: Gigabyte Technology Co., Ltd. 
Z97X-UD5H/Z97X-UD5H, BIOS F8 06/17/2014
[  214.782419] Workqueue: nvme nvme_reset_work [nvme]
[  214.782440] task: 8c0815403b00 task.stack: b6ad01384000
[  214.782463] RIP: 0010:[]  [] 
blk_mq_requeue_request+0x35/0x40
[  214.782502] RSP: 0018:b6ad01387b88  EFLAGS: 00010287
[  214.782524] RAX: 8c0814b98400 RBX: 8c0814b98200 RCX: 7530
[  214.782551] RDX: 0007 RSI: 0001 RDI: 8c0814b98200
[  214.782578] RBP: b6ad01387b98 R08:  R09: 9f408680
[  214.783394] R10: 0394 R11: 0388 R12: 0001
[  214.784212] R13: 8c081593a000 R14: 0001 R15: 8c080cdea740
[  214.785033] FS:  () GS:8c081fb8() 
knlGS:
[  214.785869] CS:  0010 DS:  ES:  CR0: 80050033
[  214.786710] CR2: 7ffae4497f34 CR3: 0001dfe06000 CR4: 001406e0
[  214.787559] Stack:
[  214.788406]  8c0814b98200  b6ad01387ba8 
c03451b3
[  214.789287]  b6ad01387bd0 c0357a4a 8c0814b98200 
d6acffc81a00
[  214.790174]  0006 b6ad01387bf8 9f3b8e22 
8c0814b98200
[  214.791066] Call Trace:
[  214.791935]  [] nvme_requeue_req+0x13/0x20 [nvme_core]
[  214.792810]  [] nvme_complete_rq+0x16a/0x1d0 [nvme]
[  214.793680]  [] __blk_mq_complete_request+0x72/0xe0
[  214.794551]  [] blk_mq_complete_request+0x1c/0x20
[  214.795422]  [] nvme_cancel_request+0x50/0x90 [nvme_core]
[  214.796299]  [] bt_tags_iter+0x2e/0x40
[  214.797157]  [] blk_mq_tagset_busy_iter+0x173/0x1e0
[  214.798005]  [] ? nvme_shutdown_ctrl+0x100/0x100 
[nvme_core]
[  214.798852]  [] ? nvme_shutdown_ctrl+0x100/0x100 
[nvme_core]
[  214.799682]  [] nvme_dev_disable+0x11d/0x380 [nvme]
[  214.800511]  [] ? acpi_unregister_gsi_ioapic+0x3a/0x40
[  214.801344]  [] ? dev_warn+0x6c/0x90
[  214.802157]  [] nvme_reset_work+0xa4/0xdc0 [nvme]
[  214.802961]  [] ? __switch_to+0x2b6/0x5f0
[  214.803773]  [] process_one_work+0x15f/0x430
[  214.804593]  [] worker_thread+0x4e/0x490
[  214.805419]  [] ? process_one_work+0x430/0x430
[  214.806255]  [] kthread+0xd9/0xf0
[  214.807096]  [] ? kthread_park+0x60/0x60
[  214.807946]  [] ret_from_fork+0x25/0x30
[  214.808801] Code: 54 53 48 89 fb 41 89 f4 e8 a9 fa ff ff 48 8b 03 48 39 c3 
75 16 41 0f b6 d4 48 89 df be 01 00 00 00 e8 10 ff ff ff 5b 41 5c 5d c3 <0f> 0b 
66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 be 40 00 00
[  214.810714] RIP  [] blk_mq_requeue_request+0x35/0x40
[  214.811628]  RSP 
[  214.812545] ---[ end trace 6ef3a3b6f8cea418 ]---
[  214.813437] [ cut here ]


Second failure, warning followed by NMI watchdog:

[  410.736619] [ cut here ]
[  410.736624] WARNING: CPU: 2 PID: 577 at lib/list_debug.c:29 
__list_add+0x62/0xb0
[  410.736883] list_add corruption. next->prev should be prev 
(acf481847d78), but was 931f8fb78000. (next=931f8fb78000).
[  410.736884] Modules linked in: nvme nvme_core nf_conntrack_netbios_ns 
nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_conntrack 
ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_security 
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw 
ip6table_mangle iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle ebtable_filter 
ebtables ip6table_filter ip6_tables vfat fat
[  410.736902] CPU: 2 PID: 577 Comm: kworker/2:1H Not tainted 4.9.0-rc1+ #28
[  410.736903] Hardware name: Gigabyte Technology Co., Ltd. 
Z97X-UD5H/Z97X-UD5H,

Re: Oops when completing request on the wrong queue

2016-08-23 Thread Keith Busch
On Tue, Aug 23, 2016 at 03:14:23PM -0600, Jens Axboe wrote:
> On 08/23/2016 03:11 PM, Jens Axboe wrote:
> >My workload looks similar to yours, in that it's high depth and with a
> >lot of jobs to keep most CPUs loaded. My bash script is different than
> >yours, I'll try that and see if it helps here.
> 
> Actually, I take that back. You're not using O_DIRECT, hence all your
> jobs are running at QD=1, not the 256 specified. That looks odd, but
> I'll try, maybe it'll hit something different.

I haven't recreated this either, but I think I can logically see why
this failure is happening.

I sent an nvme driver patch earlier on this thread to exit the hardware
context, which I thought would do the trick if the hctx's tags were
being moved. That turns out to be wrong for a couple reasons.

First, we can't release the nvmeq->tags when a hctx exits because
that nvmeq may be used by other namespaces that need to point to
the device's tag set.

The other reason is that blk-mq doesn't exit or init hardware contexts
when remapping for a CPU event, leaving the nvme driver unaware a hardware
context points to a different tag set.

So I think I see why this test would fail; don't know about a fix yet.
Maybe the nvme driver needs some indirection instead of pointing
directly to the tagset after init_hctx.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf] [LSF/MM TOPIC] block-mq issues with FC

2016-04-08 Thread Keith Busch
On Fri, Apr 08, 2016 at 01:40:06PM -0400, Matthew Wilcox wrote:
>  - Inability to use all queues supported by a device.  Intel's P3700
>supports 31 queues, but block-mq insists on assigning an even multiple
>of CPUs to each queue.  So if you have 48 CPUs, it will use 24 queues.
>If you have 128 CPUs, it will only use 16 of the queues.

While it'd be better to use all the available h/w resources, that's
actually not the worst part.

The real problems occur when there are more physical/unique CPUs than
h/w queues since blk-mq does not consider CPU topology beyond thread
siblings. With 128 CPUs, blk-mq may use all 31 queues P3700 supports,
but many CPU groups won't share a last-level-cache.

Smarter assignment would reclaim some untapped performance, and we can
share such code prior to the session.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: complete boot failure in 4.5-rc1 caused by nvme: make SG_IO support optional

2016-02-08 Thread Keith Busch
On Mon, Feb 08, 2016 at 04:19:13PM +0100, Hannes Reinecke wrote:
> Ok, so what about having a 'wwid' attribute which provides combined
> information (like scsi has)?

That looks like the sensible thing to do. Thanks for pointer.

Going forward, I will solicite more feedback from scsi developers
so NVMe's external attributes better align with storage that already
solved our issues. I agree with Christoph that we never should have
relied on SCSI translations for NVMe, but we don't want to reinvent
generic solutions either.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: complete boot failure in 4.5-rc1 caused by nvme: make SG_IO support optional

2016-02-08 Thread Keith Busch
On Mon, Feb 08, 2016 at 11:13:50AM +0100, Christoph Hellwig wrote:
> On Mon, Feb 08, 2016 at 12:01:16PM +0200, Sagi Grimberg wrote:
> >
> >> Do we have defined sysfs attributes for NVMe devices nowadays?
> >
> > /sys/block/nvme0n1/uuid
> 
> That's only supported for NVMe 1.1 and higher devices, and optional.
> For older or stupid devices we need to support the algorithm based
> on the serial attribute from nvme_fill_device_id_scsi_string() in
> drivers/nvme/host/scsi.c.

It's even worse. NGUID was defined for 1.2 devices and higher. 1.1
devices should have EUI-64 at:
 
  /sys/block/nvmeXnY/eui

1.2 devices will have either uuid or eui (or both).

The majority of devices in circulation today are 1.0, and need to concat
these three entries to make a unique identifier:

  /sys/block/nvmeXnY/device/serial
  /sys/block/nvmeXnY/device/model
  /sys/block/nvmeXnY/nsid
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] block: nvme-core: simplify ida usage

2015-10-08 Thread Keith Busch

On Fri, 2 Oct 2015, Johannes Thumshirn wrote:

Lee Duncan  writes:

Simplify ida index allocation and removal by
using the ida_simple_* helper functions.


Looks good to me. Just one comment:


 static void nvme_release_instance(struct nvme_dev *dev)
 {
spin_lock(&dev_list_lock);
-   ida_remove(&nvme_instance_ida, dev->instance);
+   ida_simple_remove(&nvme_instance_ida, dev->instance);
spin_unlock(&dev_list_lock);


No harm from taking the nvme spin lock here, but it's not necessary with
the simple interface.


Reviewed-by: Johannes Thumshirn 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Persistent Reservation API V2

2015-08-20 Thread Keith Busch

On Tue, 11 Aug 2015, Christoph Hellwig wrote:

This series adds support for a simplified Persistent Reservation API
to the block layer.  The intent is that both in-kernel and userspace
consumers can use the API instead of having to hand craft SCSI or NVMe
command through the various pass through interfaces.  It also adds
DM support as getting reservations through dm-multipath is a major
pain with the current scheme.

NVMe support currently isn't included as I don't have a multihost
NVMe setup to test on, but Keith offered to test it and I'll have
a patch for it shortly.


Hi Christoph,

I wrote an nvme implementation and it seems to work as expected with your
pr-tests (minor modification to open an nvme target). While API appears to
work as designed, there are a few features of NVMe that are unreachable
with it. For example, NVMe can ignore existing keys when acquiring a
reservation in addition to registering a new key. NVMe can also specify
if whether or not a reservation should persist through power-loss.

Anyway, I don't think SCSI has these same options. Should this new IOCTL
API accommodate the common subset to maintain its simplicity, or make
it more expressive for all features? The more important question might
be if there are any users requiring these features, and I honestly don't
know that right now.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Persistent Reservation API

2015-08-04 Thread Keith Busch

On Tue, 4 Aug 2015, Christoph Hellwig wrote:

NVMe support currently isn't included as I don't have a multihost
NVMe setup to test on, but if I can find a volunteer to test it I'm
happy to write the code for it.


Looks pretty good so far. I'd be happy to give try it out with NVMe
subsystems.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Let device drivers disable msi on shutdown

2014-07-10 Thread Keith Busch

On Thu, 10 Jul 2014, Bjorn Helgaas wrote:

[+cc LKML, Greg KH for driver core async shutdown question]
On Tue, Jun 24, 2014 at 10:48:57AM -0600, Keith Busch wrote:

To provide context why I want to do this asynchronously, NVM-Express has
one PCI device per controller, of which there could be dozens in a system,
and each one may take many seconds (I've heard over ten in some cases)
to safely shutdown.


I don't see anything in device_shutdown() that would wait for this
sort of asynchronous shutdown to complete.  So how do we know it's
finished before we turn off the power, reset, kexec, etc.?

If we need to do asynchronous shutdown, it seems like we need some
sort of driver core infrastructure to manage that.


Yes, good point! To address that, I did submit this patch:

http://lists.infradead.org/pipermail/linux-nvme/2014-May/000827.html

I need to fix the EXPORT_SYMBOL_GPL usage for a v2, but before that, I
wan't to know the reason the driver can't use MSIx in an async shutdown
shutdown, and came to the patch mentioned above.

I'd originally had the async shutdown use legacy interrupts, but I
know some NVMe devices do not support legacy, so can't use my original
proposal. If I can't rely on MSI/MSI-x being enabled in an async shutdown,
then I have to add polling, which I suppose we can live with.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH] Let device drivers disable msi on shutdown

2014-06-24 Thread Keith Busch

On Tue, 24 Jun 2014, Elliott, Robert (Server Storage) wrote:

1. That will cover the .shutdown function used by mptfc.c, mptspi.c,
and mptscsih.c, but mptsas.c uses mptsas_shutdown rather than
mptscsih_shutdown.  It doesn't call pci_disable_msi either.


Missed that; thanks.


2. mptscsih_suspend is another caller of mptscsih_shutdown
(although the latter does nothing right now).  This is done
before calling mpt_suspend, which writes to some chip registers
to disable interrupts before calling pci_disable_msi.  Adding a
pci_disable_msi call before those writes might not be safe.


Clearly more paths into this function than I investigated.


3. That mptscsih_suspend call chain will result in calling
pci_disable_msi twice, which might trigger this in
pci_disable_msi -> pci_msi_shutdown:
   BUG_ON(list_empty(&dev->msi_list));


Ah, but pci_disable_msi will not even invoke pci_msi_shutdown:

if (!pci_msi_enable || !dev || !dev->msi_enabled)
return;

The same check is done in pci_msi_shutdown. If that check wasn't there,
every driver would hit that BUG_ON since most every other pci driver
disables their interrupts in their shutdown path before returning, and
then pci-driver calls it again! MPT appears to be the exception.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] Let device drivers disable msi on shutdown

2014-06-24 Thread Keith Busch
I'd like to do shutdowns asynchronously so I can shutdown multiple
devices in parallel, but the pci-driver disables interrupts after my
driver returns from its '.shutdown', though it needs to rely on these
interrupts in its asynchronously scheduled shutdown.

I tracked the reason for pci disabling msi to ...

| commit d52877c7b1afb8c37ebe17e2005040b79cb618b0
| Author: Yinghai Lu 
| Date:   Wed Apr 23 14:58:09 2008 -0700
|
| pci/irq: let pci_device_shutdown to call pci_msi_shutdown v2

... because mptfusion doesn't disable msi in its shutdown path.

Any reason we can't let the drivers do this instead?

To provide context why I want to do this asynchronously, NVM-Express has
one PCI device per controller, of which there could be dozens in a system,
and each one may take many seconds (I've heard over ten in some cases)
to safely shutdown.

In this patch, mptfusion was compile tested only; I didn't observe any
adverse affects from running the pci portion.

Signed-off-by: Keith Busch 
Cc: Nagalakshmi Nandigama 
Cc: Sreekanth Reddy 
Cc: Bjorn Helgaas 
---
 drivers/message/fusion/mptscsih.c |3 +++
 drivers/pci/pci-driver.c  |2 --
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/message/fusion/mptscsih.c 
b/drivers/message/fusion/mptscsih.c
index 2a1c6f2..3186e17 100644
--- a/drivers/message/fusion/mptscsih.c
+++ b/drivers/message/fusion/mptscsih.c
@@ -1215,6 +1215,9 @@ mptscsih_remove(struct pci_dev *pdev)
 void
 mptscsih_shutdown(struct pci_dev *pdev)
 {
+   MPT_ADAPTER *ioc = pci_get_drvdata(pdev);
+   if (ioc->msi_enable)
+   pci_disable_msi(pdev);
 }
 
 #ifdef CONFIG_PM
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 3f8e3db..8079d98 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -453,8 +453,6 @@ static void pci_device_shutdown(struct device *dev)
 
if (drv && drv->shutdown)
drv->shutdown(pci_dev);
-   pci_msi_shutdown(pci_dev);
-   pci_msix_shutdown(pci_dev);
 
 #ifdef CONFIG_KEXEC
/*
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] sd: dif sector calculation

2013-09-20 Thread Keith Busch

On Fri, 20 Sep 2013, Martin K. Petersen wrote:


"Keith" == Keith Busch  writes:


Keith> The ref tag should be the device's physical LBA rather than the
Keith> 512 byte bio sector.

The bip sector is just a seed value set by the application. It is not
correct to scale it based on sector size or PI interval.


If I follow the current implementation correctly and have a 4k physical
block size, a two block write at LBA 0 will generate ref tags 0 and 1
for that operation. If I read back one block at LBA 1, the current code
will expect the ref tag to be 8, but come back as 1, failing the compare.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sd: dif sector calculation

2013-09-17 Thread Keith Busch
The ref tag should be the device's physical LBA rather than the 512 byte
bio sector.

Signed-off-by: Keith Busch 
Cc: Martin K. Petersen 
Cc: James E.J. Bottomley 
Cc: Ric Wheeler 
---
I CC'ed James and Ric as I think you guys expressed some interest in
seeing if this was a legit concern. :)

The patch goes out of its way to avoid division. The shifting requires
the physical sector size be a power of 2 of at least 512, which I believe
is true.

 drivers/scsi/sd_dif.c | 12 ++--
 fs/bio-integrity.c| 11 ---
 2 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 6174ca4..dc9f095 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -382,8 +382,8 @@ void sd_dif_prepare(struct request *rq, sector_t hw_sector,
if (bio_flagged(bio, BIO_MAPPED_INTEGRITY))
break;
 
-   virt = bio->bi_integrity->bip_sector & 0x;
-
+   virt = (bio->bi_integrity->bip_sector >>
+   (__ffs(sector_sz) - 9)) & 0x;
bip_for_each_vec(iv, bio->bi_integrity, i) {
sdt = kmap_atomic(iv->bv_page)
+ iv->bv_offset;
@@ -425,14 +425,14 @@ void sd_dif_complete(struct scsi_cmnd *scmd, unsigned int 
good_bytes)
sector_sz = scmd->device->sector_size;
sectors = good_bytes / sector_sz;
 
-   phys = blk_rq_pos(scmd->request) & 0x;
-   if (sector_sz == 4096)
-   phys >>= 3;
+   phys = (blk_rq_pos(scmd->request) >> (__ffs(sector_sz) - 9)) &
+   0x;
 
__rq_for_each_bio(bio, scmd->request) {
struct bio_vec *iv;
 
-   virt = bio->bi_integrity->bip_sector & 0x;
+   virt = (bio->bi_integrity->bip_sector >>
+   (__ffs(sector_sz) - 9)) & 0x;
 
bip_for_each_vec(iv, bio->bi_integrity, i) {
sdt = kmap_atomic(iv->bv_page)
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 6025084..6ee4f12 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -196,11 +196,7 @@ EXPORT_SYMBOL(bio_integrity_enabled);
 static inline unsigned int bio_integrity_hw_sectors(struct blk_integrity *bi,
unsigned int sectors)
 {
-   /* At this point there are only 512b or 4096b DIF/EPP devices */
-   if (bi->sector_size == 4096)
-   return sectors >>= 3;
-
-   return sectors;
+   return sectors >> (__ffs(bi->sector_size) - 9);
 }
 
 /**
@@ -300,7 +296,7 @@ static void bio_integrity_generate(struct bio *bio)
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec *bv;
-   sector_t sector = bio->bi_sector;
+   sector_t sector = bio->bi_sector >> (__ffs(bi->sector_size) - 9);
unsigned int i, sectors, total;
void *prot_buf = bio->bi_integrity->bip_buf;
 
@@ -442,7 +438,8 @@ static int bio_integrity_verify(struct bio *bio)
struct blk_integrity *bi = bdev_get_integrity(bio->bi_bdev);
struct blk_integrity_exchg bix;
struct bio_vec *bv;
-   sector_t sector = bio->bi_integrity->bip_sector;
+   sector_t sector = bio->bi_integrity->bip_sector >>
+   (__ffs(bi->sector_size) - 9);
unsigned int i, sectors, total, ret;
void *prot_buf = bio->bi_integrity->bip_buf;
 
-- 
1.8.2.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html