from:"Keith Busch"

Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Keith Busch

On Tue, Dec 04, 2018 at 02:21:17PM -0700, Keith Busch wrote:
> On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote:
> > On 12/4/2018 9:48 AM, Keith Busch wrote:
> > > Once quiesced, the proposed iterator can handle the final termination
> > > of the request, perform failover, or some other lld specific action
> > > depending on your situation.
> > 
> > I don't believe they can remain frozen, definitely not for the admin queue.
> > -- james
> 
> Quiesced and frozen carry different semantics.
> 
> My understanding of the nvme-fc implementation is that it returns
> BLK_STS_RESOURCE in the scenario you've described where the admin
> command can't be executed at the moment. That just has the block layer
> requeue it for later resubmission 3 milliseconds later, which will
> continue to return the same status code until you're really ready for
> it.
> 
> What I'm proposing is that instead of using that return code, you may
> have nvme-fc control when to dispatch those queued requests by utilizing
> the blk-mq quiesce on/off states. Is there a reason that wouldn't work?

BTW, this is digressing from this patch's use case. The proposed iteration
doesn't come into play for the above scenario, which can be handled with
existing interfaces.

Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Keith Busch

On Tue, Dec 04, 2018 at 11:33:33AM -0800, James Smart wrote:
> 
> 
> On 12/4/2018 9:48 AM, Keith Busch wrote:
> > On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote:
> > > > > > Yes, I'm very much in favour of this, too.
> > > > > > We always have this IMO slightly weird notion of stopping the 
> > > > > > queue, set
> > > > > > some error flags in the driver, then _restarting_ the queue, just so
> > > > > > that the driver then sees the error flag and terminates the 
> > > > > > requests.
> > > > > > Which I always found quite counter-intuitive.
> > > > > What about requests that come in after the iteration runs? how are 
> > > > > those
> > > > > terminated?
> > > > If we've reached a dead state, I think you'd want to start a queue 
> > > > freeze
> > > > before running the terminating iterator.
> > > Its not necessarily dead, in fabrics we need to handle disconnections
> > > that last for a while before we are able to reconnect (for a variety of
> > > reasons) and we need a way to fail I/O for failover (or requeue, or
> > > block its up to the upper layer). Its less of a "last resort" action
> > > like in the pci case.
> > > 
> > > Does this guarantee that after freeze+iter we won't get queued with any
> > > other request? If not then we still need to unfreeze and fail at
> > > queue_rq.
> > It sounds like there are different scenarios to consider.
> > 
> > For the dead controller, we call blk_cleanup_queue() at the end which
> > ends callers who blocked on entering.
> > 
> > If you're doing a failover, you'd replace the freeze with a current path
> > update in order to prevent new requests from entering.
> and if you're not multipath ?    I assume you want the io queues to be
> frozen so they queue there - which can block threads such as ns
> verification. It's good to have them live, as todays checks bounce the io,
> letting the thread terminate as its in a reset/reconnect state, which allows
> those threads to exit out or finish before a new reconnect kicks them off
> again. We've already been fighting deadlocks with the reset/delete/rescan
> paths and these io paths. suspending the queues completely over the
> reconnect will likely create more issues in this area.
> 
> 
> > In either case, you don't need checks in queue_rq. The queue_rq check
> > is redundant with the quiesce state that blk-mq already provides.
> 
> I disagree.  The cases I've run into are on the admin queue - where we are
> sending io to initialize the controller when another error/reset occurs, and
> the checks are required to identify/reject the "old" initialization
> commands, with another state check allowing them to proceed on the "new"
> initialization commands.  And there are also cases for ioctls and other
> things that occur during the middle of those initialization steps that need
> to be weeded out.   The Admin queue has to be kept live to allow the
> initialization commands on the new controller.
> 
> state checks are also needed for those namespace validation cases
> 
> > 
> > Once quiesced, the proposed iterator can handle the final termination
> > of the request, perform failover, or some other lld specific action
> > depending on your situation.
> 
> I don't believe they can remain frozen, definitely not for the admin queue.
> -- james

Quiesced and frozen carry different semantics.

My understanding of the nvme-fc implementation is that it returns
BLK_STS_RESOURCE in the scenario you've described where the admin
command can't be executed at the moment. That just has the block layer
requeue it for later resubmission 3 milliseconds later, which will
continue to return the same status code until you're really ready for
it.

What I'm proposing is that instead of using that return code, you may
have nvme-fc control when to dispatch those queued requests by utilizing
the blk-mq quiesce on/off states. Is there a reason that wouldn't work?

Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Keith Busch

On Tue, Dec 04, 2018 at 09:38:29AM -0800, Sagi Grimberg wrote:
> 
> > > > Yes, I'm very much in favour of this, too.
> > > > We always have this IMO slightly weird notion of stopping the queue, set
> > > > some error flags in the driver, then _restarting_ the queue, just so
> > > > that the driver then sees the error flag and terminates the requests.
> > > > Which I always found quite counter-intuitive.
> > > 
> > > What about requests that come in after the iteration runs? how are those
> > > terminated?
> > 
> > If we've reached a dead state, I think you'd want to start a queue freeze
> > before running the terminating iterator.
> 
> Its not necessarily dead, in fabrics we need to handle disconnections
> that last for a while before we are able to reconnect (for a variety of
> reasons) and we need a way to fail I/O for failover (or requeue, or
> block its up to the upper layer). Its less of a "last resort" action
> like in the pci case.
> 
> Does this guarantee that after freeze+iter we won't get queued with any
> other request? If not then we still need to unfreeze and fail at
> queue_rq.

It sounds like there are different scenarios to consider.

For the dead controller, we call blk_cleanup_queue() at the end which
ends callers who blocked on entering.

If you're doing a failover, you'd replace the freeze with a current path
update in order to prevent new requests from entering.

In either case, you don't need checks in queue_rq. The queue_rq check
is redundant with the quiesce state that blk-mq already provides.

Once quiesced, the proposed iterator can handle the final termination
of the request, perform failover, or some other lld specific action
depending on your situation.

Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-04 Thread Keith Busch

On Mon, Dec 03, 2018 at 05:33:06PM -0800, Sagi Grimberg wrote:
> >>
> > Yes, I'm very much in favour of this, too.
> > We always have this IMO slightly weird notion of stopping the queue, set 
> > some error flags in the driver, then _restarting_ the queue, just so 
> > that the driver then sees the error flag and terminates the requests.
> > Which I always found quite counter-intuitive.
> 
> What about requests that come in after the iteration runs? how are those
> terminated?

If we've reached a dead state, I think you'd want to start a queue freeze
before running the terminating iterator.

Re: [PATCH 04/13] nvme-pci: only allow polling with separate poll queues

2018-12-03 Thread Keith Busch

On Sun, Dec 02, 2018 at 08:46:19AM -0800, Christoph Hellwig wrote:
> This will allow us to simplify both the regular NVMe interrupt handler
> and the upcoming aio poll code.  In addition to that the separate
> queues are generally a good idea for performance reasons.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH 10/13] nvme-mpath: remove I/O polling support

2018-12-03 Thread Keith Busch

On Sun, Dec 02, 2018 at 08:46:25AM -0800, Christoph Hellwig wrote:
> The ->poll_fn has been stale for a while, as a lot of places check for mq
> ops.  But there is no real point in it anyway, as we don't even use
> the multipath code for subsystems without multiple ports, which is usually
> what we do high performance I/O to.  If it really becomes an issue we
> should rework the nvme code to also skip the multipath code for any
> private namespace, even if that could mean some trouble when rescanning.
> 
> Signed-off-by: Christoph Hellwig 

This was a bit flawed anyway since the head's current path could change,
and you end up polling the wrong request_queue. Not really harmful other
than some wasted CPU cycles, but might be worth thinking about if we
want to bring mpath polling back.

Reviewed-by: Keith Busch

Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues

2018-12-03 Thread Keith Busch

On Sun, Dec 02, 2018 at 08:46:22AM -0800, Christoph Hellwig wrote:
> This is the last place outside of nvme_irq that handles CQEs from
> interrupt context, and thus is in the way of removing the cq_lock for
> normal queues, and avoiding lockdep warnings on the poll queues, for
> which we already take it without IRQ disabling.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-11-30 Thread Keith Busch

On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote:
> On 11/30/18 1:26 PM, Keith Busch wrote:
> > A driver may wish to iterate every tagged request, not just ones that
> > satisfy blk_mq_request_started(). The intended use is so a driver may
> > terminate entered requests on quiesced queues.
> 
> How about we just move the started check into the handler passed in for
> those that care about it? Much saner to make the interface iterate
> everything, and leave whatever state check to the callback.

I agree that's better. I initially didn't want to examine all the users
but looks like only 4 drivers using the current tagset iterator, so
that's not too daunting. I'll give it a look.

[PATCH 2/2] nvme: Remove queue flushing hack

2018-11-30 Thread Keith Busch

The nvme driver checked the queue state on every IO so the path could
drain requests. The code however declares "We shold not need to do this",
so let's not do it. Instead, use blk-mq's tag iterator to terminate
entered requests on dying queues so the IO path doesn't have to deal
with these conditions.

Signed-off-by: Keith Busch 
---
 drivers/nvme/host/core.c | 10 --
 drivers/nvme/host/pci.c  | 43 +++
 2 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 91474b3c566c..af84c4d3c20e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -103,6 +103,13 @@ static void nvme_put_subsystem(struct nvme_subsystem 
*subsys);
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
   unsigned nsid);
 
+static bool nvme_fail_request(struct blk_mq_hw_ctx *hctx, struct request *req,
+ void *data, bool reserved)
+{
+   blk_mq_end_request(req, BLK_STS_IOERR);
+   return true;
+}
+
 static void nvme_set_queue_dying(struct nvme_ns *ns)
 {
/*
@@ -113,8 +120,7 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
return;
revalidate_disk(ns->disk);
blk_set_queue_dying(ns->queue);
-   /* Forcibly unquiesce queues to avoid blocking dispatch */
-   blk_mq_unquiesce_queue(ns->queue);
+   blk_mq_queue_tag_busy_iter(ns->queue, nvme_fail_request, NULL);
 }
 
 static void nvme_queue_scan(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3ecc0bf75a62..ec830aa52842 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -926,13 +926,6 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx 
*hctx,
struct nvme_command cmnd;
blk_status_t ret;
 
-   /*
-* We should not need to do this, but we're still using this to
-* ensure we can drain requests on a dying queue.
-*/
-   if (unlikely(!test_bit(NVMEQ_ENABLED, >flags)))
-   return BLK_STS_IOERR;
-
ret = nvme_setup_cmd(ns, req, );
if (ret)
return ret;
@@ -1408,10 +1401,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 {
if (!test_and_clear_bit(NVMEQ_ENABLED, >flags))
return 1;
-
-   /* ensure that nvme_queue_rq() sees NVMEQ_ENABLED cleared */
-   mb();
-
nvmeq->dev->online_queues--;
if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
@@ -1611,15 +1600,30 @@ static const struct blk_mq_ops nvme_mq_ops = {
.poll   = nvme_poll,
 };
 
+static bool nvme_fail_queue_request(struct request *req, void *data, bool 
reserved)
+{
+   struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+   struct nvme_queue *nvmeq = iod->nvmeq;
+
+   if (test_bit(NVMEQ_ENABLED, >flags))
+   return true;
+   blk_mq_end_request(req, BLK_STS_IOERR);
+   return true;
+}
+
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
 {
if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q)) {
/*
 * If the controller was reset during removal, it's possible
-* user requests may be waiting on a stopped queue. Start the
-* queue to flush these to completion.
+* user requests may be waiting on a stopped queue. End all
+* entered requests after preventing new requests from
+* entering.
 */
-   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+   blk_set_queue_dying(dev->ctrl.admin_q);
+   blk_mq_tagset_all_iter(>admin_tagset,
+  nvme_fail_queue_request,
+  NULL);
blk_cleanup_queue(dev->ctrl.admin_q);
blk_mq_free_tag_set(>admin_tagset);
}
@@ -2411,6 +2415,11 @@ static void nvme_pci_disable(struct nvme_dev *dev)
}
 }
 
+static void nvme_fail_requests(struct nvme_dev *dev)
+{
+   blk_mq_tagset_all_iter(>tagset, nvme_fail_queue_request, NULL);
+}
+
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
int i;
@@ -2454,11 +2463,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown)
 
/*
 * The driver will not be starting up queues again if shutting down so
-* must flush all entered requests to their failed completion to avoid
+* must end all entered requests to their failed completion to avoid
 * deadlocking blk-mq hot-cpu notifier.
 */
if (shutdown)
-   nvme_start_queues(>ctrl);
+   nvme_fail_requests(dev);
mutex_unloc

[PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-11-30 Thread Keith Busch

A driver may wish to iterate every tagged request, not just ones that
satisfy blk_mq_request_started(). The intended use is so a driver may
terminate entered requests on quiesced queues.

Signed-off-by: Keith Busch 
---
 block/blk-mq-tag.c | 41 +++--
 block/blk-mq-tag.h |  2 --
 include/linux/blk-mq.h |  4 
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 87bc5df72d48..c5fd2e0a4074 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -272,6 +272,7 @@ struct bt_tags_iter_data {
busy_tag_iter_fn *fn;
void *data;
bool reserved;
+   bool all;
 };
 
 static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void 
*data)
@@ -289,7 +290,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
int bitnr, void *data)
 * test and set the bit before assining ->rqs[].
 */
rq = tags->rqs[bitnr];
-   if (rq && blk_mq_request_started(rq))
+   if (rq && (iter_data->all || blk_mq_request_started(rq)))
return iter_data->fn(rq, iter_data->data, reserved);
 
return true;
@@ -309,13 +310,15 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
int bitnr, void *data)
  * bitmap_tags member of struct blk_mq_tags.
  */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue 
*bt,
-busy_tag_iter_fn *fn, void *data, bool reserved)
+busy_tag_iter_fn *fn, void *data, bool reserved,
+bool all)
 {
struct bt_tags_iter_data iter_data = {
.tags = tags,
.fn = fn,
.data = data,
.reserved = reserved,
+   .all = all,
};
 
if (tags->rqs)
@@ -333,11 +336,12 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, 
struct sbitmap_queue *bt,
  * @priv:  Will be passed as second argument to @fn.
  */
 static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
-   busy_tag_iter_fn *fn, void *priv)
+   busy_tag_iter_fn *fn, void *priv, bool all)
 {
if (tags->nr_reserved_tags)
-   bt_tags_for_each(tags, >breserved_tags, fn, priv, true);
-   bt_tags_for_each(tags, >bitmap_tags, fn, priv, false);
+   bt_tags_for_each(tags, >breserved_tags, fn, priv, true,
+all);
+   bt_tags_for_each(tags, >bitmap_tags, fn, priv, false, all);
 }
 
 /**
@@ -357,11 +361,35 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set 
*tagset,
 
for (i = 0; i < tagset->nr_hw_queues; i++) {
if (tagset->tags && tagset->tags[i])
-   blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
+   blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv,
+false);
}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
+/**
+ * blk_mq_tagset_all_iter - iterate over all requests in a tag set
+ * @tagset:Tag set to iterate over.
+ * @fn:Pointer to the function that will be called for each 
started
+ * request. @fn will be called as follows: @fn(rq, @priv,
+ * reserved) where rq is a pointer to a request. 'reserved'
+ * indicates whether or not @rq is a reserved request. Return
+ * true to continue iterating tags, false to stop.
+ * @priv:  Will be passed as second argument to @fn.
+ */
+void blk_mq_tagset_all_iter(struct blk_mq_tag_set *tagset,
+   busy_tag_iter_fn *fn, void *priv)
+{
+   int i;
+
+   for (i = 0; i < tagset->nr_hw_queues; i++) {
+   if (tagset->tags && tagset->tags[i])
+   blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv,
+true);
+   }
+}
+EXPORT_SYMBOL(blk_mq_tagset_all_iter);
+
 /**
  * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
  * @q: Request queue to examine.
@@ -408,6 +436,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
}
blk_queue_exit(q);
 }
+EXPORT_SYMBOL(blk_mq_queue_tag_busy_iter);
 
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
bool round_robin, int node)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..5af7ff94b400 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -33,8 +33,6 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
struct blk_mq_tags **tags,
unsigned int depth, bool can_grow);
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
-void blk_mq_queue_tag_busy_iter(struct reque

Re: [GIT PULL] nvme fixes for 4.20

2018-11-30 Thread Keith Busch

On Fri, Nov 30, 2018 at 08:26:24AM -0700, Jens Axboe wrote:
> On 11/30/18 8:24 AM, Christoph Hellwig wrote:
> > Various fixlets all over, including throwing in a 'default y' for the
> > multipath code, given that we want people to actually enable it for full
> > functionality.
> 
> Why enable it by default? 99.9% of users aren't going to care. That
> seems like an odd choice.

I'm okay with making it the default, but can we move this to 4.21 please?

Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues

2018-11-30 Thread Keith Busch

On Fri, Nov 30, 2018 at 12:08:09AM -0800, Christoph Hellwig wrote:
> On Thu, Nov 29, 2018 at 01:36:32PM -0700, Keith Busch wrote:
> > On Thu, Nov 29, 2018 at 08:13:04PM +0100, Christoph Hellwig wrote:
> > > +
> > > + /* handle any remaining CQEs */
> > > + if (opcode == nvme_admin_delete_cq &&
> > > + !test_bit(NVMEQ_DELETE_ERROR, >flags))
> > > + nvme_poll_irqdisable(nvmeq, -1);
> > 
> > We're dispatchig lots of queue deletions in parallel, and they may
> > complete in any order. I don't see how you can guarantee that the
> > wait_for_completion() will return for the nvmeq that you're polling.
> 
> True.  I thought about moving the completion to the queue so that
> we have one completion per queue, and I should have done that after
> all.  Note sure how I got the idea that not doing it is fine.

You may also move the completion polling in its own loop outside the
deletion loop.

Re: [PATCH 01/13] block: move queues types to the block layer

2018-11-30 Thread Keith Busch

On Fri, Nov 30, 2018 at 12:00:13AM -0800, Christoph Hellwig wrote:
> On Thu, Nov 29, 2018 at 01:19:14PM -0700, Keith Busch wrote:
> > On Thu, Nov 29, 2018 at 08:12:58PM +0100, Christoph Hellwig wrote:
> > > +enum hctx_type {
> > > + HCTX_TYPE_DEFAULT,  /* all I/O not otherwise accounted for */
> > > + HCTX_TYPE_READ, /* just for READ I/O */
> > > + HCTX_TYPE_POLL, /* polled I/O of any kind */
> > > +
> > > + HCTX_MAX_TYPES,
> > >  };
> > 
> > Well, there goes my plan to use this with Weighted-Round-Robin NVMe IO
> > queues!
> 
> Wo between what do you even want to round robin?  If it is between
> reads and writes that's easy.  If we want priority reads or writes
> (separate from polling) that's also still fairly easily.

I was considering the IOPRIO_PRIO_CLASS. There are four of them, which
may roughly correspond to the four NVMe IO queues weights. Maybe even
through HIPRI flagged IOs with the RT class.

 
> Btw, one thing I wanted to try once I get hold of the right hardware
> is to mark the poll queues as priority queues and see if that makes
> any differents in poll IOPS/latency.

I doubt it will make much difference in IOPS, but should improve latency
on hipri IOs at the expense of normal IO since hipri will be fetched
ahead during command arbitrarion.

Re: [PATCH 08/13] nvme-pci: remove the CQ lock for interrupt driven queues

2018-11-29 Thread Keith Busch

On Thu, Nov 29, 2018 at 08:13:05PM +0100, Christoph Hellwig wrote:
> @@ -1050,12 +1051,16 @@ static irqreturn_t nvme_irq(int irq, void *data)
>   irqreturn_t ret = IRQ_NONE;
>   u16 start, end;
>  
> - spin_lock(>cq_lock);
> + /*
> +  * The rmb/wmb pair ensures we see all updates from a previous run of
> +  * the irq handler, even if that was on another CPU.
> +  */
> + rmb();
>   if (nvmeq->cq_head != nvmeq->last_cq_head)
>   ret = IRQ_HANDLED;
>   nvme_process_cq(nvmeq, , , -1);
>   nvmeq->last_cq_head = nvmeq->cq_head;
> - spin_unlock(>cq_lock);
> + wmb();
>  
>   if (start != end) {
>   nvme_complete_cqes(nvmeq, start, end);

We saved the "start, end" only so we could do the real completion
without holding a queue lock. Since you're not using a lock anymore,
a further optimization can complete the CQE inline with moving the cq
head so that we don't go through queue twice.

That can be a follow on, though, this patch looks fine.

Reviewed-by: Keith Busch

Re: [PATCH 06/13] nvme-pci: refactor nvme_disable_io_queues

2018-11-29 Thread Keith Busch

On Thu, Nov 29, 2018 at 08:13:03PM +0100, Christoph Hellwig wrote:
> Pass the opcode for the delete SQ/CQ command as an argument instead of
> the somewhat confusing pass loop.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH 07/13] nvme-pci: don't poll from irq context when deleting queues

2018-11-29 Thread Keith Busch

On Thu, Nov 29, 2018 at 08:13:04PM +0100, Christoph Hellwig wrote:
> This is the last place outside of nvme_irq that handles CQEs from
> interrupt context, and thus is in the way of removing the cq_lock for
> normal queues, and avoiding lockdep warnings on the poll queues, for
> which we already take it without IRQ disabling.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvme/host/pci.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 9ceba9900ca3..fb8db7d8170a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -203,6 +203,7 @@ struct nvme_queue {
>   unsigned long flags;
>  #define NVMEQ_ENABLED0
>  #define NVMEQ_SQ_CMB 1
> +#define NVMEQ_DELETE_ERROR   2
>   u32 *dbbuf_sq_db;
>   u32 *dbbuf_cq_db;
>   u32 *dbbuf_sq_ei;
> @@ -2216,7 +2217,7 @@ static void nvme_del_cq_end(struct request *req, 
> blk_status_t error)
>   struct nvme_queue *nvmeq = req->end_io_data;
>  
>   if (!error)
> - nvme_poll_irqdisable(nvmeq, -1);
> + set_bit(NVMEQ_DELETE_ERROR, >flags);
>  
>   nvme_del_queue_end(req, error);
>  }
> @@ -2258,11 +2259,20 @@ static bool nvme_disable_io_queues(struct nvme_dev 
> *dev, u8 opcode)
>   nr_queues--;
>   sent++;
>   }
> - while (sent--) {
> + while (sent) {
> + struct nvme_queue *nvmeq = >queues[nr_queues + sent];
> +
>   timeout = wait_for_completion_io_timeout(>ioq_wait,
>   timeout);
>   if (timeout == 0)
>   return false;
> +
> + /* handle any remaining CQEs */
> + if (opcode == nvme_admin_delete_cq &&
> + !test_bit(NVMEQ_DELETE_ERROR, >flags))
> + nvme_poll_irqdisable(nvmeq, -1);

We're dispatchig lots of queue deletions in parallel, and they may
complete in any order. I don't see how you can guarantee that the
wait_for_completion() will return for the nvmeq that you're polling.

You also need to clear NVMEQ_DELETE_ERROR somewhere later, maybe in
nvme_init_queue().

Re: [PATCH 03/13] nvme-pci: cleanup SQ allocation a bit

2018-11-29 Thread Keith Busch

On Thu, Nov 29, 2018 at 08:13:00PM +0100, Christoph Hellwig wrote:
> Use a bit flag to mark if the SQ was allocated from the CMB, and clean
> up the surrounding code a bit.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH 02/13] nvme-pci: use atomic bitops to mark a queue enabled

2018-11-29 Thread Keith Busch

On Thu, Nov 29, 2018 at 08:12:59PM +0100, Christoph Hellwig wrote:
> This gets rid of all the messing with cq_vector and the ->polled field
> by using an atomic bitop to mark the queue enabled or not.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch

Re: [PATCH 01/13] block: move queues types to the block layer

2018-11-29 Thread Keith Busch

On Thu, Nov 29, 2018 at 08:12:58PM +0100, Christoph Hellwig wrote:
> +enum hctx_type {
> + HCTX_TYPE_DEFAULT,  /* all I/O not otherwise accounted for */
> + HCTX_TYPE_READ, /* just for READ I/O */
> + HCTX_TYPE_POLL, /* polled I/O of any kind */
> +
> + HCTX_MAX_TYPES,
>  };

Well, there goes my plan to use this with Weighted-Round-Robin NVMe IO
queues!

I'm not that sad about it, though.

Reviewed-by: Keith Busch

Re: [PATCH 3/7] nvme: implement mq_ops->commit_rqs() hook

2018-11-29 Thread Keith Busch

On Thu, Nov 29, 2018 at 10:06:20AM -0700, Jens Axboe wrote:
> On 11/29/18 10:04 AM, Christoph Hellwig wrote:
> > gcc never warns about unused static inline functions.  Which makes a lot
> > of sense at least for headers..
> 
> Not so much for non-headers :-)

You can #include .c files too! :)

Re: [PATCH] nvme: Fix PCIe surprise removal scenario

2018-11-26 Thread Keith Busch

On Fri, Nov 23, 2018 at 07:58:10AM -0800, Igor Konopko wrote:
> This patch fixes kernel OOPS for surprise removal
> scenario for PCIe connected NVMe drives.
> 
> After latest changes, when PCIe device is not present,
> nvme_dev_remove_admin() calls blk_cleanup_queue() on
> admin queue, which frees hctx for that queue.
> Moment later, on the same path nvme_kill_queues()
> calls blk_mq_unquiesce_queue() on admin queue and
> tries to access hctx of it, which leads to
> following OOPS scenario:
> 
> Oops:  [#1] SMP PTI
> RIP: 0010:sbitmap_any_bit_set+0xb/0x40
> Call Trace:
>  blk_mq_run_hw_queue+0xd5/0x150
>  blk_mq_run_hw_queues+0x3a/0x50
>  nvme_kill_queues+0x26/0x50
>  nvme_remove_namespaces+0xb2/0xc0
>  nvme_remove+0x60/0x140
>  pci_device_remove+0x3b/0xb0
> 
> Fixes: cb4bfda62afa2 ("nvme-pci: fix hot removal during error handling")
> Signed-off-by: Igor Konopko 

Thanks Igor, my previous proposal was a bit flawed, and this patch looks
good. For the future, you should adjust your commit log formatting to
use the more standard 72 character width.

The fix should go in for the next rc for sure. Maybe longer term, if
a request_queue has an active reference, it might make sense to have
blk-mq's APIs consistently handle these sorts of things.

Reviewed-by: Keith Busch 


> ---
>  drivers/nvme/host/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 65c42448e904..5aff95389694 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3601,7 +3601,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
>   down_read(>namespaces_rwsem);
>  
>   /* Forcibly unquiesce queues to avoid blocking dispatch */
> - if (ctrl->admin_q)
> + if (ctrl->admin_q && !blk_queue_dying(ctrl->admin_q))
>   blk_mq_unquiesce_queue(ctrl->admin_q);
>  
>   list_for_each_entry(ns, >namespaces, list)
> -- 
> 2.14.5

Re: [PATCHv3 2/3] scsi: Do not rely on blk-mq for double completions

2018-11-19 Thread Keith Busch

On Mon, Nov 19, 2018 at 12:58:15AM -0800, Christoph Hellwig wrote:
> > index 5d83a162d03b..c1d5e4e36125 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1635,8 +1635,11 @@ static blk_status_t scsi_mq_prep_fn(struct request 
> > *req)
> >  
> >  static void scsi_mq_done(struct scsi_cmnd *cmd)
> >  {
> > +   if (unlikely(test_and_set_bit(__SCMD_COMPLETE, >flags)))
> > +   return;
> > trace_scsi_dispatch_cmd_done(cmd);
> > -   blk_mq_complete_request(cmd->request);
> > +   if (unlikely(!blk_mq_complete_request(cmd->request)))
> > +   clear_bit(__SCMD_COMPLETE, >flags);
> >  }
> 
> This looks a little odd to me.  If we didn't complete the command
> someone else did.  Why would we clear the bit in this case?

It's only to go along with the fake timeout. If we don't clear the bit,
then then scsi timeout handler will believe it has nothing to do because
scsi did its required part. The block layer just pretends the LLD didn't
do its part, so scsi has to play along too.

> >  static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
> > @@ -1701,6 +1704,7 @@ static blk_status_t scsi_queue_rq(struct 
> > blk_mq_hw_ctx *hctx,
> > goto out_dec_host_busy;
> > req->rq_flags |= RQF_DONTPREP;
> > } else {
> > +   cmd->flags &= ~SCMD_COMPLETE;
> > blk_mq_start_request(req);
> > }
> >  
> > diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
> > index d6fd2aba0380..ded7c7194a28 100644
> > --- a/include/scsi/scsi_cmnd.h
> > +++ b/include/scsi/scsi_cmnd.h
> > @@ -58,6 +58,9 @@ struct scsi_pointer {
> >  #define SCMD_TAGGED(1 << 0)
> >  #define SCMD_UNCHECKED_ISA_DMA (1 << 1)
> >  #define SCMD_INITIALIZED   (1 << 2)
> > +
> > +#define __SCMD_COMPLETE3
> > +#define SCMD_COMPLETE  (1 << __SCMD_COMPLETE)
> 
> This mixing of atomic and non-atomic bitops looks rather dangerous
> to me.  Can you add a new atomic_flags just for the completed flag,
> and always use the bitops on it for now? I think we can eventually
> kill most of the existing flags except for SCMD_TAGGED over the
> next merge window or two and then move that over as well.

The only concurrent access is completion + timeout, otherwise access is
single-threaded. I'm using the atomic operations only where it is
needed.

We implicitly clear the SCMD_COMPLETED flag along with SCMD_TAGGED in
scsi_init_command() too, and I didn't want to add new overhead with
new atomics.

> Otherwise the concept looks fine to me.

Re: [PATCH 1/6] nvme: don't disable local ints for polled queue

2018-11-14 Thread Keith Busch

On Wed, Nov 14, 2018 at 12:43:22AM -0800, Christoph Hellwig wrote:
> static int nvme_poll_noirq(struct blk_mq_hw_ctx *hctx, unsigned int tag)
> {
> struct nvme_queue *nvmeq = hctx->driver_data;
> > u16 start, end;
>   bool found;
> >  
> > if (!nvme_cqe_pending(nvmeq))
> > return 0;
> >  
> > +   spin_lock(>cq_lock);
> > found = nvme_process_cq(nvmeq, , , tag);
> > +   spin_unlock(>cq_lock);
> > +
> > nvme_complete_cqes(nvmeq, start, end);
> > return found;
> 
> And while we are at it:  I think for the irq-driven queues in a
> separate queue for poll setup we might not even need to take the
> CQ lock.  Which might be an argument for only allowing polling
> if we have the separate queues just to keep everything simple.

That's a pretty cool observation. We still poll interrupt driven queues
in the timeout path as a sanity check (it really has helped in debugging
timeout issues), but we can temporarily disable the cq's irq and be
lockless.

Re: [PATCH 06/11] block: add polled wakeup task helper

2018-11-13 Thread Keith Busch

On Tue, Nov 13, 2018 at 08:42:28AM -0700, Jens Axboe wrote:
> If we're polling for IO on a device that doesn't use interrupts, then
> IO completion loop (and wake of task) is done by submitting task itself.
> If that is the case, then we don't need to enter the wake_up_process()
> function, we can simply mark ourselves as TASK_RUNNING.
> 
> Signed-off-by: Jens Axboe 
> ---
>  fs/block_dev.c |  6 ++
>  fs/iomap.c |  3 +--
>  include/linux/blkdev.h | 19 +++
>  3 files changed, 22 insertions(+), 6 deletions(-)

One more for swap read:

---
diff --git a/mm/page_io.c b/mm/page_io.c
index d4d1c89bcddd..57572ff46016 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -140,7 +140,7 @@ static void end_swap_bio_read(struct bio *bio)
unlock_page(page);
WRITE_ONCE(bio->bi_private, NULL);
bio_put(bio);
-   wake_up_process(waiter);
+   blk_wake_io_task(waiter);
put_task_struct(waiter);
 }
 
--
 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 2f920c03996e..0ed9be8906a8 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -181,8 +181,7 @@ static void blkdev_bio_end_io_simple(struct bio *bio)
>   struct task_struct *waiter = bio->bi_private;
>  
>   WRITE_ONCE(bio->bi_private, NULL);
> - smp_wmb();
> - wake_up_process(waiter);
> + blk_wake_io_task(waiter);
>  }
>  
>  static ssize_t
> @@ -309,8 +308,7 @@ static void blkdev_bio_end_io(struct bio *bio)
>   struct task_struct *waiter = dio->waiter;
>  
>   WRITE_ONCE(dio->waiter, NULL);
> - smp_wmb();
> - wake_up_process(waiter);
> + blk_wake_io_task(waiter);
>   }
>   }
>  
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 7898927e758e..a182699e28db 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1526,8 +1526,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>   struct task_struct *waiter = dio->submit.waiter;
>  
>   WRITE_ONCE(dio->submit.waiter, NULL);
> - smp_wmb();
> - wake_up_process(waiter);
> + blk_wake_io_task(waiter);
>   } else if (dio->flags & IOMAP_DIO_WRITE) {
>   struct inode *inode = file_inode(dio->iocb->ki_filp);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index ad8474ec8c58..d1ef8cbbea04 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1798,4 +1798,23 @@ static inline int blkdev_issue_flush(struct 
> block_device *bdev, gfp_t gfp_mask,
>  
>  #endif /* CONFIG_BLOCK */
>  
> +static inline void blk_wake_io_task(struct task_struct *waiter)
> +{
> + /*
> +  * If we're polling, the task itself is doing the completions. For
> +  * that case, we don't need to signal a wakeup, it's enough to just
> +  * mark us as RUNNING.
> +  */
> + if (waiter == current)
> + __set_current_state(TASK_RUNNING);
> + else {
> + /*
> +  * Ensure the callers waiter store is ordered and seen
> +  * by the ->bi_end_io() function.
> +  */
> + smp_wmb();
> + wake_up_process(waiter);
> + }
> +}
> +
>  #endif
> -- 
> 2.17.1
>

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-08 Thread Keith Busch

On Thu, Nov 08, 2018 at 07:10:58PM +0800, Ming Lei wrote:
> I guess the issue may depend on specific QEMU version, just tried the test 
> over
> virtio-scsi/sata/usb-storage emulated via qemu-2.10.2-1.fc27, not observed
> this problem.

I actually didn't use virtio-scsi, but it really doesn't matter.

FWIW, this is what I ran:

  # qemu-system-x86_64 --version
  QEMU emulator version 2.10.2(qemu-2.10.2-1.fc27)

  # qemu-system-x86_64 -m 192 -smp 2 -enable-kvm -display none -snapshot \
-hda /mnt/images/fedora-27.img  -nographic \
-append "console=tty0 console=ttyS0 root=/dav/sda rw" \
-kernel /boot/vmlinuz-4.18.10-100.fc27.x86_64 \
-initrd /boot/initramfs-4.18.10-100.fc27.x86_64.img

The file "fedora-27.img" is just a filesystem image of a minimal mock
setup from /var/lib/mock/fedora-27-x86_64/root/.

A small memory size makes it easier to observe, otherwise your probability
of allocating unclean pages lowers. That's really the only reason I used
qemu since all my real machines have too much memory that I never come
close to using, making observations more random.

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-07 Thread Keith Busch

On Thu, Nov 08, 2018 at 09:12:59AM +0800, Ming Lei wrote:
> Is it NVMe specific issue or common problem in other storage hardware?  SCSI
> does call blk_update_request() and handles partial completion.

Not specific to NVMe.

An example using SG_IO dumping 2MB of unsanitized kernel memory:

sg-test.c:
---
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define SIZE (2 * 1024 * 1024 + 8)
int main(int argc, char **argv)
{
struct sg_io_hdr io_hdr;
unsigned char *buffer, cmd[6] = { TEST_UNIT_READY };
int sg, i;

if (argc < 2)
fprintf(stderr, "usage: %s \n", argv[0]), exit(0);

sg = open(argv[1], O_RDONLY);
if (sg < 0)
perror("open"), exit(0);

buffer = malloc(SIZE);
if (!buffer)
fprintf(stderr, "no memory\n"), exit(0);

memset(_hdr, 0, sizeof(struct sg_io_hdr));
io_hdr.interface_id = 'S';
io_hdr.cmd_len = 6;
io_hdr.cmdp = cmd;
io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
io_hdr.dxfer_len = SIZE;
io_hdr.dxferp = buffer;

memset(buffer, 0, SIZE);
ioctl(sg, SG_IO, _hdr);
for (i = 0; i < SIZE; i++) {
printf("%02x", buffer[i]);
if (i+1 % 32 == 0)
printf("\n");
}
}
--

Test on qemu:
---
$ ./sg-test /dev/sda | grep -v 0
40733f4019db8001244019db4065244019db0094244019db
c025244019dbc0e43a4019db40973a4019dbc0623a4019db
800c244019dbc0d61d4019dbc05f244019db80091e4019db
40913a4019db806f3f4019db40a83f4019dbc083244019db
80eb1e4019db00a93f4019dbc09a3a4019db40503f4019db
007f1b4019dbc0d91e4019db40551e4019db804a1b4019db

--

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-07 Thread Keith Busch

On Thu, Nov 08, 2018 at 12:03:41AM +0800, Ming Lei wrote:
> On Wed, Nov 7, 2018 at 11:47 PM Keith Busch  wrote:
> >
> > On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote:
> > > blk_update_request() may tell us how much progress made, :-)
> >
> > Except when it doesn't, which is 100% of the time for many block
> > drivers, including nvme.
> 
> Please look at blk_mq_end_request()(<- nvme_complete_rq()), which
> calls blk_update_request().

Huh? That just says 100% of the request size was transerred no matter
what was actually transferred.

The protocol doesn't have a way to express what transfer size occured
with a command's completion, and even it did, there's no way I'd trust
every firmware not to screw it.

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-07 Thread Keith Busch

On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote:
> blk_update_request() may tell us how much progress made, :-)

Except when it doesn't, which is 100% of the time for many block
drivers, including nvme.

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-07 Thread Keith Busch

On Wed, Nov 07, 2018 at 11:09:27PM +0800, Ming Lei wrote:
> On Wed, Nov 7, 2018 at 10:42 PM Keith Busch  wrote:
> >
> > If the kernel allocates a bounce buffer for user read data, this memory
> > needs to be cleared before copying it to the user, otherwise it may leak
> > kernel memory to user space.
> >
> > Signed-off-by: Keith Busch 
> > ---
> >  block/bio.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index d5368a445561..a50d59236b19 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
> > if (ret)
> > goto cleanup;
> > } else {
> > +   zero_fill_bio(bio);
> > iov_iter_advance(iter, bio->bi_iter.bi_size);
> > }
> 
> This way looks inefficient because zero fill should only be required
> for short READ.

Sure, but how do you know that happened before copying the bounce buffer
to user space?

We could zero the pages on allocation if that's better (and doesn't zero
twice if __GFP_ZERO was already provided):

---
diff --git a/block/bio.c b/block/bio.c
index d5368a445561..a1b6383294f4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1212,6 +1212,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
nr_pages = 1 << map_data->page_order;
i = map_data->offset / PAGE_SIZE;
}
+
+   if (iov_iter_rw(iter) == READ)
+   gfp_mask |= __GFP_ZERO;
while (len) {
unsigned int bytes = PAGE_SIZE;
 
--

[PATCH] block: Clear kernel memory before copying to user

2018-11-07 Thread Keith Busch

If the kernel allocates a bounce buffer for user read data, this memory
needs to be cleared before copying it to the user, otherwise it may leak
kernel memory to user space.

Signed-off-by: Keith Busch 
---
 block/bio.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/bio.c b/block/bio.c
index d5368a445561..a50d59236b19 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
if (ret)
goto cleanup;
} else {
+   zero_fill_bio(bio);
iov_iter_advance(iter, bio->bi_iter.bi_size);
}
 
-- 
2.14.4

Re: nvme-pci: number of queues off by one

2018-10-08 Thread Keith Busch

On Mon, Oct 08, 2018 at 02:58:21PM +0800, Dongli Zhang wrote:
> I got the same result when emulating nvme with qemu: the VM has 12 cpu, while
> the num_queues of nvme is 8.
> 
> # uname -r
> 4.14.1
> # ll /sys/block/nvme*n1/mq/*/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/0/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/1/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/2/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/3/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/4/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/5/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:30 /sys/block/nvme0n1/mq/6/cpu_list
> 
> 
> # uname -r
> 4.18.10
> # ll /sys/block/nvme*n1/mq/*/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/0/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/1/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/2/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/3/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/4/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/5/cpu_list
> -r--r--r-- 1 root root 4096 Oct  8 14:34 /sys/block/nvme0n1/mq/6/cpu_list
> 
> From below qemu source code, when n->num_queues is 8, the handler of
> NVME_FEAT_NUM_QUEUES returns 0x60006.
> 
>  719 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest 
> *req)
>  720 {
>  721 uint32_t dw10 = le32_to_cpu(cmd->cdw10);
>  722 uint32_t dw11 = le32_to_cpu(cmd->cdw11);
>  723
>  724 switch (dw10) {
>  725 case NVME_VOLATILE_WRITE_CACHE:
>  726 blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
>  727 break;
>  728 case NVME_NUMBER_OF_QUEUES:
>  729 trace_nvme_setfeat_numq((dw11 & 0x) + 1,
>  730 ((dw11 >> 16) & 0x) + 1,
>  731 n->num_queues - 1, n->num_queues - 1);
>  732 req->cqe.result =
>  733 cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 
> 16));
> > returns 0x60006 when num_queues is 8.
> 
> 
> Finally, nr_io_queues is set to 6+1=7 in nvme_set_queue_count() in VM kernel.
> 
> I do not know how to paraphrase this in the world of nvme.
> 
> Dongli Zhang
> 
> On 10/08/2018 01:59 PM, Dongli Zhang wrote:
> > I can reproduce with qemu:
> > 
> > # ls /sys/block/nvme*n1/mq/*/cpu_list
> > /sys/block/nvme0n1/mq/0/cpu_list
> > /sys/block/nvme0n1/mq/1/cpu_list
> > /sys/block/nvme0n1/mq/2/cpu_list
> > /sys/block/nvme0n1/mq/3/cpu_list
> > /sys/block/nvme0n1/mq/4/cpu_list
> > /sys/block/nvme0n1/mq/5/cpu_list
> > /sys/block/nvme0n1/mq/6/cpu_list
> > 
> > Here is the qemu cmdline emulating 8-queue nvme while the VM has 12 cpu:
> > 
> > # qemu-system-x86_64 -m 4096 -smp 12 \
> > -kernel /path-to-kernel/linux-4.18.10/arch/x86_64/boot/bzImage \
> > -hda /path-to-img/ubuntu1804.qcow2  \
> > -append "root=/dev/sda1 init=/sbin/init text" -enable-kvm \
> > -net nic -net user,hostfwd=tcp::5022-:22 \
> > -device nvme,drive=nvme1,serial=deadbeaf1,num_queues=8 \
> > -drive file=/path-to-img/nvme.disk,if=none,id=nvme1
> > 
> > Dongli Zhang

Qemu counts one of those queues as the admin queue.

> > On 10/08/2018 01:05 PM, Prasun Ratn wrote:
> >> Hi
> >>
> >> I have an NVMe SSD that has 8 hw queues and on older kernels I see all
> >> 8 show up. However on a recent kernel (I tried 4.18), I only see 7. Is
> >> this a known issue?

That probably means you only have 8 MSI-x vectors, one of which is
reserved for the admin queue. We used to share an IO vector with the
admin queue, however some people figured out you can break your controller
that way with the linux irq spread.

Re: [PATCH 2/5] nvme: register ns_id attributes as default sysfs groups

2018-09-28 Thread Keith Busch

On Fri, Sep 28, 2018 at 08:17:20AM +0200, Hannes Reinecke wrote:
> We should be registering the ns_id attribute as default sysfs
> attribute groups, otherwise we have a race condition between
> the uevent and the attributes appearing in sysfs.
> 
> Suggested-by: Bart van Assche 
> Signed-off-by: Hannes Reinecke 

Thanks, looks great!

Reviewed-by: Keith Busch

Re: [PATCH v8 13/13] nvmet: Optionally use PCI P2P memory

2018-09-27 Thread Keith Busch

On Thu, Sep 27, 2018 at 10:54:20AM -0600, Logan Gunthorpe wrote:
> We create a configfs attribute in each nvme-fabrics target port to
> enable p2p memory use. When enabled, the port will only then use the
> p2p memory if a p2p memory device can be found which is behind the
> same switch hierarchy as the RDMA port and all the block devices in
> use. If the user enabled it and no devices are found, then the system
> will silently fall back on using regular memory.
> 
> If appropriate, that port will allocate memory for the RDMA buffers
> for queues from the p2pmem device falling back to system memory should
> anything fail.
> 
> Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would
> save an extra PCI transfer as the NVME card could just take the data
> out of it's own memory. However, at this time, only a limited number
> of cards with CMB buffers seem to be available.
> 
> Signed-off-by: Stephen Bates 
> Signed-off-by: Steve Wise 
> [hch: partial rewrite of the initial code]
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Logan Gunthorpe 

I haven't the necessary hardware to try this out, but looking forward
to it in the future. Looks good.

Reviewed-by: Keith Busch

Re: [PATCH v8 09/13] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-09-27 Thread Keith Busch

On Thu, Sep 27, 2018 at 10:54:16AM -0600, Logan Gunthorpe wrote:
> Register the CMB buffer as p2pmem and use the appropriate allocation
> functions to create and destroy the IO submission queues.
> 
> If the CMB supports WDS and RDS, publish it for use as P2P memory
> by other devices.
> 
> Kernels without CONFIG_PCI_P2PDMA will also no longer support NVMe CMB.
> However, seeing the main use-cases for the CMB is P2P operations,
> this seems like a reasonable dependency.
> 
> We drop the __iomem safety on the buffer seeing that, by convention, it's
> safe to directly access memory mapped by memremap()/devm_memremap_pages().
> Architectures where this is not safe will not be supported by memremap()
> and therefore will not be support PCI P2P and have no support for CMB.
> 
> Signed-off-by: Logan Gunthorpe 

Looks good to me.

Reviewed-by: Keith Busch

Re: [PATCH v8 10/13] nvme-pci: Add support for P2P memory in requests

2018-09-27 Thread Keith Busch

On Thu, Sep 27, 2018 at 10:54:17AM -0600, Logan Gunthorpe wrote:
> For P2P requests, we must use the pci_p2pmem_map_sg() function
> instead of the dma_map_sg functions.
> 
> With that, we can then indicate PCI_P2P support in the request queue.
> For this, we create an NVME_F_PCI_P2P flag which tells the core to
> set QUEUE_FLAG_PCI_P2P in the request queue.
> 
> Signed-off-by: Logan Gunthorpe 
> Reviewed-by: Sagi Grimberg 
> Reviewed-by: Christoph Hellwig 

Looks good to me as well.

Reviewed-by: Keith Busch

Re: [PATCH v8 11/13] nvme-pci: Add a quirk for a pseudo CMB

2018-09-27 Thread Keith Busch

On Thu, Sep 27, 2018 at 10:54:18AM -0600, Logan Gunthorpe wrote:
> Introduce a quirk to use CMB-like memory on older devices that have
> an exposed BAR but do not advertise support for using CMBLOC and
> CMBSIZE.
> 
> We'd like to use some of these older cards to test P2P memory.
> 
> Signed-off-by: Logan Gunthorpe 
> Reviewed-by: Sagi Grimberg 

The implementation looks fine, and this is why we have quirks ... but
CMB has existed since 2014! :)

Reviewed-by: Keith Busch

[PATCHv2] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread Keith Busch

A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5e4d6357 ("blk-mq: sync the update nr_hw_queues with 
blk_mq_queue_tag_busy_iter")
Cc: Jianchao Wang 
Signed-off-by: Keith Busch 
---
v1 -> v2:

  Leave the iterator interface as-is, making this change transparent
  to the callers. This will increment the queue usage counter in the
  timeout path twice, which is harmless.

  Changelog.

 block/blk-mq-tag.c | 13 -
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 94e1ed667b6e..41317c50a446 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -322,16 +322,11 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
 
/*
 * __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
-* queue_hw_ctx after freeze the queue. So we could use q_usage_counter
-* to avoid race with it. __blk_mq_update_nr_hw_queues will users
-* synchronize_rcu to ensure all of the users go out of the critical
-* section below and see zeroed q_usage_counter.
+* queue_hw_ctx after freeze the queue, so we use q_usage_counter
+* to avoid race with it.
 */
-   rcu_read_lock();
-   if (percpu_ref_is_zero(>q_usage_counter)) {
-   rcu_read_unlock();
+   if (!percpu_ref_tryget(>q_usage_counter))
return;
-   }
 
queue_for_each_hw_ctx(q, hctx, i) {
struct blk_mq_tags *tags = hctx->tags;
@@ -347,7 +342,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
bt_for_each(hctx, >breserved_tags, fn, priv, 
true);
bt_for_each(hctx, >bitmap_tags, fn, priv, false);
}
-   rcu_read_unlock();
+   blk_queue_exit(q);
 }
 
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
-- 
2.14.4

Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread Keith Busch

On Tue, Sep 25, 2018 at 08:14:45AM -0700, Bart Van Assche wrote:
> On Tue, 2018-09-25 at 08:39 -0600, Keith Busch wrote:
> > On Tue, Sep 25, 2018 at 10:39:46AM +0800, jianchao.wang wrote:
> > > But the issue is the left part of blk_mq_timeout_work is moved out of 
> > > protection of q refcount.
> > 
> > I'm not sure what you mean by "left part". The only part that isn't
> > outside the reference with this patch is the part Bart pointed out.
> > 
> > This looks like it may be fixed by either moving the refcount back up a
> > level to all the callers of blk_mq_queue_tag_busy_iter, or add
> > cancel_work_sync(>timeout_work) to __blk_mq_update_nr_hw_queues after
> > the freeze.
> 
> Hi Keith,
> 
> How about applying the following (untested) patch on top of your patch?

Yep, this works. I can fold in a v2.
 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 019f9b169887..099e203b5213 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -851,6 +851,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
>   if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ))
>   return;
>  
> + if (!percpu_ref_tryget(>q_usage_counter))
> + return;
> +
>   if (next != 0) {
>   mod_timer(>timeout, next);
>   } else {
> @@ -866,6 +869,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
>   blk_mq_tag_idle(hctx);
>   }
>   }
> + blk_queue_exit(q);
>  }
> 
> Thanks,
> 
> Bart.

Re: [PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-25 Thread Keith Busch

On Tue, Sep 25, 2018 at 10:39:46AM +0800, jianchao.wang wrote:
> Hi Bart
> 
> On 09/25/2018 10:20 AM, Bart Van Assche wrote:
> > On 9/24/18 7:11 PM, jianchao.wang wrote:
> >> Hi Keith
> >>
> >> On 09/25/2018 05:09 AM, Keith Busch wrote:
> >>> -    /* A deadlock might occur if a request is stuck requiring a
> >>> - * timeout at the same time a queue freeze is waiting
> >>> - * completion, since the timeout code would not be able to
> >>> - * acquire the queue reference here.
> >>> - *
> >>> - * That's why we don't use blk_queue_enter here; instead, we use
> >>> - * percpu_ref_tryget directly, because we need to be able to
> >>> - * obtain a reference even in the short window between the queue
> >>> - * starting to freeze, by dropping the first reference in
> >>> - * blk_freeze_queue_start, and the moment the last request is
> >>> - * consumed, marked by the instant q_usage_counter reaches
> >>> - * zero.
> >>> - */
> >>> -    if (!percpu_ref_tryget(>q_usage_counter))
> >>> +    if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ))
> >>>   return;
> >>
> >> We cannot discard the percpu_ref_tryget here.
> >>
> >> There following code in blk_mq_timeout_work still need it:
> >>
> >> if (next != 0) {
> >>     mod_timer(>timeout, next);
> >> } else {
> >>     queue_for_each_hw_ctx(q, hctx, i) {
> >>     /* the hctx may be unmapped, so check it here */
> >>     if (blk_mq_hw_queue_mapped(hctx))
> >>     blk_mq_tag_idle(hctx);
> >>     }
> >> }
> > 
> > Hi Jianchao,
> > 
> > Had you noticed that the percpu_ref_tryget() call has been moved into
> > blk_mq_queue_tag_busy_iter()?
> > 
> 
> Yes.
> 
> But the issue is the left part of blk_mq_timeout_work is moved out of 
> protection of q refcount.

I'm not sure what you mean by "left part". The only part that isn't
outside the reference with this patch is the part Bart pointed out.

This looks like it may be fixed by either moving the refcount back up a
level to all the callers of blk_mq_queue_tag_busy_iter, or add
cancel_work_sync(>timeout_work) to __blk_mq_update_nr_hw_queues after
the freeze.

[PATCH] blk-mq: Allow blocking queue tag iter callbacks

2018-09-24 Thread Keith Busch

A recent commit had tag iterator callbacks run under the rcu read
lock. Existing callbacks exist that do not satisy the rcu non-blocking
requirement.

The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference if the queue is not frozen instead of a rcu read lock.

Fixes: f5e4d6357 ("blk-mq: sync the update nr_hw_queues with 
blk_mq_queue_tag_busy_iter")
Cc: Jianchao Wang 
Signed-off-by: Keith Busch 
---
 block/blk-mq-tag.c | 30 +-
 block/blk-mq-tag.h |  2 +-
 block/blk-mq.c | 22 +-
 3 files changed, 19 insertions(+), 35 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 94e1ed667b6e..63542642a017 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -314,24 +314,27 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set 
*tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
-void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
+bool blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv)
 {
struct blk_mq_hw_ctx *hctx;
int i;
 
-   /*
-* __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
-* queue_hw_ctx after freeze the queue. So we could use q_usage_counter
-* to avoid race with it. __blk_mq_update_nr_hw_queues will users
-* synchronize_rcu to ensure all of the users go out of the critical
-* section below and see zeroed q_usage_counter.
+   /* A deadlock might occur if a request is stuck requiring a
+* timeout at the same time a queue freeze is waiting
+* completion, since the timeout code would not be able to
+* acquire the queue reference here.
+*
+* That's why we don't use blk_queue_enter here; instead, we use
+* percpu_ref_tryget directly, because we need to be able to
+* obtain a reference even in the short window between the queue
+* starting to freeze, by dropping the first reference in
+* blk_freeze_queue_start, and the moment the last request is
+* consumed, marked by the instant q_usage_counter reaches
+* zero.
 */
-   rcu_read_lock();
-   if (percpu_ref_is_zero(>q_usage_counter)) {
-   rcu_read_unlock();
-   return;
-   }
+   if (!percpu_ref_tryget(>q_usage_counter))
+   return false;
 
queue_for_each_hw_ctx(q, hctx, i) {
struct blk_mq_tags *tags = hctx->tags;
@@ -347,7 +350,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
bt_for_each(hctx, >breserved_tags, fn, priv, 
true);
bt_for_each(hctx, >bitmap_tags, fn, priv, false);
}
-   rcu_read_unlock();
+   blk_queue_exit(q);
+   return true;
 }
 
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..36b3bc90e867 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -33,7 +33,7 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
struct blk_mq_tags **tags,
unsigned int depth, bool can_grow);
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
-void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
+bool blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a59c72..019f9b169887 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -848,24 +848,9 @@ static void blk_mq_timeout_work(struct work_struct *work)
struct blk_mq_hw_ctx *hctx;
int i;
 
-   /* A deadlock might occur if a request is stuck requiring a
-* timeout at the same time a queue freeze is waiting
-* completion, since the timeout code would not be able to
-* acquire the queue reference here.
-*
-* That's why we don't use blk_queue_enter here; instead, we use
-* percpu_ref_tryget directly, because we need to be able to
-* obtain a reference even in the short window between the queue
-* starting to freeze, by dropping the first reference in
-* blk_freeze_queue_start, and the moment the last request is
-* consumed, marked by the instant q_usage_counter reaches
-* zero.
-*/
-   if (!percpu_ref_tryget(>q_usage_counter))
+   if (!blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ))
return;
 
-   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
-
if (next != 0) {
mod_timer(>timeout, next);
} else {
@@ -881,7 +866,6 @@

Re: Regression caused by f5bbbbe4d635

2018-09-24 Thread Keith Busch

On Mon, Sep 24, 2018 at 12:51:07PM -0700, Bart Van Assche wrote:
> On Mon, 2018-09-24 at 13:13 -0600, Keith Busch wrote:
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 85a1c1a59c72..28d128450621 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -848,22 +848,6 @@ static void blk_mq_timeout_work(struct work_struct 
> > *work)
> > struct blk_mq_hw_ctx *hctx;
> > int i;
> >  
> > -   /* A deadlock might occur if a request is stuck requiring a
> > -* timeout at the same time a queue freeze is waiting
> > -* completion, since the timeout code would not be able to
> > -* acquire the queue reference here.
> > -*
> > -* That's why we don't use blk_queue_enter here; instead, we use
> > -* percpu_ref_tryget directly, because we need to be able to
> > -* obtain a reference even in the short window between the queue
> > -* starting to freeze, by dropping the first reference in
> > -* blk_freeze_queue_start, and the moment the last request is
> > -* consumed, marked by the instant q_usage_counter reaches
> > -* zero.
> > -*/
> > -   if (!percpu_ref_tryget(>q_usage_counter))
> > -   return;
> > -
> > blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
> >  
> > if (next != 0) {
> > @@ -881,7 +865,6 @@ static void blk_mq_timeout_work(struct work_struct 
> > *work)
> > blk_mq_tag_idle(hctx);
> > }
> > }
> > -   blk_queue_exit(q);
> >  }
> 
> Hi Keith,
> 
> The above introduces a behavior change: if the percpu_ref_tryget() call inside
> blk_mq_queue_tag_busy_iter() fails then blk_mq_timeout_work() will now call
> blk_mq_tag_idle(). I think that's wrong if the percpu_ref_tryget() call fails
> due to the queue having been frozen. Please make blk_mq_queue_tag_busy_iter()
> return a bool that indicates whether or not it has iterated over the request
> queue.

Good point, thanks for the feedback.

Re: Regression caused by f5bbbbe4d635

2018-09-24 Thread Keith Busch

On Mon, Sep 24, 2018 at 12:44:13PM -0600, Jens Axboe wrote:
> Hi,
> 
> This commit introduced a rcu_read_lock() inside
> blk_mq_queue_tag_busy_iter() - this is problematic for the timout code,
> since we now end up holding the RCU read lock over the timeout code. As
> just one example, nvme ends up doing:
> 
> nvme_timeout()
>   nvme_dev_disable()
>   mutex_lock(>shutdown_lock);
> 
> and things are then obviously unhappy...

Yah, there's never been a requirement that tag iterator callbacks be
non-blocking as far as I remember.

The queue's reference in blk_mq_timeout_work looks applicable to any
blk_mq_queue_tag_busy_iter user, so just moving it there looks like it
should do what f5e4d635 was trying to fix. 

---
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 94e1ed667b6e..850577a3de6d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -320,18 +320,21 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
struct blk_mq_hw_ctx *hctx;
int i;
 
-   /*
-* __blk_mq_update_nr_hw_queues will update the nr_hw_queues and
-* queue_hw_ctx after freeze the queue. So we could use q_usage_counter
-* to avoid race with it. __blk_mq_update_nr_hw_queues will users
-* synchronize_rcu to ensure all of the users go out of the critical
-* section below and see zeroed q_usage_counter.
+   /* A deadlock might occur if a request is stuck requiring a
+* timeout at the same time a queue freeze is waiting
+* completion, since the timeout code would not be able to
+* acquire the queue reference here.
+*
+* That's why we don't use blk_queue_enter here; instead, we use
+* percpu_ref_tryget directly, because we need to be able to
+* obtain a reference even in the short window between the queue
+* starting to freeze, by dropping the first reference in
+* blk_freeze_queue_start, and the moment the last request is
+* consumed, marked by the instant q_usage_counter reaches
+* zero.
 */
-   rcu_read_lock();
-   if (percpu_ref_is_zero(>q_usage_counter)) {
-   rcu_read_unlock();
+   if (!percpu_ref_tryget(>q_usage_counter))
return;
-   }
 
queue_for_each_hw_ctx(q, hctx, i) {
struct blk_mq_tags *tags = hctx->tags;
@@ -347,7 +350,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, 
busy_iter_fn *fn,
bt_for_each(hctx, >breserved_tags, fn, priv, 
true);
bt_for_each(hctx, >bitmap_tags, fn, priv, false);
}
-   rcu_read_unlock();
+   blk_queue_exit(q);
 }
 
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a59c72..28d128450621 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -848,22 +848,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
struct blk_mq_hw_ctx *hctx;
int i;
 
-   /* A deadlock might occur if a request is stuck requiring a
-* timeout at the same time a queue freeze is waiting
-* completion, since the timeout code would not be able to
-* acquire the queue reference here.
-*
-* That's why we don't use blk_queue_enter here; instead, we use
-* percpu_ref_tryget directly, because we need to be able to
-* obtain a reference even in the short window between the queue
-* starting to freeze, by dropping the first reference in
-* blk_freeze_queue_start, and the moment the last request is
-* consumed, marked by the instant q_usage_counter reaches
-* zero.
-*/
-   if (!percpu_ref_tryget(>q_usage_counter))
-   return;
-
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
 
if (next != 0) {
@@ -881,7 +865,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
blk_mq_tag_idle(hctx);
}
}
-   blk_queue_exit(q);
 }
 
 struct flush_busy_ctx_data {
@@ -2974,10 +2957,7 @@ static void __blk_mq_update_nr_hw_queues(struct 
blk_mq_tag_set *set,
 
list_for_each_entry(q, >tag_list, tag_set_list)
blk_mq_freeze_queue(q);
-   /*
-* Sync with blk_mq_queue_tag_busy_iter.
-*/
-   synchronize_rcu();
+
/*
 * Switch IO scheduler to 'none', cleaning up the data associated
 * with the previous scheduler. We will switch back once we are done
--

Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again

2018-07-27 Thread Keith Busch

On Fri, Jul 27, 2018 at 05:14:18PM +, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 11:04 -0600, Keith Busch wrote:
> > On Fri, Jul 27, 2018 at 04:59:34PM +, Bart Van Assche wrote:
> > > On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
> > > > You skip that code if the driver returns BLK_EH_DONT_RESET_TIMER.
> > > 
> > > How about applying the following patch on top of this series?
> > 
> > That works for me if you, but it breaks scsi again when
> > scmd_eh_abort_handler completes the command a second time.
> 
> How about introducing a new request queue flag that chooses between the
> behavior with or without the patch in my previous e-mail? I don't think that
> it is possible to come up with a single implementation that covers the needs
> of NVMe and SCSI without introducing such a flag. If a SCSI request times out
> then request ownership is transferred from the LLD to the error handler. For
> the NVMe driver however there is no such transfer of ownership.

Instead of PATCH 1/5, how about creating a new timeout return code like
"BLK_EH_DONT_COMPLETE"?

Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again

2018-07-27 Thread Keith Busch

On Fri, Jul 27, 2018 at 04:59:34PM +, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
> > You skip that code if the driver returns BLK_EH_DONT_RESET_TIMER.
> 
> How about applying the following patch on top of this series?

That works for me if you, but it breaks scsi again when
scmd_eh_abort_handler completes the command a second time.

 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a97ab5ba9d18..aa66535604fd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -854,10 +854,10 @@ static void blk_mq_rq_timed_out(struct request *req, 
> bool reserved)
>* latter case, if blk_mq_complete_request() was called while
>* the timeout handler was in progress, ignore that call.
>*/
> - if (ret == BLK_EH_DONT_RESET_TIMER)
> - return;
> - WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
> - blk_mq_add_timer(req);
> + if (ret == BLK_EH_RESET_TIMER)
> + blk_mq_add_timer(req);
> + else
> + WARN_ON_ONCE(ret != BLK_EH_DONT_RESET_TIMER);
>  again:
>   if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_IN_FLIGHT))
>   return;
> 
> Bart.

Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again

2018-07-27 Thread Keith Busch

On Fri, Jul 27, 2018 at 04:51:05PM +, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 10:46 -0600, Keith Busch wrote:
> > On Fri, Jul 27, 2018 at 09:20:42AM -0700, Bart Van Assche wrote:
> > > + ret = req->q->mq_ops->timeout(req, reserved);
> > > + /*
> > > +  * BLK_EH_DONT_RESET_TIMER means that the block driver either
> > > +  * completed the request or still owns the request and will
> > > +  * continue processing the timeout asynchronously. In the
> > > +  * latter case, if blk_mq_complete_request() was called while
> > > +  * the timeout handler was in progress, ignore that call.
> > > +  */
> > > + if (ret == BLK_EH_DONT_RESET_TIMER)
> > > + return;
> > 
> > This is how completions get lost.
> 
> The new approach for handling completions that occur while the .timeout()
> callback in progress is as follows:
> * blk_mq_complete_request() executes the following code:
>if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
>return;
> * blk_mq_rq_timed_out() executes the following code:
>if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
>__blk_mq_complete_request(req);
>return;
>}
> 
> As one can see __blk_mq_complete_request() gets called if this race occurs.

You skip that code if the driver returns BLK_EH_DONT_RESET_TIMER.

Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again

2018-07-27 Thread Keith Busch

On Fri, Jul 27, 2018 at 09:20:42AM -0700, Bart Van Assche wrote:
> + ret = req->q->mq_ops->timeout(req, reserved);
> + /*
> +  * BLK_EH_DONT_RESET_TIMER means that the block driver either
> +  * completed the request or still owns the request and will
> +  * continue processing the timeout asynchronously. In the
> +  * latter case, if blk_mq_complete_request() was called while
> +  * the timeout handler was in progress, ignore that call.
> +  */
> + if (ret == BLK_EH_DONT_RESET_TIMER)
> + return;

This is how completions get lost.

Re: [PATCH 3/5] block: Split blk_add_timer()

2018-07-27 Thread Keith Busch

On Fri, Jul 27, 2018 at 09:20:40AM -0700, Bart Van Assche wrote:
> +void blk_mq_add_timer(struct request *req)
> +{
> + struct request_queue *q = req->q;
> +
> + if (!q->mq_ops)
> + lockdep_assert_held(q->queue_lock);
> +
> + /* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
> + if (!q->mq_ops && !q->rq_timed_out_fn)
> + return;



> + if (!q->mq_ops)
> + list_add_tail(>timeout_list, >q->timeout_list);

Do you really need these checks for !q->mq_ops?

Re: [PATCH v4 3/3] nvme: use blk API to remap ref tags for IOs with metadata

2018-07-25 Thread Keith Busch

On Wed, Jul 25, 2018 at 06:46:17PM +0300, Max Gurtovoy wrote:
> Also moved the logic of the remapping to the nvme core driver instead
> of implementing it in the nvme pci driver. This way all the other nvme
> transport drivers will benefit from it (in case they'll implement metadata
> support).

I'm afraid I won't be able to test any of this in the near future, but
the code looks good to me!

Acked-by: Keith Busch

Re: [PATCHv2 2/2] scsi: set timed out out mq requests to complete

2018-07-25 Thread Keith Busch

On Wed, Jul 25, 2018 at 03:52:17PM +, Bart Van Assche wrote:
> On Mon, 2018-07-23 at 08:37 -0600, Keith Busch wrote:
> > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> > index 8932ae81a15a..2715cdaa669c 100644
> > --- a/drivers/scsi/scsi_error.c
> > +++ b/drivers/scsi/scsi_error.c
> > @@ -296,6 +296,20 @@ enum blk_eh_timer_return scsi_times_out(struct request 
> > *req)
> > rtn = host->hostt->eh_timed_out(scmd);
> >  
> > if (rtn == BLK_EH_DONE) {
> > +   /*
> > +* For blk-mq, we must set the request state to complete now
> > +* before sending the request to the scsi error handler. This
> > +* will prevent a use-after-free in the event the LLD manages
> > +* to complete the request before the error handler finishes
> > +* processing this timed out request.
> > +*
> > +* If the request was already completed, then the LLD beat the
> > +* time out handler from transferring the request to the scsi
> > +* error handler. In that case we can return immediately as no
> > +* further action is required.
> > +*/
> > +   if (req->q->mq_ops && !blk_mq_mark_complete(req))
> > +   return rtn;
> > if (scsi_abort_command(scmd) != SUCCESS) {
> > set_host_byte(scmd, DID_TIME_OUT);
> > scsi_eh_scmd_add(scmd);
> 
> Hello Keith,
> 
> What will happen if a completion occurs after scsi_times_out() has started and
> before or during the host->hostt->eh_timed_out()? Can that cause a 
> use-after-free
> in .eh_timed_out()? Can that cause .eh_timed_out() to return 
> BLK_EH_RESET_TIMER
> when it should return BLK_EH_DONE? Can that cause blk_mq_rq_timed_out() to 
> call
> blk_add_timer() when that function shouldn't be called?

That's what the request's refcount protects. The whole point was that
driver returning RESET_TIMER doesn't lose the completion. In the worst
case scenario, the blk-mq timeout work spends some CPU cycles re-arming
a timer that it didn't need to.

Re: [PATCH v3 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-25 Thread Keith Busch

On Wed, Jul 25, 2018 at 01:22:47PM +0200, Christoph Hellwig wrote:
> > +   pmap = kmap_atomic(iv.bv_page) + iv.bv_offset;
> > +   p = pmap;
> 
> Maybe:
> 
>   pmap = p = kmap_atomic(iv.bv_page) + iv.bv_offset;

Max pointed out that even with this, we're still calling kunmap_atomic()
with an address potentially at an offset from the page that was kmap'ed.
While currently harmless, perhaps for correctness:

pmap = kmap_atomic(iv.bv_page);
p = pmap + iv.bv_offset;

Re: [PATCH 2/3] block: move dif_prepare/dif_complete functions to block layer

2018-07-24 Thread Keith Busch

On Tue, Jul 24, 2018 at 04:33:41PM +0300, Max Gurtovoy wrote:
> +void t10_pi_prepare(struct request *rq, u8 protection_type)
> +{
> + const int tuple_sz = rq->q->integrity.tuple_size;
> + u32 ref_tag = t10_pi_ref_tag(rq);
> + struct bio *bio;
> +
> + if (protection_type == T10_PI_TYPE3_PROTECTION)
> + return;
> +
> + __rq_for_each_bio(bio, rq) {
> + struct bio_integrity_payload *bip = bio_integrity(bio);
> + u32 virt = bip_get_seed(bip) & 0x;
> + struct bio_vec iv;
> + struct bvec_iter iter;
> +
> + /* Already remapped? */
> + if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
> + break;
> +
> + bip_for_each_vec(iv, bip, iter) {
> + struct t10_pi_tuple *pi = kmap_atomic(iv.bv_page) +
> + iv.bv_offset;
> + unsigned int j;
> +
> + for (j = 0; j < iv.bv_len; j += tuple_sz) {
> + if (be32_to_cpu(pi->ref_tag) == virt)
> + pi->ref_tag = cpu_to_be32(ref_tag);
> + virt++;
> + ref_tag++;
> + pi += tuple_sz;
> + }
> +
> + kunmap_atomic(pi);
> + }

Since you're incrementing 'pi', you end up unmapping an address that
you didn't map. It does appears harmless in current kunmap_atomic()
implementation, though.

You are also incrementing 'pi' by too many bytes since it is of type
struct t10_pi_tuple. The nvme driver used void* to make the pointer
arithmentic easier.

Re: [PATCH 1/2] block: move dif_prepare/dif_complete functions to block layer

2018-07-23 Thread Keith Busch

On Sun, Jul 22, 2018 at 12:49:57PM +0300, Max Gurtovoy wrote:
> +void blk_integrity_dif_prepare(struct request *rq, u8 protection_type,
> +u32 ref_tag)
> +{
> + const int tuple_sz = sizeof(struct t10_pi_tuple);
> + struct bio *bio;
> + struct t10_pi_tuple *pi;
> + u32 phys, virt;
> +
> + if (protection_type == T10_PI_TYPE3_PROTECTION)
> + return;
> +
> + phys = ref_tag;
> +
> + __rq_for_each_bio(bio, rq) {
> + struct bio_integrity_payload *bip = bio_integrity(bio);
> + struct bio_vec iv;
> + struct bvec_iter iter;
> + unsigned int j;
> +
> + /* Already remapped? */
> + if (bip->bip_flags & BIP_MAPPED_INTEGRITY)
> + break;
> +
> + virt = bip_get_seed(bip) & 0x;
> +
> + bip_for_each_vec(iv, bip, iter) {
> + pi = kmap_atomic(iv.bv_page) + iv.bv_offset;
> +
> + for (j = 0; j < iv.bv_len; j += tuple_sz, pi++) {

nvme's data integrity buffer can actually have more space between each
PI field, so we just need to account for that when iterating instead of
assuming each element is the size of a T10 PI tuple.

Otherwise, great idea.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-19 Thread Keith Busch

On Thu, Jul 19, 2018 at 06:29:24PM +0200, h...@lst.de wrote:
> How do we get a fix into 4.18 at this part of the cycle?  I think that
> is the most important prirority right now.

Even if you were okay at this point to incorporate the concepts from
Bart's patch, it still looks like trouble for scsi (will elobrate
separately).

But reverting breaks other things we finally got working, so I'd
still vote for isolating the old behavior to scsi if that isn't too
unpalatable. I'll send a small patch shortly and see what happens.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-19 Thread Keith Busch

On Thu, Jul 19, 2018 at 04:04:46PM +, Bart Van Assche wrote:
> On Thu, 2018-07-19 at 09:56 -0600, Keith Busch wrote:
> > Even some scsi drivers are susceptible to losing their requests with the
> > reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
> > from it's timeout handler. With the behavior everyone seems to want,
> > a natural completion at or around the same time is lost forever because
> > it was blocked from completion with no way to recover.
> 
> The patch I had posted handles a completion that occurs while a timeout is
> being handled properly. From 
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg22196.html:

Sounds like a win-win to me.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-19 Thread Keith Busch

On Thu, Jul 19, 2018 at 08:59:31AM -0600, Keith Busch wrote:
> > And we should never even hit the timeout handler for those as it
> > is rather pointless (although it looks we currently do..).
> 
> I don't see why we'd expect to never hit timeout for at least some of
> these. It's not a stretch to see, for example, that virtio-blk or loop
> could have their requests lost with no way to recover if we revert. I've
> wasted too much time debugging hardware for such lost commands when it
> was in fact functioning perfectly fine. So reintroducing that behavior
> is a bit distressing.

Even some scsi drivers are susceptible to losing their requests with the
reverted behavior: take virtio-scsi for example, which returns RESET_TIMER
from it's timeout handler. With the behavior everyone seems to want,
a natural completion at or around the same time is lost forever because
it was blocked from completion with no way to recover.

While the timing for when requests may be lost is quite narrow, I've
seen it enough with very difficult to reproduce scenarios that hardware
devs no longer trust IO timeouts are their problem because Linux loses
their completions.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-19 Thread Keith Busch

On Thu, Jul 19, 2018 at 03:19:04PM +0200, h...@lst.de wrote:
> On Wed, Jul 18, 2018 at 03:17:11PM -0600, Keith Busch wrote:
> > And there may be other drivers that don't want their completions
> > ignored, so breaking them again is also a step in the wrong direction.
> > 
> > There are not that many blk-mq drivers, so we can go through them all.
> 
> I think the point is that SCSI is the biggest user by both the number
> of low-level drivers sitting under the midlayer, and also by usage.
> 
> We need to be very careful not to break it.  Note that this doesn't
> mean that I don't want to eventually move away from just ignoring
> completions in timeout state for SCSI.  I'd just rather rever 4.18
> to a clean known state instead of doctoring around late in the rc
> phase.

I definitely do not want to break scsi. I just don't want to break every
one else either, and I think scsi can get the behavior it wants without
forcing others to subscribe to it.

> > Most don't even implement .timeout, so they never know that condition
> > ever happened. Others always return BLK_EH_RESET_TIMER without doing
> > anythign else, so the 'new' behavior would have to be better for those,
> > too.
> 
> And we should never even hit the timeout handler for those as it
> is rather pointless (although it looks we currently do..).

I don't see why we'd expect to never hit timeout for at least some of
these. It's not a stretch to see, for example, that virtio-blk or loop
could have their requests lost with no way to recover if we revert. I've
wasted too much time debugging hardware for such lost commands when it
was in fact functioning perfectly fine. So reintroducing that behavior
is a bit distressing.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Keith Busch

On Wed, Jul 18, 2018 at 10:39:36PM +0200, h...@lst.de wrote:
> On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> > On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > > that so colourfully calls everyone else "dangerous" while advocating
> > > for silently losing requests on purpose.
> > > 
> > > But where's the option that fixes scsi to handle hardware completions
> > > concurrently with arbitrary timeout software? Propping up that house of
> > > cards can't be the only recourse.
> > 
> > The important bit is that we need to fix this issue quickly.  We are
> > past -rc5 so I'm rather concerned about anything too complicated.
> > 
> > I'm not even sure SCSI has a problem with multiple completions happening
> > at the same time, but it certainly has a problem with bypassing
> > blk_mq_complete_request from the EH path.
> > 
> > I think we can solve this properly, but I also think we are way to late
> > in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> > to revert the changes and try again for 4.19 or even 4.20 if we don't
> > act quickly enough.
> 
> So here is a quick attempt at the revert while also trying to keep
> nvme working.  Keith, Bart, Jianchao - does this looks reasonable
> as a 4.18 band aid?
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-revert

Hm, I not really a fan. The far majority of blk-mq drivers don't even
implement a timeout, and reverting it will lose their requests forever
if they complete at the same time as a timeout.

Of the remaining drivers, most of those don't want the reverted behavior
either. It actually looks like scsi is the only mq driver that wants
to block completions. In the short term, scsi can make that happen with
just three lines of code.

---
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..03986af3076c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -286,6 +286,10 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
enum blk_eh_timer_return rtn = BLK_EH_DONE;
struct Scsi_Host *host = scmd->device->host;

+   if (req->q->mq_ops && cmpxchg(>state, MQ_RQ_IN_FLIGHT, 
MQ_RQ_COMPLETE) !=
+   MQ_RQ_IN_FLIGHT);
+   return rtn;
+
trace_scsi_dispatch_cmd_timeout(scmd);
scsi_log_completion(scmd, TIMEOUT_ERROR);


--

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Keith Busch

On Wed, Jul 18, 2018 at 09:30:11PM +, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 15:17 -0600, Keith Busch wrote:
> > There are not that many blk-mq drivers, so we can go through them all.
> 
> I'm not sure that's the right approach. I think it is the responsibility of
> the block layer to handle races between completion handling and timeout
> handling and that this is not the responsibility of e.g. a block driver. If
> you look at e.g. the legacy block layer then you will see that it takes care
> of this race and that legacy block drivers do not have to worry about this
> race.

Reverting doesn't handle the race at all. It just ignores completions
and puts the responsibility on the drivers to handle the race because
that's what scsi wants to happen.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Keith Busch

On Wed, Jul 18, 2018 at 08:58:48PM +, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 14:53 -0600, Keith Busch wrote:
> > If scsi needs this behavior, why not just put that behavior in scsi? It
> > can set the state to complete and then everything can play out as
> > before.
> > [ ... ]
> 
> There may be other drivers that need the same protection the SCSI core needs
> so I think the patch at the end of your previous e-mail is a step in the wrong
> direction.
> 
> Bart.

And there may be other drivers that don't want their completions
ignored, so breaking them again is also a step in the wrong direction.

There are not that many blk-mq drivers, so we can go through them all.

Most don't even implement .timeout, so they never know that condition
ever happened. Others always return BLK_EH_RESET_TIMER without doing
anythign else, so the 'new' behavior would have to be better for those,
too.

The following don't implement .timeout:

  loop, rdb, virtio, xen, dm, ubi, scm

The following always return RESET_TIMER:

  null, skd

The following is safe to the new way:

  mtip

And now ones I am not sure about:

  ndb, mmc, dasd

I don't know, reverting looks worse than just fixing the drivers.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-18 Thread Keith Busch

On Wed, Jul 18, 2018 at 09:56:50PM +0200, h...@lst.de wrote:
> On Fri, Jul 13, 2018 at 05:58:08PM -0600, Keith Busch wrote:
> > Of the two you mentioned, yours is preferable IMO. While I appreciate
> > Jianchao's detailed analysis, it's hard to take a proposal seriously
> > that so colourfully calls everyone else "dangerous" while advocating
> > for silently losing requests on purpose.
> > 
> > But where's the option that fixes scsi to handle hardware completions
> > concurrently with arbitrary timeout software? Propping up that house of
> > cards can't be the only recourse.
> 
> The important bit is that we need to fix this issue quickly.  We are
> past -rc5 so I'm rather concerned about anything too complicated.
> 
> I'm not even sure SCSI has a problem with multiple completions happening
> at the same time, but it certainly has a problem with bypassing
> blk_mq_complete_request from the EH path.
> 
> I think we can solve this properly, but I also think we are way to late
> in the 4.18 cycle to fix it properly.  For now I fear we'll just have
> to revert the changes and try again for 4.19 or even 4.20 if we don't
> act quickly enough.

If scsi needs this behavior, why not just put that behavior in scsi? It
can set the state to complete and then everything can play out as
before.

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 22326612a5d3..f50559718b71 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -558,10 +558,8 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   if (cmpxchg(>state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
-   MQ_RQ_IN_FLIGHT)
+   if (blk_mq_mark_complete(rq))
return;
-
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
 
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 8932ae81a15a..a5d05fab24a7 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -286,6 +286,14 @@ enum blk_eh_timer_return scsi_times_out(struct request 
*req)
enum blk_eh_timer_return rtn = BLK_EH_DONE;
struct Scsi_Host *host = scmd->device->host;
 
+   /*
+* Mark complete now so lld can't post a completion during error
+* handling. Return immediately if it was already marked complete, as
+* that means the lld posted a completion already.
+*/
+   if (req->q->mq_ops && blk_mq_mark_complete(req))
+   return rtn;
+
trace_scsi_dispatch_cmd_timeout(scmd);
scsi_log_completion(scmd, TIMEOUT_ERROR);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 9b0fd11ce89a..0ce587c9c27b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -289,6 +289,15 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set 
*set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
+/*
+ * Returns true if request is not in flight.
+ */
+static inline bool blk_mq_mark_complete(struct request *rq)
+{
+   return (cmpxchg(>state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) !=
+   MQ_RQ_IN_FLIGHT);
+}
+
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
--

Re: [RFC PATCH] blk-mq: move timeout handling from queue to tagset

2018-07-18 Thread Keith Busch

On Wed, Jul 18, 2018 at 05:18:45PM +, Bart Van Assche wrote:
> On Wed, 2018-07-18 at 11:00 -0600, Keith Busch wrote:
> > -   cancel_work_sync(>timeout_work);
> > -
> > if (q->mq_ops) {
> > struct blk_mq_hw_ctx *hctx;
> > int i;
> > @@ -415,6 +412,8 @@ void blk_sync_queue(struct request_queue *q)
> > queue_for_each_hw_ctx(q, hctx, i)
> > cancel_delayed_work_sync(>run_work);
> > } else {
> > +   del_timer_sync(>timeout);
> > +   cancel_work_sync(>timeout_work);
> > cancel_delayed_work_sync(>delay_work);
> > }
> >  }
> 
> What is the impact of this change on the md driver, which is the only driver
> that calls blk_sync_queue() directly? What will happen if timeout processing
> happens concurrently with or after blk_sync_queue() has returned?

That's a make_request_fn stacking driver, right? There should be
no impact in that case, since the change above affects only mq.

I'm actually a little puzzled why md calls blk_sync_queue. Are the
queue timers ever used for bio-based drivers?
 
> > +   list_for_each_entry(q, >tag_list, tag_set_list) {
> > /*
> >  * Request timeouts are handled as a forward rolling timer. If
> >  * we end up here it means that no requests are pending and
> > @@ -881,7 +868,6 @@ static void blk_mq_timeout_work(struct work_struct 
> > *work)
> > blk_mq_tag_idle(hctx);
> > }
> > }
> > -   blk_queue_exit(q);
> >  }
> 
> What prevents that a request queue is removed from set->tag_list while the 
> above
> loop examines tag_list? Can blk_cleanup_queue() queue be called from the 
> context
> of another thread while this loop is examining hardware queues?

Good point. I missed that this needs to hold the tag_list_lock.
  
> > +   timer_setup(>timer, blk_mq_timed_out_timer, 0);
> > +   INIT_WORK(>timeout_work, blk_mq_timeout_work);
> > [ ... ]
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -86,6 +86,8 @@ struct blk_mq_tag_set {
> >  
> > struct blk_mq_tags  **tags;
> >  
> > +   struct timer_list   timer;
> > +   struct work_struct  timeout_work;
> 
> Can the timer and timeout_work data structures be replaced by a single
> delayed_work instance?

I think so. I wanted to keep blk_add_timer relatively unchanged for this
proposal, so I followed the existing pattern with the timer kicking the
work. I don't see why that extra indirection is necessary, so I think
it's a great idea. Unless anyone knows a reason not to, we can collapse
this into a single delayed work for both mq and legacy as a prep patch
before this one.

Thanks for the feedback!

[RFC PATCH] blk-mq: move timeout handling from queue to tagset

2018-07-18 Thread Keith Busch

This patch removes the per-request_queue timeout handling and uses the
tagset instead. This simplifies the timeout handling since a shared tag
set can handle all timed out requests in a single work queue rather than
iterate the same set in different work for all the users of that set.

The long term objective of this is to aid blk-mq drivers with shared
tagsets. These drivers typically want their timeout error handling to be
single threaded, and this patch provides that context so they wouldn't
need to sync with other request queues requiring the same error handling.

Timeout handling per-queue was a bit racy with completions anyway:
a tag active for one request_queue while iterating could be freed
and reallocated to a different request_queue, so a queue's timeout
handler may be operating on a request allocated to a different queue.
Though no real harm was done with how the callbacks used those requests,
this patch fixes that race.

And since the timeout code takes a reference on requests, the request
can never exit the queue while timeout code is considering. We no
longer need to enter each request_queue in the timeout work so this
patch removes that unnecessary code.

Signed-off-by: Keith Busch 
---
 block/blk-core.c   |  5 ++---
 block/blk-mq.c | 45 -
 block/blk-timeout.c| 16 ++--
 include/linux/blk-mq.h |  2 ++
 4 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f84a9b7b6f5a..9cf7ff6baba0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -404,9 +404,6 @@ EXPORT_SYMBOL(blk_stop_queue);
  */
 void blk_sync_queue(struct request_queue *q)
 {
-   del_timer_sync(>timeout);
-   cancel_work_sync(>timeout_work);
-
if (q->mq_ops) {
struct blk_mq_hw_ctx *hctx;
int i;
@@ -415,6 +412,8 @@ void blk_sync_queue(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
cancel_delayed_work_sync(>run_work);
} else {
+   del_timer_sync(>timeout);
+   cancel_work_sync(>timeout_work);
cancel_delayed_work_sync(>delay_work);
}
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 95919268564b..22326612a5d3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -804,8 +804,7 @@ static bool blk_mq_req_expired(struct request *rq, unsigned 
long *next)
return false;
 }
 
-static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
-   struct request *rq, void *priv, bool reserved)
+static void blk_mq_check_expired(struct request *rq, void *priv, bool reserved)
 {
unsigned long *next = priv;
 
@@ -842,33 +841,21 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx 
*hctx,
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
-   struct request_queue *q =
-   container_of(work, struct request_queue, timeout_work);
+   struct blk_mq_tag_set *set =
+   container_of(work, struct blk_mq_tag_set, timeout_work);
+   struct request_queue *q;
unsigned long next = 0;
struct blk_mq_hw_ctx *hctx;
int i;
 
-   /* A deadlock might occur if a request is stuck requiring a
-* timeout at the same time a queue freeze is waiting
-* completion, since the timeout code would not be able to
-* acquire the queue reference here.
-*
-* That's why we don't use blk_queue_enter here; instead, we use
-* percpu_ref_tryget directly, because we need to be able to
-* obtain a reference even in the short window between the queue
-* starting to freeze, by dropping the first reference in
-* blk_freeze_queue_start, and the moment the last request is
-* consumed, marked by the instant q_usage_counter reaches
-* zero.
-*/
-   if (!percpu_ref_tryget(>q_usage_counter))
-   return;
-
-   blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
+   blk_mq_tagset_busy_iter(set, blk_mq_check_expired, );
 
if (next != 0) {
-   mod_timer(>timeout, next);
-   } else {
+   mod_timer(>timer, next);
+   return;
+   }
+
+   list_for_each_entry(q, >tag_list, tag_set_list) {
/*
 * Request timeouts are handled as a forward rolling timer. If
 * we end up here it means that no requests are pending and
@@ -881,7 +868,6 @@ static void blk_mq_timeout_work(struct work_struct *work)
blk_mq_tag_idle(hctx);
}
}
-   blk_queue_exit(q);
 }
 
 struct flush_busy_ctx_data {
@@ -2548,7 +2534,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
if (!q->nr_hw_queues)
goto err_hctxs;
 
-   INIT_WORK(>timeout_work, blk_mq_timeout_work);
blk_queue_rq

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-13 Thread Keith Busch

On Fri, Jul 13, 2018 at 11:03:18PM +, Bart Van Assche wrote:
> How do you want to go forward from here? Do you prefer the approach of the
> patch I had posted (https://www.spinics.net/lists/linux-block/msg26489.html), 
> Jianchao's approach (https://marc.info/?l=linux-block=152950093831738) or
> perhaps yet another approach? Note: I think Jianchao's patch is a good start
> but also that it needs further improvement.

Of the two you mentioned, yours is preferable IMO. While I appreciate
Jianchao's detailed analysis, it's hard to take a proposal seriously
that so colourfully calls everyone else "dangerous" while advocating
for silently losing requests on purpose.

But where's the option that fixes scsi to handle hardware completions
concurrently with arbitrary timeout software? Propping up that house of
cards can't be the only recourse.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-13 Thread Keith Busch

On Fri, Jul 13, 2018 at 03:52:38PM +, Bart Van Assche wrote:
> No. What I'm saying is that a behavior change has been introduced in the block
> layer that was not documented in the patch description. 

Did you read the changelog?

> I think that behavior change even can trigger a kernel crash.

I think we are past acknowledging issues exist with timeouts.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-13 Thread Keith Busch

On Thu, Jul 12, 2018 at 10:24:42PM +, Bart Van Assche wrote:
> Before commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"), if a
> request completion was reported after request timeout processing had
> started, completion handling was skipped. The following code in
> blk_mq_complete_request() realized that:
> 
>   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   __blk_mq_complete_request(rq);
> 
> Since commit 12f5b9314545, if a completion occurs after request timeout
> processing has started, that completion is processed if the request has the
> state MQ_RQ_IN_FLIGHT. blk_mq_rq_timed_out() does not modify the request
> state unless the block driver timeout handler modifies it, e.g. by calling
> blk_mq_end_request() or by calling blk_mq_requeue_request(). The typical
> behavior of scsi_times_out() is to queue sending of a SCSI abort and hence
> not to change the request state immediately. In other words, if a request
> completion occurs during or shortly after a timeout occurred then
> blk_mq_complete_request() will call __blk_mq_complete_request() and will
> complete the request, although that is not allowed because timeout handling
> has already started. Do you agree with this analysis?

Yes, it's different, and that was the whole point. No one made that a
secret either. Are you saying you want timeout software to take priority
over handling hardware events?

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-07-12 Thread Keith Busch

On Thu, Jul 12, 2018 at 06:16:12PM +, Bart Van Assche wrote:
> On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> > /*
> > -* We marked @rq->aborted_gstate and waited for RCU.  If there were
> > -* completions that we lost to, they would have finished and
> > -* updated @rq->gstate by now; otherwise, the completion path is
> > -* now guaranteed to see @rq->aborted_gstate and yield.  If
> > -* @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> > +* Just do a quick check if it is expired before locking the request in
> > +* so we're not unnecessarilly synchronizing across CPUs.
> >  */
> > -   if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> > -   READ_ONCE(rq->gstate) == rq->aborted_gstate)
> > +   if (!blk_mq_req_expired(rq, next))
> > +   return;
> > +
> > +   /*
> > +* We have reason to believe the request may be expired. Take a
> > +* reference on the request to lock this request lifetime into its
> > +* currently allocated context to prevent it from being reallocated in
> > +* the event the completion by-passes this timeout handler.
> > +* 
> > +* If the reference was already released, then the driver beat the
> > +* timeout handler to posting a natural completion.
> > +*/
> > +   if (!kref_get_unless_zero(>ref))
> > +   return;
> > +
> > +   /*
> > +* The request is now locked and cannot be reallocated underneath the
> > +* timeout handler's processing. Re-verify this exact request is truly
> > +* expired; if it is not expired, then the request was completed and
> > +* reallocated as a new request.
> > +*/
> > +   if (blk_mq_req_expired(rq, next))
> > blk_mq_rq_timed_out(rq, reserved);
> > +   blk_mq_put_request(rq);
> >  }
> 
> Hello Keith and Christoph,
> 
> What prevents that a request finishes and gets reused after the
> blk_mq_req_expired() call has finished and before kref_get_unless_zero() is
> called? Is this perhaps a race condition that has not yet been triggered by
> any existing block layer test? Please note that there is no such race
> condition in the patch I had posted ("blk-mq: Rework blk-mq timeout handling
> again" - https://www.spinics.net/lists/linux-block/msg26489.html).

I don't think there's any such race in the merged implementation
either. If the request is reallocated, then the kref check may succeed,
but the blk_mq_req_expired() check would surely fail, allowing the
request to proceed as normal. The code comments at least say as much.

[PATCH] block: Fix transfer when chuck sectors exceeds max

2018-06-26 Thread Keith Busch

A device may have boundary restrictions where the number of sectors
between boundaries exceeds its max transfer size. In this case, we need
to cap the max size to the smaller of the two limits.

Reported-by: Jitendra Bhivare 
Tested-by: Jitendra Bhivare 
Cc: 
Signed-off-by: Keith Busch 
---
 include/linux/blkdev.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9154570edf29..79226ca8f80f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1119,8 +1119,8 @@ static inline unsigned int blk_max_size_offset(struct 
request_queue *q,
if (!q->limits.chunk_sectors)
return q->limits.max_sectors;
 
-   return q->limits.chunk_sectors -
-   (offset & (q->limits.chunk_sectors - 1));
+   return min(q->limits.max_sectors, (unsigned 
int)(q->limits.chunk_sectors -
+   (offset & (q->limits.chunk_sectors - 1;
 }
 
 static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
-- 
2.14.3

Re: [RFC PATCH] blk-mq: User defined HCTX CPU mapping

2018-06-20 Thread Keith Busch

On Wed, Jun 20, 2018 at 11:08:05AM +0200, Christoph Hellwig wrote:
> On Mon, Jun 18, 2018 at 11:32:06AM -0600, Keith Busch wrote:
> > The default mapping of a cpu to a hardware context is often generally
> > applicable, however a user may know of a more appropriate mapping for
> > their specific access usage.
> > 
> > This patch allows a user to define their own policy by making the mq hctx
> > cpu_list writable. The usage allows a user to append a comma separated
> > and/or range list of CPUs to a given hctx's tag set mapping to reassign
> > what hctx a cpu may map.
> > 
> > While the writable attribute exists under a specific request_queue, the
> > settings will affect all request queues sharing the same tagset.
> > 
> > The user defined setting is lost if the block device is removed and
> > re-added, or if the driver re-runs the queue mapping.
> 
> We can't do this without driver opt-in.  Managed interrupt rely on
> the fact that we can't generate more interrupts once all cpus mapped
> to the interrupt line have been offlined.
>
> So what exactly is the use case?  What drivers do you care about?

This patch came at a customer request for NVMe. The controllers have 1:1
queues to CPUs, so currently a submission on CPU A will interrupt CPU A.

The user really wants their application to run in CPU A and have the
interrupt run in CPU B. We can't change the IRQ affinity, so I thought
changing the submission affinity would be less intrusive.

I think you're saying this will break if CPU B is offlined. I hadn't
considered that, so it doesn't sound like this will work.

[RFC PATCH] blk-mq: User defined HCTX CPU mapping

2018-06-18 Thread Keith Busch

The default mapping of a cpu to a hardware context is often generally
applicable, however a user may know of a more appropriate mapping for
their specific access usage.

This patch allows a user to define their own policy by making the mq hctx
cpu_list writable. The usage allows a user to append a comma separated
and/or range list of CPUs to a given hctx's tag set mapping to reassign
what hctx a cpu may map.

While the writable attribute exists under a specific request_queue, the
settings will affect all request queues sharing the same tagset.

The user defined setting is lost if the block device is removed and
re-added, or if the driver re-runs the queue mapping.

Signed-off-by: Keith Busch 
---
 block/blk-mq-debugfs.c | 16 ++
 block/blk-mq-debugfs.h | 11 +++
 block/blk-mq-sysfs.c   | 80 +-
 block/blk-mq.c |  9 --
 block/blk-mq.h | 12 
 5 files changed, 112 insertions(+), 16 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index ffa622366922..df163a79511c 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -870,18 +870,22 @@ void blk_mq_debugfs_unregister(struct request_queue *q)
q->debugfs_dir = NULL;
 }
 
-static int blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
-  struct blk_mq_ctx *ctx)
+void blk_mq_debugfs_unregister_ctx(struct blk_mq_ctx *ctx)
+{
+   debugfs_remove_recursive(ctx->debugfs_dir);
+}
+
+int blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
+   struct blk_mq_ctx *ctx)
 {
-   struct dentry *ctx_dir;
char name[20];
 
snprintf(name, sizeof(name), "cpu%u", ctx->cpu);
-   ctx_dir = debugfs_create_dir(name, hctx->debugfs_dir);
-   if (!ctx_dir)
+   ctx->debugfs_dir = debugfs_create_dir(name, hctx->debugfs_dir);
+   if (!ctx->debugfs_dir)
return -ENOMEM;
 
-   if (!debugfs_create_files(ctx_dir, ctx, blk_mq_debugfs_ctx_attrs))
+   if (!debugfs_create_files(ctx->debugfs_dir, ctx, 
blk_mq_debugfs_ctx_attrs))
return -ENOMEM;
 
return 0;
diff --git a/block/blk-mq-debugfs.h b/block/blk-mq-debugfs.h
index b9d366e57097..93df02eabf2b 100644
--- a/block/blk-mq-debugfs.h
+++ b/block/blk-mq-debugfs.h
@@ -18,6 +18,9 @@ struct blk_mq_debugfs_attr {
 int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq);
 int blk_mq_debugfs_rq_show(struct seq_file *m, void *v);
 
+void blk_mq_debugfs_unregister_ctx(struct blk_mq_ctx *ctx);
+int blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
+   struct blk_mq_ctx *ctx);
 int blk_mq_debugfs_register(struct request_queue *q);
 void blk_mq_debugfs_unregister(struct request_queue *q);
 int blk_mq_debugfs_register_hctx(struct request_queue *q,
@@ -41,6 +44,14 @@ static inline void blk_mq_debugfs_unregister(struct 
request_queue *q)
 {
 }
 
+void blk_mq_debugfs_unregister_ctx(struct blk_mq_ctx *ctx) {}
+
+int blk_mq_debugfs_register_ctx(struct blk_mq_hw_ctx *hctx,
+   struct blk_mq_ctx *ctx)
+{
+   return 0;
+}
+
 static inline int blk_mq_debugfs_register_hctx(struct request_queue *q,
   struct blk_mq_hw_ctx *hctx)
 {
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index aafb44224c89..ec2a07dd86e9 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -11,6 +11,7 @@
 
 #include 
 #include "blk-mq.h"
+#include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
 
 static void blk_mq_sysfs_release(struct kobject *kobj)
@@ -161,6 +162,82 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct 
blk_mq_hw_ctx *hctx, char *page)
return ret;
 }
 
+static void blk_mq_reassign_swqueue(unsigned int cpu,  unsigned int new_index,
+   struct blk_mq_tag_set *set)
+{
+   struct blk_mq_hw_ctx *hctx;
+   struct request_queue *q;
+   struct blk_mq_ctx *ctx;
+
+   if (set->mq_map[cpu] == new_index)
+   return;
+
+   list_for_each_entry(q, >tag_list, tag_set_list) {
+   ctx = per_cpu_ptr(q->queue_ctx, cpu);
+   blk_mq_debugfs_unregister_ctx(ctx);
+   kobject_del(>kobj);
+
+   hctx = blk_mq_map_queue(q, cpu);
+   cpumask_clear_cpu(cpu, hctx->cpumask);
+   hctx->nr_ctx--;
+   if (hctx->dispatch_from == ctx)
+   hctx->dispatch_from = NULL;
+   }
+
+   set->mq_map[cpu] = new_index;
+
+   list_for_each_entry(q, >tag_list, tag_set_list) {
+   ctx = per_cpu_ptr(q->queue_ctx, cpu);
+   hctx = blk_mq_map_queue(q, cpu);
+   cpumask_set_cpu(cpu, hctx->cpumask);
+   ctx->index_hw = hctx->nr_ctx;
+   hctx->ctxs[

Re: blktests block/019 lead system hang

2018-06-13 Thread Keith Busch

On Tue, Jun 12, 2018 at 04:41:54PM -0700, austin.bo...@dell.com wrote:
> It looks like the test is setting the Link Disable bit.  But this is not
> a good simulation for hot-plug surprise removal testing or surprise link
> down (SLD) testing, if that is the intent.  One reason is that Link
> Disable does not invoke SLD semantics per PCIe spec.  This is somewhat
> of a moot point in this case since the switch has Hot-Plug Surprise bit
> set which also masks the SLD semantics in PCIe.
> 
> Also, the Hot-Plug Capable + Surprise Hot-Plug bits set means the
> platform can tolerate the case where "an adapter present in this slot
> might be removed from the system without any prior notification".  It
> does not mean that a system can survive link down under any other
> circumstances such as setting Link Disable or generating a Secondary Bus
> Reset or a true surprise link down event.  To the earlier point, I also
> do not know of any way the OS can know a priori if the platform can
> handle surprise link down outside of surprise remove case.  We can look
> at standardizing a way to do that if OSes find it useful to know.
> 
> Relative to this particular error, Link Disable doesn't clear Presence
> Detect State which would happen on a real Surprise Hot-Plug removal
> event and this is probably why the system crashes.  What will happen is
> that after the link goes to disabled state, the ongoing I/O will cause
> MMIO accesses on the drive and that will cause a UR which is an
> uncorrectable PCIe error (ERR_FATAL on R730).  The BIOS on the R730 is
> surprise remove aware (Surprise Hot-Plug = 1) and so it will check if
> the device is still present by checking Presence Detect State.  If the
> device is not present it will mask the error and let the OS handle the
> device removal due to hot-plug interrupt(s).  If the device is present,
> as in this case, then the BIOS will escalate to OS as a fatal NMI
> (current R730 platform policy is to only mask errors due to removal).
> 
> For future, these servers may report these sort of errors as recoverable
> via the GHES structures in APEI which will allow the OS to recover from
> this non-surprise remove class of error as well.  In the (hopefully
> near) future, the industry will move to DPC as the framework for this
> sort of generic PCIe error handling/recovery but there are architectural
> changes needed that are currently being defined in the relevant
> standards bodies.  Once the architecture is defined it can be
> implemented and tested to verify these sort of test cases pass.

Thanks for the feedback!

This test does indeed toggle the Link Control Link Disable bit to simulate
the link failure. The PCIe specification specifically covers this case
in Section 3.2.1, Data Link Control and Management State Machine Rules:

  If the Link Disable bit has been Set by software, then the subsequent
  transition to DL_Inactive must not be considered an error.

So this test should suppress any Suprise Down Error events, but handling
that particular event wasn't the intent of the test (and as you mentioned,
it ought not occur anyway since the slot is HP Surprise capable).

The test should not suppress reporting the Data Link Layer State Changed
slot status. And while this doesn't trigger a Slot PDC status, triggering
a DLLSC should occur since the Link Status DLLLA should go to 0 when
state machine goes from DL_Active to DL_Down, regardless of if a Suprise
Down Error was detected.

The Linux PCIEHP driver handles a DLLSC link-down event the same as
a presence detect remove event, and that's part of what this test was
trying to cover.

Re: blktests block/019 lead system hang

2018-06-06 Thread Keith Busch

On Wed, Jun 06, 2018 at 01:42:15PM +0800, Yi Zhang wrote:
> Here is the output, and I can see "HotPlug+ Surprise+" on SltCap

Thanks. That looks like a perfectly capable port. I even have the same
switch in one of my machines, but the test doesn't trigger fatal
firmware-first errors.

Might need to query something about the platform to know how it treats
link-downs before proceeding with the test (don't know off the top of
my head; will do some digging).

Re: blktests block/019 lead system hang

2018-06-05 Thread Keith Busch

On Tue, Jun 05, 2018 at 10:18:53AM -0600, Keith Busch wrote:
> On Wed, May 30, 2018 at 03:26:54AM -0400, Yi Zhang wrote:
> > Hi Keith
> > I found blktest block/019 also can lead my NVMe server hang with 
> > 4.17.0-rc7, let me know if you need more info, thanks. 
> > 
> > Server: Dell R730xd
> > NVMe SSD: 85:00.0 Non-Volatile memory controller: Samsung Electronics Co 
> > Ltd NVMe SSD Controller 172X (rev 01)
> > 
> > Console log:
> > Kernel 4.17.0-rc7 on an x86_64
> > 
> > storageqe-62 login: [ 6043.121834] run blktests block/019 at 2018-05-30 
> > 03:16:34
> > [ 6049.108476] {1}[Hardware Error]: Hardware error from APEI Generic 
> > Hardware Error Source: 3
> > [ 6049.108478] {1}[Hardware Error]: event severity: fatal
> > [ 6049.108479] {1}[Hardware Error]:  Error 0, type: fatal
> > [ 6049.108481] {1}[Hardware Error]:   section_type: PCIe error
> > [ 6049.108482] {1}[Hardware Error]:   port_type: 6, downstream switch port
> > [ 6049.108483] {1}[Hardware Error]:   version: 1.16
> > [ 6049.108484] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
> > [ 6049.108485] {1}[Hardware Error]:   device_id: :83:05.0
> > [ 6049.108486] {1}[Hardware Error]:   slot: 0
> > [ 6049.108487] {1}[Hardware Error]:   secondary_bus: 0x85
> > [ 6049.108488] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x8734
> > [ 6049.108489] {1}[Hardware Error]:   class_code: 000406
> > [ 6049.108489] {1}[Hardware Error]:   bridge: secondary_status: 0x, 
> > control: 0x0003
> > [ 6049.108491] Kernel panic - not syncing: Fatal hardware error!
> > [ 6049.108514] Kernel Offset: 0x2580 from 0x8100 
> > (relocation range: 0x8000-0xbfff)

Could you attach 'lspci -vvv -s :83:05.0'? Just want to see
your switch's capabilities to confirm the pre-test checks are really
sufficient.

Re: blktests block/019 lead system hang

2018-06-05 Thread Keith Busch

On Wed, May 30, 2018 at 03:26:54AM -0400, Yi Zhang wrote:
> Hi Keith
> I found blktest block/019 also can lead my NVMe server hang with 4.17.0-rc7, 
> let me know if you need more info, thanks. 
> 
> Server: Dell R730xd
> NVMe SSD: 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd 
> NVMe SSD Controller 172X (rev 01)
> 
> Console log:
> Kernel 4.17.0-rc7 on an x86_64
> 
> storageqe-62 login: [ 6043.121834] run blktests block/019 at 2018-05-30 
> 03:16:34
> [ 6049.108476] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
> Error Source: 3
> [ 6049.108478] {1}[Hardware Error]: event severity: fatal
> [ 6049.108479] {1}[Hardware Error]:  Error 0, type: fatal
> [ 6049.108481] {1}[Hardware Error]:   section_type: PCIe error
> [ 6049.108482] {1}[Hardware Error]:   port_type: 6, downstream switch port
> [ 6049.108483] {1}[Hardware Error]:   version: 1.16
> [ 6049.108484] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
> [ 6049.108485] {1}[Hardware Error]:   device_id: :83:05.0
> [ 6049.108486] {1}[Hardware Error]:   slot: 0
> [ 6049.108487] {1}[Hardware Error]:   secondary_bus: 0x85
> [ 6049.108488] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x8734
> [ 6049.108489] {1}[Hardware Error]:   class_code: 000406
> [ 6049.108489] {1}[Hardware Error]:   bridge: secondary_status: 0x, 
> control: 0x0003
> [ 6049.108491] Kernel panic - not syncing: Fatal hardware error!
> [ 6049.108514] Kernel Offset: 0x2580 from 0x8100 (relocation 
> range: 0x8000-0xbfff)

Sounds like your platform fundamentally doesn't support surprise link
down if it considers the event a fatal error. That's sort of what this
test was supposed to help catch so we know what platforms can do this
vs ones that can't.

The test does check that the slot is hotplug capable before running,
so it's supposed to only run the test on slots that claim to be capable
of handling the event. I just don't know of a good way to query platform
firmware to know what it will do in response to such an event.

Re: [PATCH 13/14] nvme: use the changed namespaces list log to clear ns data changed AENs

2018-06-05 Thread Keith Busch

On Tue, Jun 05, 2018 at 06:47:33AM +0200, Christoph Hellwig wrote:
> But let's assume we don't want to use the list due to this concern:
> we'd still have to read the log page, as per the NVMe spec the only
> think clearing a pending AEN is reading the associated log page.
> We'd then need to still do the full scan using identify.  Is this what
> we want?  If you think this is important for reliability we could
> just ignore the log page.

That sounds good. Let's have the driver read the log page to re-arm
the event as you've done, but don't rely on the contents for namespace
enumeration.

The rest of your series looks good to me.

[blktests PATCHv2] Fix block/011 to not use sysfs for device disabling

2018-06-04 Thread Keith Busch

The PCI sysfs interface may not be a dependable method for toggling the
PCI device state to trigger the timeouts. This patch goes directly to
the config space to make device failure occur.

Signed-off-by: Keith Busch 
---
v1 -> v2:

  Toggling only PCI Command Register BME bit, rather than including MEM.

 tests/block/011 | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tests/block/011 b/tests/block/011
index 62e89f7..2461442 100755
--- a/tests/block/011
+++ b/tests/block/011
@@ -21,7 +21,7 @@ DESCRIPTION="disable PCI device while doing I/O"
 TIMED=1
 
 requires() {
-   _have_fio
+   _have_fio && _have_program setpci
 }
 
 device_requires() {
@@ -43,10 +43,11 @@ test_device() {
_run_fio_rand_io --filename="$TEST_DEV" --size="$size" \
--ignore_error=EIO,ENXIO,ENODEV &
 
+   # toggle PCI Command Register's Bus Master Enabling
while kill -0 $! 2>/dev/null; do
-   echo 0 > "/sys/bus/pci/devices/${pdev}/enable"
+   setpci -s "${pdev}" 4.w=0:4
sleep .2
-   echo 1 > "/sys/bus/pci/devices/${pdev}/enable"
+   setpci -s "${pdev}" 4.w=4:4
sleep .2
done
 
-- 
2.14.3

Re: [PATCH 13/14] nvme: use the changed namespaces list log to clear ns data changed AENs

2018-06-04 Thread Keith Busch

On Sat, May 26, 2018 at 12:27:34PM +0200, Christoph Hellwig wrote:
> Per section 5.2 we need to issue the corresponding log page to clear an
> AEN, so for a namespace data changed AEN we need to read the changed
> namespace list log.  And once we read that log anyway we might as well
> use it to optimize the rescan.
> 
> Signed-off-by: Christoph Hellwig 

I'm a little concerned about this. Userspace might be reading the same
log page. Since the contents of the page may change each time its read,
it's possible the driver will see only a subset of changed namespaces
at the point it gets to read the log page, missing a chance to
revalidate others.

Re: [PATCH blktests] Fix block/011 to not use sysfs for device disabling

2018-06-04 Thread Keith Busch

On Tue, May 29, 2018 at 12:54:28PM -0700, Omar Sandoval wrote:
> What's the plan for this test? Do you have a v2 coming?

Sorry for the delay. I've been out on holiday, but I'm catching up
quickly and will send a v2 shortly.

Thanks,
Keith

Re: [PATCH V6 02/11] nvme: pci: cover timeout for admin commands running in EH

2018-05-24 Thread Keith Busch

On Wed, May 16, 2018 at 12:03:04PM +0800, Ming Lei wrote:
> +static void nvme_set_host_mem_end_io(struct request *rq, blk_status_t sts)
> +{
> + struct completion *waiting = rq->end_io_data;
> +
> + rq->end_io_data = NULL;
> + blk_mq_free_request(rq);
> +
> + /*
> +  * complete last, if this is a stack request the process (and thus
> +  * the rq pointer) could be invalid right after this complete()
> +  */
> + complete(waiting);
> +}
> +
> +/*
> + * This function can only be used inside nvme_dev_disable() when timeout
> + * may not work, then this function has to cover the timeout by itself.
> + *
> + * When wait_for_completion_io_timeout() returns 0 and timeout happens,
> + * this request will be completed after controller is shutdown.
> + */
> +static int nvme_set_host_mem_timeout(struct nvme_dev *dev, u32 bits)
> +{
> + DECLARE_COMPLETION_ONSTACK(wait);
> + struct nvme_command c;
> + struct request_queue *q = dev->ctrl.admin_q;
> + struct request *req;
> + int ret;
> +
> + nvme_init_set_host_mem_cmd(dev, , bits);
> +
> + req = nvme_alloc_request(q, , 0, NVME_QID_ANY);
> + if (IS_ERR(req))
> + return PTR_ERR(req);
> +
> + req->timeout = ADMIN_TIMEOUT;
> + req->end_io_data = 
> +
> + blk_execute_rq_nowait(q, NULL, req, false,
> + nvme_set_host_mem_end_io);
> + ret = wait_for_completion_io_timeout(, ADMIN_TIMEOUT);
> + if (ret > 0) {
> + if (nvme_req(req)->flags & NVME_REQ_CANCELLED)
> + ret = -EINTR;
> + else
> + ret = nvme_req(req)->status;
> + } else
> + ret = -EINTR;
> +
> + return ret;
> +}

This function doesn't need to return anything at all. The caller doesn't
care about the success in order to decide what to do next.

Now let's say it does time out. The command will be be completed later
through nvme_cancel_request, but your completion callback will reference
a now invalid stack variable.

>  static void nvme_free_host_mem(struct nvme_dev *dev)
>  {
>   int i;
> @@ -2225,7 +2284,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
> shutdown)
>* but I'd rather be safe than sorry..
>*/
>   if (dev->host_mem_descs)
> - nvme_set_host_mem(dev, 0);
> + nvme_set_host_mem_timeout(dev, 0);
>   nvme_disable_io_queues(dev);
>   nvme_disable_admin_queue(dev, shutdown);
>   }

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-23 Thread Keith Busch

On Wed, May 23, 2018 at 08:34:48AM +0800, Ming Lei wrote:
> Let's consider the normal NVMe timeout code path:
> 
> 1) one request is timed out;
> 
> 2) controller is shutdown, this timed-out request is requeued from
> nvme_cancel_request(), but can't dispatch because queues are quiesced
> 
> 3) reset is done from another context, and this request is dispatched
> again, and completed exactly before returning EH_HANDLED to blk-mq, but
> its state isn't updated to COMPLETE yet.
> 
> 4) then double completions are done from both normal completion and timeout
> path.

We're definitely fixing this, but I must admit that's an impressive
cognitive traversal across 5 thread contexts to arrive at that race. :)

Re: [PATCH 13/14] blk-mq: Remove generation seqeunce

2018-05-23 Thread Keith Busch

On Wed, May 23, 2018 at 02:19:40PM +0200, Christoph Hellwig wrote:
> -static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
> - __releases(hctx->srcu)
> -{
> - if (!(hctx->flags & BLK_MQ_F_BLOCKING))
> - rcu_read_unlock();
> - else
> - srcu_read_unlock(hctx->srcu, srcu_idx);
> -}
> -
> -static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
> - __acquires(hctx->srcu)
> -{
> - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
> - /* shut up gcc false positive */
> - *srcu_idx = 0;
> - rcu_read_lock();
> - } else
> - *srcu_idx = srcu_read_lock(hctx->srcu);
> -}

I may have too ambitious on removing hctx_lock. That's actually nothing
to do with the completion issues, but it is necessary for quiesce to
work reliably. Otherwise, I still think there is some promise to the
reset of the approach.

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-23 Thread Keith Busch

On Wed, May 23, 2018 at 08:02:31AM -0600, Keith Busch wrote:
> Looks like the cmpxchg is also needed if old_state is MQ_RQ_TIMED_OUT,
> otherwise its guaranteed to return 'true' and there's no point to the
> loop and 'if' check.

And I see v14 is already posted with that fix! :)

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-23 Thread Keith Busch

On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> +static bool blk_mq_change_rq_state(struct request *rq,
> +enum mq_rq_state old_state,
> +enum mq_rq_state new_state)
> +{
> + union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
> + union blk_generation_and_state old_val = gstate;
> + union blk_generation_and_state new_val = gstate;
> +
> + old_val.state = old_state;
> + new_val.state = new_state;
> + if (new_state == MQ_RQ_IN_FLIGHT)
> + new_val.generation++;
> + /*
> +  * For transitions from state in-flight to another state cmpxchg()
> +  * must be used. For other state transitions it is safe to use
> +  * WRITE_ONCE().
> +  */
> + if (old_state != MQ_RQ_IN_FLIGHT) {
> + WRITE_ONCE(rq->gstate.val, new_val.val);
> + return true;
> + }
> + return blk_mq_set_rq_state(rq, old_val, new_val);
> +}



>  void blk_mq_complete_request(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
> - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> - int srcu_idx;
>  
>   if (unlikely(blk_should_fake_timeout(q)))
>   return;
>  
> - /*
> -  * If @rq->aborted_gstate equals the current instance, timeout is
> -  * claiming @rq and we lost.  This is synchronized through
> -  * hctx_lock().  See blk_mq_timeout_work() for details.
> -  *
> -  * Completion path never blocks and we can directly use RCU here
> -  * instead of hctx_lock() which can be either RCU or SRCU.
> -  * However, that would complicate paths which want to synchronize
> -  * against us.  Let stay in sync with the issue path so that
> -  * hctx_lock() covers both issue and completion paths.
> -  */
> - hctx_lock(hctx, _idx);
> - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> - __blk_mq_complete_request(rq);
> - hctx_unlock(hctx, srcu_idx);
> + /* The loop is for the unlikely case of a race with the timeout code. */
> + while (true) {
> + if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT,
> +MQ_RQ_COMPLETE)) {
> + __blk_mq_complete_request(rq);
> + break;
> + }
> + if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
> + break;
> + }
>  }

Looks like the cmpxchg is also needed if old_state is MQ_RQ_TIMED_OUT,
otherwise its guaranteed to return 'true' and there's no point to the
loop and 'if' check.

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 08:36:27PM +, Bart Van Assche wrote:
> 
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I 
> think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Okay, now to the BLK_EH_NOT_HANDLED case: that's supposedly the correct
status to return if the driver knows blk_mq_complete_request() was called
prior to returning from the timeout handler, so we need a similiar check
there, right?

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 08:36:27PM +, Bart Van Assche wrote:
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I 
> think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Right, my mistake. I noticed that when I saw your reply on the EH_HANDLED
case, so looks fine.

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> @@ -848,13 +843,22 @@ static void blk_mq_rq_timed_out(struct request *req, 
> bool reserved)
>   case BLK_EH_RESET_TIMER:
> + blk_mq_add_timer(req);
>   /*
> +  * The loop is for the unlikely case of a race with the
> +  * completion code. There is no need to reset the deadline
> +  * if the transition to the in-flight state fails because
> +  * the deadline only matters in the in-flight state.
>*/
> - blk_mq_rq_update_aborted_gstate(req, 0);
> - blk_add_timer(req);
> + while (true) {
> + if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
> +MQ_RQ_IN_FLIGHT))
> + break;
> + if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
> + __blk_mq_complete_request(req);
> + break;
> + }
> + }

I'm having some trouble triggering this case where the state is already
MQ_RQ_COMPLETE, so I'm just trying to figure this out through inspection.

It looks like the new blk_mq_complete_request() already called
__blk_mq_complete_request() when it gets the state to MQ_RQ_COMPLETE,
so doing that again completes it a second time.

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 01:38:06PM -0600, Jens Axboe wrote:
> I guess it would be cleaner to actually do the transition, in
> blk_mq_rq_timed_out():
> 
> case BLK_EH_HANDLED:  
>   
> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,  
>   
> MQ_RQ_COMPLETE))  
>   
> __blk_mq_complete_request(req);   
>   
> break;
> 
> This works for me.

Works for me as well on manual fault injection tests.

I think this change above goes back to Christoph's point earlier on usage
of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
when the driver actually knows the request has been completed before
returning the status?

Re: [RFC PATCH 0/3] blk-mq: Timeout rework

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 04:30:41PM +, Bart Van Assche wrote:
> 
> Please have a look at v13 of the timeout handling rework patch that I posted.
> That patch should not introduce any new race conditions and should also handle
> the scenario fine in which blk_mq_complete_request() is called while the NVMe
> timeout handling function is in progress.

Thanks for the notice. That sounds very interesting and will be happy
take a look.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 04:29:07PM +, Bart Van Assche wrote:
> Please have another look at the current code that handles request timeouts
> and completions. The current implementation guarantees that no double
> completions can occur but your patch removes essential aspects of that
> implementation.

How does the current implementation guarantee a double completion doesn't
happen when the request is allocated for a new command?

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 11:07:07PM +0800, Ming Lei wrote:
> > At approximately the same time, you're saying the driver that returned
> > EH_HANDLED may then call blk_mq_complete_request() in reference to the
> > *old* request that it returned EH_HANDLED for, incorrectly completing
> 
> Because this request has been completed above by blk-mq timeout,
> then this old request won't be completed any more via 
> blk_mq_complete_request()
> either from normal path or what ever, such as cancel.
 
> > the new request before it is done. That will inevitably lead to data
> > corruption. Is that happening today in any driver?
> 
> No such issue since current blk-mq makes sure one req is only completed
> once, but your patch changes to depend on driver to make sure that.

The blk-mq timeout complete makes the request available for allocation
as a new command, at which point blk_mq_complete_request can be called
again. If a driver is somehow relying on blk-mq to prevent a double
completion for a previously completed request context, they're already
in a lot of trouble.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 10:57:40PM +0800, Ming Lei wrote:
> OK, that still depends on driver's behaviour, even though it is true
> for NVMe,  you still have to audit other drivers about this sync
> because it is required by your patch.

Okay, forget about nvme for a moment here. Please run this thought
experiment to decide if what you're saying is even plausible for any
block driver with the existing implementation:

Your scenario has a driver return EH_HANDLED for a timed out request. The
timeout work then completes the request. The request may then be
reallocated for a new command and dispatched.

At approximately the same time, you're saying the driver that returned
EH_HANDLED may then call blk_mq_complete_request() in reference to the
*old* request that it returned EH_HANDLED for, incorrectly completing
the new request before it is done. That will inevitably lead to data
corruption. Is that happening today in any driver?

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 10:37:32PM +0800, Ming Lei wrote:
> On Tue, May 22, 2018 at 10:20 PM, Keith Busch
> <keith.bu...@linux.intel.com> wrote:
> > In the event the driver requests a normal completion, the timeout work
> > releasing the last reference doesn't do a second completion: it only
> 
> The reference only covers request's lifetime, not related with completion.
> 
> It isn't the last reference. When driver returns EH_HANDLED, blk-mq
> will complete this request on extra time.
> 
> Yes, if driver's timeout code and normal completion code can sync
> about this completion, that should be fine, but the current behaviour
> doesn't depend driver's sync since the req is always completed atomically
> via the following way:
> 
> 1) timeout
> 
> if (mark_completed(rq))
>timed_out(rq)
> 
> 2) normal completion
> if (mark_completed(rq))
> complete(rq)
> 
> For example, before nvme_timeout() is trying to run nvme_dev_disable(),
> irq comes and this req is completed from normal completion path, but
> nvme_timeout() still returns EH_HANDLED, and blk-mq may complete
> the req one extra time since the normal completion path may not update
> req's state yet.

nvme_dev_disable tears down irq's, meaing their handling is already
sync'ed before nvme_dev_disable can proceed. Whether the completion
comes through nvme_irq, or through nvme_dev_disable, there is no way
possible for nvme's timeout to return EH_HANDLED if the state was not
updated prior to returning that status.

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 10:49:11AM +0800, Ming Lei wrote:
> On Mon, May 21, 2018 at 05:11:31PM -0600, Keith Busch wrote:
> > -static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
> > +static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> > struct request *rq, void *priv, bool reserved)
> >  {
> > +   unsigned long *next = priv;
> > +
> > /*
> > -* We marked @rq->aborted_gstate and waited for RCU.  If there were
> > -* completions that we lost to, they would have finished and
> > -* updated @rq->gstate by now; otherwise, the completion path is
> > -* now guaranteed to see @rq->aborted_gstate and yield.  If
> > -* @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
> > +* Just do a quick check if it is expired before locking the request in
> > +* so we're not unnecessarilly synchronizing across CPUs.
> >  */
> > -   if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
> > -   READ_ONCE(rq->gstate) == rq->aborted_gstate)
> > +   if (!blk_mq_req_expired(rq, next))
> > +   return;
> > +
> > +   /*
> > +* We have reason to believe the request may be expired. Take a
> > +* reference on the request to lock this request lifetime into its
> > +* currently allocated context to prevent it from being reallocated in
> > +* the event the completion by-passes this timeout handler.
> > +* 
> > +* If the reference was already released, then the driver beat the
> > +* timeout handler to posting a natural completion.
> > +*/
> > +   if (!kref_get_unless_zero(>ref))
> > +   return;
> 
> If this request is just completed in normal path and its state isn't
> updated yet, timeout will hold the request, and may complete this
> request again, then this req can be completed two times.

Hi Ming,

In the event the driver requests a normal completion, the timeout work
releasing the last reference doesn't do a second completion: it only
releases the request's tag back for re-allocation.

Thanks,
Keith

Re: [RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-22 Thread Keith Busch

On Mon, May 21, 2018 at 11:29:06PM +, Bart Van Assche wrote:
> On Mon, 2018-05-21 at 17:11 -0600, Keith Busch wrote:
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > -   __blk_mq_complete_request(req);
> > -   break;
> > -   case BLK_EH_RESET_TIMER:
> > /*
> > -* As nothing prevents from completion happening while
> > -* ->aborted_gstate is set, this may lead to ignored
> > -* completions and further spurious timeouts.
> > +* If the request is still in flight, the driver is requesting
> > +* blk-mq complete it.
> >  */
> > -   blk_mq_rq_update_aborted_gstate(req, 0);
> > +   if (blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT)
> > +   __blk_mq_complete_request(req);
> > +   break;
> > +   case BLK_EH_RESET_TIMER:
> > blk_add_timer(req);
> > break;
> > case BLK_EH_NOT_HANDLED:
> > @@ -880,64 +782,64 @@ static void blk_mq_rq_timed_out(struct request *req, 
> > bool reserved)
> > }
> >  }
> 
> I think the above changes can lead to concurrent calls of
> __blk_mq_complete_request() from the regular completion path and the timeout
> path. That's wrong: the __blk_mq_complete_request() caller should guarantee
> that no concurrent calls from another context to that function can occur.

Hi Bart,

This shouldn't be introducing any new concorrent calls to
__blk_mq_complete_request if they don't already exist. The timeout work
calls it only if the driver's timeout returns BLK_EH_HANDLED, which means
the driver is claiming the command is now done, but the driver didn't call
blk_mq_complete_request as indicated by the request's IN_FLIGHT state.

In order to get a second call to __blk_mq_complete_request(), then,
the driver would have to end up calling blk_mq_complete_request()
in another context. But there's nothing stopping that from happening
today, and would be bad if any driver actually did that: the request
may have been re-allocated and issued as a new command, and calling
blk_mq_complete_request() the second time introduces data corruption.

Thanks,
Keith

Re: [RFC PATCH 0/3] blk-mq: Timeout rework

2018-05-22 Thread Keith Busch

On Mon, May 21, 2018 at 11:29:21PM +, Bart Van Assche wrote:
> Can you explain why the NVMe driver needs reference counting of requests but
> no other block driver needs this? Additionally, why is it that for all block
> drivers except NVMe the current block layer API is sufficient
> (blk_get_request()/blk_execute_rq()/blk_mq_start_request()/
> blk_mq_complete_request()/blk_mq_end_request())?

Hi Bart,

I'm pretty sure NVMe isn't the only driver where a call to
blk_mq_complete_request silently fails to transition the request to
COMPLETE, forcing unnecessary error handling. This patch isn't so
much about NVMe as it is about removing that silent exception from the
block API.

Thanks,
Keith

Re: [PATCH 3/6] nvme: Move all IO out of controller reset

2018-05-22 Thread Keith Busch

On Tue, May 22, 2018 at 09:46:32AM +0800, Ming Lei wrote:
> It isn't a good practice to rely on timing for avoiding race, IMO.

Sure thing, and it's easy enough to avoid this one.
 
> OK, please fix other issues mentioned in the following comment together,
> then I will review further and see if it can work well.
> 
> https://marc.info/?l=linux-block=152668465710591=2

Will do, v2 coming up by end of today.

[RFC PATCH 0/3] blk-mq: Timeout rework

2018-05-21 Thread Keith Busch

The current blk-mq code potentially locks requests out of completion by
the thousands, making drivers jump through hoops to handle them. This
patch set allows drivers to complete their requests whenever they're
completed without requiring drivers know anything about the timeout code
with minimal syncronization.

Other proposals under current consideration still have moments that
prevent a driver from progressing a request to the completed state.

The timeout is ultimatley made safe by reference counting the request
when timeout handling claims the request. By holding the reference count,
we don't need to do any tricks to prevent a driver from completing the
request out from under the timeout handler, allowing the actual state
to be changed inline with the true state, and drivers don't need to be
aware any of this is happening.

In order to make the overhead as minimal as possible, the request's
reference is taken only when it appears that actual timeout handling
needs to be done.

Keith Busch (3):
  blk-mq: Reference count request usage
  blk-mq: Fix timeout and state order
  blk-mq: Remove generation seqeunce

 block/blk-core.c   |   6 -
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 291 +
 block/blk-mq.h |  20 +---
 block/blk-timeout.c|   1 -
 include/linux/blkdev.h |  26 +
 6 files changed, 83 insertions(+), 262 deletions(-)

-- 
2.14.3

[RFC PATCH 2/3] blk-mq: Fix timeout and state order

2018-05-21 Thread Keith Busch

The block layer had been setting the state to in-flight prior to updating
the timer. This is the wrong order since the timeout handler could observe
the in-flight state with the older timeout, believing the request had
expired when in fact it is just getting started.

Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8b370ed75605..66e5c768803f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -713,8 +713,8 @@ void blk_mq_start_request(struct request *rq)
preempt_disable();
write_seqcount_begin(>gstate_seq);
 
-   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 
write_seqcount_end(>gstate_seq);
preempt_enable();
-- 
2.14.3

[RFC PATCH 3/3] blk-mq: Remove generation seqeunce

2018-05-21 Thread Keith Busch

This patch simplifies the timeout handling by relying on the request
reference counting to ensure the iterator is operating on an inflight
and truly timed out request. Since the reference counting prevents the
tag from being reallocated, the block layer no longer needs to prevent
drivers from completing their requests while the timeout handler is
operating on it: a driver completing a request is allowed to proceed to
the next state without additional syncronization with the block layer.

This also removes any need for generation sequence numbers since the
request lifetime is prevented from being reallocated as a new sequence
while timeout handling is operating on it.

Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 259 ++---
 block/blk-mq.h |  20 +---
 block/blk-timeout.c|   1 -
 include/linux/blkdev.h |  26 +
 6 files changed, 58 insertions(+), 255 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43370faee935..cee03cad99f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->internal_tag = -1;
rq->start_time_ns = ktime_get_ns();
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
-   /*
-* See comment of blk_mq_init_request
-*/
-   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 66e5c768803f..4858876fd364 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -589,56 +589,6 @@ static void __blk_mq_complete_request(struct request *rq)
put_cpu();
 }
 
-static void hctx_unlock(struct blk_mq_hw_ctx *hctx, int srcu_idx)
-   __releases(hctx->srcu)
-{
-   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-   rcu_read_unlock();
-   else
-   srcu_read_unlock(hctx->srcu, srcu_idx);
-}
-
-static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
-   __acquires(hctx->srcu)
-{
-   if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
-   /* shut up gcc false positive */
-   *srcu_idx = 0;
-   rcu_read_lock();
-   } else
-   *srcu_idx = srcu_read_lock(hctx->srcu);
-}
-
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-   unsigned long flags;
-
-   /*
-* blk_mq_rq_aborted_gstate() is used from the completion path and
-* can thus be called from irq context.  u64_stats_fetch in the
-* middle of update on the same CPU leads to lockup.  Disable irq
-* while updating.
-*/
-   local_irq_save(flags);
-   u64_stats_update_begin(>aborted_gstate_sync);
-   rq->aborted_gstate = gstate;
-   u64_stats_update_end(>aborted_gstate_sync);
-   local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-   unsigned int start;
-   u64 aborted_gstate;
-
-   do {
-   start = u64_stats_fetch_begin(>aborted_gstate_sync);
-   aborted_gstate = rq->aborted_gstate;
-   } while (u64_stats_fetch_retry(>aborted_gstate_sync, start));
-
-   return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:the request being processed
@@ -650,27 +600,10 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
struct request_queue *q = rq->q;
-   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-   int srcu_idx;
 
if (unlikely(blk_should_fake_timeout(q)))
return;
-
-   /*
-* If @rq->aborted_gstate equals the current instance, timeout is
-* claiming @rq and we lost.  This is synchronized through
-* hctx_lock().  See blk_mq_timeout_work() for details.
-*
-* Completion path never blocks and we can directly use RCU here
-* instead of hctx_lock() which can be either RCU or SRCU.
-* However, that would complicate paths which want to synchronize
-* against us.  Let stay in sync with the issue path so that
-* hctx_lock() covers both issue and completion paths.
-*/
-   hctx_lock(hctx, _idx);
-   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate

[RFC PATCH 1/3] blk-mq: Reference count request usage

2018-05-21 Thread Keith Busch

This patch adds a struct kref to struct request so that request users
can be sure they're operating on the same request without it changing
while they're processing it. The request's tag won't be released for
reuse until the last user is done with it.

Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 block/blk-mq.c | 30 +++---
 include/linux/blkdev.h |  2 ++
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4cbfd784e837..8b370ed75605 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -332,6 +332,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
 #endif
 
data->ctx->rq_dispatched[op_is_sync(op)]++;
+   kref_init(>ref);
return rq;
 }
 
@@ -465,13 +466,33 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+static void blk_mq_exit_request(struct kref *ref)
+{
+   struct request *rq = container_of(ref, struct request, ref);
+   struct request_queue *q = rq->q;
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
+   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+   const int sched_tag = rq->internal_tag;
+
+   if (rq->tag != -1)
+   blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+   if (sched_tag != -1)
+   blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
+   blk_mq_sched_restart(hctx);
+   blk_queue_exit(q);
+}
+
+static void blk_mq_put_request(struct request *rq)
+{
+   kref_put(>ref, blk_mq_exit_request);
+}
+
 void blk_mq_free_request(struct request *rq)
 {
struct request_queue *q = rq->q;
struct elevator_queue *e = q->elevator;
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-   const int sched_tag = rq->internal_tag;
 
if (rq->rq_flags & RQF_ELVPRIV) {
if (e && e->type->ops.mq.finish_request)
@@ -495,12 +516,7 @@ void blk_mq_free_request(struct request *rq)
blk_put_rl(blk_rq_rl(rq));
 
blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
-   if (rq->tag != -1)
-   blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
-   if (sched_tag != -1)
-   blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
-   blk_mq_sched_restart(hctx);
-   blk_queue_exit(q);
+   blk_mq_put_request(rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3999719f828..26bf2c1e3502 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -257,6 +257,8 @@ struct request {
struct u64_stats_sync aborted_gstate_sync;
u64 aborted_gstate;
 
+   struct kref ref;
+
/* access through blk_rq_set_deadline, blk_rq_deadline */
unsigned long __deadline;
 
-- 
2.14.3

Re: [PATCH 1/6] nvme: Sync request queues on reset

2018-05-21 Thread Keith Busch

On Tue, May 22, 2018 at 12:08:37AM +0800, Ming Lei wrote:
> Please take a look at blk_mq_complete_request(). Even with Bart's
> change, the request still won't be completed by driver. The request can
> only be completed by either driver or blk-mq, not both.

So you're saying blk-mq can't complete a request the driver returned to
blk-mq to complete. And that's the nvme driver's problem to fix?

1 2 3 4 >

1 - 100 of 305 matches

Mail list logo