Re: [dm-devel] [patch 5/5] block: use a driver-specific handler for the "inflight" value

2018-11-14 Thread Mikulas Patocka



On Wed, 14 Nov 2018, Christoph Hellwig wrote:

> On Thu, Nov 08, 2018 at 12:07:01PM -0500, Mike Snitzer wrote:
> > Discussed doing that with Jens and reported as much here:
> > 
> > https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html
> > 
> > And Jens gave additional context for why yet another attempt to switch
> > block core's in_flight to percpu counters is doomed (having already been
> > proposed and rejected twice):
> > 
> > https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html
> > 
> > And yes, definitely should've cc'd linux-block (now added).
> 
> So how is dm different from the the other 3 handful of drivers using
> the make_request interface that the per-cpu counters work for dm and
> not the others?

We want to make dm-linear (and dm-stripe) completely lockless, because it 
is used often and we don't want it to degrade performance.

DM already uses srcu to handle table changes, so that the fast path 
doesn't take any locks. And the only one "lock" that is remaining is the 
"in_flight" variable.

As for other drivers, md-raid0 could probably be lockless too (using 
percpu counting similar to dm). The other raid levels can't be lockless 
because they need to check the status of the stripe that is being 
accessed.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [patch 5/5] block: use a driver-specific handler for the "inflight" value

2018-11-14 Thread Mikulas Patocka



On Thu, 8 Nov 2018, Christoph Hellwig wrote:

> On Tue, Nov 06, 2018 at 10:35:03PM +0100, Mikulas Patocka wrote:
> > Device mapper was converted to percpu inflight counters. In order to
> > display the correct values in the "inflight" sysfs file, we need a custom
> > callback that sums the percpu counters.
> 
> The attribute that calls this is per-partition, while your method
> is per-queue, so there is some impedence mismatch here.

Device mapper doesn't use partitions.

> Is there any way you could look into just making the generic code
> use percpu counters?

In the next merge window, single-queue request-based block drivers will be 
eliminated - all the drivers will be multiqueue. So, they won't use 
the in_flight variables at all.

in_flight will be only use by bio-based stacking drivers like md.

> Also please cc linux-block on patches that touch the generic block
> code.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] multipath-tools: add ANA support for NVMe device

2018-11-14 Thread Mike Snitzer
On Wed, Nov 14 2018 at  1:51pm -0500,
Hannes Reinecke  wrote:

> On 11/14/18 6:47 PM, Mike Snitzer wrote:
> > On Wed, Nov 14 2018 at  2:49am -0500,
> > Hannes Reinecke  wrote:
> > 
> >> On 11/14/18 6:38 AM, Mike Snitzer wrote:
> >>> On Tue, Nov 13 2018 at  1:00pm -0500,
> >>> Mike Snitzer  wrote:
> >>>
>  [1]: 
>  http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html
>  [2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html
> >>> ...
> >>>
> >>> I knew there had to be a pretty tight coupling between the NVMe driver's
> >>> native multipathing and ANA support... and that the simplicity of
> >>> Hannes' patch [1] was too good to be true.
> >>>
> >>> The real justification for not making Hannes' change is it'd effectively
> >>> be useless without first splitting out the ANA handling done during NVMe
> >>> request completion (NVME_SC_ANA_* cases in nvme_failover_req) that
> >>> triggers re-reading the ANA log page accordingly.
> >>>
> >>> So without the ability to drive the ANA workqueue to trigger
> >>> nvme_read_ana_log() from the nvme driver's completion path -- even if
> >>> nvme_core.multipath=N -- it really doesn't buy multipath-tools anything
> >>> to have the NVMe driver export the ana state via sysfs, because that ANA
> >>> state will never get updated.
> >>>
> >> Hmm. Indeed, I was more focussed on having the sysfs attributes
> >> displayed, so yes, indeed it needs some more work.
> > ...
> >>> Not holding my breath BUT:
> >>> if decoupling the reading of ANA state from native NVMe multipathing
> >>> specific work during nvme request completion were an acceptable
> >>> advancement I'd gladly do the work.
> >>>
> >> I'd be happy to work on that, given that we'll have to have 'real'
> >> ANA support for device-mapper anyway for SLE12 SP4 etc.
> > 
> > I had a close enough look yesterday that I figured I'd just implement
> > what I reasoned through as one way forward, compile tested only (patch
> > relative to Jens' for-4.21/block):
> > 
> >  drivers/nvme/host/core.c  | 14 +++---
> >  drivers/nvme/host/multipath.c | 65 
> > ++-
> >  drivers/nvme/host/nvme.h  |  4 +++
> >  3 files changed, 54 insertions(+), 29 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index f172d63db2b5..05313ab5d91e 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -252,10 +252,16 @@ void nvme_complete_rq(struct request *req)
> > trace_nvme_complete_rq(req);
> >  
> > if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
> > -   if ((req->cmd_flags & REQ_NVME_MPATH) &&
> > -   blk_path_error(status)) {
> > -   nvme_failover_req(req);
> > -   return;
> > +   if (blk_path_error(status)) {
> > +   struct nvme_ns *ns = req->q->queuedata;
> > +   u16 nvme_status = nvme_req(req)->status;
> > +
> > +   if (req->cmd_flags & REQ_NVME_MPATH) {
> > +   nvme_failover_req(req);
> > +   nvme_update_ana(ns, nvme_status);
> > +   return;
> > +   }
> > +   nvme_update_ana(ns, nvme_status);
> > }
> >  
> > if (!blk_queue_dying(req->q)) {
> > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> > index 5e3cc8c59a39..f7fbc161dc8c 100644
> > --- a/drivers/nvme/host/multipath.c
> > +++ b/drivers/nvme/host/multipath.c
> > @@ -22,7 +22,7 @@ MODULE_PARM_DESC(multipath,
> >  
> >  inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
> >  {
> > -   return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
> > +   return ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
> >  }
> >  
> >  /*
> > @@ -47,6 +47,17 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns 
> > *ns,
> > }
> >  }
> >  
> > +static bool nvme_ana_error(u16 status)
> > +{
> > +   switch (status & 0x7ff) {
> > +   case NVME_SC_ANA_TRANSITION:
> > +   case NVME_SC_ANA_INACCESSIBLE:
> > +   case NVME_SC_ANA_PERSISTENT_LOSS:
> > +   return true;
> > +   }
> > +   return false;
> > +}
> > +
> >  void nvme_failover_req(struct request *req)
> >  {
> > struct nvme_ns *ns = req->q->queuedata;
> > @@ -58,10 +69,7 @@ void nvme_failover_req(struct request *req)
> > spin_unlock_irqrestore(>head->requeue_lock, flags);
> > blk_mq_end_request(req, 0);
> >  
> > -   switch (status & 0x7ff) {
> > -   case NVME_SC_ANA_TRANSITION:
> > -   case NVME_SC_ANA_INACCESSIBLE:
> > -   case NVME_SC_ANA_PERSISTENT_LOSS:
> > +   if (nvme_ana_error(status)) {
> > /*
> >  * If we got back an ANA error we know the controller is alive,
> >  * but not ready to serve this namespaces.  The spec suggests
> > @@ -69,31 +77,38 @@ void nvme_failover_req(struct request *req)
> >  * that the 

Re: [dm-devel] multipath-tools: Bump version to 0.7.9

2018-11-14 Thread Christophe Varoqui
This commit is now tagged.
Thanks.

On Wed, Nov 14, 2018 at 7:45 PM Xose Vazquez Perez 
wrote:

> Christophe Varoqui wrote:
>
> >
> https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=5c67a8b5944dd13542e6b44fa2ae9803e0cc4282
>
> Untagged: https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=tags
>
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] multipath-tools: add ANA support for NVMe device

2018-11-14 Thread Hannes Reinecke
On 11/14/18 6:47 PM, Mike Snitzer wrote:
> On Wed, Nov 14 2018 at  2:49am -0500,
> Hannes Reinecke  wrote:
> 
>> On 11/14/18 6:38 AM, Mike Snitzer wrote:
>>> On Tue, Nov 13 2018 at  1:00pm -0500,
>>> Mike Snitzer  wrote:
>>>
 [1]: 
 http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html
 [2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html
>>> ...
>>>
>>> I knew there had to be a pretty tight coupling between the NVMe driver's
>>> native multipathing and ANA support... and that the simplicity of
>>> Hannes' patch [1] was too good to be true.
>>>
>>> The real justification for not making Hannes' change is it'd effectively
>>> be useless without first splitting out the ANA handling done during NVMe
>>> request completion (NVME_SC_ANA_* cases in nvme_failover_req) that
>>> triggers re-reading the ANA log page accordingly.
>>>
>>> So without the ability to drive the ANA workqueue to trigger
>>> nvme_read_ana_log() from the nvme driver's completion path -- even if
>>> nvme_core.multipath=N -- it really doesn't buy multipath-tools anything
>>> to have the NVMe driver export the ana state via sysfs, because that ANA
>>> state will never get updated.
>>>
>> Hmm. Indeed, I was more focussed on having the sysfs attributes
>> displayed, so yes, indeed it needs some more work.
> ...
>>> Not holding my breath BUT:
>>> if decoupling the reading of ANA state from native NVMe multipathing
>>> specific work during nvme request completion were an acceptable
>>> advancement I'd gladly do the work.
>>>
>> I'd be happy to work on that, given that we'll have to have 'real'
>> ANA support for device-mapper anyway for SLE12 SP4 etc.
> 
> I had a close enough look yesterday that I figured I'd just implement
> what I reasoned through as one way forward, compile tested only (patch
> relative to Jens' for-4.21/block):
> 
>  drivers/nvme/host/core.c  | 14 +++---
>  drivers/nvme/host/multipath.c | 65 
> ++-
>  drivers/nvme/host/nvme.h  |  4 +++
>  3 files changed, 54 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f172d63db2b5..05313ab5d91e 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -252,10 +252,16 @@ void nvme_complete_rq(struct request *req)
>   trace_nvme_complete_rq(req);
>  
>   if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
> - if ((req->cmd_flags & REQ_NVME_MPATH) &&
> - blk_path_error(status)) {
> - nvme_failover_req(req);
> - return;
> + if (blk_path_error(status)) {
> + struct nvme_ns *ns = req->q->queuedata;
> + u16 nvme_status = nvme_req(req)->status;
> +
> + if (req->cmd_flags & REQ_NVME_MPATH) {
> + nvme_failover_req(req);
> + nvme_update_ana(ns, nvme_status);
> + return;
> + }
> + nvme_update_ana(ns, nvme_status);
>   }
>  
>   if (!blk_queue_dying(req->q)) {
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 5e3cc8c59a39..f7fbc161dc8c 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -22,7 +22,7 @@ MODULE_PARM_DESC(multipath,
>  
>  inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
>  {
> - return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
> + return ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
>  }
>  
>  /*
> @@ -47,6 +47,17 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns 
> *ns,
>   }
>  }
>  
> +static bool nvme_ana_error(u16 status)
> +{
> + switch (status & 0x7ff) {
> + case NVME_SC_ANA_TRANSITION:
> + case NVME_SC_ANA_INACCESSIBLE:
> + case NVME_SC_ANA_PERSISTENT_LOSS:
> + return true;
> + }
> + return false;
> +}
> +
>  void nvme_failover_req(struct request *req)
>  {
>   struct nvme_ns *ns = req->q->queuedata;
> @@ -58,10 +69,7 @@ void nvme_failover_req(struct request *req)
>   spin_unlock_irqrestore(>head->requeue_lock, flags);
>   blk_mq_end_request(req, 0);
>  
> - switch (status & 0x7ff) {
> - case NVME_SC_ANA_TRANSITION:
> - case NVME_SC_ANA_INACCESSIBLE:
> - case NVME_SC_ANA_PERSISTENT_LOSS:
> + if (nvme_ana_error(status)) {
>   /*
>* If we got back an ANA error we know the controller is alive,
>* but not ready to serve this namespaces.  The spec suggests
> @@ -69,31 +77,38 @@ void nvme_failover_req(struct request *req)
>* that the admin and I/O queues are not serialized that is
>* fundamentally racy.  So instead just clear the current path,
>* mark the the path as pending and kick of a re-read of the ANA
> -  

Re: [dm-devel] multipath-tools: Bump version to 0.7.9

2018-11-14 Thread Xose Vazquez Perez
Christophe Varoqui wrote:

> https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=5c67a8b5944dd13542e6b44fa2ae9803e0cc4282

Untagged: https://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=tags

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH] fix infinite loop in __blkdev_issue_discard

2018-11-14 Thread Mikulas Patocka
The "min_t(unsigned int" macro truncates both arguments to unsigned int. 
So, if we are running discard on a very big device with 2^32 or more 
sectors, the truncation may produce zero, resulting in infinite loop.

This patch fixes the infinite loop in the lvm test lvcreate-large-raid.sh

BTW. the patch 744889b7cbb56a64f957e65ade7cb65fe3f35714 that was committed 
between v4.19-rc8 and v4.19 also breaks discard by truncating sector_t to 
unsigned int (but it won't result in an infinite loop, it will result in 
an error instead). Should it be pulled out from the 4.19 long term branch? 
Or should we backport all the subsequent patches on the top of it?

Signed-off-by: Mikulas Patocka 
Reported-by: Zdenek Kabelac 
Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard()")
Fixes: 744889b7cbb5 ("block: don't deal with discard limit in 
blkdev_issue_discard()")
Cc: sta...@vger.kernel.org  # v4.19

---
 block/blk-lib.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/block/blk-lib.c
===
--- linux-2.6.orig/block/blk-lib.c  2018-11-14 18:25:28.0 +0100
+++ linux-2.6/block/blk-lib.c   2018-11-14 18:33:04.0 +0100
@@ -55,7 +55,7 @@ int __blkdev_issue_discard(struct block_
return -EINVAL;
 
while (nr_sects) {
-   unsigned int req_sects = min_t(unsigned int, nr_sects,
+   unsigned int req_sects = min_t(sector_t, nr_sects,
bio_allowed_max_sectors(q));
 
bio = blk_next_bio(bio, 0, gfp_mask);

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] multipath-tools: add ANA support for NVMe device

2018-11-14 Thread Mike Snitzer
On Wed, Nov 14 2018 at  2:49am -0500,
Hannes Reinecke  wrote:

> On 11/14/18 6:38 AM, Mike Snitzer wrote:
> >On Tue, Nov 13 2018 at  1:00pm -0500,
> >Mike Snitzer  wrote:
> >
> >>[1]: 
> >>http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html
> >>[2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html
> >...
> >
> >I knew there had to be a pretty tight coupling between the NVMe driver's
> >native multipathing and ANA support... and that the simplicity of
> >Hannes' patch [1] was too good to be true.
> >
> >The real justification for not making Hannes' change is it'd effectively
> >be useless without first splitting out the ANA handling done during NVMe
> >request completion (NVME_SC_ANA_* cases in nvme_failover_req) that
> >triggers re-reading the ANA log page accordingly.
> >
> >So without the ability to drive the ANA workqueue to trigger
> >nvme_read_ana_log() from the nvme driver's completion path -- even if
> >nvme_core.multipath=N -- it really doesn't buy multipath-tools anything
> >to have the NVMe driver export the ana state via sysfs, because that ANA
> >state will never get updated.
> >
> Hmm. Indeed, I was more focussed on having the sysfs attributes
> displayed, so yes, indeed it needs some more work.
...
> >Not holding my breath BUT:
> >if decoupling the reading of ANA state from native NVMe multipathing
> >specific work during nvme request completion were an acceptable
> >advancement I'd gladly do the work.
> >
> I'd be happy to work on that, given that we'll have to have 'real'
> ANA support for device-mapper anyway for SLE12 SP4 etc.

I had a close enough look yesterday that I figured I'd just implement
what I reasoned through as one way forward, compile tested only (patch
relative to Jens' for-4.21/block):

 drivers/nvme/host/core.c  | 14 +++---
 drivers/nvme/host/multipath.c | 65 ++-
 drivers/nvme/host/nvme.h  |  4 +++
 3 files changed, 54 insertions(+), 29 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f172d63db2b5..05313ab5d91e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -252,10 +252,16 @@ void nvme_complete_rq(struct request *req)
trace_nvme_complete_rq(req);
 
if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
-   if ((req->cmd_flags & REQ_NVME_MPATH) &&
-   blk_path_error(status)) {
-   nvme_failover_req(req);
-   return;
+   if (blk_path_error(status)) {
+   struct nvme_ns *ns = req->q->queuedata;
+   u16 nvme_status = nvme_req(req)->status;
+
+   if (req->cmd_flags & REQ_NVME_MPATH) {
+   nvme_failover_req(req);
+   nvme_update_ana(ns, nvme_status);
+   return;
+   }
+   nvme_update_ana(ns, nvme_status);
}
 
if (!blk_queue_dying(req->q)) {
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 5e3cc8c59a39..f7fbc161dc8c 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -22,7 +22,7 @@ MODULE_PARM_DESC(multipath,
 
 inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
 {
-   return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
+   return ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
 }
 
 /*
@@ -47,6 +47,17 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
}
 }
 
+static bool nvme_ana_error(u16 status)
+{
+   switch (status & 0x7ff) {
+   case NVME_SC_ANA_TRANSITION:
+   case NVME_SC_ANA_INACCESSIBLE:
+   case NVME_SC_ANA_PERSISTENT_LOSS:
+   return true;
+   }
+   return false;
+}
+
 void nvme_failover_req(struct request *req)
 {
struct nvme_ns *ns = req->q->queuedata;
@@ -58,10 +69,7 @@ void nvme_failover_req(struct request *req)
spin_unlock_irqrestore(>head->requeue_lock, flags);
blk_mq_end_request(req, 0);
 
-   switch (status & 0x7ff) {
-   case NVME_SC_ANA_TRANSITION:
-   case NVME_SC_ANA_INACCESSIBLE:
-   case NVME_SC_ANA_PERSISTENT_LOSS:
+   if (nvme_ana_error(status)) {
/*
 * If we got back an ANA error we know the controller is alive,
 * but not ready to serve this namespaces.  The spec suggests
@@ -69,31 +77,38 @@ void nvme_failover_req(struct request *req)
 * that the admin and I/O queues are not serialized that is
 * fundamentally racy.  So instead just clear the current path,
 * mark the the path as pending and kick of a re-read of the ANA
-* log page ASAP.
+* log page ASAP (see nvme_update_ana() below).
 */
nvme_mpath_clear_current_path(ns);
-   

Re: [dm-devel] multipath-tools: add ANA support for NVMe device

2018-11-14 Thread Mike Snitzer
On Wed, Nov 14 2018 at 10:35am -0500,
Christoph Hellwig  wrote:

> On Wed, Nov 14, 2018 at 08:24:05AM +0100, Hannes Reinecke wrote:
> > My argument here is that _ANA_ support should not be tied to the NVME 
> > native multipathing at all.
> 
> It should.  Because nvme driver multipathing is the only sanctioned
> way to use it.  All other ways aren't supported and might break at
> any time.

Quite a few of us who are multipath-tools oriented would like the proper
separation rather than your more pragmatic isolated approach.  And we'd
fix it if it broke in the future.

> > So personally I don't see why the 'raw' ANA support (parsing log pages, 
> > figuring out path states etc) needs to be tied in with native NVMe 
> > multipathing. _Especially_ not as my patch really is trivial.
> 
> And not actually usable for anything..  They only thing you do is to
> increase the code size for embedded nvme users that don't need any
> multipathing.

Isn't that why CONFIG_NVME_MULTIPATH exists?  Embedded nvme users
wouldn't set that.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [patch 5/5] block: use a driver-specific handler for the "inflight" value

2018-11-14 Thread Mike Snitzer
On Wed, Nov 14 2018 at 10:18am -0500,
Christoph Hellwig  wrote:

> On Thu, Nov 08, 2018 at 12:07:01PM -0500, Mike Snitzer wrote:
> > Discussed doing that with Jens and reported as much here:
> > 
> > https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html
> > 
> > And Jens gave additional context for why yet another attempt to switch
> > block core's in_flight to percpu counters is doomed (having already been
> > proposed and rejected twice):
> > 
> > https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html
> > 
> > And yes, definitely should've cc'd linux-block (now added).
> 
> So how is dm different from the the other 3 handful of drivers using
> the make_request interface that the per-cpu counters work for dm and
> not the others?

Think the big part of the historic reluctance to switch to percpu
in_flight counters was that until now (4.21) the legacy request path was
also using the in_flight counters.

Now that they are only used by bio-based (make_request) we likely have
more latitude (hopefully?).  Though I cannot say for sure why they
performed so well in Mikulas' testing.. you'd thinking all the percpu
summing on every jiffie during IO completion would've still been
costly.. but Mikulas saw great results.

Mikulas and I have discussed a new way forward and he is actively
working through implementing it.  Basically he'll still switch to percpu
in_flight counters but he'll change the algorithm for IO accounting
during completion so that it is more of an approximation of the
historically precise in_flight counters and io_ticks (io_ticks is
another problematic component that gets in the way of performance).
Basically the accounting done during IO completion would be much much
faster.  Big part of this is the summation of the percpu in_flight
counters would happen on demand (via sysfs or /proc/diskstats access).
I could look back at my logs from my chat with Mikulas to give you more
details or we could just wait for Mikulas to post the patches (hopefully
within a week).  Your call.

Coming off my Monday discussion with Mikulas I really think the approach
will work nicely and offer a nice performance win for bio-based.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] multipath-tools: add ANA support for NVMe device

2018-11-14 Thread Christoph Hellwig
On Wed, Nov 14, 2018 at 08:24:05AM +0100, Hannes Reinecke wrote:
> My argument here is that _ANA_ support should not be tied to the NVME 
> native multipathing at all.

It should.  Because nvme driver multipathing is the only sanctioned
way to use it.  All other ways aren't supported and might break at
any time.

> So personally I don't see why the 'raw' ANA support (parsing log pages, 
> figuring out path states etc) needs to be tied in with native NVMe 
> multipathing. _Especially_ not as my patch really is trivial.

And not actually usable for anything..  They only thing you do is to
increase the code size for embedded nvme users that don't need any
multipathing.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [patch 5/5] block: use a driver-specific handler for the "inflight" value

2018-11-14 Thread Christoph Hellwig
On Thu, Nov 08, 2018 at 12:07:01PM -0500, Mike Snitzer wrote:
> Discussed doing that with Jens and reported as much here:
> 
> https://www.redhat.com/archives/dm-devel/2018-November/msg00068.html
> 
> And Jens gave additional context for why yet another attempt to switch
> block core's in_flight to percpu counters is doomed (having already been
> proposed and rejected twice):
> 
> https://www.redhat.com/archives/dm-devel/2018-November/msg00071.html
> 
> And yes, definitely should've cc'd linux-block (now added).

So how is dm different from the the other 3 handful of drivers using
the make_request interface that the per-cpu counters work for dm and
not the others?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] multipath-tools: add ANA support for NVMe device

2018-11-14 Thread Martin Wilck
On Wed, 2018-11-14 at 08:49 +0100, Hannes Reinecke wrote:
> On 11/14/18 6:38 AM, Mike Snitzer wrote:
> > 
> > Not holding my breath BUT:
> > if decoupling the reading of ANA state from native NVMe
> > multipathing
> > specific work during nvme request completion were an acceptable
> > advancement I'd gladly do the work.
> > 
> I'd be happy to work on that, given that we'll have to have 'real'
> ANA 
> support for device-mapper anyway for SLE12 SP4 etc.

So, what's the way ahead for multipath-tools? 

Given that, IIUC, at least one official kernel (4.19) has been released
with ANA support but without the ANA sysfs attributes exported to user
space, multipath-tools can't rely on them being present. I for one have
no issue with copying code from nvme-cli and doing NVMe ioctls from
multipath-tools, as in Lijie's patch. When those sysfs attributes are
added (and work as expected), we will use them, but IMO we'll need to
fall back to ioctls until then, and also afterwards, because we can't
assume to be always running on the latest kernel.

Best,
Martin


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] multipath-tools: add ANA support for NVMe device

2018-11-14 Thread Hannes Reinecke

On 11/14/18 6:38 AM, Mike Snitzer wrote:

On Tue, Nov 13 2018 at  1:00pm -0500,
Mike Snitzer  wrote:


On Tue, Nov 13 2018 at 11:18am -0500,
Keith Busch  wrote:


On Mon, Nov 12, 2018 at 04:53:23PM -0500, Mike Snitzer wrote:

On Mon, Nov 12 2018 at 11:23am -0500,
Martin Wilck  wrote:


Hello Lijie,

On Thu, 2018-11-08 at 14:09 +0800, lijie wrote:

Add support for Asynchronous Namespace Access as specified in NVMe
1.3
TP 4004. The states are updated through reading the ANA log page.

By default, the native nvme multipath takes over the nvme device.
We can pass a false to the parameter 'multipath' of the nvme-core.ko
module,when we want to use multipath-tools.


Thank you for the patch. It looks quite good to me. I've tested it with
a Linux target and found no problems so far.

I have a few questions and comments inline below.

I suggest you also have a look at detect_prio(); it seems to make sense
to use the ana prioritizer for NVMe paths automatically if ANA is
supported (with your patch, "detect_prio no" and "prio ana" have to be
configured explicitly). But that can be done in a later patch.


I (and others) think it makes sense to at least triple check with the
NVMe developers (now cc'd) to see if we could get agreement on the nvme
driver providing the ANA state via sysfs (when modparam
nvme_core.multipath=N is set), like Hannes proposed here:
http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html

Then the userspace multipath-tools ANA support could just read sysfs
rather than reinvent harvesting the ANA state via ioctl.


I'd prefer not duplicating the log page parsing. Maybe nvme's shouldn't
even be tied to CONFIG_NVME_MULTIPATH so that the 'multipath' param
isn't even an issue.


I like your instincts, we just need to take them a bit further.

Splitting out the kernel's ANA log page parsing won't buy us much given
it is userspace (multipath-tools) that needs to consume it.  The less
work userspace needs to do (because kernel has already done it) the
better.

If the NVMe driver is made to always track and export the ANA state via
sysfs [1] we'd avoid userspace parsing duplication "for free".  This
should occur regardless of what layer is reacting to the ANA state
changes (be it NVMe's native multipathing or multipath-tools).

ANA and NVMe multipathing really are disjoint, making them tightly
coupled only serves to force NVMe driver provided multipathing _or_
userspace ANA state tracking duplication that really isn't ideal [2].

We need a reasoned answer to the primary question of whether the NVMe
maintainers are willing to cooperate by providing this basic ANA sysfs
export even if nvme_core.multipath=N [1].

Christoph said "No" [3], but offered little _real_ justification for why
this isn't the right thing for NVMe in general.

...

[1]: http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html
[2]: https://www.redhat.com/archives/dm-devel/2018-November/msg00072.html

...

I knew there had to be a pretty tight coupling between the NVMe driver's
native multipathing and ANA support... and that the simplicity of
Hannes' patch [1] was too good to be true.

The real justification for not making Hannes' change is it'd effectively
be useless without first splitting out the ANA handling done during NVMe
request completion (NVME_SC_ANA_* cases in nvme_failover_req) that
triggers re-reading the ANA log page accordingly.

So without the ability to drive the ANA workqueue to trigger
nvme_read_ana_log() from the nvme driver's completion path -- even if
nvme_core.multipath=N -- it really doesn't buy multipath-tools anything
to have the NVMe driver export the ana state via sysfs, because that ANA
state will never get updated.

Hmm. Indeed, I was more focussed on having the sysfs attributes 
displayed, so yes, indeed it needs some more work.



The inability to provide proper justification for rejecting a patch
(that already had one co-maintainer's Reviewed-by [5]) _should_ render
that rejection baseless, and the patch applied (especially if there is
contributing subsystem developer interest in maintaining this support
over time, which there is).  At least that is what would happen in a
properly maintained kernel subsystem.

It'd really go a long way if senior Linux NVMe maintainers took steps to
accept reasonable changes.


Even though I'm frustrated I was clearly too harsh and regret my tone.
I promise to _try_ to suck less.

This dynamic of terse responses or no responses at all whenever NVMe
driver changes to ease multipath-tools NVMe support are floated is the
depressing gift that keeps on giving.  But enough excuses...

Not holding my breath BUT:
if decoupling the reading of ANA state from native NVMe multipathing
specific work during nvme request completion were an acceptable
advancement I'd gladly do the work.

I'd be happy to work on that, given that we'll have to have 'real' ANA 
support for device-mapper anyway for SLE12 SP4 etc.


Cheers,

Hannes
--
Dr. Hannes Reinecke 

Re: [dm-devel] [PATCH 0/2] two multipathd fixes

2018-11-14 Thread Christophe Varoqui
Merged.
Thanks.

On Tue, Nov 13, 2018 at 10:31 PM Martin Wilck  wrote:

> Hi Christophe,
>
> Two small fixes, based on my prior submissions which have been
> acked by Ben.
>
> Martin
>
> Martin Wilck (2):
>   multipathd: fix irritating "minor number mismatch" message
>   multipathd: reset delay_wait_checks counter on failure
>
>  multipathd/main.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> --
> 2.19.1
>
>
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel