Re: [dm-devel] [PATCH 01/32] block: Provide blkdev_get_handle_* functions

2023-07-04 Thread Keith Busch
On Tue, Jul 04, 2023 at 02:21:28PM +0200, Jan Kara wrote:
> +struct bdev_handle *blkdev_get_handle_by_dev(dev_t dev, blk_mode_t mode,
> + void *holder, const struct blk_holder_ops *hops)
> +{
> + struct bdev_handle *handle = kmalloc(sizeof(struct bdev_handle),
> +  GFP_KERNEL);

I believe 'sizeof(*handle)' is the preferred style.

> + struct block_device *bdev;
> +
> + if (!handle)
> + return ERR_PTR(-ENOMEM);
> + bdev = blkdev_get_by_dev(dev, mode, holder, hops);
> + if (IS_ERR(bdev))
> + return ERR_CAST(bdev);

Need a 'kfree(handle)' before the error return. Or would it be simpler
to get the bdev first so you can check the mode settings against a
read-only bdev prior to the kmalloc?

> + handle->bdev = bdev;
> + handle->holder = holder;
> + return handle;
> +}
> +EXPORT_SYMBOL(blkdev_get_handle_by_dev);
> +
>  /**
>   * blkdev_get_by_path - open a block device by name
>   * @path: path to the block device to open
> @@ -884,6 +902,28 @@ struct block_device *blkdev_get_by_path(const char 
> *path, blk_mode_t mode,
>  }
>  EXPORT_SYMBOL(blkdev_get_by_path);
>  
> +struct bdev_handle *blkdev_get_handle_by_path(const char *path, blk_mode_t 
> mode,
> + void *holder, const struct blk_holder_ops *hops)
> +{
> + struct bdev_handle *handle;
> + dev_t dev;
> + int error;
> +
> + error = lookup_bdev(path, );
> + if (error)
> + return ERR_PTR(error);
> +
> + handle = blkdev_get_handle_by_dev(dev, mode, holder, hops);
> + if (!IS_ERR(handle) && (mode & BLK_OPEN_WRITE) &&
> + bdev_read_only(handle->bdev)) {
> + blkdev_handle_put(handle);
> + return ERR_PTR(-EACCES);
> + }
> +
> + return handle;
> +}
> +EXPORT_SYMBOL(blkdev_get_handle_by_path);

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 21/30] nvme: replace the fmode_t argument to the nvme ioctl handlers with a simple bool

2023-06-09 Thread Keith Busch
Looks good.

Reviewed-by: Keith Busch 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v4 09/18] nvme: Move pr code to it's own file

2023-03-14 Thread Keith Busch
On Tue, Mar 14, 2023 at 06:13:22PM +0100, Christoph Hellwig wrote:
> > +++ b/drivers/nvme/host/pr.c
> > @@ -0,0 +1,155 @@
> > +// SPDX-License-Identifier: GPL-2.0
> 
> I'd feel much more comfortable if we had a copyright notice code
> here.  This code was written by Keith, maybe he can help to fill
> in what the proper notice should be?

Okay, this was initially introduced with 1d277a637a711a while employed with
Intel, so let's add for the history:

/*
 * Copyright (c) 2015 Intel Corporation 
 *  Keith Busch 
 */

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request

2023-02-28 Thread Keith Busch
On Tue, Feb 28, 2023 at 05:06:55PM -0700, Uday Shankar wrote:
> The block layer might merge together discard requests up until the
> max_discard_segments limit is hit, but blk_insert_cloned_request checks
> the segment count against max_segments regardless of the req op. This
> can result in errors like the following when discards are issued through
> a DM device and max_discard_segments exceeds max_segments for the queue
> of the chosen underlying device.
> 
> blk_insert_cloned_request: over max segments limit. (256 > 129)
> 
> Fix this by looking at the req_op and enforcing the appropriate segment
> limit - max_discard_segments for REQ_OP_DISCARDs and max_segments for
> everything else.

Looks good.

Reviewed-by: Keith Busch 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v2] blk-mq: enforce op-specific segment limits in blk_insert_cloned_request

2023-02-22 Thread Keith Busch
On Wed, Feb 22, 2023 at 11:52:25AM -0700, Uday Shankar wrote:
>  static inline unsigned int blk_rq_get_max_segments(struct request *rq)
>  {
> - if (req_op(rq) == REQ_OP_DISCARD)
> - return queue_max_discard_segments(rq->q);
> - return queue_max_segments(rq->q);
> + return blk_queue_get_max_segments(rq->q, req_op(rq));
>  }

I think you should just move this function to blk.h instead of
introducing a new one.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [PATCH for-6.2/block V3 2/2] block: Change the granularity of io ticks from ms to ns

2022-12-21 Thread Keith Busch
On Wed, Dec 21, 2022 at 04:05:06AM +, Gulam Mohamed wrote:
> +u64  blk_get_iostat_ticks(struct request_queue *q)
> +{
> +   return (blk_queue_precise_io_stat(q) ? ktime_get_ns() : jiffies);
> +}
> +EXPORT_SYMBOL_GPL(blk_get_iostat_ticks);
> +
>  void update_io_ticks(struct block_device *part, u64 now, bool end)
>  {
>   u64 stamp;
> @@ -968,20 +980,26 @@ EXPORT_SYMBOL(bdev_start_io_acct);
>  u64 bio_start_io_acct(struct bio *bio)
>  {
>   return bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio),
> -   bio_op(bio), jiffies);
> +   bio_op(bio),
> +   blk_get_iostat_ticks(bio->bi_bdev->bd_queue));
>  }
>  EXPORT_SYMBOL_GPL(bio_start_io_acct);
>  
>  void bdev_end_io_acct(struct block_device *bdev, enum req_op op,
> u64 start_time)
>  {
> + u64 now;
> + u64 duration;
> + struct request_queue *q = bdev_get_queue(bdev);
>   const int sgrp = op_stat_group(op);
> - u64 now = READ_ONCE(jiffies);
> - u64 duration = (unsigned long)now -(unsigned long) start_time;
> + now = blk_get_iostat_ticks(q);;

I don't think you can rely on the blk_queue_precise_io_stat() flag in
the completion side. The user can toggle this with data in flight, which
means the completion may use different tick units than the start. I
think you'll need to add a flag to the request in the start accounting
side to know which method to use for the completion.

> @@ -951,6 +951,7 @@ ssize_t part_stat_show(struct device *dev,
>   struct request_queue *q = bdev_get_queue(bdev);
>   struct disk_stats stat;
>   unsigned int inflight;
> + u64 stat_ioticks;
>  
>   if (queue_is_mq(q))
>   inflight = blk_mq_in_flight(q, bdev);
> @@ -959,10 +960,13 @@ ssize_t part_stat_show(struct device *dev,
>  
>   if (inflight) {
>   part_stat_lock();
> - update_io_ticks(bdev, jiffies, true);
> + update_io_ticks(bdev, blk_get_iostat_ticks(q), true);
>   part_stat_unlock();
>   }
>   part_stat_read_all(bdev, );
> + stat_ioticks = (blk_queue_precise_io_stat(q) ?
> + div_u64(stat.io_ticks, NSEC_PER_MSEC) :
> + jiffies_to_msecs(stat.io_ticks));


With the user able to change the precision at will, I think these
io_ticks need to have a consistent unit size. Mixing jiffies and nsecs
is going to create garbage stats output. Could existing io_ticks using
jiffies be converted to jiffies_to_nsecs() so that you only have one
unit to consider?



Re: [RFC for-6.2/block V2] block: Change the granularity of io ticks from ms to ns

2022-12-07 Thread Keith Busch
On Wed, Dec 07, 2022 at 11:17:12PM +, Chaitanya Kulkarni wrote:
> On 12/7/22 15:08, Jens Axboe wrote:
> > 
> > My default peak testing runs at 122M IOPS. That's also the peak IOPS of
> > the devices combined, and with iostats disabled. If I enabled iostats,
> > then the performance drops to 112M IOPS. It's no longer device limited,
> > that's a drop of about 8.2%.
> > 
> 
> Wow, clearly not acceptable that's exactly I asked for perf
> numbers :).

For the record, we did say per-io ktime_get() has a measurable
performance harm and should be aggregated.

  https://www.spinics.net/lists/linux-block/msg89937.html



Re: [dm-devel] [PATCH AUTOSEL 5.15 25/31] dm-log-writes: set dma_alignment limit in io_hints

2022-11-23 Thread Keith Busch
On Wed, Nov 23, 2022 at 07:42:26AM -0500, Sasha Levin wrote:
> From: Keith Busch 
> 
> [ Upstream commit 50a893359cd2643ee1afc96eedc9e7084cab49fa ]
> 
> This device mapper needs bio vectors to be sized and memory aligned to
> the logical block size. Set the minimum required queue limit
> accordingly.

Probably harmless, but these dm dma_alignment patches are not needed
prior to 6.0.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCHv2 0/5] fix direct io device mapper errors

2022-11-15 Thread Keith Busch
On Mon, Nov 14, 2022 at 06:31:36AM -0500, Mikulas Patocka wrote:
> 
> 
> On Fri, 11 Nov 2022, Keith Busch wrote:
> 
> > > There are other DM targets that override logical_block_size in their
> > > .io_hints hook (writecache, ebs, zoned). Have you reasoned through why
> > > those do _not_ need updating too?
> > 
> > Yeah, that's a good question. The ones that have a problem all make
> > assumptions about a bio's bv_offset being logical block size aligned,
> > and each of those is accounted for here. Everything else looks fine with
> > respect to handling offsets.
> 
> Unaligned bv_offset should work - because XFS is sending such bios. If you 
> compile the kernel with memory debugging, kmalloc returns unaligned 
> memory. XFS will allocate a buffer with kmalloc, test if it crosses a page 
> boundary, if not, use the buffer, if yes, free the buffer and allocate a 
> full page.
> 
> There have been device mapper problems about unaligned bv_offset in the 
> past and I have fixed them.
> 
> Unaligned bv_length is a problem for the affected targets.

kmalloc is physically contiguous, and bio's support multi-page bvecs for
these, so the bv_len is always aligned if the kmalloc is sized
correctly. The unaligned offsets become a problem with virtually
contiguous buffers since individual bv lengths might not be block size
aligned when bv offsets exist.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCHv2 0/5] fix direct io device mapper errors

2022-11-13 Thread Keith Busch
On Fri, Nov 11, 2022 at 01:07:05PM -0500, Mike Snitzer wrote:
> On Thu, Nov 10 2022 at  1:44P -0500,
> Keith Busch  wrote:
> 
> > From: Keith Busch 
> > 
> > The 6.0 kernel made some changes to the direct io interface to allow
> > offsets in user addresses. This based on the hardware's capabilities
> > reported in the request_queue's dma_alignment attribute.
> > 
> > dm-crypt, -log-writes and -integrity require direct io be aligned to the
> > block size. Since it was only ever using the default 511 dma mask, this
> > requirement may fail if formatted to something larger, like 4k, which
> > will result in unexpected behavior with direct-io.
> > 
> > Changes since v1: Added the same fix for -integrity and -log-writes
> > 
> > The first three were reported successfully tested by Dmitrii Tcvetkov,
> > but I don't have an official Tested-by: tag.
> > 
> >   
> > https://lore.kernel.org/linux-block/20221103194140.06ce3...@xps.demsh.org/T/#mba1d0b13374541cdad3b669ec4257a11301d1860
> > 
> > Keitio errors on Busch (5):
> >   block: make dma_alignment a stacking queue_limit
> >   dm-crypt: provide dma_alignment limit in io_hints
> >   block: make blk_set_default_limits() private
> >   dm-integrity: set dma_alignment limit in io_hints
> >   dm-log-writes: set dma_alignment limit in io_hints
> > 
> >  block/blk-core.c   |  1 -
> >  block/blk-settings.c   |  9 +
> >  block/blk.h|  1 +
> >  drivers/md/dm-crypt.c  |  1 +
> >  drivers/md/dm-integrity.c  |  1 +
> >  drivers/md/dm-log-writes.c |  1 +
> >  include/linux/blkdev.h | 16 
> >  7 files changed, 17 insertions(+), 13 deletions(-)
> > 
> > -- 
> > 2.30.2
> > 
> 
> There are other DM targets that override logical_block_size in their
> .io_hints hook (writecache, ebs, zoned). Have you reasoned through why
> those do _not_ need updating too?

Yeah, that's a good question. The ones that have a problem all make
assumptions about a bio's bv_offset being logical block size aligned,
and each of those is accounted for here. Everything else looks fine with
respect to handling offsets.

> Is there any risk of just introducing a finalization method in block
> core (that DM's .io_hints would call) that would ensure limits that
> are a funtion of another are always in sync?  Would avoid whack-a-mole
> issues in the future.

I don't think we can this particular limit issue. A lot of drivers and
devices can use a smaller dma_alignment than the logical block size, so
the generic block layer wouldn't really know the driver's intentions for
the relationship of these limits.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCHv2 1/5] block: make dma_alignment a stacking queue_limit

2022-11-10 Thread Keith Busch
From: Keith Busch 

Device mappers had always been getting the default 511 dma mask, but
the underlying device might have a larger alignment requirement. Since
this value is used to determine alloweable direct-io alignment, this
needs to be a stackable limit.

Signed-off-by: Keith Busch 
Reviewed-by: Christoph Hellwig 
---
 block/blk-core.c   |  1 -
 block/blk-settings.c   |  8 +---
 include/linux/blkdev.h | 15 ---
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17667159482e..5487912befe8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -425,7 +425,6 @@ struct request_queue *blk_alloc_queue(int node_id, bool 
alloc_srcu)
PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
goto fail_stats;
 
-   blk_queue_dma_alignment(q, 511);
blk_set_default_limits(>limits);
q->nr_requests = BLKDEV_DEFAULT_RQ;
 
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 8bb9eef5310e..4949ed3ce7c9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -57,6 +57,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->misaligned = 0;
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
+   lim->dma_alignment = 511;
 }
 EXPORT_SYMBOL(blk_set_default_limits);
 
@@ -600,6 +601,7 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
 
t->io_min = max(t->io_min, b->io_min);
t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
+   t->dma_alignment = max(t->dma_alignment, b->dma_alignment);
 
/* Set non-power-of-2 compatible chunk_sectors boundary */
if (b->chunk_sectors)
@@ -773,7 +775,7 @@ EXPORT_SYMBOL(blk_queue_virt_boundary);
  **/
 void blk_queue_dma_alignment(struct request_queue *q, int mask)
 {
-   q->dma_alignment = mask;
+   q->limits.dma_alignment = mask;
 }
 EXPORT_SYMBOL(blk_queue_dma_alignment);
 
@@ -795,8 +797,8 @@ void blk_queue_update_dma_alignment(struct request_queue 
*q, int mask)
 {
BUG_ON(mask > PAGE_SIZE);
 
-   if (mask > q->dma_alignment)
-   q->dma_alignment = mask;
+   if (mask > q->limits.dma_alignment)
+   q->limits.dma_alignment = mask;
 }
 EXPORT_SYMBOL(blk_queue_update_dma_alignment);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 50e358a19d98..9898e34b2c2d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -311,6 +311,13 @@ struct queue_limits {
unsigned char   discard_misaligned;
unsigned char   raid_partial_stripes_expensive;
enum blk_zoned_modelzoned;
+
+   /*
+* Drivers that set dma_alignment to less than 511 must be prepared to
+* handle individual bvec's that are not a multiple of a SECTOR_SIZE
+* due to possible offsets.
+*/
+   unsigned intdma_alignment;
 };
 
 typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
@@ -456,12 +463,6 @@ struct request_queue {
unsigned long   nr_requests;/* Max # of requests */
 
unsigned intdma_pad_mask;
-   /*
-* Drivers that set dma_alignment to less than 511 must be prepared to
-* handle individual bvec's that are not a multiple of a SECTOR_SIZE
-* due to possible offsets.
-*/
-   unsigned intdma_alignment;
 
 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
struct blk_crypto_profile *crypto_profile;
@@ -1324,7 +1325,7 @@ static inline sector_t bdev_zone_sectors(struct 
block_device *bdev)
 
 static inline int queue_dma_alignment(const struct request_queue *q)
 {
-   return q ? q->dma_alignment : 511;
+   return q ? q->limits.dma_alignment : 511;
 }
 
 static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCHv2 5/5] dm-log-writes: set dma_alignment limit in io_hints

2022-11-10 Thread Keith Busch
From: Keith Busch 

This device mapper needs bio vectors to be sized and memory aligned to
the logical block size. Set the minimum required queue limit
accordingly.

Signed-off-by: Keith Busch 
---
 drivers/md/dm-log-writes.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 20fd688f72e7..178e13a5b059 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -875,6 +875,7 @@ static void log_writes_io_hints(struct dm_target *ti, 
struct queue_limits *limit
limits->logical_block_size = bdev_logical_block_size(lc->dev->bdev);
limits->physical_block_size = bdev_physical_block_size(lc->dev->bdev);
limits->io_min = limits->physical_block_size;
+   limits->dma_alignment = limits->logical_block_size - 1;
 }
 
 #if IS_ENABLED(CONFIG_FS_DAX)
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCHv2 2/5] dm-crypt: provide dma_alignment limit in io_hints

2022-11-10 Thread Keith Busch
From: Keith Busch 

This device mapper needs bio vectors to be sized and memory aligned to
the logical block size. Set the minimum required queue limit
accordingly.

Link: https://lore.kernel.org/linux-block/20221101001558.648ee...@xps.demsh.org/
Fixes: b1a000d3b8ec5 ("block: relax direct io memory alignment")
Reportred-by: Eric Biggers 
Reported-by: Dmitrii Tcvetkov 
Signed-off-by: Keith Busch 
---
 drivers/md/dm-crypt.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 159c6806c19b..2653516bcdef 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -3630,6 +3630,7 @@ static void crypt_io_hints(struct dm_target *ti, struct 
queue_limits *limits)
limits->physical_block_size =
max_t(unsigned, limits->physical_block_size, cc->sector_size);
limits->io_min = max_t(unsigned, limits->io_min, cc->sector_size);
+   limits->dma_alignment = limits->logical_block_size - 1;
 }
 
 static struct target_type crypt_target = {
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCHv2 4/5] dm-integrity: set dma_alignment limit in io_hints

2022-11-10 Thread Keith Busch
From: Keith Busch 

This device mapper needs bio vectors to be sized and memory aligned to
the logical block size. Set the minimum required queue limit
accordingly.

Signed-off-by: Keith Busch 
---
 drivers/md/dm-integrity.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index aaf2472df6e5..e1e7b205573f 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -3370,6 +3370,7 @@ static void dm_integrity_io_hints(struct dm_target *ti, 
struct queue_limits *lim
limits->logical_block_size = ic->sectors_per_block << 
SECTOR_SHIFT;
limits->physical_block_size = ic->sectors_per_block << 
SECTOR_SHIFT;
blk_limits_io_min(limits, ic->sectors_per_block << 
SECTOR_SHIFT);
+   limits->dma_alignment = limits->logical_block_size - 1;
}
 }
 
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/3] fix direct io errors on dm-crypt

2022-11-10 Thread Keith Busch
On Thu, Nov 10, 2022 at 06:24:03PM +, Eric Biggers wrote:
> On Thu, Nov 03, 2022 at 08:25:56AM -0700, Keith Busch wrote:
> > From: Keith Busch 
> > 
> > The 6.0 kernel made some changes to the direct io interface to allow
> > offsets in user addresses. This based on the hardware's capabilities
> > reported in the request_queue's dma_alignment attribute.
> > 
> > dm-crypt requires direct io be aligned to the block size. Since it was
> > only ever using the default 511 dma mask, this requirement may fail if
> > formatted to something larger, like 4k, which will result in unexpected
> > behavior with direct-io.
> > 
> > There are two parts to fixing this:
> > 
> >   First, the attribute needs to be moved to the queue_limit so that it
> >   can properly stack with device mappers.
> > 
> >   Second, dm-crypt provides its minimum required limit to match the
> >   logical block size.
> > 
> > Keith Busch (3):
> >   block: make dma_alignment a stacking queue_limit
> >   dm-crypt: provide dma_alignment limit in io_hints
> >   block: make blk_set_default_limits() private
> 
> Hi Keith, can you send out an updated version of this patch series that
> addresses the feedback?
> 
> I'd really like for this bug to be fixed before 6.1 is released, so that there
> isn't a known bug in STATX_DIOALIGN already upon release.

Sorry for the delay, v2 sent.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCHv2 3/5] block: make blk_set_default_limits() private

2022-11-10 Thread Keith Busch
From: Keith Busch 

There are no external users of this function.

Signed-off-by: Keith Busch 
Reviewed-by: Christoph Hellwig 
---
 block/blk-settings.c   | 1 -
 block/blk.h| 1 +
 include/linux/blkdev.h | 1 -
 3 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4949ed3ce7c9..8ac1038d0c79 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,7 +59,6 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
 }
-EXPORT_SYMBOL(blk_set_default_limits);
 
 /**
  * blk_set_stacking_limits - set default limits for stacking devices
diff --git a/block/blk.h b/block/blk.h
index d6ea0d1a6db0..a186ea20f39d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -331,6 +331,7 @@ void blk_rq_set_mixed_merge(struct request *rq);
 bool blk_rq_merge_ok(struct request *rq, struct bio *bio);
 enum elv_merge blk_try_merge(struct request *rq, struct bio *bio);
 
+void blk_set_default_limits(struct queue_limits *lim);
 int blk_dev_init(void);
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9898e34b2c2d..891f8cbcd043 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -945,7 +945,6 @@ extern void blk_queue_io_min(struct request_queue *q, 
unsigned int min);
 extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
 extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
 extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
-extern void blk_set_default_limits(struct queue_limits *lim);
 extern void blk_set_stacking_limits(struct queue_limits *lim);
 extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
sector_t offset);
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCHv2 0/5] fix direct io device mapper errors

2022-11-10 Thread Keith Busch
From: Keith Busch 

The 6.0 kernel made some changes to the direct io interface to allow
offsets in user addresses. This based on the hardware's capabilities
reported in the request_queue's dma_alignment attribute.

dm-crypt, -log-writes and -integrity require direct io be aligned to the
block size. Since it was only ever using the default 511 dma mask, this
requirement may fail if formatted to something larger, like 4k, which
will result in unexpected behavior with direct-io.

Changes since v1: Added the same fix for -integrity and -log-writes

The first three were reported successfully tested by Dmitrii Tcvetkov,
but I don't have an official Tested-by: tag.

  
https://lore.kernel.org/linux-block/20221103194140.06ce3...@xps.demsh.org/T/#mba1d0b13374541cdad3b669ec4257a11301d1860

Keitio errors on Busch (5):
  block: make dma_alignment a stacking queue_limit
  dm-crypt: provide dma_alignment limit in io_hints
  block: make blk_set_default_limits() private
  dm-integrity: set dma_alignment limit in io_hints
  dm-log-writes: set dma_alignment limit in io_hints

 block/blk-core.c   |  1 -
 block/blk-settings.c   |  9 +
 block/blk.h|  1 +
 drivers/md/dm-crypt.c  |  1 +
 drivers/md/dm-integrity.c  |  1 +
 drivers/md/dm-log-writes.c |  1 +
 include/linux/blkdev.h | 16 
 7 files changed, 17 insertions(+), 13 deletions(-)

-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 0/3] fix direct io errors on dm-crypt

2022-11-04 Thread Keith Busch
From: Keith Busch 

The 6.0 kernel made some changes to the direct io interface to allow
offsets in user addresses. This based on the hardware's capabilities
reported in the request_queue's dma_alignment attribute.

dm-crypt requires direct io be aligned to the block size. Since it was
only ever using the default 511 dma mask, this requirement may fail if
formatted to something larger, like 4k, which will result in unexpected
behavior with direct-io.

There are two parts to fixing this:

  First, the attribute needs to be moved to the queue_limit so that it
  can properly stack with device mappers.

  Second, dm-crypt provides its minimum required limit to match the
  logical block size.

Keith Busch (3):
  block: make dma_alignment a stacking queue_limit
  dm-crypt: provide dma_alignment limit in io_hints
  block: make blk_set_default_limits() private

 block/blk-core.c   |  1 -
 block/blk-settings.c   |  9 +
 block/blk.h|  1 +
 drivers/md/dm-crypt.c  |  1 +
 include/linux/blkdev.h | 16 
 5 files changed, 15 insertions(+), 13 deletions(-)

-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/3] fix direct io errors on dm-crypt

2022-11-04 Thread Keith Busch
On Thu, Nov 03, 2022 at 12:33:19PM -0400, Mikulas Patocka wrote:
> Hi
> 
> The patchset seems OK - but dm-integrity also has a limitation that the 
> bio vectors must be aligned on logical block size.
> 
> dm-writecache and dm-verity seem to handle unaligned bioset, but you 
> should check them anyway.
> 
> I'm not sure about dm-log-writes.

Yeah, dm-integrity definitly needs it too. This is easy enough to use
the same io_hint that I added for dm-crypt.

dm-log-writes is doing some weird things with writes that I don't think
would work with some low level drivers without the same alignment
constraint.

The other two appear to handle this fine, but I'll run everything
through some focused test cases to be sure.

In the meantime, do you want to see the remaining mappers fixed in a v2,
or are you okay if they follow after this series?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 3/3] block: make blk_set_default_limits() private

2022-11-04 Thread Keith Busch
From: Keith Busch 

There are no external users of this function.

Signed-off-by: Keith Busch 
---
 block/blk-settings.c   | 1 -
 block/blk.h| 1 +
 include/linux/blkdev.h | 1 -
 3 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 4949ed3ce7c9..8ac1038d0c79 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,7 +59,6 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->zone_write_granularity = 0;
lim->dma_alignment = 511;
 }
-EXPORT_SYMBOL(blk_set_default_limits);
 
 /**
  * blk_set_stacking_limits - set default limits for stacking devices
diff --git a/block/blk.h b/block/blk.h
index d259d1da4cb4..4849a2efa4c5 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -330,6 +330,7 @@ void blk_rq_set_mixed_merge(struct request *rq);
 bool blk_rq_merge_ok(struct request *rq, struct bio *bio);
 enum elv_merge blk_try_merge(struct request *rq, struct bio *bio);
 
+void blk_set_default_limits(struct queue_limits *lim);
 int blk_dev_init(void);
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69ee5ea29e2f..704bc175a732 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -939,7 +939,6 @@ extern void blk_queue_io_min(struct request_queue *q, 
unsigned int min);
 extern void blk_limits_io_opt(struct queue_limits *limits, unsigned int opt);
 extern void blk_queue_io_opt(struct request_queue *q, unsigned int opt);
 extern void blk_set_queue_depth(struct request_queue *q, unsigned int depth);
-extern void blk_set_default_limits(struct queue_limits *lim);
 extern void blk_set_stacking_limits(struct queue_limits *lim);
 extern int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
sector_t offset);
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 1/3] block: make dma_alignment a stacking queue_limit

2022-11-04 Thread Keith Busch
From: Keith Busch 

Device mappers had always been getting the default 511 dma mask, but
the underlying device might have a larger alignment requirement. Since
this value is used to determine alloweable direct-io alignment, this
needs to be a stackable limit.

Signed-off-by: Keith Busch 
---
 block/blk-core.c   |  1 -
 block/blk-settings.c   |  8 +---
 include/linux/blkdev.h | 15 ---
 3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3a2ed8dadf73..9d6a947024ea 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -418,7 +418,6 @@ struct request_queue *blk_alloc_queue(int node_id)
PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
goto fail_stats;
 
-   blk_queue_dma_alignment(q, 511);
blk_set_default_limits(>limits);
q->nr_requests = BLKDEV_DEFAULT_RQ;
 
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 8bb9eef5310e..4949ed3ce7c9 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -57,6 +57,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->misaligned = 0;
lim->zoned = BLK_ZONED_NONE;
lim->zone_write_granularity = 0;
+   lim->dma_alignment = 511;
 }
 EXPORT_SYMBOL(blk_set_default_limits);
 
@@ -600,6 +601,7 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
 
t->io_min = max(t->io_min, b->io_min);
t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
+   t->dma_alignment = max(t->dma_alignment, b->dma_alignment);
 
/* Set non-power-of-2 compatible chunk_sectors boundary */
if (b->chunk_sectors)
@@ -773,7 +775,7 @@ EXPORT_SYMBOL(blk_queue_virt_boundary);
  **/
 void blk_queue_dma_alignment(struct request_queue *q, int mask)
 {
-   q->dma_alignment = mask;
+   q->limits.dma_alignment = mask;
 }
 EXPORT_SYMBOL(blk_queue_dma_alignment);
 
@@ -795,8 +797,8 @@ void blk_queue_update_dma_alignment(struct request_queue 
*q, int mask)
 {
BUG_ON(mask > PAGE_SIZE);
 
-   if (mask > q->dma_alignment)
-   q->dma_alignment = mask;
+   if (mask > q->limits.dma_alignment)
+   q->limits.dma_alignment = mask;
 }
 EXPORT_SYMBOL(blk_queue_update_dma_alignment);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 854b4745cdd1..69ee5ea29e2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -310,6 +310,13 @@ struct queue_limits {
unsigned char   discard_misaligned;
unsigned char   raid_partial_stripes_expensive;
enum blk_zoned_modelzoned;
+
+   /*
+* Drivers that set dma_alignment to less than 511 must be prepared to
+* handle individual bvec's that are not a multiple of a SECTOR_SIZE
+* due to possible offsets.
+*/
+   unsigned intdma_alignment;
 };
 
 typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
@@ -455,12 +462,6 @@ struct request_queue {
unsigned long   nr_requests;/* Max # of requests */
 
unsigned intdma_pad_mask;
-   /*
-* Drivers that set dma_alignment to less than 511 must be prepared to
-* handle individual bvec's that are not a multiple of a SECTOR_SIZE
-* due to possible offsets.
-*/
-   unsigned intdma_alignment;
 
 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
struct blk_crypto_profile *crypto_profile;
@@ -1318,7 +1319,7 @@ static inline sector_t bdev_zone_sectors(struct 
block_device *bdev)
 
 static inline int queue_dma_alignment(const struct request_queue *q)
 {
-   return q ? q->dma_alignment : 511;
+   return q ? q->limits.dma_alignment : 511;
 }
 
 static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH 2/3] dm-crypt: provide dma_alignment limit in io_hints

2022-11-04 Thread Keith Busch
From: Keith Busch 

This device mapper needs bio vectors to be sized and memory aligned to
the logical block size. Set the minimum required queue limit
accordingly.

Fixes: b1a000d3b8ec5 ("block: relax direct io memory alignment")
Reportred-by: Eric Biggers 
Reported-by: Dmitrii Tcvetkov 
Signed-off-by: Keith Busch 
---
 drivers/md/dm-crypt.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 159c6806c19b..2653516bcdef 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -3630,6 +3630,7 @@ static void crypt_io_hints(struct dm_target *ti, struct 
queue_limits *limits)
limits->physical_block_size =
max_t(unsigned, limits->physical_block_size, cc->sector_size);
limits->io_min = max_t(unsigned, limits->io_min, cc->sector_size);
+   limits->dma_alignment = limits->logical_block_size - 1;
 }
 
 static struct target_type crypt_target = {
-- 
2.30.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] Regression: wrong DIO alignment check with dm-crypt

2022-11-03 Thread Keith Busch
On Wed, Nov 02, 2022 at 02:45:10PM -0400, Mikulas Patocka wrote:
> On Tue, 1 Nov 2022, Eric Biggers wrote:
> > Hi,
> > 
> > I happened to notice the following QEMU bug report:
> > 
> > https://gitlab.com/qemu-project/qemu/-/issues/1290
> > 
> > I believe it's a regression from the following kernel commit:
> > 
> > commit b1a000d3b8ec582da64bb644be633e5a0beffcbf
> > Author: Keith Busch 
> > Date:   Fri Jun 10 12:58:29 2022 -0700
> > 
> > block: relax direct io memory alignment
> 
> I suggest to revert this patch.

I hope we can make that option a last resort.
 
> > The bug is that if a dm-crypt device is set up with a crypto sector size 
> > (and
> > thus also a logical_block_size) of 4096, then the block layer now lets 
> > through
> > direct I/O requests to dm-crypt when the user buffer has only 512-byte
> > alignment, instead of the 4096-bytes expected by dm-crypt in that case.  
> > This is
> > because the dma_alignment of the device-mapper device is only 511 bytes.
> 
> Propagating dma_alignment through the device mapper stack would be hard 
> (because it is not in struct queue_limits). Perhaps we could set 
> dma_alignment to be the equivalent to logical_block_size, if the above 
> patch could not be reverted - but the we would hit the issue again when 
> someone stacks md or other devices over dm.

It looks straight forward to relocate the attribute to the the
queue_limits. If it stacks correctly, then no one would encounter a
problem no matter what md/dm combination you have.

I have something that looks promising, but I'm trying to give it a
thorough test before sending out another incomplete patch. Hopefully
ready by end of the day.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] Regression: wrong DIO alignment check with dm-crypt

2022-11-03 Thread Keith Busch
[Cc'ing Dmitrii, who also reported the same issue]

On Tue, Nov 01, 2022 at 08:11:15PM -0700, Eric Biggers wrote:
> Hi,
> 
> I happened to notice the following QEMU bug report:
> 
> https://gitlab.com/qemu-project/qemu/-/issues/1290
> 
> I believe it's a regression from the following kernel commit:
> 
> commit b1a000d3b8ec582da64bb644be633e5a0beffcbf
> Author: Keith Busch 
> Date:   Fri Jun 10 12:58:29 2022 -0700
> 
> block: relax direct io memory alignment
> 
> The bug is that if a dm-crypt device is set up with a crypto sector size (and
> thus also a logical_block_size) of 4096, then the block layer now lets through
> direct I/O requests to dm-crypt when the user buffer has only 512-byte
> alignment, instead of the 4096-bytes expected by dm-crypt in that case.  This 
> is
> because the dma_alignment of the device-mapper device is only 511 bytes.
> 
> This has two effects in this case:
> 
> - The error code for DIO with a misaligned buffer is now EIO, instead of
>   EINVAL as expected and documented.  This is because the I/O reaches
>   dm-crypt instead of being rejected by the block layer.
> 
> - STATX_DIOALIGN reports 512 bytes for stx_dio_mem_align, instead of the
>   correct value of 4096.  (Technically not a regression since 
> STATX_DIOALIGN
>   is new in v6.1, but still a bug.)
> 
> Any thoughts on what the correct fix is here?  Maybe the device-mapper layer
> needs to set dma_alignment correctly?  Or maybe the block layer needs to set 
> it
> to 'logical_block_size - 1' by default?

I think the quick fix is to have the device mapper override the default
queue stacking limits to align the dma mask to logical block size.

Does dm-crypt strictly require memory alignment to match the block size,
or is this just the way the current software works that we can change?
It may take me a moment to get to the bottem of that, but after a quick
glance, it looks like dm-crypt will work fine with the 512 offsets if we
happen to have a physically contiguous multi-page bvec, and will fail
otherwise due to a predetermined set of sg segments (looking at
crypt_convert_block_aead()).

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] Regression: wrong DIO alignment check with dm-crypt

2022-11-03 Thread Keith Busch
On Wed, Nov 02, 2022 at 08:03:45PM +0300, Dmitrii Tcvetkov wrote:
> 
> Applied on top 6.1-rc3, the issue still reproduces.

Yeah, I see that now. I needed to run a dm-crypt setup to figure out how
they're actually doing this, so now I have that up and running.

I think this type of usage will require the dma_alignment to move from
the request_queue to the queue_limits. Here's that, and I've
successfully tested this with cryptsetup. I still need more work to get
mdraid atop that working on my dev machine, but I'll post this now since
it's looking better:

---
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 8bb9eef5310e..e84304959318 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -81,6 +81,7 @@ void blk_set_stacking_limits(struct queue_limits *lim)
lim->max_dev_sectors = UINT_MAX;
lim->max_write_zeroes_sectors = UINT_MAX;
lim->max_zone_append_sectors = UINT_MAX;
+   lim->dma_alignment = 511;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -600,6 +601,7 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
 
t->io_min = max(t->io_min, b->io_min);
t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
+   t->dma_alignment = max(t->dma_alignment, b->dma_alignment);
 
/* Set non-power-of-2 compatible chunk_sectors boundary */
if (b->chunk_sectors)
@@ -773,7 +775,7 @@ EXPORT_SYMBOL(blk_queue_virt_boundary);
  **/
 void blk_queue_dma_alignment(struct request_queue *q, int mask)
 {
-   q->dma_alignment = mask;
+   q->limits.dma_alignment = mask;
 }
 EXPORT_SYMBOL(blk_queue_dma_alignment);
 
@@ -795,8 +797,8 @@ void blk_queue_update_dma_alignment(struct request_queue 
*q, int mask)
 {
BUG_ON(mask > PAGE_SIZE);
 
-   if (mask > q->dma_alignment)
-   q->dma_alignment = mask;
+   if (mask > q->limits.dma_alignment)
+   q->limits.dma_alignment = mask;
 }
 EXPORT_SYMBOL(blk_queue_update_dma_alignment);
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 159c6806c19b..2653516bcdef 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -3630,6 +3630,7 @@ static void crypt_io_hints(struct dm_target *ti, struct 
queue_limits *limits)
limits->physical_block_size =
max_t(unsigned, limits->physical_block_size, cc->sector_size);
limits->io_min = max_t(unsigned, limits->io_min, cc->sector_size);
+   limits->dma_alignment = limits->logical_block_size - 1;
 }
 
 static struct target_type crypt_target = {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 854b4745cdd1..69ee5ea29e2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -310,6 +310,13 @@ struct queue_limits {
unsigned char   discard_misaligned;
unsigned char   raid_partial_stripes_expensive;
enum blk_zoned_modelzoned;
+
+   /*
+* Drivers that set dma_alignment to less than 511 must be prepared to
+* handle individual bvec's that are not a multiple of a SECTOR_SIZE
+* due to possible offsets.
+*/
+   unsigned intdma_alignment;
 };
 
 typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
@@ -455,12 +462,6 @@ struct request_queue {
unsigned long   nr_requests;/* Max # of requests */
 
unsigned intdma_pad_mask;
-   /*
-* Drivers that set dma_alignment to less than 511 must be prepared to
-* handle individual bvec's that are not a multiple of a SECTOR_SIZE
-* due to possible offsets.
-*/
-   unsigned intdma_alignment;
 
 #ifdef CONFIG_BLK_INLINE_ENCRYPTION
struct blk_crypto_profile *crypto_profile;
@@ -1318,7 +1319,7 @@ static inline sector_t bdev_zone_sectors(struct 
block_device *bdev)
 
 static inline int queue_dma_alignment(const struct request_queue *q)
 {
-   return q ? q->dma_alignment : 511;
+   return q ? q->limits.dma_alignment : 511;
 }
 
 static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
--

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] Regression: wrong DIO alignment check with dm-crypt

2022-11-03 Thread Keith Busch
On Wed, Nov 02, 2022 at 08:52:15AM -0600, Keith Busch wrote:
> [Cc'ing Dmitrii, who also reported the same issue]
> 
> On Tue, Nov 01, 2022 at 08:11:15PM -0700, Eric Biggers wrote:
> > Hi,
> > 
> > I happened to notice the following QEMU bug report:
> > 
> > https://gitlab.com/qemu-project/qemu/-/issues/1290
> > 
> > I believe it's a regression from the following kernel commit:
> > 
> > commit b1a000d3b8ec582da64bb644be633e5a0beffcbf
> > Author: Keith Busch 
> > Date:   Fri Jun 10 12:58:29 2022 -0700
> > 
> > block: relax direct io memory alignment
> > 
> > The bug is that if a dm-crypt device is set up with a crypto sector size 
> > (and
> > thus also a logical_block_size) of 4096, then the block layer now lets 
> > through
> > direct I/O requests to dm-crypt when the user buffer has only 512-byte
> > alignment, instead of the 4096-bytes expected by dm-crypt in that case.  
> > This is
> > because the dma_alignment of the device-mapper device is only 511 bytes.
> > 
> > This has two effects in this case:
> > 
> > - The error code for DIO with a misaligned buffer is now EIO, instead of
> >   EINVAL as expected and documented.  This is because the I/O reaches
> >   dm-crypt instead of being rejected by the block layer.
> > 
> > - STATX_DIOALIGN reports 512 bytes for stx_dio_mem_align, instead of the
> >   correct value of 4096.  (Technically not a regression since 
> > STATX_DIOALIGN
> >   is new in v6.1, but still a bug.)
> > 
> > Any thoughts on what the correct fix is here?  Maybe the device-mapper layer
> > needs to set dma_alignment correctly?  Or maybe the block layer needs to 
> > set it
> > to 'logical_block_size - 1' by default?
> 
> I think the quick fix is to have the device mapper override the default
> queue stacking limits to align the dma mask to logical block size.

This is what I'm coming up with. Only compile tested (still setting up
an enviroment to actually run it).

---
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 159c6806c19b..9334e58a4c9f 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -43,6 +43,7 @@
 #include 
 
 #include "dm-audit.h"
+#include "dm-core.h"
 
 #define DM_MSG_PREFIX "crypt"
 
@@ -3615,7 +3616,9 @@ static int crypt_iterate_devices(struct dm_target *ti,
 
 static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
 {
+   struct mapped_device *md = dm_table_get_md(ti->table);
struct crypt_config *cc = ti->private;
+   struct request_queue *q = md->queue;
 
/*
 * Unfortunate constraint that is required to avoid the potential
@@ -3630,6 +3633,8 @@ static void crypt_io_hints(struct dm_target *ti, struct 
queue_limits *limits)
limits->physical_block_size =
max_t(unsigned, limits->physical_block_size, cc->sector_size);
limits->io_min = max_t(unsigned, limits->io_min, cc->sector_size);
+
+   blk_queue_dma_alignment(q, limits->logical_block_size - 1);
 }
 
 static struct target_type crypt_target = {
--

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 08/19] nvme: Move pr code to it's own file

2022-10-31 Thread Keith Busch
On Fri, Oct 28, 2022 at 11:06:29AM -0500, Mike Christie wrote:
> On 10/27/22 12:06 PM, Keith Busch wrote:
> > On Wed, Oct 26, 2022 at 06:19:34PM -0500, Mike Christie wrote:
> >> This patch moves the pr code to it's own file because I'm going to be
> >> adding more functions and core.c is getting bigger.
> >>
> >> Signed-off-by: Mike Christie 
> > 
> > Good idea.
> 
> Credit goes to Chaitanya.
> 
> One question for you. I wasn't sure about the copyright. I saw
> you added the pr_ops code in 2015 when you were at Intel. Do you
> want me to add:
> 
> Copyright (c) 2015, Intel Corporation.
> 
> to the new pr.c file?

I think I was just the last person to touch the code, but it was mostly
developed elsewhere. So "no", don't bother propagating the (c).

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 10/19] nvme: Move NVMe and Block PR types to an array

2022-10-31 Thread Keith Busch
On Wed, Oct 26, 2022 at 06:19:36PM -0500, Mike Christie wrote:
> For Reservation Report support we need to also convert from the NVMe spec
> PR type back to the block PR definition. This moves us to an array, so in
> the next patch we can add another helper to do the conversion without
> having to manage 2 switches.
> 
> Signed-off-by: Mike Christie 
> ---
>  drivers/nvme/host/pr.c | 42 +++---
>  include/linux/nvme.h   |  9 +
>  2 files changed, 32 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/nvme/host/pr.c b/drivers/nvme/host/pr.c
> index df7eb2440c67..5c4611d15d9c 100644
> --- a/drivers/nvme/host/pr.c
> +++ b/drivers/nvme/host/pr.c
> @@ -6,24 +6,28 @@
>  
>  #include "nvme.h"
>  
> -static char nvme_pr_type(enum pr_type type)
> +static const struct {
> + enum nvme_pr_type   nvme_type;
> + enum pr_typeblk_type;
> +} nvme_pr_types[] = {
> + { NVME_PR_WRITE_EXCLUSIVE, PR_WRITE_EXCLUSIVE },
> + { NVME_PR_EXCLUSIVE_ACCESS, PR_EXCLUSIVE_ACCESS },
> + { NVME_PR_WRITE_EXCLUSIVE_REG_ONLY, PR_WRITE_EXCLUSIVE_REG_ONLY },
> + { NVME_PR_EXCLUSIVE_ACCESS_REG_ONLY, PR_EXCLUSIVE_ACCESS_REG_ONLY },
> + { NVME_PR_WRITE_EXCLUSIVE_ALL_REGS, PR_WRITE_EXCLUSIVE_ALL_REGS },
> + { NVME_PR_EXCLUSIVE_ACCESS_ALL_REGS, PR_EXCLUSIVE_ACCESS_ALL_REGS },
> +};

Wouldn't it be easier to use the block type as the array index to avoid
the whole looped lookup?

  enum nvme_pr_type types[] = {
.PR_WRITE_EXCLUSIVE = NVME_PR_WRITE_EXCLUSIVE,
.PR_EXCLUSIVE_ACCESS  = NVME_PR_EXCLUSIVE_ACCESS,
...
  };

?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 06/19] nvme: Fix reservation status related structs

2022-10-31 Thread Keith Busch
On Wed, Oct 26, 2022 at 06:19:32PM -0500, Mike Christie wrote:
> This fixes the following issues with the reservation status structs:
> 
> 1. resv10 is bytes 23:10 so it should be 14 bytes.
> 2. regctl_ds only supports 64 bit host IDs.
> 
> These are not currently used, but will be in this patchset which adds
> support for the reservation report command.
> 
> Signed-off-by: Mike Christie 

Looks good.

Reviewed-by: Keith Busch 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 10/19] nvme: Move NVMe and Block PR types to an array

2022-10-31 Thread Keith Busch
On Thu, Oct 27, 2022 at 12:13:06PM -0500, michael.chris...@oracle.com wrote:
> Oh wait there was also a
> 
> 3. The pr_types come from userspace so if it passes us 10
> and we just do:
> 
> types[pr_type]
> 
> then we would crash due an out of bounds error.
> 
> Similarly I thought there could be a bad target that does the
> same thing.

Well, you'd of course have to check the boundaries before accessing if
you were to implement this scheme. :)

But considering this isn't a performance path, perhaps these kinds of
optimizations are not worth it.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 07/19] nvme: Don't hardcode the data len for pr commands

2022-10-31 Thread Keith Busch
On Wed, Oct 26, 2022 at 06:19:33PM -0500, Mike Christie wrote:
> Reservation Report support needs to pass in a variable sized buffer, so
> this patch has the pr command helpers take a data length argument.
> 
> Signed-off-by: Mike Christie 
> Reviewed-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 08/19] nvme: Move pr code to it's own file

2022-10-31 Thread Keith Busch
On Wed, Oct 26, 2022 at 06:19:34PM -0500, Mike Christie wrote:
> This patch moves the pr code to it's own file because I'm going to be
> adding more functions and core.c is getting bigger.
> 
> Signed-off-by: Mike Christie 

Good idea. The nvme core file is getting a bit too big and too diverse
in its responsibilities.

Reviewed-by: Keith Busch 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v2 09/20] nvme: Add helper to execute Reservation Report

2022-08-09 Thread Keith Busch
On Tue, Aug 09, 2022 at 10:56:55AM +, Chaitanya Kulkarni wrote:
> On 8/8/22 17:04, Mike Christie wrote:
> > +
> > +   c.common.opcode = nvme_cmd_resv_report;
> > +   c.common.cdw10 = cpu_to_le32(nvme_bytes_to_numd(data_len));
> > +   c.common.cdw11 = 1;
> > +   *eds = true;
> > +
> > +retry:
> > +   if (IS_ENABLED(CONFIG_NVME_MULTIPATH) &&
> > +   bdev->bd_disk->fops == _ns_head_ops)
> > +   ret = nvme_send_ns_head_pr_command(bdev, , data, data_len);
> > +   else
> > +   ret = nvme_send_ns_pr_command(bdev->bd_disk->private_data, ,
> > + data, data_len);
> > +   if (ret == NVME_SC_HOST_ID_INCONSIST && c.common.cdw11) {
> > +   c.common.cdw11 = 0;
> > +   *eds = false;
> > +   goto retry;
> 
> Unconditional retries without any limit can create problems,
> perhaps consider adding some soft limits.

It's already conditioned on cdw11, which is cleared to 0 on the 2nd try. Not
that that's particularly clear. I'd suggest naming an enum value for it so the
code tells us what the signficance of cdw11 is in this context (it's the
Extended Data Structure control flag).

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v2 09/20] nvme: Add helper to execute Reservation Report

2022-08-09 Thread Keith Busch
On Wed, Aug 10, 2022 at 01:45:48AM +, Chaitanya Kulkarni wrote:
> On 8/9/22 09:21, Mike Christie wrote:
> > On 8/9/22 9:51 AM, Keith Busch wrote:
> >> On Tue, Aug 09, 2022 at 10:56:55AM +, Chaitanya Kulkarni wrote:
> >>> On 8/8/22 17:04, Mike Christie wrote:
> >>>> +
> >>>> +c.common.opcode = nvme_cmd_resv_report;
> >>>> +c.common.cdw10 = cpu_to_le32(nvme_bytes_to_numd(data_len));
> >>>> +c.common.cdw11 = 1;
> >>>> +*eds = true;
> >>>> +
> >>>> +retry:
> >>>> +if (IS_ENABLED(CONFIG_NVME_MULTIPATH) &&
> >>>> +bdev->bd_disk->fops == _ns_head_ops)
> >>>> +ret = nvme_send_ns_head_pr_command(bdev, , data, 
> >>>> data_len);
> >>>> +else
> >>>> +ret = 
> >>>> nvme_send_ns_pr_command(bdev->bd_disk->private_data, ,
> >>>> +  data, data_len);
> >>>> +if (ret == NVME_SC_HOST_ID_INCONSIST && c.common.cdw11) {
> >>>> +c.common.cdw11 = 0;
> >>>> +*eds = false;
> >>>> +goto retry;
> >>>
> >>> Unconditional retries without any limit can create problems,
> >>> perhaps consider adding some soft limits.
> >>
> >> It's already conditioned on cdw11, which is cleared to 0 on the 2nd try. 
> >> Not
> >> that that's particularly clear. I'd suggest naming an enum value for it so 
> >> the
> >> code tells us what the signficance of cdw11 is in this context (it's the
> >> Extended Data Structure control flag).
> > 
> 
> true, my concern is if controller went bad (not a common case but it is
> H/W afterall) then we should have some soft limit to avoid infinite
> retries.

cdw11 is '0' on the 2nd try, and the 'goto' is conditioned on cdw11 being
non-zero. There's no infinite retry here.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 09/11] block, nvme: Add error for reservation conflicts.

2022-06-06 Thread Keith Busch
On Fri, Jun 03, 2022 at 01:55:34AM -0500, Mike Christie wrote:
> @@ -171,6 +171,7 @@ static const struct {
>   /* zone device specific errors */
>   [BLK_STS_ZONE_OPEN_RESOURCE]= { -ETOOMANYREFS, "open zones 
> exceeded" },
>   [BLK_STS_ZONE_ACTIVE_RESOURCE]  = { -EOVERFLOW, "active zones exceeded" 
> },
> + [BLK_STS_RSV_CONFLICT]  = { -EBADE, "resevation conflict" },

You misspelled "reservation". :)

And since you want a different error, why reuse EBADE for the errno? That is
already used for BLK_STS_NEXUS that you're trying to differentiate from, right?
At least for nvme, this error code is returned when the host lacks sufficient
rights, so something like EACCESS might make sense.

Looks good otherwise.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 24/27] block: remove QUEUE_FLAG_DISCARD

2022-04-12 Thread Keith Busch
On Sat, Apr 09, 2022 at 06:50:40AM +0200, Christoph Hellwig wrote:
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index efb85c6d8e2d5..7e07dd69262a7 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1607,10 +1607,8 @@ static void nvme_config_discard(struct gendisk *disk, 
> struct nvme_ns *ns)
>   struct request_queue *queue = disk->queue;
>   u32 size = queue_logical_block_size(queue);
>  
> - if (ctrl->max_discard_sectors == 0) {
> - blk_queue_flag_clear(QUEUE_FLAG_DISCARD, queue);
> + if (ctrl->max_discard_sectors == 0)
>   return;
> - }

I think we need to update the queue limit in this condition. While unlikley,
the flag was cleared here in case the device changed support for discard from
the previous reset. 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 3/3] nvme: add the "debug" host driver

2022-02-07 Thread Keith Busch
On Fri, Feb 04, 2022 at 03:15:02PM +0100, Hannes Reinecke wrote:
> On 2/4/22 10:58, Chaitanya Kulkarni wrote:
> 
> > and if that is the case why we don't have ZNS NVMeOF target
> > memory backed emulation ? Isn't that a bigger and more
> > complicated feature than Simple Copy where controller states
> > are involved with AENs ?
> > 
> > ZNS kernel code testing is also done on QEMU, I've also fixed
> > bugs in the ZNS kernel code which are discovered on QEMU and I've not
> > seen any issues with that. Given that simple copy feature is way smaller
> > than ZNS it will less likely to suffer from slowness and etc (listed
> > above) in QEMU.
> > 
> > my point is if we allow one, we will be opening floodgates and we need
> > to be careful not to bloat the code unless it is _absolutely
> > necessary_ which I don't think it is based on the simple copy
> > specification.
> > 
> 
> I do have a slightly different view on the nvme target code; it should
> provide the necessary means to test the nvme host code.
> And simple copy is on of these features, especially as it will operate as an
> exploiter of the new functionality.

The threshold to determine if the in-kernel fabrics target ought to
implement a feature should be if it's useful in a production.

Are users interested in copying data without using fabric bandwidth?
Yes.

Does anyone want a mocked up ZNS that has all the contraints and none of
the benefits? No.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 1/3] block: add copy offload support

2022-02-04 Thread Keith Busch
On Thu, Feb 03, 2022 at 01:50:06PM -0500, Mikulas Patocka wrote:
> On Tue, 1 Feb 2022, Bart Van Assche wrote:
> > Only supporting copying between contiguous LBA ranges seems restrictive to 
> > me.
> > I expect garbage collection by filesystems for UFS devices to perform better
> > if multiple LBA ranges are submitted as a single SCSI XCOPY command.
> 
> NVMe has a possibility to copy multiple source ranges into one destination 
> range. But I think that leveraging this capability would just make the 
> code excessively complex.

The point is to defrag discontiguous blocks into a single range. The
capability loses a major value proposition without multiple sources.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 3/3] nvme: add the "debug" host driver

2022-02-04 Thread Keith Busch
On Thu, Feb 03, 2022 at 07:38:43AM -0800, Luis Chamberlain wrote:
> On Wed, Feb 02, 2022 at 08:00:12AM +, Chaitanya Kulkarni wrote:
> > Mikulas,
> > 
> > On 2/1/22 10:33 AM, Mikulas Patocka wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > This patch adds a new driver "nvme-debug". It uses memory as a backing
> > > store and it is used to test the copy offload functionality.
> > > 
> > > Signed-off-by: Mikulas Patocka 
> > > 
> > 
> > 
> > NVMe Controller specific memory backed features needs to go into
> > QEMU which are targeted for testing and debugging, just like what
> > we have done for NVMe ZNS QEMU support and not in kernel.
> > 
> > I don't see any special reason to make copy offload an exception.
> 
> One can instantiate scsi devices with qemu by using fake scsi devices,
> but one can also just use scsi_debug to do the same. I see both efforts
> as desirable, so long as someone mantains this.
> 
> For instance, blktests uses scsi_debug for simplicity.
> 
> In the end you decide what you want to use.

Can we use the nvme-loop target instead?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 1/3] block: add copy offload support

2022-02-02 Thread Keith Busch
On Tue, Feb 01, 2022 at 01:32:29PM -0500, Mikulas Patocka wrote:
> +int blkdev_issue_copy(struct block_device *bdev1, sector_t sector1,
> +   struct block_device *bdev2, sector_t sector2,
> +   sector_t nr_sects, sector_t *copied, gfp_t gfp_mask)
> +{
> + struct page *token;
> + sector_t m;
> + int r = 0;
> + struct completion comp;
> +
> + *copied = 0;
> +
> + m = min(bdev_max_copy_sectors(bdev1), bdev_max_copy_sectors(bdev2));
> + if (!m)
> + return -EOPNOTSUPP;
> + m = min(m, (sector_t)round_down(UINT_MAX, PAGE_SIZE) >> 9);
> +
> + if (unlikely(bdev_read_only(bdev2)))
> + return -EPERM;
> +
> + token = alloc_page(gfp_mask);
> + if (unlikely(!token))
> + return -ENOMEM;
> +
> + while (nr_sects) {
> + struct bio *read_bio, *write_bio;
> + sector_t this_step = min(nr_sects, m);
> +
> + read_bio = bio_alloc(gfp_mask, 1);
> + if (unlikely(!read_bio)) {
> + r = -ENOMEM;
> + break;
> + }
> + bio_set_op_attrs(read_bio, REQ_OP_COPY_READ_TOKEN, REQ_NOMERGE);
> + bio_set_dev(read_bio, bdev1);
> + __bio_add_page(read_bio, token, PAGE_SIZE, 0);

You have this "token" payload as driver specific data, but there's no
check that bdev1 and bdev2 subscribe to the same driver specific format.

I thought we discussed defining something like a "copy domain" that
establishes which block devices can offload copy operations to/from each
other, and that should be checked before proceeding with the copy
operation.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 0/8] block: add support for REQ_OP_VERIFY

2021-11-05 Thread Keith Busch
On Thu, Nov 04, 2021 at 06:34:31PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 04, 2021 at 10:32:35AM -0700, Darrick J. Wong wrote:
> > I also wonder if it would be useful (since we're already having a
> > discussion elsewhere about data integrity syscalls for pmem) to be able
> > to call this sort of thing against files?  In which case we'd want
> > another preadv2 flag or something, and then plumb all that through the
> > vfs/iomap as needed?
> 
> IFF we do this (can't answer if there is a need) we should not
> overload read with it.  It is an operation that does not return
> data but just a status, so let's not get into that mess.

If there is a need for this, a new io_uring opcode seems like the
appropirate user facing interface for it.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 3/8] nvme: add support for the Verify command

2021-11-05 Thread Keith Busch
On Wed, Nov 03, 2021 at 11:46:29PM -0700, Chaitanya Kulkarni wrote:
> +static inline blk_status_t nvme_setup_verify(struct nvme_ns *ns,
> + struct request *req, struct nvme_command *cmnd)
> +{

Due to recent driver changes, you need a "memset(cmnd, 0, sizeof(*cmnd))"
prior to setting up the rest of the command, or you need to set each
command dword individually.

> + cmnd->verify.opcode = nvme_cmd_verify;
> + cmnd->verify.nsid = cpu_to_le32(ns->head->ns_id);
> + cmnd->verify.slba =
> + cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
> + cmnd->verify.length =
> + cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
> + cmnd->verify.control = 0;
> + return BLK_STS_OK;
> +}

> +static void nvme_config_verify(struct gendisk *disk, struct nvme_ns *ns)
> +{
> + u64 max_blocks;
> +
> + if (!(ns->ctrl->oncs & NVME_CTRL_ONCS_VERIFY))
> + return;
> +
> + if (ns->ctrl->max_hw_sectors == UINT_MAX)
> + max_blocks = (u64)USHRT_MAX + 1;
> + else
> + max_blocks = ns->ctrl->max_hw_sectors + 1;

If supported by the controller, this maximum is defined in the non-mdts
command limits in NVM command set specific Identify Controller VSL field.

> +
> + /* keep same as discard */
> + if (blk_queue_flag_test_and_set(QUEUE_FLAG_VERIFY, disk->queue))
> + return;
> +
> + blk_queue_max_verify_sectors(disk->queue,
> +  nvme_lba_to_sect(ns, max_blocks));
> +
> +}

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2021-11-03 Thread Keith Busch
On Fri, Oct 29, 2021 at 09:15:43AM -0700, Bart Van Assche wrote:
> On 10/28/21 10:51 PM, Hannes Reinecke wrote:
> > Also Keith presented his work on a simple zone-based remapping block 
> > device, which included an in-kernel copy offload facility.
> > Idea is to lift that as a standalone patch such that we can use it a 
> > fallback (ie software) implementation if no other copy offload mechanism is 
> > available.
> 
> Is a link to the presentation available?

Thanks for the interest.

I didn't post them online as the conference didn't provide it, and I
don't think the slides would be particularly interesting without the
prepared speech anyway.

The presentation described a simple prototype implementing a redirection
table on zone block devices. There was one bullet point explaining how a
generic kernel implementation would be an improvement. For zoned block
devices, an "append" like copy offload would be an even better option.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 11/29] btrfs: use bdev_nr_sectors instead of open coding it

2021-10-14 Thread Keith Busch
On Wed, Oct 13, 2021 at 07:10:24AM +0200, Christoph Hellwig wrote:
> Use the proper helper to read the block device size.

Just IMO, this patch looks like it wants a new bdev_nr_bytes() helper
instead of using the double shifting sectors back to bytes.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 15/15] nvme: use bvec_virt

2021-08-09 Thread Keith Busch
On Wed, Aug 04, 2021 at 11:56:34AM +0200, Christoph Hellwig wrote:
> Use bvec_virt instead of open coding it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvme/host/core.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index dfd9dec0c1f6..02ce94b2906b 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -968,12 +968,11 @@ void nvme_cleanup_cmd(struct request *req)
>  {
>   if (req->rq_flags & RQF_SPECIAL_PAYLOAD) {
>   struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
> - struct page *page = req->special_vec.bv_page;
>  
> - if (page == ctrl->discard_page)
> + if (req->special_vec.bv_page == ctrl->discard_page)
>   clear_bit_unlock(0, >discard_page_busy);
>   else
> - kfree(page_address(page) + req->special_vec.bv_offset);
> + kfree(bvec_virt(>special_vec));
>   }
>  }
>  EXPORT_SYMBOL_GPL(nvme_cleanup_cmd);

Looks good.

Reviewed-by: Keith Busch 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v4 1/3] scsi: scsi_ioctl: export __scsi_result_to_blk_status()

2021-06-30 Thread Keith Busch
On Tue, Jun 29, 2021 at 09:23:18PM +0200, Martin Wilck wrote:
> On Di, 2021-06-29 at 14:59 +0200, Christoph Hellwig wrote:
> > On Mon, Jun 28, 2021 at 04:57:33PM +0200, Martin Wilck wrote:
> > 
> > > The sg_io-on-multipath code needs to answer the question "is this a
> > > path or a target error?". Therefore it calls blk_path_error(),
> > > which
> > > requires obtaining a blk_status_t first. But that's not available
> > > in
> > > the sg_io code path. So how should I deal with this situation?
> > 
> > Not by exporting random crap from the SCSI code.
> 
> So, you'd prefer inlining scsi_result_to_blk_status()?

I don't think you need to. The only scsi_result_to_blk_status() caller
is sg_io_to_blk_status(). That's already in the same file as
scsi_result_to_blk_status() so no need to export it. You could even make
it static.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH V3 03/13] block: add helper of blk_create_io_context

2021-03-25 Thread Keith Busch
On Wed, Mar 24, 2021 at 08:19:17PM +0800, Ming Lei wrote:
> +static inline void blk_create_io_context(struct request_queue *q)
> +{
> + /*
> +  * Various block parts want %current->io_context, so allocate it up
> +  * front rather than dealing with lots of pain to allocate it only
> +  * where needed. This may fail and the block layer knows how to live
> +  * with it.
> +  */

I think this comment would make more sense if it were placed above the
caller rather than within this function. 

> + if (unlikely(!current->io_context))
> + create_task_io_context(current, GFP_ATOMIC, q->node);
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>   struct block_device *bdev = bio->bi_bdev;
> @@ -836,6 +848,8 @@ static noinline_for_stack bool submit_bio_checks(struct 
> bio *bio)
>   }
>   }
>  
> + blk_create_io_context(q);
> +
>   if (!blk_queue_poll(q))
>   bio->bi_opf &= ~REQ_HIPRI;
>  
> @@ -876,15 +890,6 @@ static noinline_for_stack bool submit_bio_checks(struct 
> bio *bio)
>   break;
>   }
>  
> - /*
> -  * Various block parts want %current->io_context, so allocate it up
> -  * front rather than dealing with lots of pain to allocate it only
> -  * where needed. This may fail and the block layer knows how to live
> -  * with it.
> -  */
> - if (unlikely(!current->io_context))
> - create_task_io_context(current, GFP_ATOMIC, q->node);
> -
>   if (blk_throtl_bio(bio)) {
>   blkcg_bio_issue_init(bio);
>   return false;
> -- 
> 2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH v5 0/4] add simple copy support

2021-02-22 Thread Keith Busch
On Sat, Feb 20, 2021 at 06:01:56PM +, David Laight wrote:
> From: SelvaKumar S
> > Sent: 19 February 2021 12:45
> > 
> > This patchset tries to add support for TP4065a ("Simple Copy Command"),
> > v2020.05.04 ("Ratified")
> > 
> > The Specification can be found in following link.
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> > 
> > Simple copy command is a copy offloading operation and is  used to copy
> > multiple contiguous ranges (source_ranges) of LBA's to a single destination
> > LBA within the device reducing traffic between host and device.
> 
> Sounds to me like the real reason is that the copy just ends up changing
> some indirect block pointers rather than having to actually copy the data.

I guess an implementation could do that, but I think that's missing the
point of the command. The intention is to copy the data to a new
location on the media for host managed garbage collection. 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH v3 1/2] block: add simple copy support

2020-12-14 Thread Keith Busch
On Fri, Dec 11, 2020 at 07:21:38PM +0530, SelvaKumar S wrote:
> +int blk_copy_emulate(struct block_device *bdev, struct blk_copy_payload 
> *payload,
> + gfp_t gfp_mask)
> +{
> + struct request_queue *q = bdev_get_queue(bdev);
> + struct bio *bio;
> + void *buf = NULL;
> + int i, nr_srcs, max_range_len, ret, cur_dest, cur_size;
> +
> + nr_srcs = payload->copy_range;
> + max_range_len = q->limits.max_copy_range_sectors << SECTOR_SHIFT;

The default value for this limit is 0, and this is the function for when
the device doesn't support copy. Are we expecting drivers to set this
value to something else for that case?

> + cur_dest = payload->dest;
> + buf = kvmalloc(max_range_len, GFP_ATOMIC);
> + if (!buf)
> + return -ENOMEM;
> +
> + for (i = 0; i < nr_srcs; i++) {
> + bio = bio_alloc(gfp_mask, 1);
> + bio->bi_iter.bi_sector = payload->range[i].src;
> + bio->bi_opf = REQ_OP_READ;
> + bio_set_dev(bio, bdev);
> +
> + cur_size = payload->range[i].len << SECTOR_SHIFT;
> + ret = bio_add_page(bio, virt_to_page(buf), cur_size,
> +offset_in_page(payload));

'buf' is vmalloc'ed, so we don't necessarily have congituous pages. I
think you need to allocate the bio with bio_map_kern() or something like
that instead with that kind of memory.

> + if (ret != cur_size) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = submit_bio_wait(bio);
> + bio_put(bio);
> + if (ret)
> + goto out;
> +
> + bio = bio_alloc(gfp_mask, 1);
> + bio_set_dev(bio, bdev);
> + bio->bi_opf = REQ_OP_WRITE;
> + bio->bi_iter.bi_sector = cur_dest;
> + ret = bio_add_page(bio, virt_to_page(buf), cur_size,
> +offset_in_page(payload));
> + if (ret != cur_size) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + ret = submit_bio_wait(bio);
> + bio_put(bio);
> + if (ret)
> + goto out;
> +
> + cur_dest += payload->range[i].len;
> + }

I think this would be a faster implementation if the reads were
asynchronous with a payload buffer allocated specific to that read, and
the callback can enqueue the write part. This would allow you to
accumulate all the read data and write it in a single call. 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 6/6] nvme: allow revalidate to set a namespace read-only

2020-12-09 Thread Keith Busch
On Mon, Dec 07, 2020 at 02:19:18PM +0100, Christoph Hellwig wrote:
> Unconditionally call set_disk_ro now that it only updates the hardware
> state.  This allows to properly set up the Linux devices read-only when
> the controller turns a previously writable namespace read-only.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Keith Busch 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH v2 0/2] add simple copy support

2020-12-07 Thread Keith Busch
On Fri, Dec 04, 2020 at 11:25:12AM +, Damien Le Moal wrote:
> On 2020/12/04 20:02, SelvaKumar S wrote:
> > This patchset tries to add support for TP4065a ("Simple Copy Command"),
> > v2020.05.04 ("Ratified")
> > 
> > The Specification can be found in following link.
> > https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip
> > 
> > This is an RFC. Looking forward for any feedbacks or other alternate
> > designs for plumbing simple copy to IO stack.
> > 
> > Simple copy command is a copy offloading operation and is  used to copy
> > multiple contiguous ranges (source_ranges) of LBA's to a single destination
> > LBA within the device reducing traffic between host and device.
> > 
> > This implementation accepts destination, no of sources and arrays of
> > source ranges from application and attach it as payload to the bio and
> > submits to the device.
> > 
> > Following limits are added to queue limits and are exposed in sysfs
> > to userspace
> > - *max_copy_sectors* limits the sum of all source_range length
> > - *max_copy_nr_ranges* limits the number of source ranges
> > - *max_copy_range_sectors* limit the maximum number of sectors
> > that can constitute a single source range.
> 
> Same comment as before. I think this is a good start, but for this to be 
> really
> useful to users and kernel components alike, this really needs copy emulation
> for drives that do not have a native copy feature, similarly to what write 
> zeros
> handling for instance: if the drive does not have a copy command (simple copy
> for NVMe or XCOPY for scsi), then the block layer should issue read/write
> commands to seamlessly execute the copy. Otherwise, this will only serve a 
> small
> niche for users and will not be optimal for FS and DM drivers that could be
> simplified with a generic block layer copy functionality.
> 
> This is my 10 cents though, others may differ about this.

Yes, I agree that copy emulation support should be included with the
hardware enabled solution.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 3/9] block: store a block_device pointer in struct bio

2020-12-04 Thread Keith Busch
On Tue, Dec 01, 2020 at 05:54:18PM +0100, Christoph Hellwig wrote:
> diff --git a/block/blk.h b/block/blk.h
> index 98f0b1ae264120..64dc8e5a3f44cb 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -99,8 +99,8 @@ static inline void blk_rq_bio_prep(struct request *rq, 
> struct bio *bio,
>   rq->bio = rq->biotail = bio;
>   rq->ioprio = bio_prio(bio);
>  
> - if (bio->bi_disk)
> - rq->rq_disk = bio->bi_disk;
> + if (bio->bi_bdev)
> + rq->rq_disk = bio->bi_bdev->bd_disk;
>  }

Just fyi, this function has been relocated to include/linux/blk-mq.h

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v2] block: use gcd() to fix chunk_sectors limit stacking

2020-12-04 Thread Keith Busch
On Thu, Dec 03, 2020 at 09:33:59AM -0500, Mike Snitzer wrote:
> On Wed, Dec 02 2020 at 10:26pm -0500,
> Ming Lei  wrote:
> 
> > I understand it isn't related with correctness, because the underlying
> > queue can split by its own chunk_sectors limit further. So is the issue
> > too many further-splitting on queue with chunk_sectors 8? then CPU
> > utilization is increased? Or other issue?
> 
> No, this is all about correctness.
> 
> Seems you're confining the definition of the possible stacking so that
> the top-level device isn't allowed to have its own hard requirements on
> IO sizes it sends to its internal implementation.  Just because the
> underlying device can split further doesn't mean that the top-level
> virtual driver can service larger IO sizes (not if the chunk_sectors
> stacking throws away the hint the virtual driver provided because it
> used lcm_not_zero).

I may be missing something obvious here, but if the lower layers split
to their desired boundary already, why does this limit need to stack?
Won't it also work if each layer sets their desired chunk_sectors
without considering their lower layers? The commit that initially
stacked chunk_sectors doesn't provide any explanation.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 3/3] block: allow 'chunk_sectors' to be non-power-of-2

2020-09-14 Thread Keith Busch
On Sat, Sep 12, 2020 at 10:06:30PM +0800, Ming Lei wrote:
> On Fri, Sep 11, 2020 at 05:53:38PM -0400, Mike Snitzer wrote:
> > It is possible for a block device to use a non power-of-2 for chunk
> > size which results in a full-stripe size that is also a non
> > power-of-2.
> > 
> > Update blk_queue_chunk_sectors() and blk_max_size_offset() to
> > accommodate drivers that need a non power-of-2 chunk_sectors.
> > 
> > Signed-off-by: Mike Snitzer 
> > ---
> >  block/blk-settings.c   | 10 --
> >  include/linux/blkdev.h | 12 +---
> >  2 files changed, 13 insertions(+), 9 deletions(-)
> > 
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index b09642d5f15e..e40a162cc946 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -172,15 +172,13 @@ EXPORT_SYMBOL(blk_queue_max_hw_sectors);
> >   *
> >   * Description:
> >   *If a driver doesn't want IOs to cross a given chunk size, it can set
> > - *this limit and prevent merging across chunks. Note that the chunk 
> > size
> > - *must currently be a power-of-2 in sectors. Also note that the block
> > - *layer must accept a page worth of data at any offset. So if the
> > - *crossing of chunks is a hard limitation in the driver, it must still 
> > be
> > - *prepared to split single page bios.
> > + *this limit and prevent merging across chunks. Note that the block 
> > layer
> > + *must accept a page worth of data at any offset. So if the crossing of
> > + *chunks is a hard limitation in the driver, it must still be prepared
> > + *to split single page bios.
> >   **/
> >  void blk_queue_chunk_sectors(struct request_queue *q, unsigned int 
> > chunk_sectors)
> >  {
> > -   BUG_ON(!is_power_of_2(chunk_sectors));
> > q->limits.chunk_sectors = chunk_sectors;
> >  }
> >  EXPORT_SYMBOL(blk_queue_chunk_sectors);
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 453a3d735d66..e72bcce22143 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1059,11 +1059,17 @@ static inline unsigned int 
> > blk_queue_get_max_sectors(struct request_queue *q,
> >  static inline unsigned int blk_max_size_offset(struct request_queue *q,
> >sector_t offset)
> >  {
> > -   if (!q->limits.chunk_sectors)
> > +   unsigned int chunk_sectors = q->limits.chunk_sectors;
> > +
> > +   if (!chunk_sectors)
> > return q->limits.max_sectors;
> >  
> > -   return min(q->limits.max_sectors, (unsigned 
> > int)(q->limits.chunk_sectors -
> > -   (offset & (q->limits.chunk_sectors - 1;
> > +   if (is_power_of_2(chunk_sectors))
> > +   chunk_sectors -= (offset & (chunk_sectors - 1));
> > +   else
> > +   chunk_sectors -= sector_div(offset, chunk_sectors);
> > +
> > +   return min(q->limits.max_sectors, chunk_sectors);
> >  }
> >  
> >  static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
> > -- 
> > 2.15.0
> > 
> 
> is_power_of_2() is cheap enough for fast path, so looks fine to support
> non-power-of-2 chunk sectors.
> 
> Maybe NVMe PCI can remove the power_of_2() limit too.

I'd need to see the justification for that. The boundary is just a
suggestion in NVMe. The majority of IO never crosses it so the
calculation is usually wasted CPU cycles. Crossing the boundary is going
to have to be very costly on the device side in order to justify the
host side per-IO overhead for a non-power-of-2 split. 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [trivial PATCH] treewide: Convert switch/case fallthrough; to break;

2020-09-10 Thread Keith Busch
On Wed, Sep 09, 2020 at 01:06:39PM -0700, Joe Perches wrote:
> diff --git a/crypto/tcrypt.c b/crypto/tcrypt.c
> index eea0f453cfb6..8aac5bc60f4c 100644
> --- a/crypto/tcrypt.c
> +++ b/crypto/tcrypt.c
> @@ -2464,7 +2464,7 @@ static int do_test(const char *alg, u32 type, u32 mask, 
> int m, u32 num_mb)
>   test_hash_speed("streebog512", sec,
>   generic_hash_speed_template);
>   if (mode > 300 && mode < 400) break;
> - fallthrough;
> + break;
>   case 399:
>   break;

Just imho, this change makes the preceding 'if' look even more
pointless. Maybe the fallthrough was a deliberate choice? Not that my
opinion matters here as I don't know this module, but it looked a bit
odd to me.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH] block: Improve io_opt limit stacking

2020-05-25 Thread Keith Busch
On Fri, May 22, 2020 at 09:36:18AM -0400, Martin K. Petersen wrote:
> 
> >>> + if (t->io_opt & (t->physical_block_size - 1))
> >>> + t->io_opt = lcm(t->io_opt, t->physical_block_size);
> >
> >> Any comment on this patch ?  Note: the patch the patch "nvme: Fix
> >> io_opt limit setting" is already queued for 5.8.
> >
> > Setting io_opt to the physical block size is not correct.
> 
> Oh, missed the lcm(). But I'm still concerned about twiddling io_opt to
> a value different than the one reported by an underlying device.
>
> Setting io_opt to something that's less than a full stripe width in a
> RAID, for instance, doesn't produce the expected result. So I think I'd
> prefer not to set io_opt at all if it isn't consistent across all the
> stacked devices.

We already modify the stacked io_opt value if the two request_queues
report different io_opt's. If, however, only one reports an io_opt value,
and it happens to not align with the other's physical block size, the
code currently rejects stacking those devices. Damien's patch should
just set a larger io_opt value that aligns with both, so if one io_opt
is a RAID stripe size, the stacked result will be multiple stripes.

I think that makes sense, but please do let us know if you think such
mismatched devices just shouldn't stack.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/4] block: Enhance blk_revalidate_disk_zones()

2019-10-24 Thread Keith Busch
On Thu, Oct 24, 2019 at 03:50:03PM +0900, Damien Le Moal wrote:
> - /* Do a report zone to get max_lba and the same field */
> - ret = sd_zbc_do_report_zones(sdkp, buf, bufsize, 0, false);
> + /* Do a report zone to get max_lba and the size of the first zone */
> + ret = sd_zbc_do_report_zones(sdkp, buf, SD_BUF_SIZE, 0, false);

This is no longer reading all the zones here, so you could set the
'partial' field to true. And then since this was the only caller that
had set partial to false, you can also remove that parameter.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] multipath-tools: add ANA support for NVMe device

2018-11-13 Thread Keith Busch
On Mon, Nov 12, 2018 at 04:53:23PM -0500, Mike Snitzer wrote:
> On Mon, Nov 12 2018 at 11:23am -0500,
> Martin Wilck  wrote:
> 
> > Hello Lijie,
> > 
> > On Thu, 2018-11-08 at 14:09 +0800, lijie wrote:
> > > Add support for Asynchronous Namespace Access as specified in NVMe
> > > 1.3
> > > TP 4004. The states are updated through reading the ANA log page.
> > > 
> > > By default, the native nvme multipath takes over the nvme device.
> > > We can pass a false to the parameter 'multipath' of the nvme-core.ko
> > > module,when we want to use multipath-tools.
> > 
> > Thank you for the patch. It looks quite good to me. I've tested it with
> > a Linux target and found no problems so far.
> > 
> > I have a few questions and comments inline below.
> > 
> > I suggest you also have a look at detect_prio(); it seems to make sense
> > to use the ana prioritizer for NVMe paths automatically if ANA is
> > supported (with your patch, "detect_prio no" and "prio ana" have to be
> > configured explicitly). But that can be done in a later patch.
> 
> I (and others) think it makes sense to at least triple check with the
> NVMe developers (now cc'd) to see if we could get agreement on the nvme
> driver providing the ANA state via sysfs (when modparam
> nvme_core.multipath=N is set), like Hannes proposed here:
> http://lists.infradead.org/pipermail/linux-nvme/2018-November/020765.html
> 
> Then the userspace multipath-tools ANA support could just read sysfs
> rather than reinvent harvesting the ANA state via ioctl.

I'd prefer not duplicating the log page parsing. Maybe nvme's shouldn't
even be tied to CONFIG_NVME_MULTIPATH so that the 'multipath' param
isn't even an issue.
 
> But if we continue to hit a wall on that type of sharing of the nvme
> driver's state then I'm fine with reinventing ANA state inquiry and
> tracking like has been proposed here.
> 
> Mike
> 
> p.s. thanks for your review Martin, we really need to determine the way
> forward for full multipath-tools support of NVMe with ANA.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCHv2 3/5] block: Provide blk_status_t decoding for path errors

2018-01-09 Thread Keith Busch
This patch provides a common decoder for block status path related errors
that may be retried so various entities wishing to consult this do not
have to duplicate this decision.

Acked-by: Mike Snitzer <snit...@redhat.com>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 include/linux/blk_types.h | 28 
 1 file changed, 28 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1e628e032da..2d973ac54b09 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,34 @@ typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN  ((__force blk_status_t)12)
 
+/**
+ * blk_path_error - returns true if error may be path related
+ * @error: status the request was completed with
+ *
+ * Description:
+ * This classifies block error status into non-retryable errors and ones
+ * that may be successful if retried on a failover path.
+ *
+ * Return:
+ * %false - retrying failover path will not help
+ * %true  - may succeed if retried
+ */
+static inline bool blk_path_error(blk_status_t error)
+{
+   switch (error) {
+   case BLK_STS_NOTSUPP:
+   case BLK_STS_NOSPC:
+   case BLK_STS_TARGET:
+   case BLK_STS_NEXUS:
+   case BLK_STS_MEDIUM:
+   case BLK_STS_PROTECTION:
+   return false;
+   }
+
+   /* Anything else could be a path failure, so should be retried */
+   return true;
+}
+
 struct blk_issue_stat {
u64 stat;
 };
-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCHv2 5/5] dm mpath: Use blk_path_error

2018-01-09 Thread Keith Busch
Uses common code for determining if an error should be retried on
alternate path.

Acked-by: Mike Snitzer <snit...@redhat.com>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 drivers/md/dm-mpath.c | 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index f7810cc869ac..ef57c6d1c887 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1475,21 +1475,6 @@ static void activate_path_work(struct work_struct *work)
activate_or_offline_path(pgpath);
 }
 
-static int noretry_error(blk_status_t error)
-{
-   switch (error) {
-   case BLK_STS_NOTSUPP:
-   case BLK_STS_NOSPC:
-   case BLK_STS_TARGET:
-   case BLK_STS_NEXUS:
-   case BLK_STS_MEDIUM:
-   return 1;
-   }
-
-   /* Anything else could be a path failure, so should be retried */
-   return 0;
-}
-
 static int multipath_end_io(struct dm_target *ti, struct request *clone,
blk_status_t error, union map_info *map_context)
 {
@@ -1508,7 +1493,7 @@ static int multipath_end_io(struct dm_target *ti, struct 
request *clone,
 * request into dm core, which will remake a clone request and
 * clone bios for it and resubmit it later.
 */
-   if (error && !noretry_error(error)) {
+   if (error && blk_path_error(error)) {
struct multipath *m = ti->private;
 
r = DM_ENDIO_REQUEUE;
@@ -1544,7 +1529,7 @@ static int multipath_end_io_bio(struct dm_target *ti, 
struct bio *clone,
unsigned long flags;
int r = DM_ENDIO_DONE;
 
-   if (!*error || noretry_error(*error))
+   if (!*error || !blk_path_error(*error))
goto done;
 
if (pgpath)
-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCHv2 2/5] nvme/multipath: Consult blk_status_t for failover

2018-01-09 Thread Keith Busch
This removes nvme multipath's specific status decoding to see if failover
is needed, using the generic blk_status_t that was decoded earlier. This
abstraction from the raw NVMe status means all status decoding exists
in one place.

Acked-by: Mike Snitzer <snit...@redhat.com>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 drivers/nvme/host/core.c  |  9 +
 drivers/nvme/host/multipath.c | 44 ---
 drivers/nvme/host/nvme.h  |  5 +++--
 3 files changed, 16 insertions(+), 42 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 499f3141e365..82fb5bcfbf91 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -197,8 +197,10 @@ static inline bool nvme_req_needs_retry(struct request 
*req)
 
 void nvme_complete_rq(struct request *req)
 {
-   if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
-   if (nvme_req_needs_failover(req)) {
+   blk_status_t status = nvme_error_status(req);
+
+   if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
+   if (nvme_req_needs_failover(req, status)) {
nvme_failover_req(req);
return;
}
@@ -209,8 +211,7 @@ void nvme_complete_rq(struct request *req)
return;
}
}
-
-   blk_mq_end_request(req, nvme_error_status(req));
+   blk_mq_end_request(req, status);
 }
 EXPORT_SYMBOL_GPL(nvme_complete_rq);
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1218a9fca846..ae9abb600c0f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -33,46 +33,18 @@ void nvme_failover_req(struct request *req)
kblockd_schedule_work(>head->requeue_work);
 }
 
-bool nvme_req_needs_failover(struct request *req)
+bool nvme_req_needs_failover(struct request *req, blk_status_t error)
 {
if (!(req->cmd_flags & REQ_NVME_MPATH))
return false;
 
-   switch (nvme_req(req)->status & 0x7ff) {
-   /*
-* Generic command status:
-*/
-   case NVME_SC_INVALID_OPCODE:
-   case NVME_SC_INVALID_FIELD:
-   case NVME_SC_INVALID_NS:
-   case NVME_SC_LBA_RANGE:
-   case NVME_SC_CAP_EXCEEDED:
-   case NVME_SC_RESERVATION_CONFLICT:
-   return false;
-
-   /*
-* I/O command set specific error.  Unfortunately these values are
-* reused for fabrics commands, but those should never get here.
-*/
-   case NVME_SC_BAD_ATTRIBUTES:
-   case NVME_SC_INVALID_PI:
-   case NVME_SC_READ_ONLY:
-   case NVME_SC_ONCS_NOT_SUPPORTED:
-   WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
-   nvme_fabrics_command);
-   return false;
-
-   /*
-* Media and Data Integrity Errors:
-*/
-   case NVME_SC_WRITE_FAULT:
-   case NVME_SC_READ_ERROR:
-   case NVME_SC_GUARD_CHECK:
-   case NVME_SC_APPTAG_CHECK:
-   case NVME_SC_REFTAG_CHECK:
-   case NVME_SC_COMPARE_FAILED:
-   case NVME_SC_ACCESS_DENIED:
-   case NVME_SC_UNWRITTEN_BLOCK:
+   switch (error) {
+   case BLK_STS_NOTSUPP:
+   case BLK_STS_NOSPC:
+   case BLK_STS_TARGET:
+   case BLK_STS_NEXUS:
+   case BLK_STS_MEDIUM:
+   case BLK_STS_PROTECTION:
return false;
}
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa5283e8e..8d18cfa36093 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -401,7 +401,7 @@ extern const struct block_device_operations 
nvme_ns_head_ops;
 
 #ifdef CONFIG_NVME_MULTIPATH
 void nvme_failover_req(struct request *req);
-bool nvme_req_needs_failover(struct request *req);
+bool nvme_req_needs_failover(struct request *req, blk_status_t error);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
 void nvme_mpath_add_disk(struct nvme_ns_head *head);
@@ -421,7 +421,8 @@ struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 static inline void nvme_failover_req(struct request *req)
 {
 }
-static inline bool nvme_req_needs_failover(struct request *req)
+static inline bool nvme_req_needs_failover(struct request *req,
+  blk_status_t error)
 {
return false;
 }
-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCHv2 0/5] nvme/dm failover unification

2018-01-09 Thread Keith Busch
Native nvme multipath provided a separate NVMe status decoder,
complicating maintenance as new statuses need to be accounted for. This
was already diverging from the generic nvme status decoder, which has
implications for other components that rely on accurate generic block
errors.

This series unifies common code among nvme and device-mapper multipath
so user experience regarding the failover fate of a command is the same.

v1 -> v2:

  Fixed blk_status_t used for NVME_SC_LBA_RANGE.

  Fixed line break formatting.

  Changed name of new block API for path related errors and added kernel
  doc for it. 

  Added reviews and acks.

Keith Busch (5):
  nvme: Add more command status translation
  nvme/multipath: Consult blk_status_t for failover
  block: Provide blk_status_t decoding for path errors
  nvme/multipath: Use blk_path_error
  dm mpath: Use blk_path_error

 drivers/md/dm-mpath.c | 19 ++-
 drivers/nvme/host/core.c  | 16 
 drivers/nvme/host/multipath.c | 44 ++-
 drivers/nvme/host/nvme.h  |  5 +++--
 include/linux/blk_types.h | 28 +++
 5 files changed, 47 insertions(+), 65 deletions(-)

-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCHv2 1/5] nvme: Add more command status translation

2018-01-09 Thread Keith Busch
This adds more NVMe status code translations to blk_status_t values,
and captures all the current status codes NVMe multipath uses.

Acked-by: Mike Snitzer <snit...@redhat.com>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 drivers/nvme/host/core.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2a69d735efbc..499f3141e365 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -157,13 +157,20 @@ static blk_status_t nvme_error_status(struct request *req)
return BLK_STS_OK;
case NVME_SC_CAP_EXCEEDED:
return BLK_STS_NOSPC;
+   case NVME_SC_LBA_RANGE:
+   return BLK_STS_TARGET;
+   case NVME_SC_BAD_ATTRIBUTES:
case NVME_SC_ONCS_NOT_SUPPORTED:
+   case NVME_SC_INVALID_OPCODE:
+   case NVME_SC_INVALID_FIELD:
+   case NVME_SC_INVALID_NS:
return BLK_STS_NOTSUPP;
case NVME_SC_WRITE_FAULT:
case NVME_SC_READ_ERROR:
case NVME_SC_UNWRITTEN_BLOCK:
case NVME_SC_ACCESS_DENIED:
case NVME_SC_READ_ONLY:
+   case NVME_SC_COMPARE_FAILED:
return BLK_STS_MEDIUM;
case NVME_SC_GUARD_CHECK:
case NVME_SC_APPTAG_CHECK:
-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/5] nvme/multipath: Consult blk_status_t for failover

2018-01-09 Thread Keith Busch
On Mon, Jan 08, 2018 at 01:57:07AM -0800, Christoph Hellwig wrote:
> > -   if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
> > -   if (nvme_req_needs_failover(req)) {
> > +   blk_status_t status = nvme_error_status(req);
> > +
> > +   if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
> > +   if (nvme_req_needs_failover(req, status)) {
> 
> We don't really need to call nvme_error_status if nvme_req(req)->status
> is zero.

We are already calling nvme_error_status unconditionally for
blk_mq_end_request, so we currently read nvme_req(req)->status multiple
times in the completion path. I think we'd prefer to read it just once.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 1/5] nvme: Add more command status translation

2018-01-08 Thread Keith Busch
On Mon, Jan 08, 2018 at 04:34:36PM +0100, Christoph Hellwig wrote:
> It's basically a kernel bug as it tries to access lbas that do not
> exist.  BLK_STS_TARGET should be fine.

Okay, I'll fix this and address your other comments, and resend. Thanks
for the feedback.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 0/5] Failover criteria unification

2018-01-04 Thread Keith Busch
On Thu, Jan 04, 2018 at 06:36:27PM -0500, Mike Snitzer wrote:
> Right, I dropped that patch since it'd have only resulted in conflicts
> come merge time.  As such, this series can easily go through the nvme
> tree to Jens.

It looks like you can also touch up dm to allow it to multipath nvme
even if CONFIG_NVME_MULTIPATH is set. It may be useful since native NVMe
doesn't multipath namespaces across subsystems, and some crack smoking
people want to create something that requires that.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 5/5] dm mpath: Use blk_retryable

2018-01-04 Thread Keith Busch
Uses common code for determining if an error should be retried on
alternate path.

Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 drivers/md/dm-mpath.c | 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index f7810cc869ac..3eb9500db1b3 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1475,21 +1475,6 @@ static void activate_path_work(struct work_struct *work)
activate_or_offline_path(pgpath);
 }
 
-static int noretry_error(blk_status_t error)
-{
-   switch (error) {
-   case BLK_STS_NOTSUPP:
-   case BLK_STS_NOSPC:
-   case BLK_STS_TARGET:
-   case BLK_STS_NEXUS:
-   case BLK_STS_MEDIUM:
-   return 1;
-   }
-
-   /* Anything else could be a path failure, so should be retried */
-   return 0;
-}
-
 static int multipath_end_io(struct dm_target *ti, struct request *clone,
blk_status_t error, union map_info *map_context)
 {
@@ -1508,7 +1493,7 @@ static int multipath_end_io(struct dm_target *ti, struct 
request *clone,
 * request into dm core, which will remake a clone request and
 * clone bios for it and resubmit it later.
 */
-   if (error && !noretry_error(error)) {
+   if (error && blk_retryable(error)) {
struct multipath *m = ti->private;
 
r = DM_ENDIO_REQUEUE;
@@ -1544,7 +1529,7 @@ static int multipath_end_io_bio(struct dm_target *ti, 
struct bio *clone,
unsigned long flags;
int r = DM_ENDIO_DONE;
 
-   if (!*error || noretry_error(*error))
+   if (!*error || !blk_retryable(*error))
goto done;
 
if (pgpath)
-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 4/5] nvme/multipath: Use blk_retryable

2018-01-04 Thread Keith Busch
Uses common code for determining if an error should be retried on
alternate path.

Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 drivers/nvme/host/multipath.c | 14 +-
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index ae9abb600c0f..93bb72b6efb6 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -37,19 +37,7 @@ bool nvme_req_needs_failover(struct request *req, 
blk_status_t error)
 {
if (!(req->cmd_flags & REQ_NVME_MPATH))
return false;
-
-   switch (error) {
-   case BLK_STS_NOTSUPP:
-   case BLK_STS_NOSPC:
-   case BLK_STS_TARGET:
-   case BLK_STS_NEXUS:
-   case BLK_STS_MEDIUM:
-   case BLK_STS_PROTECTION:
-   return false;
-   }
-
-   /* Everything else could be a path failure, so should be retried */
-   return true;
+   return blk_retryable(error);
 }
 
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 1/5] nvme: Add more command status translation

2018-01-04 Thread Keith Busch
This adds more NVMe status code translations to blk_status_t values,
and captures all the current status codes NVMe multipath uses.

Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 drivers/nvme/host/core.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2a69d735efbc..f1cf1f6c5b28 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -156,14 +156,20 @@ static blk_status_t nvme_error_status(struct request *req)
case NVME_SC_SUCCESS:
return BLK_STS_OK;
case NVME_SC_CAP_EXCEEDED:
+   case NVME_SC_LBA_RANGE:
return BLK_STS_NOSPC;
+   case NVME_SC_BAD_ATTRIBUTES:
case NVME_SC_ONCS_NOT_SUPPORTED:
+   case NVME_SC_INVALID_OPCODE:
+   case NVME_SC_INVALID_FIELD:
+   case NVME_SC_INVALID_NS:
return BLK_STS_NOTSUPP;
case NVME_SC_WRITE_FAULT:
case NVME_SC_READ_ERROR:
case NVME_SC_UNWRITTEN_BLOCK:
case NVME_SC_ACCESS_DENIED:
case NVME_SC_READ_ONLY:
+   case NVME_SC_COMPARE_FAILED:
return BLK_STS_MEDIUM;
case NVME_SC_GUARD_CHECK:
case NVME_SC_APPTAG_CHECK:
-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 3/5] block: Provide blk_status_t decoding for retryable errors

2018-01-04 Thread Keith Busch
This patch provides a common decoder for block status that may be retried
so various entities wishing to consult this do not have to duplicate
this decision.

Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 include/linux/blk_types.h | 16 
 1 file changed, 16 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1e628e032da..b6a8723b493c 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -39,6 +39,22 @@ typedef u8 __bitwise blk_status_t;
 
 #define BLK_STS_AGAIN  ((__force blk_status_t)12)
 
+static inline bool blk_retryable(blk_status_t error)
+{
+   switch (error) {
+   case BLK_STS_NOTSUPP:
+   case BLK_STS_NOSPC:
+   case BLK_STS_TARGET:
+   case BLK_STS_NEXUS:
+   case BLK_STS_MEDIUM:
+   case BLK_STS_PROTECTION:
+   return false;
+   }
+
+   /* Anything else could be a path failure, so should be retried */
+   return true;
+}
+
 struct blk_issue_stat {
u64 stat;
 };
-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 2/5] nvme/multipath: Consult blk_status_t for failover

2018-01-04 Thread Keith Busch
This removes nvme multipath's specific status decoding to see if failover
is needed, using the generic blk_status_t that was translated earlier. This
abstraction from the raw NVMe status means nvme status decoding exists
in just one place.

Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
 drivers/nvme/host/core.c  |  9 +
 drivers/nvme/host/multipath.c | 44 ---
 drivers/nvme/host/nvme.h  |  4 ++--
 3 files changed, 15 insertions(+), 42 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f1cf1f6c5b28..4afc681f7089 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -196,8 +196,10 @@ static inline bool nvme_req_needs_retry(struct request 
*req)
 
 void nvme_complete_rq(struct request *req)
 {
-   if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
-   if (nvme_req_needs_failover(req)) {
+   blk_status_t status = nvme_error_status(req);
+
+   if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
+   if (nvme_req_needs_failover(req, status)) {
nvme_failover_req(req);
return;
}
@@ -208,8 +210,7 @@ void nvme_complete_rq(struct request *req)
return;
}
}
-
-   blk_mq_end_request(req, nvme_error_status(req));
+   blk_mq_end_request(req, status);
 }
 EXPORT_SYMBOL_GPL(nvme_complete_rq);
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1218a9fca846..ae9abb600c0f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -33,46 +33,18 @@ void nvme_failover_req(struct request *req)
kblockd_schedule_work(>head->requeue_work);
 }
 
-bool nvme_req_needs_failover(struct request *req)
+bool nvme_req_needs_failover(struct request *req, blk_status_t error)
 {
if (!(req->cmd_flags & REQ_NVME_MPATH))
return false;
 
-   switch (nvme_req(req)->status & 0x7ff) {
-   /*
-* Generic command status:
-*/
-   case NVME_SC_INVALID_OPCODE:
-   case NVME_SC_INVALID_FIELD:
-   case NVME_SC_INVALID_NS:
-   case NVME_SC_LBA_RANGE:
-   case NVME_SC_CAP_EXCEEDED:
-   case NVME_SC_RESERVATION_CONFLICT:
-   return false;
-
-   /*
-* I/O command set specific error.  Unfortunately these values are
-* reused for fabrics commands, but those should never get here.
-*/
-   case NVME_SC_BAD_ATTRIBUTES:
-   case NVME_SC_INVALID_PI:
-   case NVME_SC_READ_ONLY:
-   case NVME_SC_ONCS_NOT_SUPPORTED:
-   WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
-   nvme_fabrics_command);
-   return false;
-
-   /*
-* Media and Data Integrity Errors:
-*/
-   case NVME_SC_WRITE_FAULT:
-   case NVME_SC_READ_ERROR:
-   case NVME_SC_GUARD_CHECK:
-   case NVME_SC_APPTAG_CHECK:
-   case NVME_SC_REFTAG_CHECK:
-   case NVME_SC_COMPARE_FAILED:
-   case NVME_SC_ACCESS_DENIED:
-   case NVME_SC_UNWRITTEN_BLOCK:
+   switch (error) {
+   case BLK_STS_NOTSUPP:
+   case BLK_STS_NOSPC:
+   case BLK_STS_TARGET:
+   case BLK_STS_NEXUS:
+   case BLK_STS_MEDIUM:
+   case BLK_STS_PROTECTION:
return false;
}
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa5283e8e..cf8548c1d7ec 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -401,7 +401,7 @@ extern const struct block_device_operations 
nvme_ns_head_ops;
 
 #ifdef CONFIG_NVME_MULTIPATH
 void nvme_failover_req(struct request *req);
-bool nvme_req_needs_failover(struct request *req);
+bool nvme_req_needs_failover(struct request *req, blk_status_t error);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
 void nvme_mpath_add_disk(struct nvme_ns_head *head);
@@ -421,7 +421,7 @@ struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
 static inline void nvme_failover_req(struct request *req)
 {
 }
-static inline bool nvme_req_needs_failover(struct request *req)
+static inline bool nvme_req_needs_failover(struct request *req, blk_status_t 
error)
 {
return false;
 }
-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 0/5] Failover criteria unification

2018-01-04 Thread Keith Busch
The nvme native multipath provided a separate NVMe status decoder,
complicating maintenance as new statuses need to be accounted for. This
was already diverging from the generic nvme status decoder, which has
implications for other components that rely on accurate generic block
errors.

This series unifies common code among nvme and device-mapper multipath
so that user experience regarding the failover fate of a command is the
same.

Mike:

I split this up because I thought there'd be trouble merging the dm
mpath update with the inverted retry logic, but I think you may have
rebased the dm-4.16 without that patch, as I'm not seeing it in the most
current branch.

Keith Busch (5):
  nvme: Add more command status translation
  nvme/multipath: Consult blk_status_t for failover
  block: Provide blk_status_t decoding for retryable errors
  nvme/multipath: Use blk_retryable
  dm mpath: Use blk_retryable

 drivers/md/dm-mpath.c | 19 ++-
 drivers/nvme/host/core.c  | 15 +++
 drivers/nvme/host/multipath.c | 44 ++-
 drivers/nvme/host/nvme.h  |  4 ++--
 include/linux/blk_types.h | 16 
 5 files changed, 33 insertions(+), 65 deletions(-)

-- 
2.13.6

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [for-4.16 PATCH v2 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler

2018-01-02 Thread Keith Busch
Instead of hiding NVMe path related errors, the NVMe driver needs to
code an appropriate generic block status from an NVMe status.

We already do this translation whether or not CONFIG_NVME_MULTIPATHING is
set, so I think it's silly NVMe native multipathing has a second status
decoder. This just doubles the work if we need to handle any new NVMe
status codes in the future.

I have a counter-proposal below that unifies NVMe-to-block status
translations, and combines common code for determining if an error is a
path failure. This should work for both NVMe and DM, and DM won't need
NVMe specifics.

I can split this into a series if there's indication this is ok and
satisfies the need.

---
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index f7810cc869ac..f6f7a1aefc53 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1475,21 +1475,6 @@ static void activate_path_work(struct work_struct *work)
activate_or_offline_path(pgpath);
 }
 
-static int noretry_error(blk_status_t error)
-{
-   switch (error) {
-   case BLK_STS_NOTSUPP:
-   case BLK_STS_NOSPC:
-   case BLK_STS_TARGET:
-   case BLK_STS_NEXUS:
-   case BLK_STS_MEDIUM:
-   return 1;
-   }
-
-   /* Anything else could be a path failure, so should be retried */
-   return 0;
-}
-
 static int multipath_end_io(struct dm_target *ti, struct request *clone,
blk_status_t error, union map_info *map_context)
 {
@@ -1508,7 +1493,7 @@ static int multipath_end_io(struct dm_target *ti, struct 
request *clone,
 * request into dm core, which will remake a clone request and
 * clone bios for it and resubmit it later.
 */
-   if (error && !noretry_error(error)) {
+   if (error && blk_path_failure(error)) {
struct multipath *m = ti->private;
 
r = DM_ENDIO_REQUEUE;
@@ -1544,7 +1529,7 @@ static int multipath_end_io_bio(struct dm_target *ti, 
struct bio *clone,
unsigned long flags;
int r = DM_ENDIO_DONE;
 
-   if (!*error || noretry_error(*error))
+   if (!*error || !blk_path_failure(*error))
goto done;
 
if (pgpath)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f837d666cbd4..25ef349fd4e4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -190,20 +190,18 @@ static inline bool nvme_req_needs_retry(struct request 
*req)
 
 void nvme_complete_rq(struct request *req)
 {
-   if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
-   if (nvme_req_needs_failover(req)) {
-   nvme_failover_req(req);
-   return;
-   }
+   blk_status_t status = nvme_error_status(req);
 
+   if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
+   if (nvme_failover_req(req, status))
+   return;
if (!blk_queue_dying(req->q)) {
nvme_req(req)->retries++;
blk_mq_requeue_request(req, true);
return;
}
}
-
-   blk_mq_end_request(req, nvme_error_status(req));
+   blk_mq_end_request(req, status);
 }
 EXPORT_SYMBOL_GPL(nvme_complete_rq);
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1218a9fca846..fa3d96780258 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -19,11 +19,16 @@ module_param(multipath, bool, 0644);
 MODULE_PARM_DESC(multipath,
"turn on native support for multiple controllers per subsystem");
 
-void nvme_failover_req(struct request *req)
+bool nvme_failover_req(struct request *req, blk_status_t status)
 {
struct nvme_ns *ns = req->q->queuedata;
unsigned long flags;
 
+   if (!(req->cmd_flags & REQ_NVME_MPATH))
+   return false;
+   if (!blk_path_failure(status))
+   return false;
+
spin_lock_irqsave(>head->requeue_lock, flags);
blk_steal_bios(>head->requeue_list, req);
spin_unlock_irqrestore(>head->requeue_lock, flags);
@@ -31,52 +36,6 @@ void nvme_failover_req(struct request *req)
 
nvme_reset_ctrl(ns->ctrl);
kblockd_schedule_work(>head->requeue_work);
-}
-
-bool nvme_req_needs_failover(struct request *req)
-{
-   if (!(req->cmd_flags & REQ_NVME_MPATH))
-   return false;
-
-   switch (nvme_req(req)->status & 0x7ff) {
-   /*
-* Generic command status:
-*/
-   case NVME_SC_INVALID_OPCODE:
-   case NVME_SC_INVALID_FIELD:
-   case NVME_SC_INVALID_NS:
-   case NVME_SC_LBA_RANGE:
-   case NVME_SC_CAP_EXCEEDED:
-   case NVME_SC_RESERVATION_CONFLICT:
-   return false;
-
-   /*
-* I/O command set specific error.  Unfortunately these values are
-* reused for fabrics commands, but those should never get here.
-*/
-   case 

Re: [dm-devel] [for-4.16 PATCH 0/5] block, nvme, dm: allow DM multipath to use NVMe's error handler

2017-12-26 Thread Keith Busch
On Tue, Dec 19, 2017 at 04:05:41PM -0500, Mike Snitzer wrote:
> These patches enable DM multipath to work well on NVMe over Fabrics
> devices.  Currently that implies CONFIG_NVME_MULTIPATH is _not_ set.
> 
> But follow-on work will be to make it so that native NVMe multipath
> and DM multipath can be made to co-exist (e.g. blacklisting certain
> NVMe devices from being consumed by native NVMe multipath?)

Hi Mike,

I've reviewed the series and I support with the goal. I'm not a big fan,
though, of having yet-another-field to set in bio and req on each IO.

Unless I'm missing something, I think we can make this simpler if you add
the new 'failover_req_fn' as an attribute of the struct request_queue
instead of threading it through bio and request. Native nvme multipath
can set the field directly in the nvme driver, and dm-mpath can set it
in each path when not using the nvme mpath. What do you think?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 2/2] dm unstripe: Add documentation for unstripe target

2017-12-12 Thread Keith Busch
On Tue, Dec 12, 2017 at 01:35:13PM +0200, Nikolay Borisov wrote:
> On 11.12.2017 18:00, Scott Bauer wrote:
> > +As an example:
> > +
> > +Intel NVMe drives contain two cores on the physical device.
> > +Each core of the drive has segregated access to its LBA range.
> > +The current LBA model has a RAID 0 128k stripe across the two cores:
> > +
> > +   Core 0:Core 1:
> > +  ____
> > +  | LBA 511|| LBA 768|
> > +  | LBA 0  || LBA 256|
> > +  ⎻⎻⎻⎻
> 
> If it's 128k stripe shouldn't it be LBAs 0/256 on core0 and LBAs 128/511
> on core1?

Ah, this device's makers call the "stripe" size what should be called
"chunk". This device has a 128k chunk per core with two cores, so the
full stripe is 256k. The above should have core 0 owning LBA 512 rather
than 511 (assuming 512b LBA format).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 2/2] dm unstripe: Add documentation for unstripe target

2017-12-11 Thread Keith Busch
On Mon, Dec 11, 2017 at 09:00:19AM -0700, Scott Bauer wrote:
> +Example scripts:
> +
> +
> +dmsetup create nvmset1 --table '0 1 dm-unstripe /dev/nvme0n1 1 2 0'
> +dmsetup create nvmset0 --table '0 1 dm-unstripe /dev/nvme0n1 0 2 0'
> +
> +There will now be two mappers:
> +/dev/mapper/nvmset1
> +/dev/mapper/nvmset0
> +
> +that will expose core 0 and core 1.
> +
> +
> +In a Raid 0 with 4 drives of stripe size 128K:
> +dmsetup create raid_disk0 --table '0 1 dm-unstripe /dev/nvme0n1 0 4 256'
> +dmsetup create raid_disk1 --table '0 1 dm-unstripe /dev/nvme0n1 1 4 256'
> +dmsetup create raid_disk2 --table '0 1 dm-unstripe /dev/nvme0n1 2 4 256'
> +dmsetup create raid_disk3 --table '0 1 dm-unstripe /dev/nvme0n1 3 4 256'

While this device mapper is intended for H/W RAID where the member disks
are hidden, we can test it using DM software striping so we don't need
any particular hardware.

Here's a little test script I wrote for that. It sets up a striped
device backed by files, unstripes it into different sets, then compares
each to its original backing file after writing random data to it,
cleaning up the test artifacts before exiting. The parameters at the
top can be modified to test different striping scenarios.

---
#!/bin/bash

MEMBER_SIZE=$((128 * 1024 * 1024))
NUM=4
SEQ_END=$((${NUM}-1))
CHUNK=256
BS=4096

RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
COUNT=$((${MEMBER_SIZE} / ${BS}))

for i in $(seq 0 ${SEQ_END}); do
  dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
  losetup /dev/loop${i} member-${i}
  DM_PARMS+=" /dev/loop${i} 0"
done

echo $DM_PARMS | dmsetup create raid0
for i in $(seq 0 ${SEQ_END}); do
  echo "0 1 dm-unstripe /dev/mapper/raid0 ${i} ${NUM} ${CHUNK}" | dmsetup 
create set-${i}
done;

for i in $(seq 0 ${SEQ_END}); do
  dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} 
oflag=direct
  diff /dev/mapper/set-${i} member-${i}
done;

for i in $(seq 0 ${SEQ_END}); do
  dmsetup remove set-${i}
done
dmsetup remove raid0

for i in $(seq 0 ${SEQ_END}); do
  losetup -d /dev/loop${i}
  rm -f member-${i}
done
--

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 1/2] dm-unstripe: unstripe of IO across RAID 0

2017-12-11 Thread Keith Busch
On Mon, Dec 11, 2017 at 09:00:18AM -0700, Scott Bauer wrote:
> +
> + dm_set_target_max_io_len(ti, target->max_hw_sectors);

The return for this function has "__must_check", so it's currently
throwing an a compiler warning.

Otherwise, this looks like it's doing what you want, and tests
successfully on my synthetic workloads.

Acked-by: Keith Busch <keith.bu...@intel.com>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm: submit stacked requests in irq enabled context

2017-05-09 Thread Keith Busch
On Tue, May 09, 2017 at 03:19:31PM +0530, Neeraj Soni wrote:
>+ Alasdair and dm-devel for awareness and inputs.
> 
>On 5/9/2017 12:26 PM, Neeraj Soni wrote:
> 
>  Hi Keith/Snitzer,
> 
>  I have recently started using kernel 4.4 on a Android device and ran
>  Androbench to check storage read/write performance. I found that the
>  Random Read (RR) and Random write(RW) performance with Full Disk
>  Encryption is degraded compared to no Disk Encryption. Initially i
>  thought this must the issue with the storage part used and i compared
>  the performance of similar storage part on a device that was using
>  Android with kernel 3.18. I found that with no Disk Encryption the
>  performance was equivalent to the device which was using 4.4 but with
>  Disk Encryption there was degradation in RR (~20%) and RW(10%).
> 
>  I then tried to compare the changes that was brought in kernel 4.4 in
>  Full Disk Encryption path. I came across the patch mentioned in subject
>  and found that now a worker thread is scheduled in dm_request_fn() to
>  process the requests instead of directly invoking map_request() as was
>  in kernel 3.18.
> 
>  I reverted this patch and found that the RR and RW performance was now
>  closer to what we have without Disk Encryption. From the commit message
>  i understand that this change is significant and will be required for
>  blk-mq support but have you came across such degradation issue with your
>  patch and do we have any fix for this degradation available?

Just for comparison, could you check performance on a more recent stable
kernel?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCHv2] Fill NVMe specific path info

2017-02-21 Thread Keith Busch
Signed-off-by: Keith Busch <keith.bu...@intel.com>
---
v1->v2:
  Removed explicitly setting the wwid path information. We get that with
  through exported udev attributes.

  Added default retain_hwhandler to off for NVME devices. This has the
  kernel not call into scsi specific APIs that don't exist for NVMe.

 libmultipath/discovery.c | 35 +++
 libmultipath/hwtable.c   | 10 ++
 libmultipath/structs.h   |  1 +
 3 files changed, 46 insertions(+)

diff --git a/libmultipath/discovery.c b/libmultipath/discovery.c
index d1aec31..39dd92a 100644
--- a/libmultipath/discovery.c
+++ b/libmultipath/discovery.c
@@ -1187,6 +1187,36 @@ scsi_sysfs_pathinfo (struct path * pp, vector hwtable)
 }
 
 static int
+nvme_sysfs_pathinfo (struct path * pp, vector hwtable)
+{
+   struct udev_device *parent, *nvme = NULL;
+
+   parent = pp->udev;
+   while (parent) {
+   const char *subsys = udev_device_get_subsystem(parent);
+
+   if (subsys && !strncmp(subsys, "nvme", 4)) {
+   nvme = parent;
+   break;
+   }
+   parent = udev_device_get_parent(parent);
+   }
+   if (!nvme)
+   return 1;
+
+   snprintf(pp->vendor_id, SCSI_VENDOR_SIZE, "NVME");
+   snprintf(pp->product_id, SCSI_PRODUCT_SIZE, "%s", 
udev_device_get_sysattr_value(nvme, "model"));
+   snprintf(pp->serial, SERIAL_SIZE, "%s", 
udev_device_get_sysattr_value(nvme, "serial"));
+   snprintf(pp->rev, SCSI_REV_SIZE, "%s", 
udev_device_get_sysattr_value(nvme, "firmware_rev"));
+
+   condlog(3, "%s: vendor:%s product:%s serial:%s rev:%s", pp->dev,
+   pp->vendor_id, pp->product_id, pp->serial, pp->rev);
+   pp->hwe = find_hwe(hwtable, pp->vendor_id, pp->product_id, NULL);
+
+   return 0;
+}
+
+static int
 rbd_sysfs_pathinfo (struct path * pp, vector hwtable)
 {
sprintf(pp->vendor_id, "Ceph");
@@ -1405,6 +1435,8 @@ sysfs_pathinfo(struct path * pp, vector hwtable)
pp->bus = SYSFS_BUS_SCSI;
if (!strncmp(pp->dev,"rbd", 3))
pp->bus = SYSFS_BUS_RBD;
+   if (!strncmp(pp->dev,"nvme", 4))
+   pp->bus = SYSFS_BUS_NVME;
 
if (pp->bus == SYSFS_BUS_UNDEF)
return 0;
@@ -1420,6 +1452,9 @@ sysfs_pathinfo(struct path * pp, vector hwtable)
} else if (pp->bus == SYSFS_BUS_RBD) {
if (rbd_sysfs_pathinfo(pp, hwtable))
return 1;
+   } else if (pp->bus == SYSFS_BUS_NVME) {
+   if (nvme_sysfs_pathinfo(pp, hwtable))
+   return 1;
}
return 0;
 }
diff --git a/libmultipath/hwtable.c b/libmultipath/hwtable.c
index f5a5f7b..8409261 100644
--- a/libmultipath/hwtable.c
+++ b/libmultipath/hwtable.c
@@ -1073,6 +1073,16 @@ static struct hwentry default_hw[] = {
.prio_name = PRIO_ALUA,
.no_path_retry = 30,
},
+   /*
+* Generic NVMe devices
+*/
+   {
+   .vendor= "NVME",
+   .product   = ".*",
+   .uid_attribute = "ID_WWN",
+   .checker_name  = DIRECTIO,
+   .retain_hwhandler = RETAIN_HWHANDLER_OFF,
+   },
 #if 0
/*
 * Copy this TEMPLATE to add new hardware.
diff --git a/libmultipath/structs.h b/libmultipath/structs.h
index 6edd927..dfd65ae 100644
--- a/libmultipath/structs.h
+++ b/libmultipath/structs.h
@@ -53,6 +53,7 @@ enum sysfs_buses {
SYSFS_BUS_CCW,
SYSFS_BUS_CCISS,
SYSFS_BUS_RBD,
+   SYSFS_BUS_NVME,
 };
 
 enum pathstates {
-- 
2.5.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] Fill NVMe specific path info

2017-02-21 Thread Keith Busch
On Mon, Feb 20, 2017 at 11:57:59AM -0600, Benjamin Marzinski wrote:
> > +
> > +   snprintf(pp->vendor_id, SCSI_VENDOR_SIZE, "NVME");
> > +   snprintf(pp->product_id, SCSI_PRODUCT_SIZE, "%s", 
> > udev_device_get_sysattr_value(nvme, "model"));
> > +   snprintf(pp->serial, SERIAL_SIZE, "%s", 
> > udev_device_get_sysattr_value(nvme, "serial"));
> > +   snprintf(pp->rev, SCSI_REV_SIZE, "%s", 
> > udev_device_get_sysattr_value(nvme, "firmware_rev"));
> > +   snprintf(pp->wwid, WWID_SIZE, "%s", 
> > udev_device_get_sysattr_value(pp->udev, "wwid"));
> 
> With your udev rules change, you shouldn't need to set the wwid here. It
> should just get dealt with by the get_uid call in pathinfo (or with
> the new uevent merging code, before multipathd ever calls pathinfo).

Oh right, you are correct. I can spin this with that removed.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]

2017-02-16 Thread Keith Busch
On Thu, Feb 16, 2017 at 01:21:29PM -0500, Mike Snitzer wrote:
> Then undeprecate them.  Decisions like marking a path checker deprecated
> were _not_ made with NVMe in mind.  They must predate NVMe.
> 
> multipath-tools has tables that specify all the defaults for a given
> target backend.  NVMe will just be yet another.  Yes some user _could_
> shoot themselves in the foot by overriding the proper configuration but
> since when are we motivated by _not_ giving users the power to hang
> themselves?
> 
> As for configurability (chosing between N valid configs/settings): At
> some point the user will want one behaviour vs another.  Thinking
> otherwise is just naive.  Think error timeouts, etc.  Any multipath
> kernel implementation (which dm-multipath is BTW) will eventually find
> itself at a crossroads where the underlying fabric could be tweaked in
> different ways.  Thinking you can just hardcode these attributes and
> settings is foolish.

Roger that, and I absolutely want to see this work with the existing
framework.

I just think it'd be easier for everyone if multipath were more like
the generic block layer, in that devices are surfaced with configurable
policies without userspace telling it which to use. The kernel knowing
safe defaults for a particular device is probably the more common case,
and userspace can still tune them as needed. Of course, I accept you're
in a better position to know if this is folly.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]

2017-02-16 Thread Keith Busch
On Thu, Feb 16, 2017 at 05:37:41PM +, Bart Van Assche wrote:
> On Thu, 2017-02-16 at 12:38 -0500, Keith Busch wrote:
> > Maybe I'm not seeing the bigger picture. Is there some part to multipath
> > that the kernel is not in a better position to handle?
> 
> Does this mean that the code to parse /etc/multipath.conf will be moved into
> the kernel? Or will we lose the ability to configure the policies that
> /etc/multipath.conf allows to configure?

No, I'm just considering the settings for a device that won't work
at all if multipath.conf is wrong. For example, the uuid attributes,
path priority, or path checker. These can't be considered configurable
policies if all but one of them are invalid for a specific device type.

It shouldn't even be an option to let a user select TUR path checker
for NVMe, and the only checkers multipath-tools provide that even make
sense for NVMe are deprecated.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] hch's native NVMe multipathing [was: Re: [PATCH 1/2] Don't blacklist nvme]

2017-02-16 Thread Keith Busch
On Thu, Feb 16, 2017 at 10:13:37AM -0500, Mike Snitzer wrote:
> On Thu, Feb 16 2017 at  9:26am -0500,
> Christoph Hellwig  wrote:
>
> > just a little new code in the block layer, and a move of the path
> > selectors from dm to the block layer.  I would not call this
> > fragmentation.
> 
> I'm fine with the path selectors getting moved out; maybe it'll
> encourage new path selectors to be developed.
> 
> But there will need to be some userspace interface stood up to support
> your native NVMe multipathing (you may not think it needed but think in
> time there will be a need to configure _something_).  That is the
> fragmentation I'm referring to.

I'm not sure what Christoph's proposal looks like, but I have to agree
that multipath support directly in the kernel without requiring user
space to setup the mpath block device is easier for everyone. The only
NVMe specific part, though, just needs to be how it reports unique
identifiers to the multipath layer.

Maybe I'm not seeing the bigger picture. Is there some part to multipath
that the kernel is not in a better position to handle?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 1/2] Don't blacklist nvme

2017-02-15 Thread Keith Busch
On Wed, Feb 15, 2017 at 06:57:21AM -0800, Christoph Hellwig wrote:
> On Tue, Feb 14, 2017 at 06:00:23PM -0500, Keith Busch wrote:
> > Good point. I was unknowingly running with CONFIG_SCSI_DH disabled,
> > and blissfully unaware of its existence! After enabling that option,
> > I see what you mean.
> 
> Someone needs to fix that crash ASAP.  I though Hannes had patches for
> them a long time ago in the SuSE tree, but they never went anywhere.

Adding Hannes.

I can work around this by setting the parameters from multipath-tools
specific to NVMe, but that seems worng. Blindly calling scsi specific
directly from dm-mpath isn't a good idea.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 1/2] Don't blacklist nvme

2017-02-14 Thread Keith Busch
On Tue, Feb 14, 2017 at 01:35:45PM -0800, Bart Van Assche wrote:
> On 02/14/2017 01:19 PM, Keith Busch wrote:
> > These devices are mulitpath capable, and have been able to stack with
> > dm-mpath since kernel 4.2.
> > 
> > -   str = STRDUP("^(ram|raw|loop|fd|md|dm-|sr|scd|st|dcssblk|nvme)[0-9]");
> > +   str = STRDUP("^(ram|raw|loop|fd|md|dm-|sr|scd|st|dcssblk)[0-9]");
> 
> Have you checked whether dm-mpath works properly with nvme? Last time I
> tried that dm-mpath crashed when it called scsi_dh_attach() because that
> last function assumes that it is passed a SCSI device queue instead of
> checking whether or not the queue passed to that function is a SCSI
> device queue.

Good point. I was unknowingly running with CONFIG_SCSI_DH disabled,
and blissfully unaware of its existence! After enabling that option,
I see what you mean.

If we don't want to mess with the kernel, I can change the multipath-tools
to get around that by appending the following to NVMe hwentry in the
second patch in this series:

.retain_hwhandler = RETAIN_HWHANDLER_OFF,

And the problem goes away.

I'm not sure yet what the other implications are of setting that (I will
look into it, but I'm out of time today).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] NVMeoF multi-path setup

2016-06-30 Thread Keith Busch
On Thu, Jun 30, 2016 at 06:52:07PM -0400, Mike Snitzer wrote:
> AFAIK, hch had Intel disable that by default in the hopes of avoiding
> people having dm-multipath "just work" with NVMeoF.  (Makes me wonder
> what other unpleasant unilateral decisions were made because some
> non-existant NVMe specific multipath capabilities would be forthcoming
> but I digress).

For the record, Intel was okay with making SCSI a separate config option,
but I was pretty clear about our wish to let it default to 'Y', which
didn't happen. :)

To be fair, NVMe's SCSI translation is a bit of a kludge, and we have
better ways to get device identification now. Specifically, the block
device provides 'ATTR{wwid}' available to all NVMe namespaces in existing
kernel releases.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel