Re: [dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.

2023-08-14 Thread Bart Van Assche

On 8/14/23 05:18, Nitesh Shetty wrote:

On 23/08/11 02:25PM, Bart Van Assche wrote:

On 8/11/23 03:52, Nitesh Shetty wrote:

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 0bad62cca3d0..de0ad7a0d571 100644
+static inline bool op_is_copy(blk_opf_t op)
+{
+    return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC ||
+    (op & REQ_OP_MASK) == REQ_OP_COPY_DST);
+}
+


The above function should be moved into include/linux/blk-mq.h below the
definition of req_op() such that it can use req_op() instead of 
open-coding it.


We use this later for dm patches(patch 9) as well, and we don't have 
request at

that time.


My understanding is that include/linux/blk_types.h should only contain
data types and constants and hence that inline functions like
op_is_copy() should be moved elsewhere.

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v14 04/11] block: add emulation for copy

2023-08-11 Thread Bart Van Assche

On 8/11/23 03:52, Nitesh Shetty wrote:

+   schedule_work(&emulation_io->emulation_work);


schedule_work() uses system_wq. This won't work for all users since 
there are no latency guarantees for system_wq.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v14 03/11] block: add copy offload support

2023-08-11 Thread Bart Van Assche

On 8/11/23 03:52, Nitesh Shetty wrote:

+   if (rem != chunk)
+   atomic_inc(&cio->refcount);


This code will be easier to read if the above if-test is left out
and if the following code is added below the for-loop:

if (atomic_dec_and_test(&cio->refcount))
blkdev_copy_endio(cio);

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.

2023-08-11 Thread Bart Van Assche

On 8/11/23 03:52, Nitesh Shetty wrote:

We expect caller to take a plug and send bio with source information,
followed by bio with destination information.
Once the src bio arrives we form a request and wait for destination
bio. Upon arrival of destination we merge these two bio's and send
corresponding request down to device driver.


Is the above description up-to-date? In the cover letter there is a 
different description of how copy offloading works.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v14 01/11] block: Introduce queue limits and sysfs for copy-offload support

2023-08-11 Thread Bart Van Assche

On 8/11/23 03:52, Nitesh Shetty wrote:

+/* maximum copy offload length, this is set to 128MB based on current testing 
*/
+#define COPY_MAX_BYTES (1 << 27)


Since the COPY_MAX_BYTES constant is only used in source file
block/blk-settings.c it should be moved into that file. If you really
want to keep it in include/linux/blkdev.h, a BLK_ prefix should
be added.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v14 00/11] Implement copy offload support

2023-08-11 Thread Bart Van Assche

On 8/11/23 03:52, Nitesh Shetty wrote:

We achieve copy offload by sending 2 bio's with source and destination
info and merge them to form a request. This request is sent to driver.
So this design works only for request based storage drivers.


[ ... ]


Overall series supports:

1. Driver
- NVMe Copy command (single NS, TP 4065), including support
in nvme-target (for block and file back end).

2. Block layer
- Block-generic copy (REQ_OP_COPY_DST/SRC), operation with
   interface accommodating two block-devs
 - Merging copy requests in request layer
- Emulation, for in-kernel user when offload is natively
 absent
- dm-linear support (for cases not requiring split)

3. User-interface
- copy_file_range


Is this sufficient? The combination of dm-crypt, dm-linear and the NVMe 
driver is very common. What is the plan for supporting dm-crypt? 
Shouldn't bio splitting be supported for dm-linear?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v14 03/11] block: add copy offload support

2023-08-11 Thread Bart Van Assche

On 8/11/23 03:52, Nitesh Shetty wrote:

+ * Description:
+ * Copy source offset to destination offset within block device, using
+ * device's native copy offload feature.


Offloading the copy operation is not guaranteed so I think that needs to 
be reflected in the above comment.



+ * We perform copy operation by sending 2 bio's.
+ * 1. We take a plug and send a REQ_OP_COPY_SRC bio along with source
+ * sector and length. Once this bio reaches request layer, we form a
+ * request and wait for dst bio to arrive.


What will happen if the queue depth of the request queue at the bottom 
is one?



+   blk_start_plug(&plug);
+   dst_bio = blk_next_bio(src_bio, bdev, 0, REQ_OP_COPY_DST, gfp);


blk_next_bio() can return NULL so its return value should be checked.


+   dst_bio->bi_iter.bi_size = chunk;
+   dst_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT;
+   dst_bio->bi_end_io = blkdev_copy_offload_dst_endio;
+   dst_bio->bi_private = offload_io;


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.

2023-08-11 Thread Bart Van Assche

On 8/11/23 03:52, Nitesh Shetty wrote:

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 0bad62cca3d0..de0ad7a0d571 100644
+static inline bool op_is_copy(blk_opf_t op)
+{
+   return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC ||
+   (op & REQ_OP_MASK) == REQ_OP_COPY_DST);
+}
+


The above function should be moved into include/linux/blk-mq.h below the
definition of req_op() such that it can use req_op() instead of 
open-coding it.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 01/32] block: Provide blkdev_get_handle_* functions

2023-07-09 Thread Bart Van Assche

On 7/4/23 09:14, Matthew Wilcox wrote:

On Tue, Jul 04, 2023 at 07:06:26AM -0700, Bart Van Assche wrote:

On 7/4/23 05:21, Jan Kara wrote:

+struct bdev_handle {
+   struct block_device *bdev;
+   void *holder;
+};


Please explain in the patch description why a holder pointer is introduced
in struct bdev_handle and how it relates to the bd_holder pointer in struct
block_device. Is one of the purposes of this patch series perhaps to add
support for multiple holders per block device?


That is all in patch 0/32.  Why repeat it?


This cover letter: 
https://lore.kernel.org/linux-block/20230629165206.383-1-j...@suse.cz/T/#t?

The word "holder" doesn't even occur in that cover letter so how could the
answer to my question be present in the cover letter?

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 01/32] block: Provide blkdev_get_handle_* functions

2023-07-04 Thread Bart Van Assche

On 7/4/23 05:21, Jan Kara wrote:

+struct bdev_handle {
+   struct block_device *bdev;
+   void *holder;
+};


Please explain in the patch description why a holder pointer is 
introduced in struct bdev_handle and how it relates to the bd_holder 
pointer in struct block_device. Is one of the purposes of this patch 
series perhaps to add support for multiple holders per block device?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v4 6/7] dm: Inline __dm_mq_kick_requeue_list()

2023-06-21 Thread Bart Van Assche
Since commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") the function __dm_mq_kick_requeue_list() is too short to keep it
as a separate function. Hence, inline this function.

Reviewed-by: Christoph Hellwig 
Cc: Damien Le Moal 
Cc: Ming Lei 
Cc: Mike Snitzer 
Signed-off-by: Bart Van Assche 
---
 drivers/md/dm-rq.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index f7e9a3632eb3..bbe1e2ea0aa4 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -168,21 +168,16 @@ static void dm_end_request(struct request *clone, 
blk_status_t error)
rq_completed(md);
 }
 
-static void __dm_mq_kick_requeue_list(struct request_queue *q, unsigned long 
msecs)
-{
-   blk_mq_delay_kick_requeue_list(q, msecs);
-}
-
 void dm_mq_kick_requeue_list(struct mapped_device *md)
 {
-   __dm_mq_kick_requeue_list(md->queue, 0);
+   blk_mq_kick_requeue_list(md->queue);
 }
 EXPORT_SYMBOL(dm_mq_kick_requeue_list);
 
 static void dm_mq_delay_requeue_request(struct request *rq, unsigned long 
msecs)
 {
blk_mq_requeue_request(rq, false);
-   __dm_mq_kick_requeue_list(rq->q, msecs);
+   blk_mq_delay_kick_requeue_list(rq->q, msecs);
 }
 
 static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool 
delay_requeue)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v4 7/7] block: Inline blk_mq_{, delay_}kick_requeue_list()

2023-06-21 Thread Bart Van Assche
Patch "block: Preserve the order of requeued requests" changed
blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() into
blk_mq_run_hw_queues() and blk_mq_delay_run_hw_queues() calls
respectively. Inline blk_mq_{,delay_}kick_requeue_list() because these
functions are now too short to keep these as separate functions.

Acked-by: Vineeth Vijayan  [ for the s390 changes ]
Cc: Christoph Hellwig 
Cc: Damien Le Moal 
Cc: Ming Lei 
Cc: Mike Snitzer 
Signed-off-by: Bart Van Assche 
---
 block/blk-flush.c|  4 ++--
 block/blk-mq-debugfs.c   |  2 +-
 block/blk-mq.c   | 15 +--
 drivers/block/ublk_drv.c |  6 +++---
 drivers/block/xen-blkfront.c |  1 -
 drivers/md/dm-rq.c   |  6 +++---
 drivers/nvme/host/core.c |  2 +-
 drivers/s390/block/scm_blk.c |  2 +-
 drivers/scsi/scsi_lib.c  |  2 +-
 include/linux/blk-mq.h   |  2 --
 10 files changed, 13 insertions(+), 29 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 4bfb92f58aa9..157b86fd9ccb 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -192,7 +192,7 @@ static void blk_flush_complete_seq(struct request *rq,
spin_lock(&hctx->requeue_lock);
list_add_tail(&rq->queuelist, &hctx->flush_list);
spin_unlock(&hctx->requeue_lock);
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
break;
 
case REQ_FSEQ_DONE:
@@ -354,7 +354,7 @@ static void blk_kick_flush(struct blk_mq_hw_ctx *hctx,
list_add_tail(&flush_rq->queuelist, &hctx->flush_list);
spin_unlock(&hctx->requeue_lock);
 
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
 }
 
 static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 787bdff3cc64..76792ebab935 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -114,7 +114,7 @@ static ssize_t queue_state_write(void *data, const char 
__user *buf,
} else if (strcmp(op, "start") == 0) {
blk_mq_start_stopped_hw_queues(q, true);
} else if (strcmp(op, "kick") == 0) {
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
} else {
pr_err("%s: unsupported operation '%s'\n", __func__, op);
 inval:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index de39984d17c4..12fd8b65b930 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1436,7 +1436,7 @@ void blk_mq_requeue_request(struct request *rq, bool 
kick_requeue_list)
spin_unlock_irqrestore(&hctx->requeue_lock, flags);
 
if (kick_requeue_list)
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
 }
 EXPORT_SYMBOL(blk_mq_requeue_request);
 
@@ -1473,19 +1473,6 @@ static void blk_mq_process_requeue_list(struct 
blk_mq_hw_ctx *hctx)
}
 }
 
-void blk_mq_kick_requeue_list(struct request_queue *q)
-{
-   blk_mq_run_hw_queues(q, true);
-}
-EXPORT_SYMBOL(blk_mq_kick_requeue_list);
-
-void blk_mq_delay_kick_requeue_list(struct request_queue *q,
-   unsigned long msecs)
-{
-   blk_mq_delay_run_hw_queues(q, msecs);
-}
-EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
-
 static bool blk_mq_rq_inflight(struct request *rq, void *priv)
 {
/*
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 1c823750c95a..cddbbdc9b199 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -902,7 +902,7 @@ static inline void __ublk_rq_task_work(struct request *req,
 */
if (unlikely(!mapped_bytes)) {
blk_mq_requeue_request(req, false);
-   blk_mq_delay_kick_requeue_list(req->q,
+   blk_mq_delay_run_hw_queues(req->q,
UBLK_REQUEUE_DELAY_MS);
return;
}
@@ -1297,7 +1297,7 @@ static void ublk_unquiesce_dev(struct ublk_device *ub)
 
blk_mq_unquiesce_queue(ub->ub_disk->queue);
/* We may have requeued some rqs in ublk_quiesce_queue() */
-   blk_mq_kick_requeue_list(ub->ub_disk->queue);
+   blk_mq_run_hw_queues(ub->ub_disk->queue, true);
 }
 
 static void ublk_stop_dev(struct ublk_device *ub)
@@ -2341,7 +2341,7 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
blk_mq_unquiesce_queue(ub->ub_disk->queue);
pr_devel("%s: queue unquiesced, dev id %d.\n",
__func__, header->dev_id);
-   blk_mq_kick_requeue_list(ub->ub_disk->queue);
+   blk_mq_run_hw_queues(ub->ub_disk->queue, true);
ub->dev_info.state = UBLK_S_DEV_LIVE;
schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON

Re: [dm-devel] [PATCH] blk-crypto: use dynamic lock class for blk_crypto_profile::lock

2023-06-12 Thread Bart Van Assche

On 6/9/23 23:11, Eric Biggers wrote:

When a device-mapper device is passing through the inline encryption
support of an underlying device, calls to blk_crypto_evict_key() take
the blk_crypto_profile::lock of the device-mapper device, then take the
blk_crypto_profile::lock of the underlying device (nested).  This isn't
a real deadlock, but it causes a lockdep report because there is only
one lock class for all instances of this lock.

Lockdep subclasses don't really work here because the hierarchy of block
devices is dynamic and could have more than 2 levels.

Instead, register a dynamic lock class for each blk_crypto_profile, and
associate that with the lock.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



[dm-devel] [PATCH v3 7/7] block: Inline blk_mq_{, delay_}kick_requeue_list()

2023-05-22 Thread Bart Van Assche
Patch "block: Preserve the order of requeued requests" changed
blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() into
blk_mq_run_hw_queues() and blk_mq_delay_run_hw_queues() calls
respectively. Inline blk_mq_{,delay_}kick_requeue_list() because these
functions are now too short to keep these as separate functions.

Cc: Christoph Hellwig 
Cc: Damien Le Moal 
Cc: Ming Lei 
Cc: Mike Snitzer 
Signed-off-by: Bart Van Assche 
---
 block/blk-flush.c|  4 ++--
 block/blk-mq-debugfs.c   |  2 +-
 block/blk-mq.c   | 16 +---
 drivers/block/ublk_drv.c |  6 +++---
 drivers/block/xen-blkfront.c |  1 -
 drivers/md/dm-rq.c   |  6 +++---
 drivers/nvme/host/core.c |  2 +-
 drivers/s390/block/scm_blk.c |  2 +-
 drivers/scsi/scsi_lib.c  |  2 +-
 include/linux/blk-mq.h   |  2 --
 10 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index dba392cf22be..22170036ddcb 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -191,7 +191,7 @@ static void blk_flush_complete_seq(struct request *rq,
spin_lock(&q->requeue_lock);
list_add_tail(&rq->queuelist, &q->flush_list);
spin_unlock(&q->requeue_lock);
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
break;
 
case REQ_FSEQ_DONE:
@@ -352,7 +352,7 @@ static void blk_kick_flush(struct request_queue *q, struct 
blk_flush_queue *fq,
list_add_tail(&flush_rq->queuelist, &q->flush_list);
spin_unlock(&q->requeue_lock);
 
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
 }
 
 static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq,
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 68165a50951b..869cc62ed50f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -145,7 +145,7 @@ static ssize_t queue_state_write(void *data, const char 
__user *buf,
} else if (strcmp(op, "start") == 0) {
blk_mq_start_stopped_hw_queues(q, true);
} else if (strcmp(op, "kick") == 0) {
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
} else {
pr_err("%s: unsupported operation '%s'\n", __func__, op);
 inval:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 52dffdc70480..34dcfc84d902 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1430,7 +1430,7 @@ void blk_mq_requeue_request(struct request *rq, bool 
kick_requeue_list)
spin_unlock_irqrestore(&q->requeue_lock, flags);
 
if (kick_requeue_list)
-   blk_mq_kick_requeue_list(q);
+   blk_mq_run_hw_queues(q, true);
 }
 EXPORT_SYMBOL(blk_mq_requeue_request);
 
@@ -1470,19 +1470,6 @@ static void blk_mq_process_requeue_list(struct 
blk_mq_hw_ctx *hctx)
blk_mq_insert_request(rq, 0);
 }
 
-void blk_mq_kick_requeue_list(struct request_queue *q)
-{
-   blk_mq_run_hw_queues(q, true);
-}
-EXPORT_SYMBOL(blk_mq_kick_requeue_list);
-
-void blk_mq_delay_kick_requeue_list(struct request_queue *q,
-   unsigned long msecs)
-{
-   blk_mq_delay_run_hw_queues(q, msecs);
-}
-EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
-
 static bool blk_mq_rq_inflight(struct request *rq, void *priv)
 {
/*
@@ -3537,7 +3524,6 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, 
struct hlist_node *node)
 
list_for_each_entry_safe(rq, next, &tmp, queuelist)
blk_mq_requeue_request(rq, false);
-   blk_mq_kick_requeue_list(hctx->queue);
 
blk_mq_run_hw_queue(hctx, true);
return 0;
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 539eada32861..4a3d579a25b5 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -900,7 +900,7 @@ static inline void __ublk_rq_task_work(struct request *req,
 */
if (unlikely(!mapped_bytes)) {
blk_mq_requeue_request(req, false);
-   blk_mq_delay_kick_requeue_list(req->q,
+   blk_mq_delay_run_hw_queues(req->q,
UBLK_REQUEUE_DELAY_MS);
return;
}
@@ -1290,7 +1290,7 @@ static void ublk_unquiesce_dev(struct ublk_device *ub)
 
blk_mq_unquiesce_queue(ub->ub_disk->queue);
/* We may have requeued some rqs in ublk_quiesce_queue() */
-   blk_mq_kick_requeue_list(ub->ub_disk->queue);
+   blk_mq_run_hw_queues(ub->ub_disk->queue, true);
 }
 
 static void ublk_stop_dev(struct ublk_device *ub)
@@ -2334,7 +2334,7 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
blk_mq_unquiesce_queue(ub->ub_disk->queue);
pr_dev

[dm-devel] [PATCH v3 6/7] dm: Inline __dm_mq_kick_requeue_list()

2023-05-22 Thread Bart Van Assche
Since commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") the function __dm_mq_kick_requeue_list() is too short to keep it
as a separate function. Hence, inline this function.

Cc: Christoph Hellwig 
Cc: Damien Le Moal 
Cc: Ming Lei 
Cc: Mike Snitzer 
Signed-off-by: Bart Van Assche 
---
 drivers/md/dm-rq.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index f7e9a3632eb3..bbe1e2ea0aa4 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -168,21 +168,16 @@ static void dm_end_request(struct request *clone, 
blk_status_t error)
rq_completed(md);
 }
 
-static void __dm_mq_kick_requeue_list(struct request_queue *q, unsigned long 
msecs)
-{
-   blk_mq_delay_kick_requeue_list(q, msecs);
-}
-
 void dm_mq_kick_requeue_list(struct mapped_device *md)
 {
-   __dm_mq_kick_requeue_list(md->queue, 0);
+   blk_mq_kick_requeue_list(md->queue);
 }
 EXPORT_SYMBOL(dm_mq_kick_requeue_list);
 
 static void dm_mq_delay_requeue_request(struct request *rq, unsigned long 
msecs)
 {
blk_mq_requeue_request(rq, false);
-   __dm_mq_kick_requeue_list(rq->q, msecs);
+   blk_mq_delay_kick_requeue_list(rq->q, msecs);
 }
 
 static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool 
delay_requeue)

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v4 1/4] block: Introduce provisioning primitives

2023-04-18 Thread Bart Van Assche

On 4/18/23 15:12, Sarthak Kukreti wrote:

/* Fail if we don't recognize the flags. */
-   if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
+   if (mode != 0 && mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
return -EOPNOTSUPP;


Is this change necessary? Doesn't (mode & ~BLKDEV_FALLOC_FL_SUPPORTED) 
!= 0 imply that mode != 0?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v6 04/18] scsi: Move sd_pr_type to scsi_common

2023-04-07 Thread Bart Van Assche

On 4/7/23 13:05, Mike Christie wrote:

LIO is going to want to do the same block to/from SCSI pr types as sd.c
so this moves the sd_pr_type helper to scsi_common and renames it. The
next patch will then also add a helper to go from the SCSI value to the
block one for use with PERSISTENT_RESERVE_IN commands.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 03/18] scsi: Rename sd_pr_command

2023-03-24 Thread Bart Van Assche

On 3/24/23 11:17, Mike Christie wrote:

Rename sd_pr_command to sd_pr_out_command to match a
sd_pr_in_command helper added in the next patches.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 02/18] block: Rename BLK_STS_NEXUS to BLK_STS_RESV_CONFLICT

2023-03-24 Thread Bart Van Assche

On 3/24/23 11:17, Mike Christie wrote:

BLK_STS_NEXUS is used for NVMe/SCSI reservation conflicts and DASD's
locking feature which works similar to NVMe/SCSI reservations where a
host can get a lock on a device and when the lock is taken it will get
failures.

This patch renames BLK_STS_NEXUS so it better reflects this type of
use.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 01/18] block: Add PR callouts for read keys and reservation

2023-03-24 Thread Bart Van Assche

On 3/24/23 11:17, Mike Christie wrote:

Add callouts for reading keys and reservations. This allows LIO to support
the READ_KEYS and READ_RESERVATION commands and will allow dm-multipath
to optimize it's error handling so it can check if it's getting an error
because there's an existing reservation or if we need to retry different
paths.

Note: This only initially adds the struct definitions in the kernel as I'm
not sure if we wanted to export the interface to userspace yet. read_keys
and read_reservation are exactly what dm-multipath and LIO need, but for a
userspace interface we may want something like SCSI's READ_FULL_STATUS and
NVMe's report reservation commands. Those are overkill for dm/LIO and
READ_FULL_STATUS is sometimes broken for SCSI devices.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v5 04/18] scsi: Move sd_pr_type to header to share

2023-03-24 Thread Bart Van Assche

On 3/24/23 11:17, Mike Christie wrote:

diff --git a/include/scsi/scsi_block_pr.h b/include/scsi/scsi_block_pr.h
new file mode 100644
index ..44766d7a81d8
--- /dev/null
+++ b/include/scsi/scsi_block_pr.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _SCSI_BLOCK_PR_H
+#define _SCSI_BLOCK_PR_H
+
+#include 
+
+enum scsi_pr_type {
+   SCSI_PR_WRITE_EXCLUSIVE = 0x01,
+   SCSI_PR_EXCLUSIVE_ACCESS= 0x03,
+   SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY= 0x05,
+   SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY   = 0x06,
+   SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS= 0x07,
+   SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS   = 0x08,
+};
+
+static inline enum scsi_pr_type block_pr_type_to_scsi(enum pr_type type)
+{
+   switch (type) {
+   case PR_WRITE_EXCLUSIVE:
+   return SCSI_PR_WRITE_EXCLUSIVE;
+   case PR_EXCLUSIVE_ACCESS:
+   return SCSI_PR_EXCLUSIVE_ACCESS;
+   case PR_WRITE_EXCLUSIVE_REG_ONLY:
+   return SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY;
+   case PR_EXCLUSIVE_ACCESS_REG_ONLY:
+   return SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY;
+   case PR_WRITE_EXCLUSIVE_ALL_REGS:
+   return SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS;
+   case PR_EXCLUSIVE_ACCESS_ALL_REGS:
+   return SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS;
+   }
+
+   return 0;
+}
+
+#endif


Hi Mike,

Has it been considered to move enum scsi_pr_type into 
include/scsi/scsi_common.h and block_pr_type_to_scsi() into 
drivers/scsi/scsi_common.c? Other definitions that are shared between 
SCSI initiator and SCSI target code exist in these files.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] dm-crypt: fix softlockup in dmcrypt_write

2023-03-06 Thread Bart Van Assche

On 3/6/23 10:02, Mikulas Patocka wrote:

On Tue, 28 Feb 2023, yangerkun wrote:

It's ok to fix the softlockup, but for async write encrypt,
kcryptd_crypt_write_io_submit will add bio to write_tree, and once we
call cond_resched before every kcryptd_io_write, the write performance
may be poor while we meet a high cpu usage scene.


Hi

To fix this problem, find the PID of the process "dmcrypt_write" and
change its priority to -20, for example "renice -n -20 -p 34748".

This is the proper way how to fix it; locking up the process for one
second is not.

We used to have high-priority workqueues by default, but it caused audio
playback skipping, so we had to revert it - see
f612b2132db529feac4f965f28a1b9258ea7c22b.

Perhaps we should add an option to have high-priority kernel threads?


Would calling cond_resched() every n iterations instead of every 
iteration help? From mm/swapfile.c:


if (unlikely(--latency_ration < 0)) {
cond_resched();
latency_ration = LATENCY_LIMIT;
}

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH] dm-crypt: fix softlockup in dmcrypt_write

2023-02-25 Thread Bart Van Assche

On 2/22/23 19:19, yangerkun wrote:

@@ -1924,6 +1926,10 @@ static int dmcrypt_write(void *data)
  
  		BUG_ON(rb_parent(write_tree.rb_node));
  
+		if (time_is_before_jiffies(start_time + HZ)) {

+   schedule();
+   start_time = jiffies;
+   }


Why schedule() instead of cond_resched()?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/9] Documentation: correct lots of spelling errors (series 2)

2023-02-02 Thread Bart Van Assche

On 2/2/23 10:33, Randy Dunlap wrote:

On 2/2/23 10:09, Jonathan Corbet wrote:

Randy Dunlap  writes:

  [PATCH 6/9] Documentation: scsi/ChangeLog*: correct spelling
  [PATCH 7/9] Documentation: scsi: correct spelling


I've left these for the SCSI folks for now.  Do we *really* want to be
fixing spelling in ChangeLog files from almost 20 years ago?


That's why I made it a separate patch -- so the SCSI folks can decide that...


How about removing the Documentation/scsi/ChangeLog.* files? I'm not 
sure these changelogs are still useful since these duplicate information 
that is already available in the output of git log ${driver_directory}.


Thanks,

Bart.


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 3/7] nvmet: introduce bdev_zone_no helper

2023-01-06 Thread Bart Van Assche

On 1/6/23 00:33, Pankaj Raghav wrote:

A generic bdev_zone_no() helper is added to calculate zone number for a
given sector in a block device. This helper internally uses disk_zone_no()
to find the zone number.

Use the helper bdev_zone_no() to calculate nr of zones. This let's us
make modifications to the math if needed in one place.


Please use the imperative mood in patch descriptions (... is added -> 
add ...).


let's -> lets

If these two issues are addressed, please add:

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 2/7] block: add a new helper bdev_{is_zone_start, offset_from_zone_start}

2023-01-06 Thread Bart Van Assche

On 1/6/23 00:33, Pankaj Raghav wrote:

+static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev,
+  sector_t sec)
+{
+   if (!bdev_is_zoned(bdev))
+   return 0;
+
+   return sec & (bdev_zone_sectors(bdev) - 1);
+}
+
+static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec)
+{
+   if (!bdev_is_zoned(bdev))
+   return false;
+
+   return bdev_offset_from_zone_start(bdev, sec) == 0;
+}


A nit: 'sector_t sector' is much more common in the block layer than 
'sector_t sec'. Please consider changing 'sec' into 'sector'.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/7] block: remove superfluous check for request queue in bdev_is_zoned

2023-01-06 Thread Bart Van Assche

On 1/6/23 00:33, Pankaj Raghav wrote:

Remove the superfluous request queue check in bdev_is_zoned() as the
bdev_get_queue can never return NULL.


the bdev_get_queue -> bdev_get_queue()

Anyway:

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH] block: remove bio_set_op_attrs

2022-12-06 Thread Bart Van Assche

On 12/6/22 06:40, Christoph Hellwig wrote:

This macro is obsolete, so replace the last few uses with open coded
bi_opf assignments.


For the entire patch:

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3] virtio_blk: add VIRTIO_BLK_F_LIFETIME feature support

2022-12-06 Thread Bart Van Assche

On 12/5/22 08:20, Alvaro Karsz wrote:

+/* Get lifetime information struct for each request */
+struct virtio_blk_lifetime {
+   /*
+* specifies the percentage of reserved blocks that are consumed.
+* optional values following virtio spec:
+* 0 - undefined
+* 1 - normal, < 80% of reserved blocks are consumed
+* 2 - warning, 80% of reserved blocks are consumed
+* 3 - urgent, 90% of reserved blocks are consumed
+*/
+   __le16 pre_eol_info;
+   /*
+* this field refers to wear of SLC cells and is provided in increments 
of 10used,
+* and so on, thru to 11 meaning estimated lifetime exceeded. All 
values above 11
+* are reserved
+*/
+   __le16 device_lifetime_est_typ_a;
+   /*
+* this field refers to wear of MLC cells and is provided with the same 
semantics as
+* device_lifetime_est_typ_a
+*/
+   __le16 device_lifetime_est_typ_b;
+};


Why does the above data structure only refer to SLC and MLC but not to
TLC or QLC?

How will this data structure be extended without having to introduce a
new ioctl?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 03/19] scsi: Move sd_pr_type to header to share

2022-11-03 Thread Bart Van Assche

On 11/2/22 19:13, Mike Christie wrote:

On 11/2/22 5:47 PM, Bart Van Assche wrote:

On 10/26/22 16:19, Mike Christie wrote:

+static inline enum scsi_pr_type block_pr_type_to_scsi(enum pr_type type)
+{
+    switch (type) {
+    case PR_WRITE_EXCLUSIVE:
+    return SCSI_PR_WRITE_EXCLUSIVE;
+    case PR_EXCLUSIVE_ACCESS:
+    return SCSI_PR_EXCLUSIVE_ACCESS;
+    case PR_WRITE_EXCLUSIVE_REG_ONLY:
+    return SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY;
+    case PR_EXCLUSIVE_ACCESS_REG_ONLY:
+    return SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY;
+    case PR_WRITE_EXCLUSIVE_ALL_REGS:
+    return SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS;
+    case PR_EXCLUSIVE_ACCESS_ALL_REGS:
+    return SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS;
+    default:
+    return 0;
+    }
+};


Please leave out "default: return 0;" from the switch statement and add "return 
0;" after the switch statement. That will make the compiler emit a warning if a value is added 
in enum pr_type but not in the above function.


Did you want that compiler warning functionality in the future code
or are you asking me to do the above only if we go the switch based
route?

Chaitanya requested me to make this array based in nvme/scsi. For this
approach, I can add a WARN or printk warning if the pr_type passed in
does not match a value in the array. I don't think I can do a compiler
warning though.

I didn't care, but I think the compiler warning route might be better
though.


Hi Mike,

Although I do not have a strong opinion about this, keeping the switch 
statement and moving the return statement out of the switch statement 
has the advantage that the compiler will warn if the switch statement is 
incomplete. I don't think that a similar warning would be emitted if the 
switch statement would be converted into an array lookup.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v3 04/19] scsi: Add support for block PR read keys/reservation

2022-11-02 Thread Bart Van Assche

On 10/26/22 16:19, Mike Christie wrote:

+static inline enum pr_type scsi_pr_type_to_block(enum scsi_pr_type type)
+{
+   switch (type) {
+   case SCSI_PR_WRITE_EXCLUSIVE:
+   return PR_WRITE_EXCLUSIVE;
+   case SCSI_PR_EXCLUSIVE_ACCESS:
+   return PR_EXCLUSIVE_ACCESS;
+   case SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY:
+   return PR_WRITE_EXCLUSIVE_REG_ONLY;
+   case SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY:
+   return PR_EXCLUSIVE_ACCESS_REG_ONLY;
+   case SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS:
+   return PR_WRITE_EXCLUSIVE_ALL_REGS;
+   case SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS:
+   return PR_EXCLUSIVE_ACCESS_ALL_REGS;
+   default:
+   return 0;
+   }
+}


Also for this function, how about moving the "return 0" statement out of 
the switch statement?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 01/19] block: Add PR callouts for read keys and reservation

2022-11-02 Thread Bart Van Assche

On 10/26/22 16:19, Mike Christie wrote:

+struct pr_keys {
+   u32 generation;
+   u32 num_keys;
+   u64 keys[];
+};
Is my understanding correct that keys[] is treated as opaque data by the 
kernel? If so, is it necessary to convert the persistent reservation 
keys from big endian to CPU endianness? Some SCSI stacks keep 
reservation keys as __be64 format.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 01/19] block: Add PR callouts for read keys and reservation

2022-11-02 Thread Bart Van Assche

On 10/26/22 16:19, Mike Christie wrote:

+struct pr_keys {
+   u32 generation;
+   u32 num_keys;
+   u64 keys[];
+};
+
+struct pr_held_reservation {
+   u64 key;
+   u32 generation;
+   enum pr_typetype;
+};
+
  struct pr_ops {
int (*pr_register)(struct block_device *bdev, u64 old_key, u64 new_key,
u32 flags);
@@ -14,6 +26,18 @@ struct pr_ops {
int (*pr_preempt)(struct block_device *bdev, u64 old_key, u64 new_key,
enum pr_type type, bool abort);
int (*pr_clear)(struct block_device *bdev, u64 key);
+   /*
+* pr_read_keys - Read the registered keys and return them in the
+* pr_keys->keys array. The keys array will have been allocated at the
+* end of the pr_keys struct and is keys_len bytes. If there are more
+* keys than can fit in the array, success will still be returned and
+* pr_keys->num_keys will reflect the total number of keys the device
+* contains, so the caller can retry with a larger array.
+*/
+   int (*pr_read_keys)(struct block_device *bdev,
+   struct pr_keys *keys_info, u32 keys_len);
+   int (*pr_read_reservation)(struct block_device *bdev,
+   struct pr_held_reservation *rsv);
  };


Is there any pr_read_keys() implementation that won't have to divide 
@keys_len by 8? How about leaving out that argument and making callers 
store the number of elements in the keys[] array in the num_keys member 
before calling pr_read_keys()?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 03/19] scsi: Move sd_pr_type to header to share

2022-11-02 Thread Bart Van Assche

On 10/26/22 16:19, Mike Christie wrote:

+static inline enum scsi_pr_type block_pr_type_to_scsi(enum pr_type type)
+{
+   switch (type) {
+   case PR_WRITE_EXCLUSIVE:
+   return SCSI_PR_WRITE_EXCLUSIVE;
+   case PR_EXCLUSIVE_ACCESS:
+   return SCSI_PR_EXCLUSIVE_ACCESS;
+   case PR_WRITE_EXCLUSIVE_REG_ONLY:
+   return SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY;
+   case PR_EXCLUSIVE_ACCESS_REG_ONLY:
+   return SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY;
+   case PR_WRITE_EXCLUSIVE_ALL_REGS:
+   return SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS;
+   case PR_EXCLUSIVE_ACCESS_ALL_REGS:
+   return SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS;
+   default:
+   return 0;
+   }
+};


Please leave out "default: return 0;" from the switch statement and add 
"return 0;" after the switch statement. That will make the compiler emit 
a warning if a value is added in enum pr_type but not in the above function.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

2022-10-24 Thread Bart Van Assche

On 9/30/22 14:24, Jens Axboe wrote:

Noted. I'll find some time to review this as well separately, once we're
on the other side of the merge window.


Hi Jens,

Now that we are on the other side of the merge window: do you perhaps 
want Pankaj to repost this patch series? From what I have heard in 
several fora (JEDEC, SNIA) all flash storage vendors except one (WDC) 
are in favor of a contiguous LBA space and hence are in favor of 
supporting zone sizes that are not a power of two.


As you may know in JEDEC we are working on standardizing zoned storage 
for UFS devices. We (JEDEC JC-64.1 committee members) would like to know 
whether or not we should require that the UFS zone size should be a 
power of two.


Thank you,

Bart.


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

2022-09-30 Thread Bart Van Assche

On 9/30/22 17:45, Damien Le Moal wrote:

On 10/1/22 04:38, Bart Van Assche wrote:

Since this has not been mentioned in the cover letter, I want to add
that in the near future we will need these patches for Android devices.
JEDEC is working on supporting zoned storage for UFS devices, the
storage devices used in all modern Android phones. Although it would be
possible to make the offset between zone starts a power of two by
inserting gap zones between data zones, UFS vendors asked not to do this
and hence need support for zone sizes that are not a power of two. An
advantage of not having to deal with gap zones is better filesystem
performance since filesystem extents cannot span gap zones. Having to
split filesystem extents because of gap zones reduces filesystem
performance.


As mentioned many times, my opinion is that a good implementation should
*not* have any extent span zone boundaries. So personally, I do not
consider such argument as a valid justification for the non-power-of-2
zone size support.


Hi Damien,

Although the filesystem extent issue probably can be solved in software, 
the argument that UFS vendors strongly prefer not to have gap zones and 
hence need support for zone sizes that are not a power of two remains.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes

2022-09-30 Thread Bart Van Assche

On 9/30/22 08:13, Jens Axboe wrote:

On 9/29/22 12:31 AM, Pankaj Raghav wrote:

Hi Jens,
   Please consider this patch series for the 6.1 release.



Hi Jens, Christoph, and Keith,
  All the patches have a Reviewed-by tag at this point. Can we queue this up
for 6.1?


It's getting pretty late for 6.1 and I'd really like to have both Christoph
and Martin sign off on these changes.


Hi Jens,

Agreed that it's getting late for 6.1.

Since this has not been mentioned in the cover letter, I want to add 
that in the near future we will need these patches for Android devices. 
JEDEC is working on supporting zoned storage for UFS devices, the 
storage devices used in all modern Android phones. Although it would be 
possible to make the offset between zone starts a power of two by 
inserting gap zones between data zones, UFS vendors asked not to do this 
and hence need support for zone sizes that are not a power of two. An 
advantage of not having to deal with gap zones is better filesystem 
performance since filesystem extents cannot span gap zones. Having to 
split filesystem extents because of gap zones reduces filesystem 
performance.


Thanks,

Bart.


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v15 11/13] dm: call dm_zone_endio after the target endio callback for zoned devices

2022-09-28 Thread Bart Van Assche

On 9/23/22 10:36, Pankaj Raghav wrote:

dm_zone_endio() updates the bi_sector of orig bio for zoned devices that
uses either native append or append emulation, and it is called before the
endio of the target. But target endio can still update the clone bio
after dm_zone_endio is called, thereby, the orig bio does not contain
the updated information anymore.

Currently, this is not a problem as the targets that support zoned devices
such as dm-zoned, dm-linear, and dm-crypt do not have an endio function,
and even if they do (such as dm-flakey), they don't modify the
bio->bi_iter.bi_sector of the cloned bio that is used to update the
orig_bio's bi_sector in dm_zone_endio function.

This is a prep patch for the new dm-po2zoned target as it modifies
bi_sector in the endio callback.

Call dm_zone_endio for zoned devices after calling the target's endio
function.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v15 07/13] zonefs: allow non power of 2 zoned devices

2022-09-28 Thread Bart Van Assche

On 9/23/22 10:36, Pankaj Raghav wrote:

The zone size shift variable is useful only if the zone sizes are known
to be power of 2. Remove that variable and use generic helpers from
block layer to calculate zone index in zonefs.


Reviewed-by: Bart Van Assche 


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

2022-09-23 Thread Bart Van Assche

On 9/22/22 23:29, Matias Bjørling wrote:

With UFS, in the proposed copy I have (may been changed) - there's
the concept of gap zones, which is zones that cannot be accessed by
the host. The gap zones are essentially "LBA fillers", enabling the
next writeable zone to start at a X * pow2 size offset. My
understanding is that this specific approach was chosen to simplify
standardization in UFS and avoid updating T10's ZBC with zone
capacity support.

While UFS would technically expose non-power of 2 zone sizes, they're
also, due to the gap zones, could also be considered power of 2 zones
if one considers the seq. write zone + the gap zone as a single
unit.

When I think about having UFS support in the kernel, the SWR and the
gap zone could be represented as a single unit. For example:

UFS - Zone Report
   Zone 0: SWR, LBA 0-11
   Zone 1: Gap, LBA 12-15
   Zone 2: SWR, LBA 16-27
   Zone 3: Gap, LBA 28-31
   ...

Kernel representation - Zone Report (as supported today)
   Zone 0: SWR, LBA 0-15, Zone Capacity 12
   Zone 1: SWR, LBA 16-31, Zone Capacity 12
   ...

If doing it this way, it removes the need for filesystems,
device-mappers, user-space applications having to understand gap
zones, and allows UFS to work out of the box with no changes to the
rest of the zoned storage eco-system.

Has the above representation been considered?


Hi Matias,

What has been described above is the approach from the first version of 
the zoned storage for UFS (ZUFS) draft standard. Support for this 
approach is available in the upstream kernel. See also "[PATCH v2 0/9] 
Support zoned devices with gap zones", 2022-04-21 
(https://lore.kernel.org/linux-scsi/20220421183023.3462291-1-bvanass...@acm.org/).


Since F2FS extents must be split at gap zones, gap zones negatively 
affect sequential read and write performance. So we abandoned the gap 
zone approach. The current approach is as follows:
* The power-of-two restriction for the offset between zone starts has 
been removed. Gap zones are no longer required. Hence, we will need the 
patches that add support for zone sizes that are not a power of two.
* The Sequential Write Required (SWR) and Sequential Write Preferred 
(SWP) zone types are supported. The feedback we received from UFS 
vendors is that which zone type works best depends on their firmware and 
ASIC design.
* We need a queue depth larger than one (QD > 1) for writes to achieve 
the full sequential write bandwidth. We plan to support QD > 1 as follows:

  - If writes have to be serialized, submit these to the same hardware
queue. According to the UFS host controller interface (UFSHCI)
standard, UFS host controllers are not allowed to reorder SCSI
commands that are submitted to the same hardware queue. A source of
command reordering that remains is the SCSI retry mechanism. Retries
happen e.g. after a command timeout.
  - For SWP zones, require the UFS device firmware to use its garbage
collection mechanism to reorder data in the unlikely case that
out-of-order writes happened.
  - For SWR zones, retry writes that failed because these were received
out-of-order by a UFS device. ZBC-1 requires compliant devices to
respond with ILLEGAL REQUEST / UNALIGNED WRITE COMMAND to out-of-
order writes.

We have considered the zone append approach but decided not to use it 
because if zone append commands get reordered the data ends up 
permanently out-of-order on the storage medium. This affects sequential 
read performance negatively.


Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

2022-09-22 Thread Bart Van Assche

On 9/21/22 16:55, Damien Le Moal wrote:

But again, that all depends on if Pankaj patch series is accepted, that
is, on everybody accepting that we lift the power-of-2 zone size constraint.


The companies that are busy with implementing zoned storage for UFS 
devices are asking for kernel support for non-power-of-2 zone sizes.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage

2022-09-16 Thread Bart Van Assche

On 9/16/22 11:48, Sarthak Kukreti wrote:

Yes. On ChromiumOS, we regularly deal with storage devices that don't
support WRITE_ZEROES or that need to have it disabled, via a quirk,
due to a bug in the vendor's implementation. Using WRITE_ZEROES for
allocation makes the allocation path quite slow for such devices (not
to mention the effect on storage lifetime), so having a separate
provisioning construct is very appealing. Even for devices that do
support an efficient WRITE_ZEROES implementation but don't support
logical provisioning per-se, I suppose that the allocation path might
be a bit faster (the device driver's request queue would report
'max_provision_sectors'=0 and the request would be short circuited
there) although I haven't benchmarked the difference.


Some background information about why ChromiumOS uses thin provisioning 
instead of a single filesystem across the entire storage device would be 
welcome. Although UFS devices support thin provisioning I am not aware 
of any use cases in Android that would benefit from UFS thin 
provisioning support.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v12 09/13] dm-zone: use generic helpers to calculate offset from zone start

2022-08-25 Thread Bart Van Assche

On 8/23/22 05:18, Pankaj Raghav wrote:

Use the bdev_offset_from_zone_start() helper function to calculate
the offset from zone start instead of using power of 2 based
calculation.

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v12 08/13] dm-zoned: ensure only power of 2 zone sizes are allowed

2022-08-25 Thread Bart Van Assche

On 8/23/22 05:18, Pankaj Raghav wrote:

From: Luis Chamberlain 

dm-zoned relies on the assumption that the zone size is a
power-of-2(po2) and the zone capacity is same as the zone size.

Ensure only po2 devices can be used as dm-zoned target until a native
support for zoned devices with non-po2 zone size is added.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v12 06/13] null_blk: allow zoned devices with non power-of-2 zone sizes

2022-08-25 Thread Bart Van Assche

On 8/23/22 05:18, Pankaj Raghav wrote:

Convert the power-of-2(po2) based calculation with zone size to be generic
in null_zone_no with optimization for po2 zone sizes.

The nr_zones calculation in null_init_zoned_dev has been replaced with a
division without special handling for po2 zone sizes as this function is
called only during the initialization and will not be invoked in the hot
path.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v12 05/13] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size

2022-08-25 Thread Bart Van Assche

On 8/23/22 05:18, Pankaj Raghav wrote:

Remove the condition which disallows non-power_of_2 zone size ZNS drive
to be updated and use generic method to calculate number of zones
instead of relying on log and shift based calculation on zone size.

The power_of_2 calculation has been replaced directly with generic
calculation without special handling. Both modified functions are not
used in hot paths, they are only used during initialization &
revalidation of the ZNS device.

As rounddown macro from math.h does not work for 32 bit architectures,
round down operation is open coded.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v12 02/13] block:rearrange bdev_{is_zoned, zone_sectors, get_queue} helpers in blkdev.h

2022-08-25 Thread Bart Van Assche

On 8/23/22 05:18, Pankaj Raghav wrote:

Define bdev_is_zoned(), bdev_zone_sectors() and bdev_get_queue() earlier
in the blkdev.h include file. Simplify bdev_is_zoned() by removing the
superfluous NULL check for request queue while we are at it.

This commit has no functional change, and it is a prep patch for allowing
zoned devices with non-power-of-2 zone sizes in the block layer.


It seems like a space is missing in the patch subject after the colon? 
Anyway:


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [ANNOUNCE] CFP: Zoned Storage Microconference - Linux Plumbers Conference 2022

2022-08-24 Thread Bart Van Assche

On 5/22/22 15:01, Adam Manzanares wrote:

Zoned Storage Devices (SMR HDDs and ZNS SSDs) have demonstrated that they can
improve storage capacity, throughput, and latency over conventional storage
devices for many workloads. Zoned storage technology is deployed at scale in
some of the largest data centers in the world. There's already a
well-established set of storage vendors with increasing device availability and
a mature software foundation for interacting with zoned storage devices is
available. Zoned storage software support is evolving and their is room for
increased file-system support and additional userspace applications.

The Zoned Storage microconference focuses on evolving the Linux zoned
storage ecosystem by improving kernel support, file systems, and applications.
In addition, the forum allows us to open the discussion to incorporate and grow
the zoned storage community making sure to meet everyone's needs and
expectations. Finally, it is an excellent opportunity for anyone interested in
zoned storage devices to meet and discuss how we can move the ecosystem forward
together.


Hi Adam,

On https://lpc.events/event/16/contributions/1147/ I see four speakers 
but no agenda? Will an agenda be added before the microconference starts?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v2 12/20] block, nvme, scsi, dm: Add blk_status to pr_ops callouts.

2022-08-09 Thread Bart Van Assche

On 8/9/22 11:08, Mike Christie wrote:

On 8/9/22 2:21 AM, Christoph Hellwig wrote:

On Mon, Aug 08, 2022 at 07:04:11PM -0500, Mike Christie wrote:

To handle both cases, this patch adds a blk_status_t arg to the pr_ops
callouts. The lower levels will convert their device specific error to
the blk_status_t then the upper levels can easily check that code
without knowing the device type. It also allows us to keep userspace
compat where it expects a negative -Exyz error code if the command fails
before it's sent to the device or a device/tranport specific value if the
error is > 0.


Why do we need two return values here?


I know the 2 return values are gross :) I can do it in one, but I wasn't sure
what's worse. See below for the other possible solutions. I think they are all
bad.


0. Convert device specific conflict error to -EBADE then back:

sd_pr_command()

.

/* would add similar check for NVME_SC_RESERVATION_CONFLICT in nvme */
if (result == SAM_STAT_CHECK_CONDITION)
return -EBADE;
else
return result;


LIO then just checks for -EBADE but when going to userspace we have to
convert:


blkdev_pr_register()

...
result = ops->pr_register()
if (result < 0) {
/* For compat we must convert back to the nvme/scsi code */
if (result == -EBADE) {
/* need some helper for this that calls down the stack 
*/
if (bdev == SCSI)
return SAM_STAT_RESERVATION_CONFLICT
else
return NVME_SC_RESERVATION_CONFLICT
} else
return blk_status_to_str(result)
} else
return result;


The conversion is kind of gross and I was thinking in the future it's going
to get worse. I'm going to want to have more advanced error handling in LIO
and dm-multipath. Like dm-multipath wants to know if an pr_op failed because
of a path failure, so it can retry another one, or a hard device/target error.
It would be nice for LIO if an PGR had bad/illegal values and the device
returned an error than I could detect that.


1. Drop the -Exyz error type and use blk_status_t in the kernel:

sd_pr_command()

.
if (result < 0)
return -errno_to_blk_status(result);
else if (result == SAM_STAT_CHECK_CONDITION)
return -BLK_STS_NEXUS;
else
return result;

blkdev_pr_register()

...
result = ops->pr_register()
if (result < 0) {
/* For compat we must convert back to the nvme/scsi code */
if (result == -BLK_STS_NEXUS) {
/* need some helper for this that calls down the stack 
*/
if (bdev == SCSI)
return SAM_STAT_RESERVATION_CONFLICT
else
return NVME_SC_RESERVATION_CONFLICT
} else
return blk_status_to_str(result)
} else
return result;

This has similar issues as #0 where we have to convert before returning to
userspace.


Note: In this case, if the block layer uses an -Exyz error code there's not
BLK_STS for then we would return -EIO to userspace now. I was thinking
that might not be ok but I could also just add a BLK_STS error code
for errors like EINVAL, EWOULDBLOCK, ENOMEM, etc so that doesn't happen.


2. We could do something like below where the low levels are not changed but the
caller converts:

sd_pr_command()
/* no changes */

lio()
result = ops->pr_register()
if (result > 0) {
/* add some stacked helper again that goes through dm and
 * to the low level device
 */
if (bdev == SCSI) {
result = scsi_result_to_blk_status(result)
else
result = nvme_error_status(result)


This looks simple, but it felt wrong having upper layers having to
know the device type and calling conversion functions.


Has it been considered to introduce a new enumeration type instead of 
choosing (0), (1) or (2)?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v2 14/20] scsi: Retry pr_ops commands if a UA is returned.

2022-08-09 Thread Bart Van Assche

On 8/9/22 09:24, Mike Christie wrote:

On 8/9/22 2:16 AM, Christoph Hellwig wrote:

On Mon, Aug 08, 2022 at 07:04:13PM -0500, Mike Christie wrote:

It's common to get a UA when doing PR commands. It could be due to a
target restarting, transport level relogin or other PR commands like a
release causing it. The upper layers don't get the sense and in some cases
have no idea if it's a SCSI device, so this has the sd layer retry.


This seems like another case for the generic in-kernel passthrugh
command retry discussed in the other thread.


It is.


Has it been considered to introduce a flag that makes scsi_noretry_cmd() 
retry passthrough commands?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v2 04/20] scsi: Add support for block PR read keys/reservation.

2022-08-09 Thread Bart Van Assche

On 8/8/22 17:04, Mike Christie wrote:

+static int sd_pr_in_command(struct block_device *bdev, u8 sa,
+   unsigned char *data, int data_len)
+{
+   struct scsi_disk *sdkp = scsi_disk(bdev->bd_disk);
+   struct scsi_device *sdev = sdkp->device;
+   struct scsi_sense_hdr sshdr;
+   u8 cmd[10] = { 0, };
+   int result;


Isn't "{ }" instead of "{ 0, }" the preferred way to zero-initialize a 
data structure?



+
+   cmd[0] = PERSISTENT_RESERVE_IN;
+   cmd[1] = sa;


Can the above two assignments be moved into the initializer of cmd[]?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v2 02/20] scsi: Rename sd_pr_command.

2022-08-09 Thread Bart Van Assche

On 8/8/22 17:04, Mike Christie wrote:

Rename sd_pr_command to sd_pr_out_command to match a
sd_pr_in_command helper added in the next patches.


No trailing dots at the end of the patch subject please (this patch and 
other patches). Otherwise this patch looks good to me.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v9 10/13] dm-table: allow zoned devices with non power-of-2 zone sizes

2022-08-03 Thread Bart Van Assche

On 8/3/22 02:47, Pankaj Raghav wrote:

Allow dm to support zoned devices with non power-of-2(po2) zone sizes as
the block layer now supports it.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v9 09/13] dm-zone: use generic helpers to calculate offset from zone start

2022-08-03 Thread Bart Van Assche

On 8/3/22 02:47, Pankaj Raghav wrote:

-   if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset)
+   if ((bdev_offset_from_zone_start(md->disk->part0,
+clone->bi_iter.bi_sector)) != 
zwp_offset)


If this patch series needs to be reposted, please leave out the 
superfluous parentheses from the above statement.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v9 05/13] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size

2022-08-03 Thread Bart Van Assche

On 8/3/22 02:47, Pankaj Raghav wrote:

-   sector &= ~(ns->zsze - 1);
+   /*
+* Round down the sector value to the nearest zone start
+*/
+   div64_u64_rem(sector, ns->zsze, &remainder);
+   sector -= remainder;


Could bdev_offset_from_zone_start() be used here?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v9 03/13] block: allow blk-zoned devices to have non-power-of-2 zone size

2022-08-03 Thread Bart Van Assche

On 8/3/22 02:47, Pankaj Raghav wrote:

+   if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+   pr_err("%s: Invalid zone capacity: %lld with non power-of-2 
zone size: %lld",
+  disk->disk_name, zone->capacity, zone->len);


If this patch series needs to be reposted, please leave out the second 
and third colon (:) from the above error message. Once this comment has 
been addressed, feel free to add:


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v9 02/13] block:rearrange bdev_{is_zoned, zone_sectors, get_queue} helpers in blkdev.h

2022-08-03 Thread Bart Van Assche

On 8/3/22 02:47, Pankaj Raghav wrote:

Define bdev_is_zoned(), bdev_zone_sectors() and bdev_get_queue() earlier
in the blkdev.h include file.

This commit has no functional change, and it is a prep patch for allowing
zoned devices with non-power-of-2 zone sizes in the block layer.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device

2022-07-28 Thread Bart Van Assche

On 7/28/22 04:57, Pankaj Raghav wrote:

Hi Bart,

On 2022-07-28 01:19, Bart Van Assche wrote:

On 7/27/22 09:22, Pankaj Raghav wrote:

This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2z-target. This new
target will be initially used for filesystems such as btrfs and
f2fs that does not have native npo2 zone support.


Should any SCSI changes be included in this patch series? From sd_zbc.c:

 if (!is_power_of_2(zone_blocks)) {
     sd_printk(KERN_ERR, sdkp,
   "Zone size %llu is not a power of two.\n",
   zone_blocks);
     return -EINVAL;
 }


I would keep these changes out of the current patch series because it
will also increase the test scope. I think once the block layer
constraint is removed as a part of this series, we can work on the SCSI
changes in the next cycle.


That's fine with me.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size

2022-07-28 Thread Bart Van Assche

On 7/28/22 05:11, Pankaj Raghav wrote:

On 2022-07-28 01:16, Bart Van Assche wrote:

The bdev_is_zone_start() name seems more clear to me than
bdev_is_zone_aligned(). Has there already been a discussion about which
name to use for this function?


The reason I did s/bdev_is_zone_start/bdev_is_zone_aligned is that this
name makes more sense for also checking if a given size is a multiple of
zone sectors for e.g., used in PATCH 9:

-   if (len & (zone_sectors - 1)) {
+   if (!bdev_is_zone_aligned(bdev, len)) {

I felt `bdev_is_zone_aligned` fits the use case of checking if the
sector starts at the start of a zone and also check if a given length of
sectors also align with the zone sectors. bdev_is_zone_start does not
make the intention clear for the latter use case IMO.

But I am fine with going back to bdev_is_zone_start if you and Damien
feel strongly otherwise.
The "zone start LBA" terminology occurs in ZBC-1, ZBC-2 and ZNS but 
"zone aligned" not. I prefer "zone start" because it is clear, 
unambiguous and because it has the same meaning as in the corresponding 
standards documents. I propose to proceed as follows for checking 
whether a number of LBAs is a multiple of the zone length:

* Either use bdev_is_zone_start() directly.
* Or introduce a synonym for bdev_is_zone_start() with an appropriate 
name, e.g. bdev_is_zone_len_multiple().


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device

2022-07-27 Thread Bart Van Assche

On 7/27/22 18:52, Damien Le Moal wrote:

On 7/28/22 08:19, Bart Van Assche wrote:

On 7/27/22 09:22, Pankaj Raghav wrote:

This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2z-target. This new
target will be initially used for filesystems such as btrfs and
f2fs that does not have native npo2 zone support.


Should any SCSI changes be included in this patch series? From sd_zbc.c:

if (!is_power_of_2(zone_blocks)) {
sd_printk(KERN_ERR, sdkp,
  "Zone size %llu is not a power of two.\n",
  zone_blocks);
return -EINVAL;
}


There are no non-power of 2 SMR drives on the market and no plans to have
any as far as I know. Users want power of 2 zone size. So I think it is
better to leave sd_zbc & scsi_debug as is for now.


Zoned UFS devices will support ZBC and may have a zone size that is not 
a power of two.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size

2022-07-27 Thread Bart Van Assche

On 7/27/22 09:22, Pankaj Raghav wrote:

Checking if a given sector is aligned to a zone is a common
operation that is performed for zoned devices. Add
bdev_is_zone_start helper to check for this instead of opencoding it
everywhere.


I can't find the bdev_is_zone_start() function in this patch?


To make this work bdev_get_queue(), bdev_zone_sectors() and
bdev_is_zoned() are moved earlier without modifications.


Can that change perhaps be isolated into a separate patch?


diff --git a/block/blk-core.c b/block/blk-core.c
index 3d286a256d3d..1f7e9a90e198 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -570,7 +570,7 @@ static inline blk_status_t blk_check_zone_append(struct 
request_queue *q,
return BLK_STS_NOTSUPP;
  
  	/* The bio sector must point to the start of a sequential zone */

-   if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) ||
+   if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) ||
!bio_zone_is_seq(bio))
return BLK_STS_IOERR;


The bdev_is_zone_start() name seems more clear to me than 
bdev_is_zone_aligned(). Has there already been a discussion about which 
name to use for this function?



+   /*
+* Non power-of-2 zone size support was added to remove the
+* gap between zone capacity and zone size. Though it is 
technically
+* possible to have gaps in a non power-of-2 device, Linux 
requires
+* the zone size to be equal to zone capacity for non power-of-2
+* zoned devices.
+*/
+   if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+   pr_warn("%s: Invalid zone capacity for non power of 2 zone 
size",
+   disk->disk_name);


Given the severity of this error, shouldn't the zone capacity and length 
be reported in the error message?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device

2022-07-27 Thread Bart Van Assche

On 7/27/22 09:22, Pankaj Raghav wrote:

This series adds support to npo2 zoned devices in the block and nvme
layer and a new **dm target** is added: dm-po2z-target. This new
target will be initially used for filesystems such as btrfs and
f2fs that does not have native npo2 zone support.


Should any SCSI changes be included in this patch series? From sd_zbc.c:

if (!is_power_of_2(zone_blocks)) {
sd_printk(KERN_ERR, sdkp,
  "Zone size %llu is not a power of two.\n",
  zone_blocks);
return -EINVAL;
}

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/3] dm-verity: optionally use tasklets in dm-verity

2022-07-22 Thread Bart Van Assche

On 7/22/22 09:41, Christoph Hellwig wrote:

We've been tying to kill off task lets for about 15 years.  I don't
think adding new users will make you a whole lot of friends..


+1 for not using tasklets. At least in Android the real-time thread 
scheduling latency is important. Tasklets are not visible to the 
scheduler and hence cause latency spikes for real-time threads. These 
latency spikes can be observed by users, e.g. if the real-time thread is 
involved in audio playback.


Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-29 Thread Bart Van Assche

On 6/29/22 12:05, Kent Overstreet wrote:

On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote:

On 6/29/22 11:40, Kent Overstreet wrote:

But Jens, to be blunt - I know we have different priorities in the way we write 
code.


Please stay professional in your communication and focus on the technical
issues instead of on the people involved.

BTW, I remember that some time ago I bisected a kernel bug to one of your
commits and that you refused to fix the bug introduced by that commit. I had
to spend my time on root-causing it and sending a fix upstream.


I'd be genuinely appreciative if you'd refresh my memory on what it was. Because
yeah, if I did that that was my fuckup and I want to learn from my mistakes.


I was referring to the following two conversations from May 2018:
* [PATCH] Revert "block: Add warning for bi_next not NULL in 
bio_endio()" 
(https://lore.kernel.org/linux-block/20180522235505.20937-1-bart.vanass...@wdc.com/)
* [PATCH v2] Revert "block: Add warning for bi_next not NULL in 
bio_endio()" 
(https://lore.kernel.org/linux-block/20180619172640.15246-1-bart.vanass...@wdc.com/)


Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API

2022-06-29 Thread Bart Van Assche

On 6/29/22 11:40, Kent Overstreet wrote:

But Jens, to be blunt - I know we have different priorities in the way we write 
code.


Please stay professional in your communication and focus on the 
technical issues instead of on the people involved.


BTW, I remember that some time ago I bisected a kernel bug to one of 
your commits and that you refused to fix the bug introduced by that 
commit. I had to spend my time on root-causing it and sending a fix 
upstream.


Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v7 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size

2022-06-15 Thread Bart Van Assche

On 6/15/22 03:19, Pankaj Raghav wrote:

@@ -489,14 +489,27 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, 
unsigned int idx,
 * smaller last zone.
 */
if (zone->start == 0) {
-   if (zone->len == 0 || !is_power_of_2(zone->len)) {
-   pr_warn("%s: Invalid zoned device with non power of two zone 
size (%llu)\n",
-   disk->disk_name, zone->len);
+   if (zone->len == 0) {
+   pr_warn("%s: Invalid zone size", disk->disk_name);
+   return -ENODEV;
+   }
+
+   /*
+* Don't allow zoned device with non power_of_2 zone size with
+* zone capacity less than zone size.
+*/


Please change "power_of_2" into "power-of-2".


+   if (!is_power_of_2(zone->len) && zone->capacity < zone->len) {
+   pr_warn("%s: Invalid zone capacity for non power of 2 zone 
size",
+   disk->disk_name);
return -ENODEV;
}


The above check seems wrong to me. I don't see why devices that report a 
capacity that is less than the zone size should be rejected.



+   /*
+* Division is used to calculate nr_zones for both power_of_2
+* and non power_of_2 zone sizes as it is not in the hot path.
+*/


Shouldn't the above comment be moved to the patch description? I'm not 
sure whether having such a comment in the source code is valuable.



+static inline sector_t blk_queue_offset_from_zone_start(struct request_queue 
*q,
+   sector_t sec)
+{
+   sector_t zone_sectors = blk_queue_zone_sectors(q);
+   u64 remainder = 0;
+
+   if (!blk_queue_is_zoned(q))
+   return false;


"return false" should only occur in functions returning a boolean. This 
function returns type sector_t.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v7 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze

2022-06-15 Thread Bart Van Assche

On 6/15/22 03:19, Pankaj Raghav wrote:

Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
also work for non-power-of-2 zone sizes.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is kept for those devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 5/6] block: fold blk_max_size_offset into get_max_io_size

2022-06-14 Thread Bart Van Assche

On 6/14/22 02:09, Christoph Hellwig wrote:

Now that blk_max_size_offset has a single caller left, fold it into that
and clean up the naming convention for the local variables there.


Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 4/6] block: cleanup variable naming in get_max_io_size

2022-06-14 Thread Bart Van Assche

On 6/14/22 02:09, Christoph Hellwig wrote:

get_max_io_size has a very odd choice of variables names and
initialization patterns.  Switch to more descriptive names and more
clear initialization of them.


Hmm ... what is so odd about the variable names? I have done my best to 
chose clear and descriptive names when I introduced these names.


Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 6/6] block: move blk_queue_get_max_sectors to blk.h

2022-06-14 Thread Bart Van Assche

On 6/14/22 02:09, Christoph Hellwig wrote:

blk_queue_get_max_sectors is private to the block layer, so move it out
of blkdev.h.

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 3/6] block: open code blk_max_size_offset in blk_rq_get_max_sectors

2022-06-14 Thread Bart Van Assche

On 6/14/22 09:43, Bart Van Assche wrote:

On 6/14/22 02:09, Christoph Hellwig wrote:

blk_rq_get_max_sectors always uses q->limits.chunk_sectors as the
chunk_sectors argument, and already checks for max_sectors through the
call to blk_queue_get_max_sectors.  That means much of
blk_max_size_offset is not needed and open coding it simplifies the code.

Signed-off-by: Christoph Hellwig 
---
  block/blk-merge.c | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index db2e03c8af7f4..df003ecfbd474 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -566,17 +566,18 @@ static inline unsigned int 
blk_rq_get_max_sectors(struct request *rq,

    sector_t offset)
  {
  struct request_queue *q = rq->q;
+    unsigned int max_sectors;
  if (blk_rq_is_passthrough(rq))
  return q->limits.max_hw_sectors;
+    max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
  if (!q->limits.chunk_sectors ||
  req_op(rq) == REQ_OP_DISCARD ||
  req_op(rq) == REQ_OP_SECURE_ERASE)
-    return blk_queue_get_max_sectors(q, req_op(rq));
-
-    return min(blk_max_size_offset(q, offset, 0),
-    blk_queue_get_max_sectors(q, req_op(rq)));
+    return max_sectors;
+    return min(max_sectors,
+   blk_chunk_sectors_left(offset, q->limits.chunk_sectors));
  }


blk_set_default_limits() initializes chunk_sectors to zero and 
blk_chunk_sectors_left() triggers a division by zero if a zero is passed 
as the second argument. What am I missing?


Answering my own question: I overlooked one of the return statements.

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 3/6] block: open code blk_max_size_offset in blk_rq_get_max_sectors

2022-06-14 Thread Bart Van Assche

On 6/14/22 02:09, Christoph Hellwig wrote:

blk_rq_get_max_sectors always uses q->limits.chunk_sectors as the
chunk_sectors argument, and already checks for max_sectors through the
call to blk_queue_get_max_sectors.  That means much of
blk_max_size_offset is not needed and open coding it simplifies the code.

Signed-off-by: Christoph Hellwig 
---
  block/blk-merge.c | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index db2e03c8af7f4..df003ecfbd474 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -566,17 +566,18 @@ static inline unsigned int blk_rq_get_max_sectors(struct 
request *rq,
  sector_t offset)
  {
struct request_queue *q = rq->q;
+   unsigned int max_sectors;
  
  	if (blk_rq_is_passthrough(rq))

return q->limits.max_hw_sectors;
  
+	max_sectors = blk_queue_get_max_sectors(q, req_op(rq));

if (!q->limits.chunk_sectors ||
req_op(rq) == REQ_OP_DISCARD ||
req_op(rq) == REQ_OP_SECURE_ERASE)
-   return blk_queue_get_max_sectors(q, req_op(rq));
-
-   return min(blk_max_size_offset(q, offset, 0),
-   blk_queue_get_max_sectors(q, req_op(rq)));
+   return max_sectors;
+   return min(max_sectors,
+  blk_chunk_sectors_left(offset, q->limits.chunk_sectors));
  }


blk_set_default_limits() initializes chunk_sectors to zero and 
blk_chunk_sectors_left() triggers a division by zero if a zero is passed 
as the second argument. What am I missing?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/6] block: factor out a chunk_size_left helper

2022-06-14 Thread Bart Van Assche

On 6/14/22 02:09, Christoph Hellwig wrote:

Factor out a helper from blk_max_size_offset so that it can be reused
independently. 

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/8] Use block pr_ops in LIO

2022-06-05 Thread Bart Van Assche

On 6/5/22 09:55, Mike Christie wrote:

libiscsi is not suitable for this type of setup.

I think that this setup can be tested as follows with libiscsi:
* Configure the backend storage.
* Configure LIO to use the backend storage on two different servers.
* On a third server, log in with the iSCSI initiator to both servers.
* Run the libiscsi iscsi-test-cu test software on the third server and
  pass the two IQNs that refer to the LIO servers as arguments.

From the iscsi-test-cu -h output:

iscsi-test-cu [OPTIONS]  [multipath-iscsi-url]

Did I perhaps overlook or misunderstand something?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/8] Use block pr_ops in LIO

2022-06-04 Thread Bart Van Assche

On 6/2/22 23:55, Mike Christie wrote:

The following patches were built over Linus's tree. They allow us to use
the block pr_ops with LIO's target_core_iblock module to support cluster
applications in VMs.

Currently, to use something like windows clustering in VMs with LIO and
vhost-scsi, you have to use tcmu or pscsi or use a cluster aware
FS/framework for the LIO pr file. Setting up a cluster FS/framework is
pain and waste when your real backend device is already a distributed
device, and pscsi and tcmu are nice for specific use cases, but iblock
gives you the best performance and allows you to use stacked devices
like dm-multipath. So these patches allow iblock to work like pscsi/tcmu
where they can pass a PR command to the backend module. And then iblock
will use the pr_ops to pass the PR command to the real devices similar
to what we do for unmap today.

Note that this is patchset does not attempt to support every PR SCSI
feature in iblock. It has the same limitations as tcmu and pscsi where
you can have a single I_T nexus per device and only supports what is
needed for windows clustering right now.


How has this patch series been tested? Does LIO pass the libiscsi 
persistent reservation tests with this patch series applied?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 09/11] block, nvme: Add error for reservation conflicts.

2022-06-04 Thread Bart Van Assche

On 6/2/22 23:55, Mike Christie wrote:

BLK_STS_NEXUS is used for nvme/scsi reservation conflicts and also
general nexus failures. For the pr_ops use we want to know if an IO failed
for specifically a reservation conflict so we can report that error upwards
to a VM. This patch adds a new error code for this case and converts nvme.
The next patch converts scsi because it's a little more complicated.

Signed-off-by: Mike Christie 
---
  block/blk-core.c  | 1 +
  drivers/nvme/host/core.c  | 2 +-
  include/linux/blk_types.h | 4 
  3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bc0506772152..3908ac4a70b6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -171,6 +171,7 @@ static const struct {
/* zone device specific errors */
[BLK_STS_ZONE_OPEN_RESOURCE]= { -ETOOMANYREFS, "open zones 
exceeded" },
[BLK_STS_ZONE_ACTIVE_RESOURCE]  = { -EOVERFLOW, "active zones exceeded" 
},
+   [BLK_STS_RSV_CONFLICT]  = { -EBADE, "resevation conflict" },

 ^^

Please fix the spelling of "reservation".

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 03/11] scsi: Move sd_pr_type to header to share.

2022-06-04 Thread Bart Van Assche

On 6/2/22 23:55, Mike Christie wrote:

+static inline char block_pr_type_to_scsi(enum pr_type type)
+{
+   switch (type) {
+   case PR_WRITE_EXCLUSIVE:
+   return 0x01;
+   case PR_EXCLUSIVE_ACCESS:
+   return 0x03;
+   case PR_WRITE_EXCLUSIVE_REG_ONLY:
+   return 0x05;
+   case PR_EXCLUSIVE_ACCESS_REG_ONLY:
+   return 0x06;
+   case PR_WRITE_EXCLUSIVE_ALL_REGS:
+   return 0x07;
+   case PR_EXCLUSIVE_ACCESS_ALL_REGS:
+   return 0x08;
+   default:
+   return 0;
+   }
+};
+
+static inline enum pr_type scsi_pr_type_to_block(char type)
+{
+   switch (type) {
+   case 0x01:
+   return PR_WRITE_EXCLUSIVE;
+   case 0x03:
+   return PR_EXCLUSIVE_ACCESS;
+   case 0x05:
+   return PR_WRITE_EXCLUSIVE_REG_ONLY;
+   case 0x06:
+   return PR_EXCLUSIVE_ACCESS_REG_ONLY;
+   case 0x07:
+   return PR_WRITE_EXCLUSIVE_ALL_REGS;
+   case 0x08:
+   return PR_EXCLUSIVE_ACCESS_ALL_REGS;
+   default:
+   return 0;
+   }
+}


Since 'char' is used above to represent a single byte, please use u8 or 
uint8_t instead.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v3 10/11] null_blk: allow non power of 2 zoned devices

2022-05-12 Thread Bart Van Assche

On 5/9/22 04:56, Pankaj Raghav wrote:

Even though I am not sure if this optimization will directly add value
looking at my experiments with the current change, I can fold this in
with a comment on top of zone_size_sect_shifts variable stating that
size can be npo2 and this variable is only meaningful for the po2 size
scenario.


Have these experiments perhaps been run on an x86_64 CPU? These CPUs 
only need a single instruction to calculate ilog2(). No equivalent of 
that instruction is available on ARM CPUs as far as I know. I think the 
optimization Damien proposed will help on ARM CPUs.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 13/16] null_blk: allow non power of 2 zoned devices

2022-05-03 Thread Bart Van Assche

On 4/27/22 09:02, Pankaj Raghav wrote:

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index c441a4972064..82a62b543782 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -1931,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev)
dev->mbps = 0;
  
  	if (dev->zoned &&

-   (!dev->zone_size || !is_power_of_2(dev->zone_size))) {
-   pr_err("zone_size must be power-of-two\n");
+   (!dev->zone_size)) {
+   pr_err("zone_size must not be zero\n");
return -EINVAL;
}


Please combine "if (dev->zoned &&" and "(!dev->zone_size)) {" into a 
single line and leave out the parentheses that became superfluous.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 06/16] nvmet: use blk_queue_zone_no()

2022-05-03 Thread Bart Van Assche

On 4/27/22 09:02, Pankaj Raghav wrote:

From: Luis Chamberlain 

Instead of open coding the number of zones given a sector, use the helper

 ^^
I can't parse this. Please rephrase this.


blk_queue_zone_no(). This let's us make modifications to the math if
needed in one place and adds now support for npo2 zone devices.


But since the code looks fine:

Reviewed-by: Bart Van Assche 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size

2022-05-03 Thread Bart Van Assche

On 4/27/22 09:02, Pankaj Raghav wrote:

-   sector &= ~(ns->zsze - 1);
+   sector = rounddown(sector, ns->zsze);


The above change breaks 32-bit builds since ns->zsze is 64 bits wide and 
since rounddown() uses the C division operator instead of div64_u64().


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze

2022-05-03 Thread Bart Van Assche

On 4/27/22 09:02, Pankaj Raghav wrote:

Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can
also work for non-power-of-2 zone sizes.

As the existing deployments of zoned devices had power-of-2
assumption, power-of-2 optimized calculation is kept for those devices.

There are no direct hot paths modified and the changes just
introduce one new branch per call.

Reviewed-by: Luis Chamberlain 
Signed-off-by: Pankaj Raghav 
---
  block/blk-zoned.c  | 8 +++-
  include/linux/blkdev.h | 8 +++-
  2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 38cd840d8838..1dff4a8bd51d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock);
  unsigned int blkdev_nr_zones(struct gendisk *disk)
  {
sector_t zone_sectors = blk_queue_zone_sectors(disk->queue);
+   sector_t capacity = get_capacity(disk);
  
  	if (!blk_queue_is_zoned(disk->queue))

return 0;
-   return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors);
+
+   if (is_power_of_2(zone_sectors))
+   return (capacity + zone_sectors - 1) >>
+  ilog2(zone_sectors);
+
+   return div64_u64(capacity + zone_sectors - 1, zone_sectors);
  }
  EXPORT_SYMBOL_GPL(blkdev_nr_zones);


Does anyone need support for more than 4 billion sectors per zone? If 
not, do_div() should be sufficient.



diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 60d016138997..c4e4c7071b7b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct 
request_queue *q)
  static inline unsigned int blk_queue_zone_no(struct request_queue *q,
 sector_t sector)
  {
+   sector_t zone_sectors = blk_queue_zone_sectors(q);
+
if (!blk_queue_is_zoned(q))
return 0;
-   return sector >> ilog2(q->limits.chunk_sectors);
+
+   if (is_power_of_2(zone_sectors))
+   return sector >> ilog2(zone_sectors);
+
+   return div64_u64(sector, zone_sectors);
  }


Same comment here.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH v4 00/10] Add Copy offload support

2022-04-28 Thread Bart Van Assche

On 4/28/22 14:37, Damien Le Moal wrote:

On 4/28/22 16:49, Nitesh Shetty wrote:

On Thu, Apr 28, 2022 at 07:05:32AM +0900, Damien Le Moal wrote:

On 4/27/22 21:49, Nitesh Shetty wrote:

O Wed, Apr 27, 2022 at 11:19:48AM +0900, Damien Le Moal wrote:

On 4/26/22 19:12, Nitesh Shetty wrote:

The patch series covers the points discussed in November 2021 virtual call
[LSF/MM/BFP TOPIC] Storage: Copy Offload[0].
We have covered the Initial agreed requirements in this patchset.
Patchset borrows Mikulas's token based approach for 2 bdev
implementation.

Overall series supports –

1. Driver
- NVMe Copy command (single NS), including support in nvme-target (for
 block and file backend)


It would also be nice to have copy offload emulation in null_blk for testing.



We can plan this in next phase of copy support, once this series settles down.


So how can people test your series ? Not a lot of drives out there with
copy support.



Yeah not many drives at present, Qemu can be used to test NVMe copy.


Upstream QEMU ? What is the command line options ? An example would be
nice. But I still think null_blk support would be easiest.


+1 for adding copy offloading support in null_blk. That enables running 
copy offloading tests without depending on Qemu.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 03/16] block: add bdev_zone_no helper

2022-04-27 Thread Bart Van Assche

On 4/27/22 09:02, Pankaj Raghav wrote:

+static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t 
sec)
+{
+   struct request_queue *q = bdev_get_queue(bdev);
+
+   if (q)
+   return blk_queue_zone_no(q, sec);
+   return 0;
+}


This patch series has been split incorrectly: the same patch that 
introduces a new function should also introduce a caller to that function.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper

2022-04-27 Thread Bart Van Assche

On 4/27/22 09:02, Pankaj Raghav wrote:

+static inline bool bdev_zone_aligned(struct block_device *bdev, sector_t sec)
+{
+   struct request_queue *q = bdev_get_queue(bdev);
+
+   if (q)
+   return blk_queue_zone_aligned(q, sec);
+   return false;
+}


Which patch uses this function? I can't find any patch in this series 
that introduces a call to this function.


Thanks,

Bart.


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 01/11] blk-mq: Add blk_mq_init_queue_ops()

2022-03-22 Thread Bart Van Assche

On 3/22/22 03:39, John Garry wrote:

Add an API to allocate a request queue which accepts a custom set of
blk_mq_ops for that request queue.

The reason which we may want custom ops is for queuing requests which we
don't want to go through the normal queuing path.


Custom ops shouldn't be required for this. See e.g. how tmf_queue
is used in the UFS driver for an example of a queue implementation
with custom operations and that does not require changes of the block
layer core.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 1/3] block: add copy offload support

2022-02-03 Thread Bart Van Assche

On 2/3/22 10:50, Mikulas Patocka wrote:

On Tue, 1 Feb 2022, Bart Van Assche wrote:

On 2/1/22 10:32, Mikulas Patocka wrote:

   /**
+ * blk_queue_max_copy_sectors - set maximum copy offload sectors for the
queue
+ * @q:  the request queue for the device
+ * @size:  the maximum copy offload sectors
+ */
+void blk_queue_max_copy_sectors(struct request_queue *q, unsigned int size)
+{
+   q->limits.max_copy_sectors = size;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors);


Please either change the unit of 'size' into bytes or change its type into
sector_t.


blk_queue_chunk_sectors, blk_queue_max_discard_sectors,
blk_queue_max_write_same_sectors, blk_queue_max_write_zeroes_sectors,
blk_queue_max_zone_append_sectors also have the unit of sectors and the
argument is "unsigned int". Should blk_queue_max_copy_sectors be
different?


As far as I know using the type sector_t for variables that represent a 
number of sectors is a widely followed convention:


$ git grep -w sector_t | wc -l
2575

I would appreciate it if that convention would be used consistently, 
even if that means modifying existing code.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 1/3] block: add copy offload support

2022-02-01 Thread Bart Van Assche

On 2/1/22 10:32, Mikulas Patocka wrote:

  /**
+ * blk_queue_max_copy_sectors - set maximum copy offload sectors for the queue
+ * @q:  the request queue for the device
+ * @size:  the maximum copy offload sectors
+ */
+void blk_queue_max_copy_sectors(struct request_queue *q, unsigned int size)
+{
+   q->limits.max_copy_sectors = size;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors);


Please either change the unit of 'size' into bytes or change its type 
into sector_t.



+extern int blkdev_issue_copy(struct block_device *bdev1, sector_t sector1,
+ struct block_device *bdev2, sector_t sector2,
+ sector_t nr_sects, sector_t *copied, gfp_t gfp_mask);
+


Only supporting copying between contiguous LBA ranges seems restrictive 
to me. I expect garbage collection by filesystems for UFS devices to 
perform better if multiple LBA ranges are submitted as a single SCSI 
XCOPY command.


A general comment about the approach: encoding the LBA range information 
in a bio payload is not compatible with bio splitting. How can the dm 
driver implement copy offloading without the ability to split copy 
offload bio's?



+int blkdev_issue_copy(struct block_device *bdev1, sector_t sector1,
+ struct block_device *bdev2, sector_t sector2,
+ sector_t nr_sects, sector_t *copied, gfp_t gfp_mask)
+{
+   struct page *token;
+   sector_t m;
+   int r = 0;
+   struct completion comp;


Consider using DECLARE_COMPLETION_ONSTACK() instead of a separate 
declaration and init_completion() call.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 2/3] nvme: add copy offload support

2022-02-01 Thread Bart Van Assche

On 2/1/22 10:33, Mikulas Patocka wrote:

+static inline blk_status_t nvme_setup_read_token(struct nvme_ns *ns, struct 
request *req)
+{
+   struct bio *bio = req->bio;
+   struct nvme_copy_token *token = page_to_virt(bio->bi_io_vec[0].bv_page) + 
bio->bi_io_vec[0].bv_offset;


Hmm ... shouldn't this function use bvec_kmap_local() instead of 
page_to_virt()?


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2022-01-31 Thread Bart Van Assche

On 1/26/22 23:14, Chaitanya Kulkarni wrote:

[1]https://content.riscv.org/wp-content/uploads/2018/12/A-New-Golden-Age-for-Computer-Architecture-History-Challenges-and-Opportunities-David-Patterson-.pdf
[2] https://www.snia.org/computational
https://www.napatech.com/support/resources/solution-descriptions/napatech-smartnic-solution-for-hardware-offload/
https://www.eideticom.com/products.html
https://www.xilinx.com/applications/data-center/computational-storage.html
[3] git://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git xcopy
[4] https://www.spinics.net/lists/linux-block/msg00599.html
[5] https://lwn.net/Articles/793585/
[6] https://nvmexpress.org/new-nvmetm-specification-defines-zoned-
namespaces-zns-as-go-to-industry-technology/
[7] https://github.com/sbates130272/linux-p2pmem
[8] https://kernel.dk/io_uring.pdf


Please consider adding the following link to the above list:
https://github.com/bvanassche/linux-kernel-copy-offload

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [RFC PATCH 0/1] Add inline encryption support for dm-crypt

2022-01-13 Thread Bart Van Assche

On 1/13/22 10:09, Israel Rukshin wrote:

I am working to add support for inline encryption/decryption
at storage protocols like nvmf over RDMA. The HW that I am
working with is ConnecX-6 Dx, which supports inline crypto
and can send the data on the fabric at the same time.

This patchset is based on v5.16-rc4 with Eric Biggers patches
of the HW wrapped keys.
It can be retrieved from branch "wip-wrapped-keys" at
https://git.kernel.org/pub/scm/fs/fscrypt/fscrypt.git

I tested this patch with modified nvme-rdma as the block device
and created a DM crypt on top of it. I got performance enhancement
compared to SW crypto. I tested the HW wrapped inline mode with
the HW and the standard mode with a fallback algorithm.


Hi Israel,

Thank you for your work. For future patches related to inline 
encryption, please Cc Eric Biggers.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2021-11-19 Thread Bart Van Assche

On 11/19/21 02:47, Kanchan Joshi wrote:

Given the multitude of things accumulated on this topic, Martin
suggested to have a table/matrix.
Some of those should go in the initial patchset, and the remaining are
to be staged for subsequent work.
Here is the attempt to split the stuff into two buckets. Please change
if something needs to be changed below.

1. Driver
*
Initial: NVMe Copy command (single NS)
Subsequent: Multi NS copy, XCopy/Token-based Copy

2. Block layer
**
Initial:
- Block-generic copy (REQ_OP_COPY), with interface accommodating two block-devs
- Emulation, when offload is natively absent
- DM support (at least dm-linear)

3. User-interface
*
Initial: new ioctl or io_uring opcode

4. In-kernel user
**
Initial: at least one user
- dm-kcopyd user (e.g. dm-clone), or FS requiring GC (F2FS/Btrfs)

Subsequent:
- copy_file_range


Integrity support and inline encryption support are missing from the above
overview. Both are supported by the block layer. See also block/blk-integrity.c
and include/linux/blk-crypto.h. I'm not claiming that these should be supported
in the first version but I think it would be good to add these to the above
overview.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2021-11-17 Thread Bart Van Assche

On 11/17/21 04:53, Javier González wrote:

Thanks for sharing this. We will make sure that DM / MD are supported
and then we can cover examples. Hopefully, you guys can help with the
bits for dm-crypt to make the decision to offload when it make sense.


Will ask around to learn who should work on this.


I will update the notes to keep them alive. Maybe we can have them open
in your github page?


Feel free to submit a pull request.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2021-11-16 Thread Bart Van Assche

On 11/16/21 05:43, Javier González wrote:
     - Here, we need copy emulation to support encryption 
without dealing with HW issues and garbage


Hi Javier,

Thanks very much for having taken notes and also for having shared 
these. Regarding the above comment, after the meeting I learned that the 
above is not correct. Encryption in Android is LBA independent and hence 
it should be possible to offload F2FS garbage collection in Android once 
the (UFS) storage controller supports this.


For the general case, I propose to let the dm-crypt driver decide 
whether or not to offload data copying since that driver knows whether 
or not data copying can be offloaded.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2021-10-29 Thread Bart Van Assche

On 10/28/21 10:51 PM, Hannes Reinecke wrote:

Also Keith presented his work on a simple zone-based remapping block device, 
which included an in-kernel copy offload facility.
Idea is to lift that as a standalone patch such that we can use it a fallback 
(ie software) implementation if no other copy offload mechanism is available.


Is a link to the presentation available?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/5] block: factor out a chunk_size_left helper

2021-10-12 Thread Bart Van Assche

On 10/12/21 9:36 AM, Christoph Hellwig wrote:

+/*
+ * Return how much of the chunk sectors is left to be used for an I/O at the
+ * given offset.
+ */
+static inline unsigned int chunk_size_left(sector_t offset,
+   unsigned int chunk_sectors)
+{
+   if (unlikely(!is_power_of_2(chunk_sectors)))
+   return chunk_sectors - sector_div(offset, chunk_sectors);
+   return chunk_sectors - (offset & (chunk_sectors - 1));
+}


No "blk_" prefix for the function name? I think most other functions 
declared or defined in this header file have such a prefix.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2021-10-06 Thread Bart Van Assche

On 10/6/21 3:05 AM, Javier González wrote:

I agree that the topic is complex. However, we have not been able to
find a clear path forward in the mailing list.


Hmm ... really? At least Martin Petersen and I consider device mapper 
support essential. How about starting from Mikulas' patch series that 
supports the device mapper? See also 
https://lore.kernel.org/all/alpine.lrh.2.02.2108171630120.30...@file01.intranet.prod.int.rdu2.redhat.com/



What do you think about joining the call to talk very specific next
steps to get a patchset that we can start reviewing in detail.


I can do that.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

2021-09-30 Thread Bart Van Assche

On 9/28/21 12:13 PM, Javier González wrote:

Since we are not going to be able to talk about this at LSF/MM, a few of
us thought about holding a dedicated virtual discussion about Copy
Offload. I believe we can use Chaitanya's thread as a start. Given the
current state of the current patches, I would propose that we focus on
the next step to get the minimal patchset that can go upstream so that
we can build from there.

Before we try to find a date and a time that fits most of us, who would
be interested in participating?


Given the technical complexity of this topic and also that the people who are
interested live in multiple time zones, I prefer email to discuss the technical
aspects of this work. My attempt to summarize how to implement copy offloading
is available here: https://github.com/bvanassche/linux-kernel-copy-offload.
Feedback on this text is welcome.

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] dm-init.c: boot up race with partitions

2021-09-23 Thread Bart Van Assche

On 9/22/21 18:33, 李平 wrote:

I encountered the same problem as this one. Is there a solution?

How about a creating the verity device from inside a udev rule?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH 3/7] block: copy offload support infrastructure

2021-08-20 Thread Bart Van Assche

On 8/20/21 3:39 AM, Kanchan Joshi wrote:

Bart, Mikulas

On Tue, Aug 17, 2021 at 10:44 PM Bart Van Assche  wrote:


On 8/17/21 3:14 AM, SelvaKumar S wrote:

Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
bio with control information as payload and submit to the device.
Larger copy operation may be divided if necessary by looking at device
limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when
submitted to zoned device.
Native copy offload is not supported for stacked devices.


Using a single operation for copy-offloading instead of separate
operations for reading and writing is fundamentally incompatible with
the device mapper. I think we need a copy-offloading implementation that
is compatible with the device mapper.



While each read/write command is for a single contiguous range of
device, with simple-copy we get to operate on multiple discontiguous
ranges, with a single command.
That seemed like a good opportunity to reduce control-plane traffic
(compared to read/write operations) as well.

With a separate read-and-write bio approach, each source-range will
spawn at least one read, one write and eventually one SCC command. And
it only gets worse as there could be many such discontiguous ranges (for
GC use-case at least) coming from user-space in a single payload.
Overall sequence will be
- Receive a payload from user-space
- Disassemble into many read-write pair bios at block-layer
- Assemble those (somehow) in NVMe to reduce simple-copy commands
- Send commands to device

We thought payload could be a good way to reduce the
disassembly/assembly work and traffic between block-layer to nvme.
How do you see this tradeoff?  What seems necessary for device-mapper
usecase, appears to be a cost when device-mapper isn't used.
Especially for SCC (since copy is within single ns), device-mappers
may not be too compelling anyway.

Must device-mapper support be a requirement for the initial support atop SCC?
Or do you think it will still be a progress if we finalize the
user-space interface to cover all that is foreseeable.And for
device-mapper compatible transport between block-layer and NVMe - we
do it in the later stage when NVMe too comes up with better copy
capabilities?


Hi Kanchan,

These days there might be more systems that run the device mapper on top 
of the NVMe driver or a SCSI driver than systems that do use the device 
mapper. It is common practice these days to use dm-crypt on personal 
workstations and laptops. LVM (dm-linear) is popular because it is more 
flexible than a traditional partition table. Android phones use 
dm-verity on top of hardware encryption. In other words, not supporting 
the device mapper means that a very large number of use cases is 
excluded. So I think supporting the device mapper from the start is 
important, even if that means combining individual bios at the bottom of 
the storage stack into simple copy commands.


Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



  1   2   3   4   5   6   7   8   >