Re: [dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.
On 8/14/23 05:18, Nitesh Shetty wrote: On 23/08/11 02:25PM, Bart Van Assche wrote: On 8/11/23 03:52, Nitesh Shetty wrote: diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 0bad62cca3d0..de0ad7a0d571 100644 +static inline bool op_is_copy(blk_opf_t op) +{ + return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC || + (op & REQ_OP_MASK) == REQ_OP_COPY_DST); +} + The above function should be moved into include/linux/blk-mq.h below the definition of req_op() such that it can use req_op() instead of open-coding it. We use this later for dm patches(patch 9) as well, and we don't have request at that time. My understanding is that include/linux/blk_types.h should only contain data types and constants and hence that inline functions like op_is_copy() should be moved elsewhere. Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v14 04/11] block: add emulation for copy
On 8/11/23 03:52, Nitesh Shetty wrote: + schedule_work(&emulation_io->emulation_work); schedule_work() uses system_wq. This won't work for all users since there are no latency guarantees for system_wq. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v14 03/11] block: add copy offload support
On 8/11/23 03:52, Nitesh Shetty wrote: + if (rem != chunk) + atomic_inc(&cio->refcount); This code will be easier to read if the above if-test is left out and if the following code is added below the for-loop: if (atomic_dec_and_test(&cio->refcount)) blkdev_copy_endio(cio); Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.
On 8/11/23 03:52, Nitesh Shetty wrote: We expect caller to take a plug and send bio with source information, followed by bio with destination information. Once the src bio arrives we form a request and wait for destination bio. Upon arrival of destination we merge these two bio's and send corresponding request down to device driver. Is the above description up-to-date? In the cover letter there is a different description of how copy offloading works. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v14 01/11] block: Introduce queue limits and sysfs for copy-offload support
On 8/11/23 03:52, Nitesh Shetty wrote: +/* maximum copy offload length, this is set to 128MB based on current testing */ +#define COPY_MAX_BYTES (1 << 27) Since the COPY_MAX_BYTES constant is only used in source file block/blk-settings.c it should be moved into that file. If you really want to keep it in include/linux/blkdev.h, a BLK_ prefix should be added. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v14 00/11] Implement copy offload support
On 8/11/23 03:52, Nitesh Shetty wrote: We achieve copy offload by sending 2 bio's with source and destination info and merge them to form a request. This request is sent to driver. So this design works only for request based storage drivers. [ ... ] Overall series supports: 1. Driver - NVMe Copy command (single NS, TP 4065), including support in nvme-target (for block and file back end). 2. Block layer - Block-generic copy (REQ_OP_COPY_DST/SRC), operation with interface accommodating two block-devs - Merging copy requests in request layer - Emulation, for in-kernel user when offload is natively absent - dm-linear support (for cases not requiring split) 3. User-interface - copy_file_range Is this sufficient? The combination of dm-crypt, dm-linear and the NVMe driver is very common. What is the plan for supporting dm-crypt? Shouldn't bio splitting be supported for dm-linear? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v14 03/11] block: add copy offload support
On 8/11/23 03:52, Nitesh Shetty wrote: + * Description: + * Copy source offset to destination offset within block device, using + * device's native copy offload feature. Offloading the copy operation is not guaranteed so I think that needs to be reflected in the above comment. + * We perform copy operation by sending 2 bio's. + * 1. We take a plug and send a REQ_OP_COPY_SRC bio along with source + * sector and length. Once this bio reaches request layer, we form a + * request and wait for dst bio to arrive. What will happen if the queue depth of the request queue at the bottom is one? + blk_start_plug(&plug); + dst_bio = blk_next_bio(src_bio, bdev, 0, REQ_OP_COPY_DST, gfp); blk_next_bio() can return NULL so its return value should be checked. + dst_bio->bi_iter.bi_size = chunk; + dst_bio->bi_iter.bi_sector = pos_out >> SECTOR_SHIFT; + dst_bio->bi_end_io = blkdev_copy_offload_dst_endio; + dst_bio->bi_private = offload_io; Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v14 02/11] Add infrastructure for copy offload in block and request layer.
On 8/11/23 03:52, Nitesh Shetty wrote: diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 0bad62cca3d0..de0ad7a0d571 100644 +static inline bool op_is_copy(blk_opf_t op) +{ + return ((op & REQ_OP_MASK) == REQ_OP_COPY_SRC || + (op & REQ_OP_MASK) == REQ_OP_COPY_DST); +} + The above function should be moved into include/linux/blk-mq.h below the definition of req_op() such that it can use req_op() instead of open-coding it. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 01/32] block: Provide blkdev_get_handle_* functions
On 7/4/23 09:14, Matthew Wilcox wrote: On Tue, Jul 04, 2023 at 07:06:26AM -0700, Bart Van Assche wrote: On 7/4/23 05:21, Jan Kara wrote: +struct bdev_handle { + struct block_device *bdev; + void *holder; +}; Please explain in the patch description why a holder pointer is introduced in struct bdev_handle and how it relates to the bd_holder pointer in struct block_device. Is one of the purposes of this patch series perhaps to add support for multiple holders per block device? That is all in patch 0/32. Why repeat it? This cover letter: https://lore.kernel.org/linux-block/20230629165206.383-1-j...@suse.cz/T/#t? The word "holder" doesn't even occur in that cover letter so how could the answer to my question be present in the cover letter? Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 01/32] block: Provide blkdev_get_handle_* functions
On 7/4/23 05:21, Jan Kara wrote: +struct bdev_handle { + struct block_device *bdev; + void *holder; +}; Please explain in the patch description why a holder pointer is introduced in struct bdev_handle and how it relates to the bd_holder pointer in struct block_device. Is one of the purposes of this patch series perhaps to add support for multiple holders per block device? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH v4 6/7] dm: Inline __dm_mq_kick_requeue_list()
Since commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues") the function __dm_mq_kick_requeue_list() is too short to keep it as a separate function. Hence, inline this function. Reviewed-by: Christoph Hellwig Cc: Damien Le Moal Cc: Ming Lei Cc: Mike Snitzer Signed-off-by: Bart Van Assche --- drivers/md/dm-rq.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index f7e9a3632eb3..bbe1e2ea0aa4 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -168,21 +168,16 @@ static void dm_end_request(struct request *clone, blk_status_t error) rq_completed(md); } -static void __dm_mq_kick_requeue_list(struct request_queue *q, unsigned long msecs) -{ - blk_mq_delay_kick_requeue_list(q, msecs); -} - void dm_mq_kick_requeue_list(struct mapped_device *md) { - __dm_mq_kick_requeue_list(md->queue, 0); + blk_mq_kick_requeue_list(md->queue); } EXPORT_SYMBOL(dm_mq_kick_requeue_list); static void dm_mq_delay_requeue_request(struct request *rq, unsigned long msecs) { blk_mq_requeue_request(rq, false); - __dm_mq_kick_requeue_list(rq->q, msecs); + blk_mq_delay_kick_requeue_list(rq->q, msecs); } static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool delay_requeue) -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH v4 7/7] block: Inline blk_mq_{, delay_}kick_requeue_list()
Patch "block: Preserve the order of requeued requests" changed blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() into blk_mq_run_hw_queues() and blk_mq_delay_run_hw_queues() calls respectively. Inline blk_mq_{,delay_}kick_requeue_list() because these functions are now too short to keep these as separate functions. Acked-by: Vineeth Vijayan [ for the s390 changes ] Cc: Christoph Hellwig Cc: Damien Le Moal Cc: Ming Lei Cc: Mike Snitzer Signed-off-by: Bart Van Assche --- block/blk-flush.c| 4 ++-- block/blk-mq-debugfs.c | 2 +- block/blk-mq.c | 15 +-- drivers/block/ublk_drv.c | 6 +++--- drivers/block/xen-blkfront.c | 1 - drivers/md/dm-rq.c | 6 +++--- drivers/nvme/host/core.c | 2 +- drivers/s390/block/scm_blk.c | 2 +- drivers/scsi/scsi_lib.c | 2 +- include/linux/blk-mq.h | 2 -- 10 files changed, 13 insertions(+), 29 deletions(-) diff --git a/block/blk-flush.c b/block/blk-flush.c index 4bfb92f58aa9..157b86fd9ccb 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -192,7 +192,7 @@ static void blk_flush_complete_seq(struct request *rq, spin_lock(&hctx->requeue_lock); list_add_tail(&rq->queuelist, &hctx->flush_list); spin_unlock(&hctx->requeue_lock); - blk_mq_kick_requeue_list(q); + blk_mq_run_hw_queues(q, true); break; case REQ_FSEQ_DONE: @@ -354,7 +354,7 @@ static void blk_kick_flush(struct blk_mq_hw_ctx *hctx, list_add_tail(&flush_rq->queuelist, &hctx->flush_list); spin_unlock(&hctx->requeue_lock); - blk_mq_kick_requeue_list(q); + blk_mq_run_hw_queues(q, true); } static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq, diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 787bdff3cc64..76792ebab935 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -114,7 +114,7 @@ static ssize_t queue_state_write(void *data, const char __user *buf, } else if (strcmp(op, "start") == 0) { blk_mq_start_stopped_hw_queues(q, true); } else if (strcmp(op, "kick") == 0) { - blk_mq_kick_requeue_list(q); + blk_mq_run_hw_queues(q, true); } else { pr_err("%s: unsupported operation '%s'\n", __func__, op); inval: diff --git a/block/blk-mq.c b/block/blk-mq.c index de39984d17c4..12fd8b65b930 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1436,7 +1436,7 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list) spin_unlock_irqrestore(&hctx->requeue_lock, flags); if (kick_requeue_list) - blk_mq_kick_requeue_list(q); + blk_mq_run_hw_queues(q, true); } EXPORT_SYMBOL(blk_mq_requeue_request); @@ -1473,19 +1473,6 @@ static void blk_mq_process_requeue_list(struct blk_mq_hw_ctx *hctx) } } -void blk_mq_kick_requeue_list(struct request_queue *q) -{ - blk_mq_run_hw_queues(q, true); -} -EXPORT_SYMBOL(blk_mq_kick_requeue_list); - -void blk_mq_delay_kick_requeue_list(struct request_queue *q, - unsigned long msecs) -{ - blk_mq_delay_run_hw_queues(q, msecs); -} -EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); - static bool blk_mq_rq_inflight(struct request *rq, void *priv) { /* diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 1c823750c95a..cddbbdc9b199 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -902,7 +902,7 @@ static inline void __ublk_rq_task_work(struct request *req, */ if (unlikely(!mapped_bytes)) { blk_mq_requeue_request(req, false); - blk_mq_delay_kick_requeue_list(req->q, + blk_mq_delay_run_hw_queues(req->q, UBLK_REQUEUE_DELAY_MS); return; } @@ -1297,7 +1297,7 @@ static void ublk_unquiesce_dev(struct ublk_device *ub) blk_mq_unquiesce_queue(ub->ub_disk->queue); /* We may have requeued some rqs in ublk_quiesce_queue() */ - blk_mq_kick_requeue_list(ub->ub_disk->queue); + blk_mq_run_hw_queues(ub->ub_disk->queue, true); } static void ublk_stop_dev(struct ublk_device *ub) @@ -2341,7 +2341,7 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub, blk_mq_unquiesce_queue(ub->ub_disk->queue); pr_devel("%s: queue unquiesced, dev id %d.\n", __func__, header->dev_id); - blk_mq_kick_requeue_list(ub->ub_disk->queue); + blk_mq_run_hw_queues(ub->ub_disk->queue, true); ub->dev_info.state = UBLK_S_DEV_LIVE; schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON
Re: [dm-devel] [PATCH] blk-crypto: use dynamic lock class for blk_crypto_profile::lock
On 6/9/23 23:11, Eric Biggers wrote: When a device-mapper device is passing through the inline encryption support of an underlying device, calls to blk_crypto_evict_key() take the blk_crypto_profile::lock of the device-mapper device, then take the blk_crypto_profile::lock of the underlying device (nested). This isn't a real deadlock, but it causes a lockdep report because there is only one lock class for all instances of this lock. Lockdep subclasses don't really work here because the hierarchy of block devices is dynamic and could have more than 2 levels. Instead, register a dynamic lock class for each blk_crypto_profile, and associate that with the lock. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
[dm-devel] [PATCH v3 7/7] block: Inline blk_mq_{, delay_}kick_requeue_list()
Patch "block: Preserve the order of requeued requests" changed blk_mq_kick_requeue_list() and blk_mq_delay_kick_requeue_list() into blk_mq_run_hw_queues() and blk_mq_delay_run_hw_queues() calls respectively. Inline blk_mq_{,delay_}kick_requeue_list() because these functions are now too short to keep these as separate functions. Cc: Christoph Hellwig Cc: Damien Le Moal Cc: Ming Lei Cc: Mike Snitzer Signed-off-by: Bart Van Assche --- block/blk-flush.c| 4 ++-- block/blk-mq-debugfs.c | 2 +- block/blk-mq.c | 16 +--- drivers/block/ublk_drv.c | 6 +++--- drivers/block/xen-blkfront.c | 1 - drivers/md/dm-rq.c | 6 +++--- drivers/nvme/host/core.c | 2 +- drivers/s390/block/scm_blk.c | 2 +- drivers/scsi/scsi_lib.c | 2 +- include/linux/blk-mq.h | 2 -- 10 files changed, 13 insertions(+), 30 deletions(-) diff --git a/block/blk-flush.c b/block/blk-flush.c index dba392cf22be..22170036ddcb 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -191,7 +191,7 @@ static void blk_flush_complete_seq(struct request *rq, spin_lock(&q->requeue_lock); list_add_tail(&rq->queuelist, &q->flush_list); spin_unlock(&q->requeue_lock); - blk_mq_kick_requeue_list(q); + blk_mq_run_hw_queues(q, true); break; case REQ_FSEQ_DONE: @@ -352,7 +352,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq, list_add_tail(&flush_rq->queuelist, &q->flush_list); spin_unlock(&q->requeue_lock); - blk_mq_kick_requeue_list(q); + blk_mq_run_hw_queues(q, true); } static enum rq_end_io_ret mq_flush_data_end_io(struct request *rq, diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 68165a50951b..869cc62ed50f 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -145,7 +145,7 @@ static ssize_t queue_state_write(void *data, const char __user *buf, } else if (strcmp(op, "start") == 0) { blk_mq_start_stopped_hw_queues(q, true); } else if (strcmp(op, "kick") == 0) { - blk_mq_kick_requeue_list(q); + blk_mq_run_hw_queues(q, true); } else { pr_err("%s: unsupported operation '%s'\n", __func__, op); inval: diff --git a/block/blk-mq.c b/block/blk-mq.c index 52dffdc70480..34dcfc84d902 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1430,7 +1430,7 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list) spin_unlock_irqrestore(&q->requeue_lock, flags); if (kick_requeue_list) - blk_mq_kick_requeue_list(q); + blk_mq_run_hw_queues(q, true); } EXPORT_SYMBOL(blk_mq_requeue_request); @@ -1470,19 +1470,6 @@ static void blk_mq_process_requeue_list(struct blk_mq_hw_ctx *hctx) blk_mq_insert_request(rq, 0); } -void blk_mq_kick_requeue_list(struct request_queue *q) -{ - blk_mq_run_hw_queues(q, true); -} -EXPORT_SYMBOL(blk_mq_kick_requeue_list); - -void blk_mq_delay_kick_requeue_list(struct request_queue *q, - unsigned long msecs) -{ - blk_mq_delay_run_hw_queues(q, msecs); -} -EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list); - static bool blk_mq_rq_inflight(struct request *rq, void *priv) { /* @@ -3537,7 +3524,6 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node) list_for_each_entry_safe(rq, next, &tmp, queuelist) blk_mq_requeue_request(rq, false); - blk_mq_kick_requeue_list(hctx->queue); blk_mq_run_hw_queue(hctx, true); return 0; diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 539eada32861..4a3d579a25b5 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -900,7 +900,7 @@ static inline void __ublk_rq_task_work(struct request *req, */ if (unlikely(!mapped_bytes)) { blk_mq_requeue_request(req, false); - blk_mq_delay_kick_requeue_list(req->q, + blk_mq_delay_run_hw_queues(req->q, UBLK_REQUEUE_DELAY_MS); return; } @@ -1290,7 +1290,7 @@ static void ublk_unquiesce_dev(struct ublk_device *ub) blk_mq_unquiesce_queue(ub->ub_disk->queue); /* We may have requeued some rqs in ublk_quiesce_queue() */ - blk_mq_kick_requeue_list(ub->ub_disk->queue); + blk_mq_run_hw_queues(ub->ub_disk->queue, true); } static void ublk_stop_dev(struct ublk_device *ub) @@ -2334,7 +2334,7 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub, blk_mq_unquiesce_queue(ub->ub_disk->queue); pr_dev
[dm-devel] [PATCH v3 6/7] dm: Inline __dm_mq_kick_requeue_list()
Since commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues") the function __dm_mq_kick_requeue_list() is too short to keep it as a separate function. Hence, inline this function. Cc: Christoph Hellwig Cc: Damien Le Moal Cc: Ming Lei Cc: Mike Snitzer Signed-off-by: Bart Van Assche --- drivers/md/dm-rq.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index f7e9a3632eb3..bbe1e2ea0aa4 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -168,21 +168,16 @@ static void dm_end_request(struct request *clone, blk_status_t error) rq_completed(md); } -static void __dm_mq_kick_requeue_list(struct request_queue *q, unsigned long msecs) -{ - blk_mq_delay_kick_requeue_list(q, msecs); -} - void dm_mq_kick_requeue_list(struct mapped_device *md) { - __dm_mq_kick_requeue_list(md->queue, 0); + blk_mq_kick_requeue_list(md->queue); } EXPORT_SYMBOL(dm_mq_kick_requeue_list); static void dm_mq_delay_requeue_request(struct request *rq, unsigned long msecs) { blk_mq_requeue_request(rq, false); - __dm_mq_kick_requeue_list(rq->q, msecs); + blk_mq_delay_kick_requeue_list(rq->q, msecs); } static void dm_requeue_original_request(struct dm_rq_target_io *tio, bool delay_requeue) -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v4 1/4] block: Introduce provisioning primitives
On 4/18/23 15:12, Sarthak Kukreti wrote: /* Fail if we don't recognize the flags. */ - if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED) + if (mode != 0 && mode & ~BLKDEV_FALLOC_FL_SUPPORTED) return -EOPNOTSUPP; Is this change necessary? Doesn't (mode & ~BLKDEV_FALLOC_FL_SUPPORTED) != 0 imply that mode != 0? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v6 04/18] scsi: Move sd_pr_type to scsi_common
On 4/7/23 13:05, Mike Christie wrote: LIO is going to want to do the same block to/from SCSI pr types as sd.c so this moves the sd_pr_type helper to scsi_common and renames it. The next patch will then also add a helper to go from the SCSI value to the block one for use with PERSISTENT_RESERVE_IN commands. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v5 03/18] scsi: Rename sd_pr_command
On 3/24/23 11:17, Mike Christie wrote: Rename sd_pr_command to sd_pr_out_command to match a sd_pr_in_command helper added in the next patches. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v5 02/18] block: Rename BLK_STS_NEXUS to BLK_STS_RESV_CONFLICT
On 3/24/23 11:17, Mike Christie wrote: BLK_STS_NEXUS is used for NVMe/SCSI reservation conflicts and DASD's locking feature which works similar to NVMe/SCSI reservations where a host can get a lock on a device and when the lock is taken it will get failures. This patch renames BLK_STS_NEXUS so it better reflects this type of use. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v5 01/18] block: Add PR callouts for read keys and reservation
On 3/24/23 11:17, Mike Christie wrote: Add callouts for reading keys and reservations. This allows LIO to support the READ_KEYS and READ_RESERVATION commands and will allow dm-multipath to optimize it's error handling so it can check if it's getting an error because there's an existing reservation or if we need to retry different paths. Note: This only initially adds the struct definitions in the kernel as I'm not sure if we wanted to export the interface to userspace yet. read_keys and read_reservation are exactly what dm-multipath and LIO need, but for a userspace interface we may want something like SCSI's READ_FULL_STATUS and NVMe's report reservation commands. Those are overkill for dm/LIO and READ_FULL_STATUS is sometimes broken for SCSI devices. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v5 04/18] scsi: Move sd_pr_type to header to share
On 3/24/23 11:17, Mike Christie wrote: diff --git a/include/scsi/scsi_block_pr.h b/include/scsi/scsi_block_pr.h new file mode 100644 index ..44766d7a81d8 --- /dev/null +++ b/include/scsi/scsi_block_pr.h @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _SCSI_BLOCK_PR_H +#define _SCSI_BLOCK_PR_H + +#include + +enum scsi_pr_type { + SCSI_PR_WRITE_EXCLUSIVE = 0x01, + SCSI_PR_EXCLUSIVE_ACCESS= 0x03, + SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY= 0x05, + SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY = 0x06, + SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS= 0x07, + SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS = 0x08, +}; + +static inline enum scsi_pr_type block_pr_type_to_scsi(enum pr_type type) +{ + switch (type) { + case PR_WRITE_EXCLUSIVE: + return SCSI_PR_WRITE_EXCLUSIVE; + case PR_EXCLUSIVE_ACCESS: + return SCSI_PR_EXCLUSIVE_ACCESS; + case PR_WRITE_EXCLUSIVE_REG_ONLY: + return SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY; + case PR_EXCLUSIVE_ACCESS_REG_ONLY: + return SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY; + case PR_WRITE_EXCLUSIVE_ALL_REGS: + return SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS; + case PR_EXCLUSIVE_ACCESS_ALL_REGS: + return SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS; + } + + return 0; +} + +#endif Hi Mike, Has it been considered to move enum scsi_pr_type into include/scsi/scsi_common.h and block_pr_type_to_scsi() into drivers/scsi/scsi_common.c? Other definitions that are shared between SCSI initiator and SCSI target code exist in these files. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm-crypt: fix softlockup in dmcrypt_write
On 3/6/23 10:02, Mikulas Patocka wrote: On Tue, 28 Feb 2023, yangerkun wrote: It's ok to fix the softlockup, but for async write encrypt, kcryptd_crypt_write_io_submit will add bio to write_tree, and once we call cond_resched before every kcryptd_io_write, the write performance may be poor while we meet a high cpu usage scene. Hi To fix this problem, find the PID of the process "dmcrypt_write" and change its priority to -20, for example "renice -n -20 -p 34748". This is the proper way how to fix it; locking up the process for one second is not. We used to have high-priority workqueues by default, but it caused audio playback skipping, so we had to revert it - see f612b2132db529feac4f965f28a1b9258ea7c22b. Perhaps we should add an option to have high-priority kernel threads? Would calling cond_resched() every n iterations instead of every iteration help? From mm/swapfile.c: if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; } Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] dm-crypt: fix softlockup in dmcrypt_write
On 2/22/23 19:19, yangerkun wrote: @@ -1924,6 +1926,10 @@ static int dmcrypt_write(void *data) BUG_ON(rb_parent(write_tree.rb_node)); + if (time_is_before_jiffies(start_time + HZ)) { + schedule(); + start_time = jiffies; + } Why schedule() instead of cond_resched()? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 0/9] Documentation: correct lots of spelling errors (series 2)
On 2/2/23 10:33, Randy Dunlap wrote: On 2/2/23 10:09, Jonathan Corbet wrote: Randy Dunlap writes: [PATCH 6/9] Documentation: scsi/ChangeLog*: correct spelling [PATCH 7/9] Documentation: scsi: correct spelling I've left these for the SCSI folks for now. Do we *really* want to be fixing spelling in ChangeLog files from almost 20 years ago? That's why I made it a separate patch -- so the SCSI folks can decide that... How about removing the Documentation/scsi/ChangeLog.* files? I'm not sure these changelogs are still useful since these duplicate information that is already available in the output of git log ${driver_directory}. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 3/7] nvmet: introduce bdev_zone_no helper
On 1/6/23 00:33, Pankaj Raghav wrote: A generic bdev_zone_no() helper is added to calculate zone number for a given sector in a block device. This helper internally uses disk_zone_no() to find the zone number. Use the helper bdev_zone_no() to calculate nr of zones. This let's us make modifications to the math if needed in one place. Please use the imperative mood in patch descriptions (... is added -> add ...). let's -> lets If these two issues are addressed, please add: Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/7] block: add a new helper bdev_{is_zone_start, offset_from_zone_start}
On 1/6/23 00:33, Pankaj Raghav wrote: +static inline sector_t bdev_offset_from_zone_start(struct block_device *bdev, + sector_t sec) +{ + if (!bdev_is_zoned(bdev)) + return 0; + + return sec & (bdev_zone_sectors(bdev) - 1); +} + +static inline bool bdev_is_zone_start(struct block_device *bdev, sector_t sec) +{ + if (!bdev_is_zoned(bdev)) + return false; + + return bdev_offset_from_zone_start(bdev, sec) == 0; +} A nit: 'sector_t sector' is much more common in the block layer than 'sector_t sec'. Please consider changing 'sec' into 'sector'. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/7] block: remove superfluous check for request queue in bdev_is_zoned
On 1/6/23 00:33, Pankaj Raghav wrote: Remove the superfluous request queue check in bdev_is_zoned() as the bdev_get_queue can never return NULL. the bdev_get_queue -> bdev_get_queue() Anyway: Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH] block: remove bio_set_op_attrs
On 12/6/22 06:40, Christoph Hellwig wrote: This macro is obsolete, so replace the last few uses with open coded bi_opf assignments. For the entire patch: Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v3] virtio_blk: add VIRTIO_BLK_F_LIFETIME feature support
On 12/5/22 08:20, Alvaro Karsz wrote: +/* Get lifetime information struct for each request */ +struct virtio_blk_lifetime { + /* +* specifies the percentage of reserved blocks that are consumed. +* optional values following virtio spec: +* 0 - undefined +* 1 - normal, < 80% of reserved blocks are consumed +* 2 - warning, 80% of reserved blocks are consumed +* 3 - urgent, 90% of reserved blocks are consumed +*/ + __le16 pre_eol_info; + /* +* this field refers to wear of SLC cells and is provided in increments of 10used, +* and so on, thru to 11 meaning estimated lifetime exceeded. All values above 11 +* are reserved +*/ + __le16 device_lifetime_est_typ_a; + /* +* this field refers to wear of MLC cells and is provided with the same semantics as +* device_lifetime_est_typ_a +*/ + __le16 device_lifetime_est_typ_b; +}; Why does the above data structure only refer to SLC and MLC but not to TLC or QLC? How will this data structure be extended without having to introduce a new ioctl? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v3 03/19] scsi: Move sd_pr_type to header to share
On 11/2/22 19:13, Mike Christie wrote: On 11/2/22 5:47 PM, Bart Van Assche wrote: On 10/26/22 16:19, Mike Christie wrote: +static inline enum scsi_pr_type block_pr_type_to_scsi(enum pr_type type) +{ + switch (type) { + case PR_WRITE_EXCLUSIVE: + return SCSI_PR_WRITE_EXCLUSIVE; + case PR_EXCLUSIVE_ACCESS: + return SCSI_PR_EXCLUSIVE_ACCESS; + case PR_WRITE_EXCLUSIVE_REG_ONLY: + return SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY; + case PR_EXCLUSIVE_ACCESS_REG_ONLY: + return SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY; + case PR_WRITE_EXCLUSIVE_ALL_REGS: + return SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS; + case PR_EXCLUSIVE_ACCESS_ALL_REGS: + return SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS; + default: + return 0; + } +}; Please leave out "default: return 0;" from the switch statement and add "return 0;" after the switch statement. That will make the compiler emit a warning if a value is added in enum pr_type but not in the above function. Did you want that compiler warning functionality in the future code or are you asking me to do the above only if we go the switch based route? Chaitanya requested me to make this array based in nvme/scsi. For this approach, I can add a WARN or printk warning if the pr_type passed in does not match a value in the array. I don't think I can do a compiler warning though. I didn't care, but I think the compiler warning route might be better though. Hi Mike, Although I do not have a strong opinion about this, keeping the switch statement and moving the return statement out of the switch statement has the advantage that the compiler will warn if the switch statement is incomplete. I don't think that a similar warning would be emitted if the switch statement would be converted into an array lookup. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v3 04/19] scsi: Add support for block PR read keys/reservation
On 10/26/22 16:19, Mike Christie wrote: +static inline enum pr_type scsi_pr_type_to_block(enum scsi_pr_type type) +{ + switch (type) { + case SCSI_PR_WRITE_EXCLUSIVE: + return PR_WRITE_EXCLUSIVE; + case SCSI_PR_EXCLUSIVE_ACCESS: + return PR_EXCLUSIVE_ACCESS; + case SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY: + return PR_WRITE_EXCLUSIVE_REG_ONLY; + case SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY: + return PR_EXCLUSIVE_ACCESS_REG_ONLY; + case SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS: + return PR_WRITE_EXCLUSIVE_ALL_REGS; + case SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS: + return PR_EXCLUSIVE_ACCESS_ALL_REGS; + default: + return 0; + } +} Also for this function, how about moving the "return 0" statement out of the switch statement? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v3 01/19] block: Add PR callouts for read keys and reservation
On 10/26/22 16:19, Mike Christie wrote: +struct pr_keys { + u32 generation; + u32 num_keys; + u64 keys[]; +}; Is my understanding correct that keys[] is treated as opaque data by the kernel? If so, is it necessary to convert the persistent reservation keys from big endian to CPU endianness? Some SCSI stacks keep reservation keys as __be64 format. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v3 01/19] block: Add PR callouts for read keys and reservation
On 10/26/22 16:19, Mike Christie wrote: +struct pr_keys { + u32 generation; + u32 num_keys; + u64 keys[]; +}; + +struct pr_held_reservation { + u64 key; + u32 generation; + enum pr_typetype; +}; + struct pr_ops { int (*pr_register)(struct block_device *bdev, u64 old_key, u64 new_key, u32 flags); @@ -14,6 +26,18 @@ struct pr_ops { int (*pr_preempt)(struct block_device *bdev, u64 old_key, u64 new_key, enum pr_type type, bool abort); int (*pr_clear)(struct block_device *bdev, u64 key); + /* +* pr_read_keys - Read the registered keys and return them in the +* pr_keys->keys array. The keys array will have been allocated at the +* end of the pr_keys struct and is keys_len bytes. If there are more +* keys than can fit in the array, success will still be returned and +* pr_keys->num_keys will reflect the total number of keys the device +* contains, so the caller can retry with a larger array. +*/ + int (*pr_read_keys)(struct block_device *bdev, + struct pr_keys *keys_info, u32 keys_len); + int (*pr_read_reservation)(struct block_device *bdev, + struct pr_held_reservation *rsv); }; Is there any pr_read_keys() implementation that won't have to divide @keys_len by 8? How about leaving out that argument and making callers store the number of elements in the keys[] array in the num_keys member before calling pr_read_keys()? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v3 03/19] scsi: Move sd_pr_type to header to share
On 10/26/22 16:19, Mike Christie wrote: +static inline enum scsi_pr_type block_pr_type_to_scsi(enum pr_type type) +{ + switch (type) { + case PR_WRITE_EXCLUSIVE: + return SCSI_PR_WRITE_EXCLUSIVE; + case PR_EXCLUSIVE_ACCESS: + return SCSI_PR_EXCLUSIVE_ACCESS; + case PR_WRITE_EXCLUSIVE_REG_ONLY: + return SCSI_PR_WRITE_EXCLUSIVE_REG_ONLY; + case PR_EXCLUSIVE_ACCESS_REG_ONLY: + return SCSI_PR_EXCLUSIVE_ACCESS_REG_ONLY; + case PR_WRITE_EXCLUSIVE_ALL_REGS: + return SCSI_PR_WRITE_EXCLUSIVE_ALL_REGS; + case PR_EXCLUSIVE_ACCESS_ALL_REGS: + return SCSI_PR_EXCLUSIVE_ACCESS_ALL_REGS; + default: + return 0; + } +}; Please leave out "default: return 0;" from the switch statement and add "return 0;" after the switch statement. That will make the compiler emit a warning if a value is added in enum pr_type but not in the above function. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes
On 9/30/22 14:24, Jens Axboe wrote: Noted. I'll find some time to review this as well separately, once we're on the other side of the merge window. Hi Jens, Now that we are on the other side of the merge window: do you perhaps want Pankaj to repost this patch series? From what I have heard in several fora (JEDEC, SNIA) all flash storage vendors except one (WDC) are in favor of a contiguous LBA space and hence are in favor of supporting zone sizes that are not a power of two. As you may know in JEDEC we are working on standardizing zoned storage for UFS devices. We (JEDEC JC-64.1 committee members) would like to know whether or not we should require that the UFS zone size should be a power of two. Thank you, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes
On 9/30/22 17:45, Damien Le Moal wrote: On 10/1/22 04:38, Bart Van Assche wrote: Since this has not been mentioned in the cover letter, I want to add that in the near future we will need these patches for Android devices. JEDEC is working on supporting zoned storage for UFS devices, the storage devices used in all modern Android phones. Although it would be possible to make the offset between zone starts a power of two by inserting gap zones between data zones, UFS vendors asked not to do this and hence need support for zone sizes that are not a power of two. An advantage of not having to deal with gap zones is better filesystem performance since filesystem extents cannot span gap zones. Having to split filesystem extents because of gap zones reduces filesystem performance. As mentioned many times, my opinion is that a good implementation should *not* have any extent span zone boundaries. So personally, I do not consider such argument as a valid justification for the non-power-of-2 zone size support. Hi Damien, Although the filesystem extent issue probably can be solved in software, the argument that UFS vendors strongly prefer not to have gap zones and hence need support for zone sizes that are not a power of two remains. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v15 00/13] support zoned block devices with non-power-of-2 zone sizes
On 9/30/22 08:13, Jens Axboe wrote: On 9/29/22 12:31 AM, Pankaj Raghav wrote: Hi Jens, Please consider this patch series for the 6.1 release. Hi Jens, Christoph, and Keith, All the patches have a Reviewed-by tag at this point. Can we queue this up for 6.1? It's getting pretty late for 6.1 and I'd really like to have both Christoph and Martin sign off on these changes. Hi Jens, Agreed that it's getting late for 6.1. Since this has not been mentioned in the cover letter, I want to add that in the near future we will need these patches for Android devices. JEDEC is working on supporting zoned storage for UFS devices, the storage devices used in all modern Android phones. Although it would be possible to make the offset between zone starts a power of two by inserting gap zones between data zones, UFS vendors asked not to do this and hence need support for zone sizes that are not a power of two. An advantage of not having to deal with gap zones is better filesystem performance since filesystem extents cannot span gap zones. Having to split filesystem extents because of gap zones reduces filesystem performance. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v15 11/13] dm: call dm_zone_endio after the target endio callback for zoned devices
On 9/23/22 10:36, Pankaj Raghav wrote: dm_zone_endio() updates the bi_sector of orig bio for zoned devices that uses either native append or append emulation, and it is called before the endio of the target. But target endio can still update the clone bio after dm_zone_endio is called, thereby, the orig bio does not contain the updated information anymore. Currently, this is not a problem as the targets that support zoned devices such as dm-zoned, dm-linear, and dm-crypt do not have an endio function, and even if they do (such as dm-flakey), they don't modify the bio->bi_iter.bi_sector of the cloned bio that is used to update the orig_bio's bi_sector in dm_zone_endio function. This is a prep patch for the new dm-po2zoned target as it modifies bi_sector in the endio callback. Call dm_zone_endio for zoned devices after calling the target's endio function. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v15 07/13] zonefs: allow non power of 2 zoned devices
On 9/23/22 10:36, Pankaj Raghav wrote: The zone size shift variable is useful only if the zone sizes are known to be power of 2. Remove that variable and use generic helpers from block layer to calculate zone index in zonefs. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
On 9/22/22 23:29, Matias Bjørling wrote: With UFS, in the proposed copy I have (may been changed) - there's the concept of gap zones, which is zones that cannot be accessed by the host. The gap zones are essentially "LBA fillers", enabling the next writeable zone to start at a X * pow2 size offset. My understanding is that this specific approach was chosen to simplify standardization in UFS and avoid updating T10's ZBC with zone capacity support. While UFS would technically expose non-power of 2 zone sizes, they're also, due to the gap zones, could also be considered power of 2 zones if one considers the seq. write zone + the gap zone as a single unit. When I think about having UFS support in the kernel, the SWR and the gap zone could be represented as a single unit. For example: UFS - Zone Report Zone 0: SWR, LBA 0-11 Zone 1: Gap, LBA 12-15 Zone 2: SWR, LBA 16-27 Zone 3: Gap, LBA 28-31 ... Kernel representation - Zone Report (as supported today) Zone 0: SWR, LBA 0-15, Zone Capacity 12 Zone 1: SWR, LBA 16-31, Zone Capacity 12 ... If doing it this way, it removes the need for filesystems, device-mappers, user-space applications having to understand gap zones, and allows UFS to work out of the box with no changes to the rest of the zoned storage eco-system. Has the above representation been considered? Hi Matias, What has been described above is the approach from the first version of the zoned storage for UFS (ZUFS) draft standard. Support for this approach is available in the upstream kernel. See also "[PATCH v2 0/9] Support zoned devices with gap zones", 2022-04-21 (https://lore.kernel.org/linux-scsi/20220421183023.3462291-1-bvanass...@acm.org/). Since F2FS extents must be split at gap zones, gap zones negatively affect sequential read and write performance. So we abandoned the gap zone approach. The current approach is as follows: * The power-of-two restriction for the offset between zone starts has been removed. Gap zones are no longer required. Hence, we will need the patches that add support for zone sizes that are not a power of two. * The Sequential Write Required (SWR) and Sequential Write Preferred (SWP) zone types are supported. The feedback we received from UFS vendors is that which zone type works best depends on their firmware and ASIC design. * We need a queue depth larger than one (QD > 1) for writes to achieve the full sequential write bandwidth. We plan to support QD > 1 as follows: - If writes have to be serialized, submit these to the same hardware queue. According to the UFS host controller interface (UFSHCI) standard, UFS host controllers are not allowed to reorder SCSI commands that are submitted to the same hardware queue. A source of command reordering that remains is the SCSI retry mechanism. Retries happen e.g. after a command timeout. - For SWP zones, require the UFS device firmware to use its garbage collection mechanism to reorder data in the unlikely case that out-of-order writes happened. - For SWR zones, retry writes that failed because these were received out-of-order by a UFS device. ZBC-1 requires compliant devices to respond with ILLEGAL REQUEST / UNALIGNED WRITE COMMAND to out-of- order writes. We have considered the zone append approach but decided not to use it because if zone append commands get reordered the data ends up permanently out-of-order on the storage medium. This affects sequential read performance negatively. Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
On 9/21/22 16:55, Damien Le Moal wrote: But again, that all depends on if Pankaj patch series is accepted, that is, on everybody accepting that we lift the power-of-2 zone size constraint. The companies that are busy with implementing zoned storage for UFS devices are asking for kernel support for non-power-of-2 zone sizes. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH RFC 0/8] Introduce provisioning primitives for thinly provisioned storage
On 9/16/22 11:48, Sarthak Kukreti wrote: Yes. On ChromiumOS, we regularly deal with storage devices that don't support WRITE_ZEROES or that need to have it disabled, via a quirk, due to a bug in the vendor's implementation. Using WRITE_ZEROES for allocation makes the allocation path quite slow for such devices (not to mention the effect on storage lifetime), so having a separate provisioning construct is very appealing. Even for devices that do support an efficient WRITE_ZEROES implementation but don't support logical provisioning per-se, I suppose that the allocation path might be a bit faster (the device driver's request queue would report 'max_provision_sectors'=0 and the request would be short circuited there) although I haven't benchmarked the difference. Some background information about why ChromiumOS uses thin provisioning instead of a single filesystem across the entire storage device would be welcome. Although UFS devices support thin provisioning I am not aware of any use cases in Android that would benefit from UFS thin provisioning support. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v12 09/13] dm-zone: use generic helpers to calculate offset from zone start
On 8/23/22 05:18, Pankaj Raghav wrote: Use the bdev_offset_from_zone_start() helper function to calculate the offset from zone start instead of using power of 2 based calculation. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v12 08/13] dm-zoned: ensure only power of 2 zone sizes are allowed
On 8/23/22 05:18, Pankaj Raghav wrote: From: Luis Chamberlain dm-zoned relies on the assumption that the zone size is a power-of-2(po2) and the zone capacity is same as the zone size. Ensure only po2 devices can be used as dm-zoned target until a native support for zoned devices with non-po2 zone size is added. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v12 06/13] null_blk: allow zoned devices with non power-of-2 zone sizes
On 8/23/22 05:18, Pankaj Raghav wrote: Convert the power-of-2(po2) based calculation with zone size to be generic in null_zone_no with optimization for po2 zone sizes. The nr_zones calculation in null_init_zoned_dev has been replaced with a division without special handling for po2 zone sizes as this function is called only during the initialization and will not be invoked in the hot path. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v12 05/13] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
On 8/23/22 05:18, Pankaj Raghav wrote: Remove the condition which disallows non-power_of_2 zone size ZNS drive to be updated and use generic method to calculate number of zones instead of relying on log and shift based calculation on zone size. The power_of_2 calculation has been replaced directly with generic calculation without special handling. Both modified functions are not used in hot paths, they are only used during initialization & revalidation of the ZNS device. As rounddown macro from math.h does not work for 32 bit architectures, round down operation is open coded. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v12 02/13] block:rearrange bdev_{is_zoned, zone_sectors, get_queue} helpers in blkdev.h
On 8/23/22 05:18, Pankaj Raghav wrote: Define bdev_is_zoned(), bdev_zone_sectors() and bdev_get_queue() earlier in the blkdev.h include file. Simplify bdev_is_zoned() by removing the superfluous NULL check for request queue while we are at it. This commit has no functional change, and it is a prep patch for allowing zoned devices with non-power-of-2 zone sizes in the block layer. It seems like a space is missing in the patch subject after the colon? Anyway: Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [ANNOUNCE] CFP: Zoned Storage Microconference - Linux Plumbers Conference 2022
On 5/22/22 15:01, Adam Manzanares wrote: Zoned Storage Devices (SMR HDDs and ZNS SSDs) have demonstrated that they can improve storage capacity, throughput, and latency over conventional storage devices for many workloads. Zoned storage technology is deployed at scale in some of the largest data centers in the world. There's already a well-established set of storage vendors with increasing device availability and a mature software foundation for interacting with zoned storage devices is available. Zoned storage software support is evolving and their is room for increased file-system support and additional userspace applications. The Zoned Storage microconference focuses on evolving the Linux zoned storage ecosystem by improving kernel support, file systems, and applications. In addition, the forum allows us to open the discussion to incorporate and grow the zoned storage community making sure to meet everyone's needs and expectations. Finally, it is an excellent opportunity for anyone interested in zoned storage devices to meet and discuss how we can move the ecosystem forward together. Hi Adam, On https://lpc.events/event/16/contributions/1147/ I see four speakers but no agenda? Will an agenda be added before the microconference starts? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v2 12/20] block, nvme, scsi, dm: Add blk_status to pr_ops callouts.
On 8/9/22 11:08, Mike Christie wrote: On 8/9/22 2:21 AM, Christoph Hellwig wrote: On Mon, Aug 08, 2022 at 07:04:11PM -0500, Mike Christie wrote: To handle both cases, this patch adds a blk_status_t arg to the pr_ops callouts. The lower levels will convert their device specific error to the blk_status_t then the upper levels can easily check that code without knowing the device type. It also allows us to keep userspace compat where it expects a negative -Exyz error code if the command fails before it's sent to the device or a device/tranport specific value if the error is > 0. Why do we need two return values here? I know the 2 return values are gross :) I can do it in one, but I wasn't sure what's worse. See below for the other possible solutions. I think they are all bad. 0. Convert device specific conflict error to -EBADE then back: sd_pr_command() . /* would add similar check for NVME_SC_RESERVATION_CONFLICT in nvme */ if (result == SAM_STAT_CHECK_CONDITION) return -EBADE; else return result; LIO then just checks for -EBADE but when going to userspace we have to convert: blkdev_pr_register() ... result = ops->pr_register() if (result < 0) { /* For compat we must convert back to the nvme/scsi code */ if (result == -EBADE) { /* need some helper for this that calls down the stack */ if (bdev == SCSI) return SAM_STAT_RESERVATION_CONFLICT else return NVME_SC_RESERVATION_CONFLICT } else return blk_status_to_str(result) } else return result; The conversion is kind of gross and I was thinking in the future it's going to get worse. I'm going to want to have more advanced error handling in LIO and dm-multipath. Like dm-multipath wants to know if an pr_op failed because of a path failure, so it can retry another one, or a hard device/target error. It would be nice for LIO if an PGR had bad/illegal values and the device returned an error than I could detect that. 1. Drop the -Exyz error type and use blk_status_t in the kernel: sd_pr_command() . if (result < 0) return -errno_to_blk_status(result); else if (result == SAM_STAT_CHECK_CONDITION) return -BLK_STS_NEXUS; else return result; blkdev_pr_register() ... result = ops->pr_register() if (result < 0) { /* For compat we must convert back to the nvme/scsi code */ if (result == -BLK_STS_NEXUS) { /* need some helper for this that calls down the stack */ if (bdev == SCSI) return SAM_STAT_RESERVATION_CONFLICT else return NVME_SC_RESERVATION_CONFLICT } else return blk_status_to_str(result) } else return result; This has similar issues as #0 where we have to convert before returning to userspace. Note: In this case, if the block layer uses an -Exyz error code there's not BLK_STS for then we would return -EIO to userspace now. I was thinking that might not be ok but I could also just add a BLK_STS error code for errors like EINVAL, EWOULDBLOCK, ENOMEM, etc so that doesn't happen. 2. We could do something like below where the low levels are not changed but the caller converts: sd_pr_command() /* no changes */ lio() result = ops->pr_register() if (result > 0) { /* add some stacked helper again that goes through dm and * to the low level device */ if (bdev == SCSI) { result = scsi_result_to_blk_status(result) else result = nvme_error_status(result) This looks simple, but it felt wrong having upper layers having to know the device type and calling conversion functions. Has it been considered to introduce a new enumeration type instead of choosing (0), (1) or (2)? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v2 14/20] scsi: Retry pr_ops commands if a UA is returned.
On 8/9/22 09:24, Mike Christie wrote: On 8/9/22 2:16 AM, Christoph Hellwig wrote: On Mon, Aug 08, 2022 at 07:04:13PM -0500, Mike Christie wrote: It's common to get a UA when doing PR commands. It could be due to a target restarting, transport level relogin or other PR commands like a release causing it. The upper layers don't get the sense and in some cases have no idea if it's a SCSI device, so this has the sd layer retry. This seems like another case for the generic in-kernel passthrugh command retry discussed in the other thread. It is. Has it been considered to introduce a flag that makes scsi_noretry_cmd() retry passthrough commands? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v2 04/20] scsi: Add support for block PR read keys/reservation.
On 8/8/22 17:04, Mike Christie wrote: +static int sd_pr_in_command(struct block_device *bdev, u8 sa, + unsigned char *data, int data_len) +{ + struct scsi_disk *sdkp = scsi_disk(bdev->bd_disk); + struct scsi_device *sdev = sdkp->device; + struct scsi_sense_hdr sshdr; + u8 cmd[10] = { 0, }; + int result; Isn't "{ }" instead of "{ 0, }" the preferred way to zero-initialize a data structure? + + cmd[0] = PERSISTENT_RESERVE_IN; + cmd[1] = sa; Can the above two assignments be moved into the initializer of cmd[]? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v2 02/20] scsi: Rename sd_pr_command.
On 8/8/22 17:04, Mike Christie wrote: Rename sd_pr_command to sd_pr_out_command to match a sd_pr_in_command helper added in the next patches. No trailing dots at the end of the patch subject please (this patch and other patches). Otherwise this patch looks good to me. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v9 10/13] dm-table: allow zoned devices with non power-of-2 zone sizes
On 8/3/22 02:47, Pankaj Raghav wrote: Allow dm to support zoned devices with non power-of-2(po2) zone sizes as the block layer now supports it. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v9 09/13] dm-zone: use generic helpers to calculate offset from zone start
On 8/3/22 02:47, Pankaj Raghav wrote: - if ((clone->bi_iter.bi_sector & (zsectors - 1)) != zwp_offset) + if ((bdev_offset_from_zone_start(md->disk->part0, +clone->bi_iter.bi_sector)) != zwp_offset) If this patch series needs to be reposted, please leave out the superfluous parentheses from the above statement. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v9 05/13] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
On 8/3/22 02:47, Pankaj Raghav wrote: - sector &= ~(ns->zsze - 1); + /* +* Round down the sector value to the nearest zone start +*/ + div64_u64_rem(sector, ns->zsze, &remainder); + sector -= remainder; Could bdev_offset_from_zone_start() be used here? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v9 03/13] block: allow blk-zoned devices to have non-power-of-2 zone size
On 8/3/22 02:47, Pankaj Raghav wrote: + if (!is_power_of_2(zone->len) && zone->capacity < zone->len) { + pr_err("%s: Invalid zone capacity: %lld with non power-of-2 zone size: %lld", + disk->disk_name, zone->capacity, zone->len); If this patch series needs to be reposted, please leave out the second and third colon (:) from the above error message. Once this comment has been addressed, feel free to add: Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v9 02/13] block:rearrange bdev_{is_zoned, zone_sectors, get_queue} helpers in blkdev.h
On 8/3/22 02:47, Pankaj Raghav wrote: Define bdev_is_zoned(), bdev_zone_sectors() and bdev_get_queue() earlier in the blkdev.h include file. This commit has no functional change, and it is a prep patch for allowing zoned devices with non-power-of-2 zone sizes in the block layer. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
On 7/28/22 04:57, Pankaj Raghav wrote: Hi Bart, On 2022-07-28 01:19, Bart Van Assche wrote: On 7/27/22 09:22, Pankaj Raghav wrote: This series adds support to npo2 zoned devices in the block and nvme layer and a new **dm target** is added: dm-po2z-target. This new target will be initially used for filesystems such as btrfs and f2fs that does not have native npo2 zone support. Should any SCSI changes be included in this patch series? From sd_zbc.c: if (!is_power_of_2(zone_blocks)) { sd_printk(KERN_ERR, sdkp, "Zone size %llu is not a power of two.\n", zone_blocks); return -EINVAL; } I would keep these changes out of the current patch series because it will also increase the test scope. I think once the block layer constraint is removed as a part of this series, we can work on the SCSI changes in the next cycle. That's fine with me. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
On 7/28/22 05:11, Pankaj Raghav wrote: On 2022-07-28 01:16, Bart Van Assche wrote: The bdev_is_zone_start() name seems more clear to me than bdev_is_zone_aligned(). Has there already been a discussion about which name to use for this function? The reason I did s/bdev_is_zone_start/bdev_is_zone_aligned is that this name makes more sense for also checking if a given size is a multiple of zone sectors for e.g., used in PATCH 9: - if (len & (zone_sectors - 1)) { + if (!bdev_is_zone_aligned(bdev, len)) { I felt `bdev_is_zone_aligned` fits the use case of checking if the sector starts at the start of a zone and also check if a given length of sectors also align with the zone sectors. bdev_is_zone_start does not make the intention clear for the latter use case IMO. But I am fine with going back to bdev_is_zone_start if you and Damien feel strongly otherwise. The "zone start LBA" terminology occurs in ZBC-1, ZBC-2 and ZNS but "zone aligned" not. I prefer "zone start" because it is clear, unambiguous and because it has the same meaning as in the corresponding standards documents. I propose to proceed as follows for checking whether a number of LBAs is a multiple of the zone length: * Either use bdev_is_zone_start() directly. * Or introduce a synonym for bdev_is_zone_start() with an appropriate name, e.g. bdev_is_zone_len_multiple(). Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
On 7/27/22 18:52, Damien Le Moal wrote: On 7/28/22 08:19, Bart Van Assche wrote: On 7/27/22 09:22, Pankaj Raghav wrote: This series adds support to npo2 zoned devices in the block and nvme layer and a new **dm target** is added: dm-po2z-target. This new target will be initially used for filesystems such as btrfs and f2fs that does not have native npo2 zone support. Should any SCSI changes be included in this patch series? From sd_zbc.c: if (!is_power_of_2(zone_blocks)) { sd_printk(KERN_ERR, sdkp, "Zone size %llu is not a power of two.\n", zone_blocks); return -EINVAL; } There are no non-power of 2 SMR drives on the market and no plans to have any as far as I know. Users want power of 2 zone size. So I think it is better to leave sd_zbc & scsi_debug as is for now. Zoned UFS devices will support ZBC and may have a zone size that is not a power of two. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v8 02/11] block: allow blk-zoned devices to have non-power-of-2 zone size
On 7/27/22 09:22, Pankaj Raghav wrote: Checking if a given sector is aligned to a zone is a common operation that is performed for zoned devices. Add bdev_is_zone_start helper to check for this instead of opencoding it everywhere. I can't find the bdev_is_zone_start() function in this patch? To make this work bdev_get_queue(), bdev_zone_sectors() and bdev_is_zoned() are moved earlier without modifications. Can that change perhaps be isolated into a separate patch? diff --git a/block/blk-core.c b/block/blk-core.c index 3d286a256d3d..1f7e9a90e198 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -570,7 +570,7 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_NOTSUPP; /* The bio sector must point to the start of a sequential zone */ - if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) || + if (!bdev_is_zone_aligned(bio->bi_bdev, bio->bi_iter.bi_sector) || !bio_zone_is_seq(bio)) return BLK_STS_IOERR; The bdev_is_zone_start() name seems more clear to me than bdev_is_zone_aligned(). Has there already been a discussion about which name to use for this function? + /* +* Non power-of-2 zone size support was added to remove the +* gap between zone capacity and zone size. Though it is technically +* possible to have gaps in a non power-of-2 device, Linux requires +* the zone size to be equal to zone capacity for non power-of-2 +* zoned devices. +*/ + if (!is_power_of_2(zone->len) && zone->capacity < zone->len) { + pr_warn("%s: Invalid zone capacity for non power of 2 zone size", + disk->disk_name); Given the severity of this error, shouldn't the zone capacity and length be reported in the error message? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v8 00/11] support non power of 2 zoned device
On 7/27/22 09:22, Pankaj Raghav wrote: This series adds support to npo2 zoned devices in the block and nvme layer and a new **dm target** is added: dm-po2z-target. This new target will be initially used for filesystems such as btrfs and f2fs that does not have native npo2 zone support. Should any SCSI changes be included in this patch series? From sd_zbc.c: if (!is_power_of_2(zone_blocks)) { sd_printk(KERN_ERR, sdkp, "Zone size %llu is not a power of two.\n", zone_blocks); return -EINVAL; } Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 0/3] dm-verity: optionally use tasklets in dm-verity
On 7/22/22 09:41, Christoph Hellwig wrote: We've been tying to kill off task lets for about 15 years. I don't think adding new users will make you a whole lot of friends.. +1 for not using tasklets. At least in Android the real-time thread scheduling latency is important. Tasklets are not visible to the scheduler and hence cause latency spikes for real-time threads. These latency spikes can be observed by users, e.g. if the real-time thread is involved in audio playback. Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
On 6/29/22 12:05, Kent Overstreet wrote: On Wed, Jun 29, 2022 at 11:51:27AM -0700, Bart Van Assche wrote: On 6/29/22 11:40, Kent Overstreet wrote: But Jens, to be blunt - I know we have different priorities in the way we write code. Please stay professional in your communication and focus on the technical issues instead of on the people involved. BTW, I remember that some time ago I bisected a kernel bug to one of your commits and that you refused to fix the bug introduced by that commit. I had to spend my time on root-causing it and sending a fix upstream. I'd be genuinely appreciative if you'd refresh my memory on what it was. Because yeah, if I did that that was my fuckup and I want to learn from my mistakes. I was referring to the following two conversations from May 2018: * [PATCH] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180522235505.20937-1-bart.vanass...@wdc.com/) * [PATCH v2] Revert "block: Add warning for bi_next not NULL in bio_endio()" (https://lore.kernel.org/linux-block/20180619172640.15246-1-bart.vanass...@wdc.com/) Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 5.20 1/4] block: add bio_rewind() API
On 6/29/22 11:40, Kent Overstreet wrote: But Jens, to be blunt - I know we have different priorities in the way we write code. Please stay professional in your communication and focus on the technical issues instead of on the people involved. BTW, I remember that some time ago I bisected a kernel bug to one of your commits and that you refused to fix the bug introduced by that commit. I had to spend my time on root-causing it and sending a fix upstream. Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v7 02/13] block: allow blk-zoned devices to have non-power-of-2 zone size
On 6/15/22 03:19, Pankaj Raghav wrote: @@ -489,14 +489,27 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx, * smaller last zone. */ if (zone->start == 0) { - if (zone->len == 0 || !is_power_of_2(zone->len)) { - pr_warn("%s: Invalid zoned device with non power of two zone size (%llu)\n", - disk->disk_name, zone->len); + if (zone->len == 0) { + pr_warn("%s: Invalid zone size", disk->disk_name); + return -ENODEV; + } + + /* +* Don't allow zoned device with non power_of_2 zone size with +* zone capacity less than zone size. +*/ Please change "power_of_2" into "power-of-2". + if (!is_power_of_2(zone->len) && zone->capacity < zone->len) { + pr_warn("%s: Invalid zone capacity for non power of 2 zone size", + disk->disk_name); return -ENODEV; } The above check seems wrong to me. I don't see why devices that report a capacity that is less than the zone size should be rejected. + /* +* Division is used to calculate nr_zones for both power_of_2 +* and non power_of_2 zone sizes as it is not in the hot path. +*/ Shouldn't the above comment be moved to the patch description? I'm not sure whether having such a comment in the source code is valuable. +static inline sector_t blk_queue_offset_from_zone_start(struct request_queue *q, + sector_t sec) +{ + sector_t zone_sectors = blk_queue_zone_sectors(q); + u64 remainder = 0; + + if (!blk_queue_is_zoned(q)) + return false; "return false" should only occur in functions returning a boolean. This function returns type sector_t. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v7 01/13] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
On 6/15/22 03:19, Pankaj Raghav wrote: Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can also work for non-power-of-2 zone sizes. As the existing deployments of zoned devices had power-of-2 assumption, power-of-2 optimized calculation is kept for those devices. There are no direct hot paths modified and the changes just introduce one new branch per call. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 5/6] block: fold blk_max_size_offset into get_max_io_size
On 6/14/22 02:09, Christoph Hellwig wrote: Now that blk_max_size_offset has a single caller left, fold it into that and clean up the naming convention for the local variables there. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 4/6] block: cleanup variable naming in get_max_io_size
On 6/14/22 02:09, Christoph Hellwig wrote: get_max_io_size has a very odd choice of variables names and initialization patterns. Switch to more descriptive names and more clear initialization of them. Hmm ... what is so odd about the variable names? I have done my best to chose clear and descriptive names when I introduced these names. Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 6/6] block: move blk_queue_get_max_sectors to blk.h
On 6/14/22 02:09, Christoph Hellwig wrote: blk_queue_get_max_sectors is private to the block layer, so move it out of blkdev.h. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 3/6] block: open code blk_max_size_offset in blk_rq_get_max_sectors
On 6/14/22 09:43, Bart Van Assche wrote: On 6/14/22 02:09, Christoph Hellwig wrote: blk_rq_get_max_sectors always uses q->limits.chunk_sectors as the chunk_sectors argument, and already checks for max_sectors through the call to blk_queue_get_max_sectors. That means much of blk_max_size_offset is not needed and open coding it simplifies the code. Signed-off-by: Christoph Hellwig --- block/blk-merge.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index db2e03c8af7f4..df003ecfbd474 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -566,17 +566,18 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, sector_t offset) { struct request_queue *q = rq->q; + unsigned int max_sectors; if (blk_rq_is_passthrough(rq)) return q->limits.max_hw_sectors; + max_sectors = blk_queue_get_max_sectors(q, req_op(rq)); if (!q->limits.chunk_sectors || req_op(rq) == REQ_OP_DISCARD || req_op(rq) == REQ_OP_SECURE_ERASE) - return blk_queue_get_max_sectors(q, req_op(rq)); - - return min(blk_max_size_offset(q, offset, 0), - blk_queue_get_max_sectors(q, req_op(rq))); + return max_sectors; + return min(max_sectors, + blk_chunk_sectors_left(offset, q->limits.chunk_sectors)); } blk_set_default_limits() initializes chunk_sectors to zero and blk_chunk_sectors_left() triggers a division by zero if a zero is passed as the second argument. What am I missing? Answering my own question: I overlooked one of the return statements. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 3/6] block: open code blk_max_size_offset in blk_rq_get_max_sectors
On 6/14/22 02:09, Christoph Hellwig wrote: blk_rq_get_max_sectors always uses q->limits.chunk_sectors as the chunk_sectors argument, and already checks for max_sectors through the call to blk_queue_get_max_sectors. That means much of blk_max_size_offset is not needed and open coding it simplifies the code. Signed-off-by: Christoph Hellwig --- block/blk-merge.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index db2e03c8af7f4..df003ecfbd474 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -566,17 +566,18 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq, sector_t offset) { struct request_queue *q = rq->q; + unsigned int max_sectors; if (blk_rq_is_passthrough(rq)) return q->limits.max_hw_sectors; + max_sectors = blk_queue_get_max_sectors(q, req_op(rq)); if (!q->limits.chunk_sectors || req_op(rq) == REQ_OP_DISCARD || req_op(rq) == REQ_OP_SECURE_ERASE) - return blk_queue_get_max_sectors(q, req_op(rq)); - - return min(blk_max_size_offset(q, offset, 0), - blk_queue_get_max_sectors(q, req_op(rq))); + return max_sectors; + return min(max_sectors, + blk_chunk_sectors_left(offset, q->limits.chunk_sectors)); } blk_set_default_limits() initializes chunk_sectors to zero and blk_chunk_sectors_left() triggers a division by zero if a zero is passed as the second argument. What am I missing? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/6] block: factor out a chunk_size_left helper
On 6/14/22 02:09, Christoph Hellwig wrote: Factor out a helper from blk_max_size_offset so that it can be reused independently. Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 0/8] Use block pr_ops in LIO
On 6/5/22 09:55, Mike Christie wrote: libiscsi is not suitable for this type of setup. I think that this setup can be tested as follows with libiscsi: * Configure the backend storage. * Configure LIO to use the backend storage on two different servers. * On a third server, log in with the iSCSI initiator to both servers. * Run the libiscsi iscsi-test-cu test software on the third server and pass the two IQNs that refer to the LIO servers as arguments. From the iscsi-test-cu -h output: iscsi-test-cu [OPTIONS] [multipath-iscsi-url] Did I perhaps overlook or misunderstand something? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 0/8] Use block pr_ops in LIO
On 6/2/22 23:55, Mike Christie wrote: The following patches were built over Linus's tree. They allow us to use the block pr_ops with LIO's target_core_iblock module to support cluster applications in VMs. Currently, to use something like windows clustering in VMs with LIO and vhost-scsi, you have to use tcmu or pscsi or use a cluster aware FS/framework for the LIO pr file. Setting up a cluster FS/framework is pain and waste when your real backend device is already a distributed device, and pscsi and tcmu are nice for specific use cases, but iblock gives you the best performance and allows you to use stacked devices like dm-multipath. So these patches allow iblock to work like pscsi/tcmu where they can pass a PR command to the backend module. And then iblock will use the pr_ops to pass the PR command to the real devices similar to what we do for unmap today. Note that this is patchset does not attempt to support every PR SCSI feature in iblock. It has the same limitations as tcmu and pscsi where you can have a single I_T nexus per device and only supports what is needed for windows clustering right now. How has this patch series been tested? Does LIO pass the libiscsi persistent reservation tests with this patch series applied? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 09/11] block, nvme: Add error for reservation conflicts.
On 6/2/22 23:55, Mike Christie wrote: BLK_STS_NEXUS is used for nvme/scsi reservation conflicts and also general nexus failures. For the pr_ops use we want to know if an IO failed for specifically a reservation conflict so we can report that error upwards to a VM. This patch adds a new error code for this case and converts nvme. The next patch converts scsi because it's a little more complicated. Signed-off-by: Mike Christie --- block/blk-core.c | 1 + drivers/nvme/host/core.c | 2 +- include/linux/blk_types.h | 4 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index bc0506772152..3908ac4a70b6 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -171,6 +171,7 @@ static const struct { /* zone device specific errors */ [BLK_STS_ZONE_OPEN_RESOURCE]= { -ETOOMANYREFS, "open zones exceeded" }, [BLK_STS_ZONE_ACTIVE_RESOURCE] = { -EOVERFLOW, "active zones exceeded" }, + [BLK_STS_RSV_CONFLICT] = { -EBADE, "resevation conflict" }, ^^ Please fix the spelling of "reservation". Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 03/11] scsi: Move sd_pr_type to header to share.
On 6/2/22 23:55, Mike Christie wrote: +static inline char block_pr_type_to_scsi(enum pr_type type) +{ + switch (type) { + case PR_WRITE_EXCLUSIVE: + return 0x01; + case PR_EXCLUSIVE_ACCESS: + return 0x03; + case PR_WRITE_EXCLUSIVE_REG_ONLY: + return 0x05; + case PR_EXCLUSIVE_ACCESS_REG_ONLY: + return 0x06; + case PR_WRITE_EXCLUSIVE_ALL_REGS: + return 0x07; + case PR_EXCLUSIVE_ACCESS_ALL_REGS: + return 0x08; + default: + return 0; + } +}; + +static inline enum pr_type scsi_pr_type_to_block(char type) +{ + switch (type) { + case 0x01: + return PR_WRITE_EXCLUSIVE; + case 0x03: + return PR_EXCLUSIVE_ACCESS; + case 0x05: + return PR_WRITE_EXCLUSIVE_REG_ONLY; + case 0x06: + return PR_EXCLUSIVE_ACCESS_REG_ONLY; + case 0x07: + return PR_WRITE_EXCLUSIVE_ALL_REGS; + case 0x08: + return PR_EXCLUSIVE_ACCESS_ALL_REGS; + default: + return 0; + } +} Since 'char' is used above to represent a single byte, please use u8 or uint8_t instead. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v3 10/11] null_blk: allow non power of 2 zoned devices
On 5/9/22 04:56, Pankaj Raghav wrote: Even though I am not sure if this optimization will directly add value looking at my experiments with the current change, I can fold this in with a comment on top of zone_size_sect_shifts variable stating that size can be npo2 and this variable is only meaningful for the po2 size scenario. Have these experiments perhaps been run on an x86_64 CPU? These CPUs only need a single instruction to calculate ilog2(). No equivalent of that instruction is available on ARM CPUs as far as I know. I think the optimization Damien proposed will help on ARM CPUs. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 13/16] null_blk: allow non power of 2 zoned devices
On 4/27/22 09:02, Pankaj Raghav wrote: diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c index c441a4972064..82a62b543782 100644 --- a/drivers/block/null_blk/main.c +++ b/drivers/block/null_blk/main.c @@ -1931,8 +1931,8 @@ static int null_validate_conf(struct nullb_device *dev) dev->mbps = 0; if (dev->zoned && - (!dev->zone_size || !is_power_of_2(dev->zone_size))) { - pr_err("zone_size must be power-of-two\n"); + (!dev->zone_size)) { + pr_err("zone_size must not be zero\n"); return -EINVAL; } Please combine "if (dev->zoned &&" and "(!dev->zone_size)) {" into a single line and leave out the parentheses that became superfluous. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 06/16] nvmet: use blk_queue_zone_no()
On 4/27/22 09:02, Pankaj Raghav wrote: From: Luis Chamberlain Instead of open coding the number of zones given a sector, use the helper ^^ I can't parse this. Please rephrase this. blk_queue_zone_no(). This let's us make modifications to the math if needed in one place and adds now support for npo2 zone devices. But since the code looks fine: Reviewed-by: Bart Van Assche -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 05/16] nvme: zns: Allow ZNS drives that have non-power_of_2 zone size
On 4/27/22 09:02, Pankaj Raghav wrote: - sector &= ~(ns->zsze - 1); + sector = rounddown(sector, ns->zsze); The above change breaks 32-bit builds since ns->zsze is 64 bits wide and since rounddown() uses the C division operator instead of div64_u64(). Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 01/16] block: make blkdev_nr_zones and blk_queue_zone_no generic for npo2 zsze
On 4/27/22 09:02, Pankaj Raghav wrote: Adapt blkdev_nr_zones and blk_queue_zone_no function so that it can also work for non-power-of-2 zone sizes. As the existing deployments of zoned devices had power-of-2 assumption, power-of-2 optimized calculation is kept for those devices. There are no direct hot paths modified and the changes just introduce one new branch per call. Reviewed-by: Luis Chamberlain Signed-off-by: Pankaj Raghav --- block/blk-zoned.c | 8 +++- include/linux/blkdev.h | 8 +++- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 38cd840d8838..1dff4a8bd51d 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -117,10 +117,16 @@ EXPORT_SYMBOL_GPL(__blk_req_zone_write_unlock); unsigned int blkdev_nr_zones(struct gendisk *disk) { sector_t zone_sectors = blk_queue_zone_sectors(disk->queue); + sector_t capacity = get_capacity(disk); if (!blk_queue_is_zoned(disk->queue)) return 0; - return (get_capacity(disk) + zone_sectors - 1) >> ilog2(zone_sectors); + + if (is_power_of_2(zone_sectors)) + return (capacity + zone_sectors - 1) >> + ilog2(zone_sectors); + + return div64_u64(capacity + zone_sectors - 1, zone_sectors); } EXPORT_SYMBOL_GPL(blkdev_nr_zones); Does anyone need support for more than 4 billion sectors per zone? If not, do_div() should be sufficient. diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 60d016138997..c4e4c7071b7b 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -665,9 +665,15 @@ static inline unsigned int blk_queue_nr_zones(struct request_queue *q) static inline unsigned int blk_queue_zone_no(struct request_queue *q, sector_t sector) { + sector_t zone_sectors = blk_queue_zone_sectors(q); + if (!blk_queue_is_zoned(q)) return 0; - return sector >> ilog2(q->limits.chunk_sectors); + + if (is_power_of_2(zone_sectors)) + return sector >> ilog2(zone_sectors); + + return div64_u64(sector, zone_sectors); } Same comment here. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH v4 00/10] Add Copy offload support
On 4/28/22 14:37, Damien Le Moal wrote: On 4/28/22 16:49, Nitesh Shetty wrote: On Thu, Apr 28, 2022 at 07:05:32AM +0900, Damien Le Moal wrote: On 4/27/22 21:49, Nitesh Shetty wrote: O Wed, Apr 27, 2022 at 11:19:48AM +0900, Damien Le Moal wrote: On 4/26/22 19:12, Nitesh Shetty wrote: The patch series covers the points discussed in November 2021 virtual call [LSF/MM/BFP TOPIC] Storage: Copy Offload[0]. We have covered the Initial agreed requirements in this patchset. Patchset borrows Mikulas's token based approach for 2 bdev implementation. Overall series supports – 1. Driver - NVMe Copy command (single NS), including support in nvme-target (for block and file backend) It would also be nice to have copy offload emulation in null_blk for testing. We can plan this in next phase of copy support, once this series settles down. So how can people test your series ? Not a lot of drives out there with copy support. Yeah not many drives at present, Qemu can be used to test NVMe copy. Upstream QEMU ? What is the command line options ? An example would be nice. But I still think null_blk support would be easiest. +1 for adding copy offloading support in null_blk. That enables running copy offloading tests without depending on Qemu. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 03/16] block: add bdev_zone_no helper
On 4/27/22 09:02, Pankaj Raghav wrote: +static inline unsigned int bdev_zone_no(struct block_device *bdev, sector_t sec) +{ + struct request_queue *q = bdev_get_queue(bdev); + + if (q) + return blk_queue_zone_no(q, sec); + return 0; +} This patch series has been split incorrectly: the same patch that introduces a new function should also introduce a caller to that function. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 02/16] block: add blk_queue_zone_aligned and bdev_zone_aligned helper
On 4/27/22 09:02, Pankaj Raghav wrote: +static inline bool bdev_zone_aligned(struct block_device *bdev, sector_t sec) +{ + struct request_queue *q = bdev_get_queue(bdev); + + if (q) + return blk_queue_zone_aligned(q, sec); + return false; +} Which patch uses this function? I can't find any patch in this series that introduces a call to this function. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 01/11] blk-mq: Add blk_mq_init_queue_ops()
On 3/22/22 03:39, John Garry wrote: Add an API to allocate a request queue which accepts a custom set of blk_mq_ops for that request queue. The reason which we may want custom ops is for queuing requests which we don't want to go through the normal queuing path. Custom ops shouldn't be required for this. See e.g. how tmf_queue is used in the UFS driver for an example of a queue implementation with custom operations and that does not require changes of the block layer core. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH 1/3] block: add copy offload support
On 2/3/22 10:50, Mikulas Patocka wrote: On Tue, 1 Feb 2022, Bart Van Assche wrote: On 2/1/22 10:32, Mikulas Patocka wrote: /** + * blk_queue_max_copy_sectors - set maximum copy offload sectors for the queue + * @q: the request queue for the device + * @size: the maximum copy offload sectors + */ +void blk_queue_max_copy_sectors(struct request_queue *q, unsigned int size) +{ + q->limits.max_copy_sectors = size; +} +EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors); Please either change the unit of 'size' into bytes or change its type into sector_t. blk_queue_chunk_sectors, blk_queue_max_discard_sectors, blk_queue_max_write_same_sectors, blk_queue_max_write_zeroes_sectors, blk_queue_max_zone_append_sectors also have the unit of sectors and the argument is "unsigned int". Should blk_queue_max_copy_sectors be different? As far as I know using the type sector_t for variables that represent a number of sectors is a widely followed convention: $ git grep -w sector_t | wc -l 2575 I would appreciate it if that convention would be used consistently, even if that means modifying existing code. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH 1/3] block: add copy offload support
On 2/1/22 10:32, Mikulas Patocka wrote: /** + * blk_queue_max_copy_sectors - set maximum copy offload sectors for the queue + * @q: the request queue for the device + * @size: the maximum copy offload sectors + */ +void blk_queue_max_copy_sectors(struct request_queue *q, unsigned int size) +{ + q->limits.max_copy_sectors = size; +} +EXPORT_SYMBOL_GPL(blk_queue_max_copy_sectors); Please either change the unit of 'size' into bytes or change its type into sector_t. +extern int blkdev_issue_copy(struct block_device *bdev1, sector_t sector1, + struct block_device *bdev2, sector_t sector2, + sector_t nr_sects, sector_t *copied, gfp_t gfp_mask); + Only supporting copying between contiguous LBA ranges seems restrictive to me. I expect garbage collection by filesystems for UFS devices to perform better if multiple LBA ranges are submitted as a single SCSI XCOPY command. A general comment about the approach: encoding the LBA range information in a bio payload is not compatible with bio splitting. How can the dm driver implement copy offloading without the ability to split copy offload bio's? +int blkdev_issue_copy(struct block_device *bdev1, sector_t sector1, + struct block_device *bdev2, sector_t sector2, + sector_t nr_sects, sector_t *copied, gfp_t gfp_mask) +{ + struct page *token; + sector_t m; + int r = 0; + struct completion comp; Consider using DECLARE_COMPLETION_ONSTACK() instead of a separate declaration and init_completion() call. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH 2/3] nvme: add copy offload support
On 2/1/22 10:33, Mikulas Patocka wrote: +static inline blk_status_t nvme_setup_read_token(struct nvme_ns *ns, struct request *req) +{ + struct bio *bio = req->bio; + struct nvme_copy_token *token = page_to_virt(bio->bi_io_vec[0].bv_page) + bio->bi_io_vec[0].bv_offset; Hmm ... shouldn't this function use bvec_kmap_local() instead of page_to_virt()? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload
On 1/26/22 23:14, Chaitanya Kulkarni wrote: [1]https://content.riscv.org/wp-content/uploads/2018/12/A-New-Golden-Age-for-Computer-Architecture-History-Challenges-and-Opportunities-David-Patterson-.pdf [2] https://www.snia.org/computational https://www.napatech.com/support/resources/solution-descriptions/napatech-smartnic-solution-for-hardware-offload/ https://www.eideticom.com/products.html https://www.xilinx.com/applications/data-center/computational-storage.html [3] git://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git xcopy [4] https://www.spinics.net/lists/linux-block/msg00599.html [5] https://lwn.net/Articles/793585/ [6] https://nvmexpress.org/new-nvmetm-specification-defines-zoned- namespaces-zns-as-go-to-industry-technology/ [7] https://github.com/sbates130272/linux-p2pmem [8] https://kernel.dk/io_uring.pdf Please consider adding the following link to the above list: https://github.com/bvanassche/linux-kernel-copy-offload Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [RFC PATCH 0/1] Add inline encryption support for dm-crypt
On 1/13/22 10:09, Israel Rukshin wrote: I am working to add support for inline encryption/decryption at storage protocols like nvmf over RDMA. The HW that I am working with is ConnecX-6 Dx, which supports inline crypto and can send the data on the fabric at the same time. This patchset is based on v5.16-rc4 with Eric Biggers patches of the HW wrapped keys. It can be retrieved from branch "wip-wrapped-keys" at https://git.kernel.org/pub/scm/fs/fscrypt/fscrypt.git I tested this patch with modified nvme-rdma as the block device and created a DM crypt on top of it. I got performance enhancement compared to SW crypto. I tested the HW wrapped inline mode with the HW and the standard mode with a fallback algorithm. Hi Israel, Thank you for your work. For future patches related to inline encryption, please Cc Eric Biggers. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload
On 11/19/21 02:47, Kanchan Joshi wrote: Given the multitude of things accumulated on this topic, Martin suggested to have a table/matrix. Some of those should go in the initial patchset, and the remaining are to be staged for subsequent work. Here is the attempt to split the stuff into two buckets. Please change if something needs to be changed below. 1. Driver * Initial: NVMe Copy command (single NS) Subsequent: Multi NS copy, XCopy/Token-based Copy 2. Block layer ** Initial: - Block-generic copy (REQ_OP_COPY), with interface accommodating two block-devs - Emulation, when offload is natively absent - DM support (at least dm-linear) 3. User-interface * Initial: new ioctl or io_uring opcode 4. In-kernel user ** Initial: at least one user - dm-kcopyd user (e.g. dm-clone), or FS requiring GC (F2FS/Btrfs) Subsequent: - copy_file_range Integrity support and inline encryption support are missing from the above overview. Both are supported by the block layer. See also block/blk-integrity.c and include/linux/blk-crypto.h. I'm not claiming that these should be supported in the first version but I think it would be good to add these to the above overview. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload
On 11/17/21 04:53, Javier González wrote: Thanks for sharing this. We will make sure that DM / MD are supported and then we can cover examples. Hopefully, you guys can help with the bits for dm-crypt to make the decision to offload when it make sense. Will ask around to learn who should work on this. I will update the notes to keep them alive. Maybe we can have them open in your github page? Feel free to submit a pull request. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload
On 11/16/21 05:43, Javier González wrote: - Here, we need copy emulation to support encryption without dealing with HW issues and garbage Hi Javier, Thanks very much for having taken notes and also for having shared these. Regarding the above comment, after the meeting I learned that the above is not correct. Encryption in Android is LBA independent and hence it should be possible to offload F2FS garbage collection in Android once the (UFS) storage controller supports this. For the general case, I propose to let the dm-crypt driver decide whether or not to offload data copying since that driver knows whether or not data copying can be offloaded. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload
On 10/28/21 10:51 PM, Hannes Reinecke wrote: Also Keith presented his work on a simple zone-based remapping block device, which included an in-kernel copy offload facility. Idea is to lift that as a standalone patch such that we can use it a fallback (ie software) implementation if no other copy offload mechanism is available. Is a link to the presentation available? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 1/5] block: factor out a chunk_size_left helper
On 10/12/21 9:36 AM, Christoph Hellwig wrote: +/* + * Return how much of the chunk sectors is left to be used for an I/O at the + * given offset. + */ +static inline unsigned int chunk_size_left(sector_t offset, + unsigned int chunk_sectors) +{ + if (unlikely(!is_power_of_2(chunk_sectors))) + return chunk_sectors - sector_div(offset, chunk_sectors); + return chunk_sectors - (offset & (chunk_sectors - 1)); +} No "blk_" prefix for the function name? I think most other functions declared or defined in this header file have such a prefix. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload
On 10/6/21 3:05 AM, Javier González wrote: I agree that the topic is complex. However, we have not been able to find a clear path forward in the mailing list. Hmm ... really? At least Martin Petersen and I consider device mapper support essential. How about starting from Mikulas' patch series that supports the device mapper? See also https://lore.kernel.org/all/alpine.lrh.2.02.2108171630120.30...@file01.intranet.prod.int.rdu2.redhat.com/ What do you think about joining the call to talk very specific next steps to get a patchset that we can start reviewing in detail. I can do that. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload
On 9/28/21 12:13 PM, Javier González wrote: Since we are not going to be able to talk about this at LSF/MM, a few of us thought about holding a dedicated virtual discussion about Copy Offload. I believe we can use Chaitanya's thread as a start. Given the current state of the current patches, I would propose that we focus on the next step to get the minimal patchset that can go upstream so that we can build from there. Before we try to find a date and a time that fits most of us, who would be interested in participating? Given the technical complexity of this topic and also that the people who are interested live in multiple time zones, I prefer email to discuss the technical aspects of this work. My attempt to summarize how to implement copy offloading is available here: https://github.com/bvanassche/linux-kernel-copy-offload. Feedback on this text is welcome. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] dm-init.c: boot up race with partitions
On 9/22/21 18:33, 李平 wrote: I encountered the same problem as this one. Is there a solution? How about a creating the verity device from inside a udev rule? Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 3/7] block: copy offload support infrastructure
On 8/20/21 3:39 AM, Kanchan Joshi wrote: Bart, Mikulas On Tue, Aug 17, 2021 at 10:44 PM Bart Van Assche wrote: On 8/17/21 3:14 AM, SelvaKumar S wrote: Introduce REQ_OP_COPY, a no-merge copy offload operation. Create bio with control information as payload and submit to the device. Larger copy operation may be divided if necessary by looking at device limits. REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted to zoned device. Native copy offload is not supported for stacked devices. Using a single operation for copy-offloading instead of separate operations for reading and writing is fundamentally incompatible with the device mapper. I think we need a copy-offloading implementation that is compatible with the device mapper. While each read/write command is for a single contiguous range of device, with simple-copy we get to operate on multiple discontiguous ranges, with a single command. That seemed like a good opportunity to reduce control-plane traffic (compared to read/write operations) as well. With a separate read-and-write bio approach, each source-range will spawn at least one read, one write and eventually one SCC command. And it only gets worse as there could be many such discontiguous ranges (for GC use-case at least) coming from user-space in a single payload. Overall sequence will be - Receive a payload from user-space - Disassemble into many read-write pair bios at block-layer - Assemble those (somehow) in NVMe to reduce simple-copy commands - Send commands to device We thought payload could be a good way to reduce the disassembly/assembly work and traffic between block-layer to nvme. How do you see this tradeoff? What seems necessary for device-mapper usecase, appears to be a cost when device-mapper isn't used. Especially for SCC (since copy is within single ns), device-mappers may not be too compelling anyway. Must device-mapper support be a requirement for the initial support atop SCC? Or do you think it will still be a progress if we finalize the user-space interface to cover all that is foreseeable.And for device-mapper compatible transport between block-layer and NVMe - we do it in the later stage when NVMe too comes up with better copy capabilities? Hi Kanchan, These days there might be more systems that run the device mapper on top of the NVMe driver or a SCSI driver than systems that do use the device mapper. It is common practice these days to use dm-crypt on personal workstations and laptops. LVM (dm-linear) is popular because it is more flexible than a traditional partition table. Android phones use dm-verity on top of hardware encryption. In other words, not supporting the device mapper means that a very large number of use cases is excluded. So I think supporting the device mapper from the start is important, even if that means combining individual bios at the bottom of the storage stack into simple copy commands. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://listman.redhat.com/mailman/listinfo/dm-devel