On 4/10/25 1:33 PM, Damien Le Moal wrote:
> When a dm-delay devie is being suspended, the .presuspend() operation is
> first executed (delay_presuspend()) to immediately issue all the BIOs
> present in the delayed list of the device and also sets the device
> may_delay boolean to false. At the same time, any new BIO is issued to
> the device will not be delayed and immediately issued with delay_bio()
> returning DM_MAPIO_REMAPPED. This creates a situation where potentially
> 2 different contexts may be issuing write BIOs to the same zone of a
> zone device without respecting the issuing order from the user, that is,
> a newly issued write BIO may be issued before other write BIOs for the
> same target zone that are in the device delayed list. If such situation
> occurs, write BIOs may be failed by the underlying zoned device due to
> an unaligned write error.
>
> Prevent this situation from happening by always handling newly issued
> write BIOs using the delayed list of BIOs, even when the device is being
> suspended. This is done by forcing the use of the worker kthread for
> zoned devices, and by modifying flush_worker_fn() to always flush all
> delayed BIOs if the device may_delay boolean is false.
>
> Fixes: d43929ef65a6 ("dm-delay: support zoned devices")
> Signed-off-by: Damien Le Moal <[email protected]>
I probably should also add:
Reported-by: Benjamin Marzinski <[email protected]>
> ---
> drivers/md/dm-delay.c | 29 +++++++++++++++++++++--------
> 1 file changed, 21 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c
> index d4cf0ac2a7aa..c665b2ab1115 100644
> --- a/drivers/md/dm-delay.c
> +++ b/drivers/md/dm-delay.c
> @@ -128,7 +128,7 @@ static int flush_worker_fn(void *data)
> struct delay_c *dc = data;
>
> while (!kthread_should_stop()) {
> - flush_delayed_bios(dc, false);
> + flush_delayed_bios(dc, !dc->may_delay);
> spin_lock(&dc->delayed_bios_lock);
> if (unlikely(list_empty(&dc->delayed_bios))) {
> set_current_state(TASK_INTERRUPTIBLE);
> @@ -213,6 +213,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int
> argc, char **argv)
> struct delay_c *dc;
> int ret;
> unsigned int max_delay;
> + bool is_zoned = false;
>
> if (argc != 3 && argc != 6 && argc != 9) {
> ti->error = "Requires exactly 3, 6 or 9 arguments";
> @@ -236,6 +237,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int
> argc, char **argv)
> if (ret)
> goto bad;
> max_delay = dc->read.delay;
> + is_zoned = bdev_is_zoned(dc->read.dev->bdev);
>
> if (argc == 3) {
> ret = delay_class_ctr(ti, &dc->write, argv);
> @@ -251,6 +253,7 @@ static int delay_ctr(struct dm_target *ti, unsigned int
> argc, char **argv)
> if (ret)
> goto bad;
> max_delay = max(max_delay, dc->write.delay);
> + is_zoned = is_zoned || bdev_is_zoned(dc->write.dev->bdev);
>
> if (argc == 6) {
> ret = delay_class_ctr(ti, &dc->flush, argv + 3);
> @@ -263,13 +266,16 @@ static int delay_ctr(struct dm_target *ti, unsigned int
> argc, char **argv)
> if (ret)
> goto bad;
> max_delay = max(max_delay, dc->flush.delay);
> + is_zoned = is_zoned || bdev_is_zoned(dc->flush.dev->bdev);
>
> out:
> - if (max_delay < 50) {
> - /*
> - * In case of small requested delays, use kthread instead of
> - * timers and workqueue to achieve better latency.
> - */
> + /*
> + * In case of small requested delays, use kthread instead of timers and
> + * workqueue to achieve better latency. Also use a kthread for a zoned
> + * device so that we can preserve the order of write operations during
> + * suspend.
> + */
> + if (max_delay < 50 || is_zoned) {
> dc->worker = kthread_run(&flush_worker_fn, dc,
> "dm-delay-flush-worker");
> if (IS_ERR(dc->worker)) {
> ret = PTR_ERR(dc->worker);
> @@ -313,8 +319,15 @@ static int delay_bio(struct delay_c *dc, struct
> delay_class *c, struct bio *bio)
>
> spin_lock(&dc->delayed_bios_lock);
> if (unlikely(!dc->may_delay)) {
> - spin_unlock(&dc->delayed_bios_lock);
> - return DM_MAPIO_REMAPPED;
> + /*
> + * Issue the BIO immediately if the device is not zoned. FOr a
> + * zoned device, preserver the ordering of write operations by
> + * using the delay list.
> + */
> + if (!bdev_is_zoned(c->dev->bdev) || c != &dc->write) {
> + spin_unlock(&dc->delayed_bios_lock);
> + return DM_MAPIO_REMAPPED;
> + }
> }
> c->ops++;
> list_add_tail(&delayed->list, &dc->delayed_bios);
--
Damien Le Moal
Western Digital Research