On Thu, 2025-10-09 at 18:25 -0400, Benjamin Marzinski wrote: > On Thu, Oct 09, 2025 at 11:57:08AM +0200, Martin Wilck wrote: > > On Wed, 2025-10-08 at 23:04 -0400, Benjamin Marzinski wrote: > > > Request-based devices (dm-multipath) queue I/O in blk-mq on > > > noflush > > > suspends. Any queued IO will make it impossible to freeze the > > > queue. > > > If > > > a process attempts to update the queue limits while there is > > > queued > > > IO, > > > it can be get stuck holding the limits lock, while unable to > > > freeze > > > the > > > queue. If device-mapper then attempts to update the limits during > > > a > > > table swap, it will deadlock trying to grab the limits lock while > > > making > > > it impossible to flush the IO. > > > > > > Disallow updating the queue limits during a table swap, when > > > updating > > > an > > > immutable request-based dm device (dm-multipath) during a noflush > > > suspend. It is userspace's responsibility to make sure that the > > > new > > > table uses the same limits as the existing table if it asks for a > > > noflush suspend. > > > > > > Signed-off-by: Benjamin Marzinski <[email protected]> > > > --- > > > drivers/md/dm-table.c | 4 ++++ > > > drivers/md/dm-thin.c | 7 ++----- > > > drivers/md/dm.c | 35 +++++++++++++++++++++++------------ > > > 3 files changed, 29 insertions(+), 17 deletions(-) > > > > > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > > > index ad0a60a07b93..0522cd700e0e 100644 > > > --- a/drivers/md/dm-table.c > > > +++ b/drivers/md/dm-table.c > > > @@ -2043,6 +2043,10 @@ bool dm_table_supports_size_change(struct > > > dm_table *t, sector_t old_size, > > > return true; > > > } > > > > > > +/* > > > + * This function will be skipped by noflush reloads of immutable > > > request > > > + * based devices (dm-mpath). > > > + */ > > > int dm_table_set_restrictions(struct dm_table *t, struct > > > request_queue *q, > > > struct queue_limits *limits) > > > { > > > diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c > > > index c84149ba4e38..6f98936f0e05 100644 > > > --- a/drivers/md/dm-thin.c > > > +++ b/drivers/md/dm-thin.c > > > @@ -4383,11 +4383,8 @@ static void thin_postsuspend(struct > > > dm_target > > > *ti) > > > { > > > struct thin_c *tc = ti->private; > > > > > > - /* > > > - * The dm_noflush_suspending flag has been cleared by > > > now, > > > so > > > - * unfortunately we must always run this. > > > - */ > > > - noflush_work(tc, do_noflush_stop); > > > + if (dm_noflush_suspending(ti)) > > > + noflush_work(tc, do_noflush_stop); > > > } > > > > > > static int thin_preresume(struct dm_target *ti) > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > > > index 3ede5ba4cf7e..87e65c48dd04 100644 > > > --- a/drivers/md/dm.c > > > +++ b/drivers/md/dm.c > > > @@ -2439,7 +2439,6 @@ static struct dm_table *__bind(struct > > > mapped_device *md, struct dm_table *t, > > > { > > > struct dm_table *old_map; > > > sector_t size, old_size; > > > - int ret; > > > > > > lockdep_assert_held(&md->suspend_lock); > > > > > > @@ -2454,11 +2453,13 @@ static struct dm_table *__bind(struct > > > mapped_device *md, struct dm_table *t, > > > > > > set_capacity(md->disk, size); > > > > > > - ret = dm_table_set_restrictions(t, md->queue, limits); > > > - if (ret) { > > > - set_capacity(md->disk, old_size); > > > - old_map = ERR_PTR(ret); > > > - goto out; > > > + if (limits) { > > > + int ret = dm_table_set_restrictions(t, md- > > > >queue, > > > limits); > > > + if (ret) { > > > + set_capacity(md->disk, old_size); > > > + old_map = ERR_PTR(ret); > > > + goto out; > > > + } > > > } > > > > > > /* > > > @@ -2836,6 +2837,7 @@ static void dm_wq_work(struct work_struct > > > *work) > > > > > > static void dm_queue_flush(struct mapped_device *md) > > > { > > > + clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); > > > clear_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags); > > > smp_mb__after_atomic(); > > > queue_work(md->wq, &md->work); > > > @@ -2848,6 +2850,7 @@ struct dm_table *dm_swap_table(struct > > > mapped_device *md, struct dm_table *table) > > > { > > > struct dm_table *live_map = NULL, *map = ERR_PTR(- > > > EINVAL); > > > struct queue_limits limits; > > > + bool update_limits = true; > > > int r; > > > > > > mutex_lock(&md->suspend_lock); > > > @@ -2856,20 +2859,31 @@ struct dm_table *dm_swap_table(struct > > > mapped_device *md, struct dm_table *table) > > > if (!dm_suspended_md(md)) > > > goto out; > > > > > > + /* > > > + * To avoid a potential deadlock locking the queue > > > limits, > > > disallow > > > + * updating the queue limits during a table swap, when > > > updating an > > > + * immutable request-based dm device (dm-multipath) > > > during a > > > noflush > > > + * suspend. It is userspace's responsibility to make > > > sure > > > that the new > > > + * table uses the same limits as the existing table, if > > > it > > > asks for a > > > + * noflush suspend. > > > + */ > > > + if (dm_request_based(md) && md->immutable_target && > > > + __noflush_suspending(md)) > > > + update_limits = false; > > > /* > > > * If the new table has no data devices, retain the > > > existing > > > limits. > > > * This helps multipath with queue_if_no_path if all > > > paths > > > disappear, > > > * then new I/O is queued based on these limits, and > > > then > > > some paths > > > * reappear. > > > */ > > > - if (dm_table_has_no_data_devices(table)) { > > > + else if (dm_table_has_no_data_devices(table)) { > > > live_map = dm_get_live_table_fast(md); > > > if (live_map) > > > limits = md->queue->limits; > > > dm_put_live_table_fast(md); > > > } > > > > > > - if (!live_map) { > > > + if (update_limits && !live_map) { > > > r = dm_calculate_queue_limits(table, &limits); > > > if (r) { > > > map = ERR_PTR(r); > > > @@ -2877,7 +2891,7 @@ struct dm_table *dm_swap_table(struct > > > mapped_device *md, struct dm_table *table) > > > } > > > } > > > > > > - map = __bind(md, table, &limits); > > > + map = __bind(md, table, update_limits ? &limits : NULL); > > > dm_issue_global_event(); > > > > > > out: > > > @@ -2930,7 +2944,6 @@ static int __dm_suspend(struct > > > mapped_device > > > *md, struct dm_table *map, > > > > > > /* > > > * DMF_NOFLUSH_SUSPENDING must be set before presuspend. > > > - * This flag is cleared before dm_suspend returns. > > > */ > > > if (noflush) > > > set_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); > > > @@ -2993,8 +3006,6 @@ static int __dm_suspend(struct > > > mapped_device > > > *md, struct dm_table *map, > > > if (!r) > > > set_bit(dmf_suspended_flag, &md->flags); > > > > > > - if (noflush) > > > - clear_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); > > > if (map) > > > synchronize_srcu(&md->io_barrier); > > > > This moves the clear_bit() behind synchronize_rcu(). Is that safe? > > It's not just moved behind the synchronize_rcu(). > DMF_NOFLUSH_SUSPENDING > is now set for the entire time the device is suspended. If people > would > like, I'll update the patch to rename it to DMF_NOFLUSH_SUSPEND.
Right. I realize now that there's a smp_mb__after_atomic() in dm_queue_flush(). > I did check to see if holding it for the entire suspend would cause > issues, but I didn't see any case where it would. If I missed a > case where __noflush_suspending() should only return true if we are > actually in the process of suspending, I can easily update that > function to do that. If this is necessary, I agree that the flag an related function should be renamed. But there are already generic DM flags to indicate that a queue is suspend*ed*. Perhaps, instead of changing the semantics of DMF_NOFLUSH_SUSPENDING, it would make more sense to test (__noflush_suspending || test_bit(DMF_BLOCK_IO_FOR_SUSPEND) in dm_swap_table()? Anyway, I am still not convinced that we want to have this specific exception for multipath only. > > In general, I'm wondering whether we need a more generic solution > > to > > this problem. Therefore I've added linux-block to cc. > > > > The way I see it, if a device has queued IO without any means to > > perform the IO, it can't be frozen. We'd either need to fail all > > queued > > IO in this case, or refuse attempts to freeze the queue. > > In general, it's perfectly normal to start freezing the queue while > there are still outstanding requests. Sure, but my point was "without any means to perform the IO". As you pointed out yourself, a freeze attempt in this situation would get stuck. I find Bart's approach very attractive; freezing might not be necessary at all in that case. We'dd just need to avoid a race where paths get reinstated while the operation that would normally have required a freeze is ongoing. Martin
