Re: [dm-devel] ext4_writepages: jbd2_start: 5120 pages, ino 11; err -5

2022-09-29 Thread Donald Buczek
On 5/31/22 5:39 PM, Theodore Ts'o wrote:
> Hmmm. I think this patch should fix your issues.

Thanks a lot. Unfortunately, it didn't, I still occasionally get

[368259.560885] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 344 pages, ino 
279244; err -5


D.


> 
> If the journal has been aborted (which happens as part of the
> shutdown, we will never write out the commit block --- so it should be
> fine to skip the writeback of any dirty inodes in data=ordered mode.
> 
> BTW, if you know that the file system is going to get nuked in this
> way all the time, so you never care about file system after it is shut
> down, you could mount the file system with the mount option
> data=writeback.
> 
>   - Ted
> 
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 8ff4c6545a49..2e18211121f6 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -542,7 +542,10 @@ static int 
> ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  {
>   int ret;
> + journal_t *journal = EXT4_SB(jinode->i_vfs_inode->i_sb)->s_journal;
>  
> + if (!journal || is_journal_aborted(journal))
> + return 0;
>   if (ext4_should_journal_data(jinode->i_vfs_inode))
>   ret = ext4_journalled_submit_inode_data_buffers(jinode);
>   else
> @@ -554,7 +557,10 @@ static int ext4_journal_submit_inode_data_buffers(struct 
> jbd2_inode *jinode)
>  static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
>  {
>   int ret = 0;
> + journal_t *journal = EXT4_SB(jinode->i_vfs_inode->i_sb)->s_journal;
>  
> + if (!journal || is_journal_aborted(journal))
> + return 0;
>       if (!ext4_should_journal_data(jinode->i_vfs_inode))
>   ret = jbd2_journal_finish_inode_data_buffers(jinode);
> 


-- 
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] ext4_writepages: jbd2_start: 5120 pages, ino 11; err -5

2022-06-01 Thread Donald Buczek

On 5/31/22 12:38 PM, Jan Kara wrote:

Late reply but maybe it is still useful :)


That's very welcome!


On Thu 14-04-22 17:19:49, Donald Buczek wrote:

We have a cluster scheduler which provides each cluster job with a
private scratch filesystem (TMPDIR). These are created when a job starts
and removed when a job completes. The setup works by fallocate, losetup,
mkfs.ext4, mkdir, mount, "losetup -d", rm and the teardown just does a
umount and rmdir.

This works but there is one nuisance: The systems usually have a lot of
memory and some jobs write a lot of data to their scratch filesystems. So
when a job finishes, there often is a lot to sync by umount which
sometimes takes many minutes and wastes a lot of I/O bandwidth.
Additionally, the reserved space can't be returned and reused until the
umount is finished and the backing file is deleted.

So I was looking for a way to avoid that but didn't find something
straightforward. The workaround I've found so far is using a dm-device
(linear target) between the filesystem and the loop device and then use
this sequence for teardown:

- fcntl EXT4_IOC_SHUTDOWN with EXT4_GOING_FLAGS_NOLOGFLUSH
- dmestup reload $dmname --table "0 $sectors zero"
- dmsetup resume $dmname --noflush
- umount $mountpoint
- dmsetup remove --deferred $dmname
- rmdir $mountpoint

This seems to do what I want. The unnecessary flushing of the temporary data is 
redirected from the backing file into the zero target and it works really fast. 
There is one remaining problem though, which might be just a cosmetic one: 
Although ext4 is shut down to prevent it from writing, I sometimes get the 
error message from the subject in the logs:

[2963044.462043] EXT4-fs (dm-1): mounted filesystem without journal. Opts: 
(null)
[2963044.686994] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2963044.728391] EXT4-fs (dm-2): mounted filesystem without journal. Opts: 
(null)
[2963055.585198] EXT4-fs (dm-2): shut down requested (2)
[2963064.821246] EXT4-fs (dm-2): mounted filesystem without journal. Opts: 
(null)
[2963074.838259] EXT4-fs (dm-2): shut down requested (2)
[2963095.979089] EXT4-fs (dm-0): shut down requested (2)
[2963096.066376] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 5120 pages, ino 
11; err -5
[2963108.636648] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2963125.194740] EXT4-fs (dm-0): shut down requested (2)
[2963166.708088] EXT4-fs (dm-1): shut down requested (2)
[2963169.334437] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2963227.515974] EXT4-fs (dm-0): shut down requested (2)
[2966222.515143] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2966222.523390] EXT4-fs (dm-1): mounted filesystem without journal. Opts: 
(null)
[2966222.598071] EXT4-fs (dm-2): mounted filesystem without journal. Opts: 
(null)




So I'd like to ask a few questions:

- Is this error message expected or is it a bug?


Well, shutdown is not 100% tuned for clean teardown. It is mostly a testing
/ debugging aid.


- Can it be ignored or is there a leak or something on that error path.


The error recovery path should be cleaning up everything. If not, that
would be a bug :)


- Is there a better way to do what I want? Something I've overlooked?


Why not just rm -rf $mountpoint/*? That will remove all dirty data from
memory without writing it back. It will cost you more in terms of disk IOs
than the above dance with shutdown but unless you have many files, it
should be fast... And it is much more standard path than shutdown :).


This is in fact what we are doing now, after I've rejected the above solution
because of the fear of a leak on the probably not so well tested error path.

You answer encourages me to maybe just try it. The problem is, that if a few 
pages
or inodes keep hanging around only occasionally, the systems would probably 
need month
and month until this manifest as a problem. So I'd never know when to declare 
the
test as finished and trust the solution.

The "rm -r" does help a bit but not as much as I've hoped. It might take half
the time for some workloads but it is still in the same order of magnitude
for typical loads.


- I consider to create a new dm target or add an option to an existing
one, because I feel that "zero" underneath a filesystem asks for problems
because a filesystem expects to read back the data that it wrote, and the
"error" target would trigger lots of errors during the writeback
attempts. What I really want is a target which silently discard writes
and returns errors on reads. Any opinion about that?



- But to use devicemapper to eat away the I/O is also just a workaround
to the fact that we can't parse some flag to umount to say that we are
okay to lose all data and leave the filesystem in a corrupted state if
this was the last reference to it. Would this be a useful feature?


I think something like this might be useful if 

[dm-devel] ext4_writepages: jbd2_start: 5120 pages, ino 11; err -5

2022-04-19 Thread Donald Buczek

We have a cluster scheduler which provides each cluster job with a private scratch 
filesystem (TMPDIR). These are created when a job starts and removed when a job 
completes. The setup works by fallocate, losetup, mkfs.ext4, mkdir, mount, "losetup 
-d", rm and the teardown just does a umount and rmdir.

This works but there is one nuisance: The systems usually have a lot of memory 
and some jobs write a lot of data to their scratch filesystems. So when a job 
finishes, there often is a lot to sync by umount which sometimes takes many 
minutes and wastes a lot of I/O bandwidth. Additionally, the reserved space 
can't be returned and reused until the umount is finished and the backing file 
is deleted.

So I was looking for a way to avoid that but didn't find something 
straightforward. The workaround I've found so far is using a dm-device (linear 
target) between the filesystem and the loop device and then use this sequence 
for teardown:

- fcntl EXT4_IOC_SHUTDOWN with EXT4_GOING_FLAGS_NOLOGFLUSH
- dmestup reload $dmname --table "0 $sectors zero"
- dmsetup resume $dmname --noflush
- umount $mountpoint
- dmsetup remove --deferred $dmname
- rmdir $mountpoint

This seems to do what I want. The unnecessary flushing of the temporary data is 
redirected from the backing file into the zero target and it works really fast. 
There is one remaining problem though, which might be just a cosmetic one: 
Although ext4 is shut down to prevent it from writing, I sometimes get the 
error message from the subject in the logs:

[2963044.462043] EXT4-fs (dm-1): mounted filesystem without journal. Opts: 
(null)
[2963044.686994] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2963044.728391] EXT4-fs (dm-2): mounted filesystem without journal. Opts: 
(null)
[2963055.585198] EXT4-fs (dm-2): shut down requested (2)
[2963064.821246] EXT4-fs (dm-2): mounted filesystem without journal. Opts: 
(null)
[2963074.838259] EXT4-fs (dm-2): shut down requested (2)
[2963095.979089] EXT4-fs (dm-0): shut down requested (2)
[2963096.066376] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 5120 pages, ino 
11; err -5
[2963108.636648] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2963125.194740] EXT4-fs (dm-0): shut down requested (2)
[2963166.708088] EXT4-fs (dm-1): shut down requested (2)
[2963169.334437] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2963227.515974] EXT4-fs (dm-0): shut down requested (2)
[2966222.515143] EXT4-fs (dm-0): mounted filesystem without journal. Opts: 
(null)
[2966222.523390] EXT4-fs (dm-1): mounted filesystem without journal. Opts: 
(null)
[2966222.598071] EXT4-fs (dm-2): mounted filesystem without journal. Opts: 
(null)

So I'd like to ask a few questions:

- Is this error message expected or is it a bug?
- Can it be ignored or is there a leak or something on that error path.
- Is there a better way to do what I want? Something I've overlooked?
- I consider to create a new dm target or add an option to an existing one, because I feel that 
"zero" underneath a filesystem asks for problems because a filesystem expects to read 
back the data that it wrote, and the "error" target would trigger lots of errors during 
the writeback attempts. What I really want is a target which silently discard writes and returns 
errors on reads. Any opinion about that?
- But to use devicemapper to eat away the I/O is also just a workaround to the 
fact that we can't parse some flag to umount to say that we are okay to lose 
all data and leave the filesystem in a corrupted state if this was the last 
reference to it. Would this be a useful feature?

Best
  Donald
--
Donald Buczek
buc...@molgen.mpg.de

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH V2] md: don't unregister sync_thread with reconfig_mutex held

2021-12-14 Thread Donald Buczek

On 14.12.21 11:03, Guoqing Jiang wrote:



On 12/14/21 5:31 PM, Donald Buczek wrote:


  -void md_reap_sync_thread(struct mddev *mddev)
+void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
  {
  struct md_rdev *rdev;
  sector_t old_dev_sectors = mddev->dev_sectors;
  bool is_reshaped = false;
    /* resync has finished, collect result */
+    if (reconfig_mutex_held)
+    mddev_unlock(mddev);



If one thread got here, e.g. via action_store( /* "idle" */ ), now that the 
mutex is unlocked, is there anything which would prevent another thread getting  here as 
well, e.g. via the same path?

If not, they both might call


md_unregister_thread(>sync_thread);


Which is not reentrant:

void md_unregister_thread(struct md_thread **threadp)
{
struct md_thread *thread = *threadp;
if (!thread)
    return;
pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
/* Locking ensures that mddev_unlock does not wake_up a
 * non-existent thread
 */
spin_lock(_lock);
*threadp = NULL;
spin_unlock(_lock);

kthread_stop(thread->tsk);
kfree(thread);
}

This might be a preexisting problem, because the call site in dm-raid.c, which 
you updated to `md_reap_sync_thread(mddev, false);`, didn't hold the mutex 
anyway.

Am I missing something? Probably, I do.

Otherwise: Move the deref of threadp in md_unregister_thread() into the 
spinlock scope?


Good point, I think you are right.

And actually pers_lock does extra service to protect accesses to mddev->thread 
(I think it
also suitable for mddev->sync_thread ) when the mutex can't be held. Care to 
send a patch
for it?


I'm really sorry, but it's one thing to point to a possible problem and another 
thing to come up with a correct solution.


Yes, it is often the reality of life, and we can always correct ourselves if 
there is problem .


While I think it would be easy to avoid the double free with the spinlock (or 
maybe atomic RMW) , we surely don't want to hold the spinlock while we are 
sleeping in kthread_stop(). If we don't hold some kind of lock, what are the 
side effects of another sync thread being started or any other reconfiguration? 
Are the existing flags enough to protect us from this? If we do want to hold 
the lock while waiting for the thread to terminate, should it be made into a 
mutex? If so, it probably shouldn't be static but moved into the mddev 
structure. I'd need weeks if not month to figure that out and to feel bold 
enough to post it.


Thanks for deep thinking about it, I think we can avoid to call kthread_stop 
with spinlock
held. Maybe something like this, but just my raw idea, please have a thorough 
review.

  void md_unregister_thread(struct md_thread **threadp)
  {
-   struct md_thread *thread = *threadp;
-   if (!thread)
+   struct md_thread *thread = READ_ONCE(*threadp);


- The access to *threadp needs to be after the spin_lock(). Otherwise two CPUs 
might read non-NULL here.

- If it was after spin_lock(), I think (!= I know), we don't need the 
READ_ONCE, because spin_lock() implies a compiler barrier.


+
+   spin_lock(_lock);
+   if (!thread) {
+   spin_unlock(_lock);
     return;
+   }
     pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
     /* Locking ensures that mddev_unlock does not wake_up a
  * non-existent thread
  */
-   spin_lock(_lock);
     *threadp = NULL;
     spin_unlock(_lock);

-   kthread_stop(thread->tsk);
+   if (IS_ERR_OR_NULL(thread->tsk)) {


- Test accidentally negated? This test is new. Is this an unrelated change? 
Anyway, I don't get it.


+   kthread_stop(thread->tsk);
+   thread->tsk = NULL;


- This assignment can't be required, because we are going to kfree(thread) in 
the next line.


+   }
     kfree(thread);
  }
  EXPORT_SYMBOL(md_unregister_thread);


I don't want to push work to others, but my own my understanding of md is to 
narrow.


Me either 

Thanks,
Guoqing


Best

  Donald

--
Donald Buczek
buc...@molgen.mpg.de
Tel: +49 30 8413 1433


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH V2] md: don't unregister sync_thread with reconfig_mutex held

2021-12-14 Thread Donald Buczek

On 14.12.21 03:34, Guoqing Jiang wrote:



On 12/10/21 10:16 PM, Donald Buczek wrote:

Dear Guoqing,

On 13.02.21 01:49, Guoqing Jiang wrote:

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
    idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
    which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
    to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc...@molgen.mpg.de/T/#t

And add one parameter to md_reap_sync_thread since it could be called by
dm-raid which doesn't hold reconfig_mutex.

Reported-and-tested-by: Donald Buczek 
Signed-off-by: Guoqing Jiang 
---
V2:
1. add one parameter to md_reap_sync_thread per Jack's suggestion.

  drivers/md/dm-raid.c |  2 +-
  drivers/md/md.c  | 14 +-
  drivers/md/md.h  |  2 +-
  3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index cab12b2..0c4cbba 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3668,7 +3668,7 @@ static int raid_message(struct dm_target *ti, unsigned 
int argc, char **argv,
  if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
  if (mddev->sync_thread) {
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, false);
  }
  } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
  return -EBUSY;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ca40942..0c12b7f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4857,7 +4857,7 @@ action_store(struct mddev *mddev, const char *page, 
size_t len)
  flush_workqueue(md_misc_wq);
  if (mddev->sync_thread) {
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  }
  mddev_unlock(mddev);
  }
@@ -6234,7 +6234,7 @@ static void __md_stop_writes(struct mddev *mddev)
  flush_workqueue(md_misc_wq);
  if (mddev->sync_thread) {
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  }
    del_timer_sync(>safemode_timer);
@@ -9256,7 +9256,7 @@ void md_check_recovery(struct mddev *mddev)
   * ->spare_active and clear saved_raid_disk
   */
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  clear_bit(MD_RECOVERY_RECOVER, >recovery);
  clear_bit(MD_RECOVERY_NEEDED, >recovery);
  clear_bit(MD_SB_CHANGE_PENDING, >sb_flags);
@@ -9291,7 +9291,7 @@ void md_check_recovery(struct mddev *mddev)
  goto unlock;
  }
  if (mddev->sync_thread) {
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  goto unlock;
  }
  /* Set RUNNING before clearing NEEDED to avoid
@@ -9364,14 +9364,18 @@ void md_check_recovery(struct mddev *mddev)
  }
  EXPORT_SYMBOL(md_check_recovery);
  -void md_reap_sync_thread(struct mddev *mddev)
+void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
  {
  struct md_rdev *rdev;
  sector_t old_dev_sectors = mddev->dev_sectors;
  bool is_reshaped = false;
    /* resync has finished, collect result */
+    if (reconfig_mutex_held)
+    mddev_unlock(mddev);



If one thread got here, e.g. via action_store( /* "idle" */ ), now that the 
mutex is unlocked, is there anything which would prevent another thread getting  here as 
well, e.g. via the same path?

If not, they both might call


md_unregister_thread(>sync_thread);


Which is not reentrant:

void md_unregister_thread(struct md_thread **threadp)
{
struct md_thread *thread = *threadp;
if (!thread)
    return;
pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
/* Locking ensures that mddev_unlock does not wake_up a
 * non-existent thread
 */
spin_lock(_lock);
*threadp = NULL;
spin_unlock(_lock);

kthread_stop(thread->tsk);
kfree(thread);
}

This might be a preexisting problem, because the call site in dm-raid.c, which 
you updated to `md_reap_sync_thread(mddev, false);`, didn't hold the mutex 
anyway.

Am I missing something? Probably, I do.

Otherwise: Move the deref of threadp in md_unregister_t

Re: [dm-devel] [PATCH V2] md: don't unregister sync_thread with reconfig_mutex held

2021-12-12 Thread Donald Buczek

Dear Guoqing,

On 13.02.21 01:49, Guoqing Jiang wrote:

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc...@molgen.mpg.de/T/#t

And add one parameter to md_reap_sync_thread since it could be called by
dm-raid which doesn't hold reconfig_mutex.

Reported-and-tested-by: Donald Buczek 
Signed-off-by: Guoqing Jiang 
---
V2:
1. add one parameter to md_reap_sync_thread per Jack's suggestion.

  drivers/md/dm-raid.c |  2 +-
  drivers/md/md.c  | 14 +-
  drivers/md/md.h  |  2 +-
  3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index cab12b2..0c4cbba 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3668,7 +3668,7 @@ static int raid_message(struct dm_target *ti, unsigned 
int argc, char **argv,
if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, >recovery);
-   md_reap_sync_thread(mddev);
+   md_reap_sync_thread(mddev, false);
}
} else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
return -EBUSY;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ca40942..0c12b7f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4857,7 +4857,7 @@ action_store(struct mddev *mddev, const char *page, 
size_t len)
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, >recovery);
-   md_reap_sync_thread(mddev);
+   md_reap_sync_thread(mddev, true);
}
mddev_unlock(mddev);
}
@@ -6234,7 +6234,7 @@ static void __md_stop_writes(struct mddev *mddev)
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, >recovery);
-   md_reap_sync_thread(mddev);
+   md_reap_sync_thread(mddev, true);
}
  
  	del_timer_sync(>safemode_timer);

@@ -9256,7 +9256,7 @@ void md_check_recovery(struct mddev *mddev)
 * ->spare_active and clear saved_raid_disk
 */
set_bit(MD_RECOVERY_INTR, >recovery);
-   md_reap_sync_thread(mddev);
+   md_reap_sync_thread(mddev, true);
clear_bit(MD_RECOVERY_RECOVER, >recovery);
clear_bit(MD_RECOVERY_NEEDED, >recovery);
clear_bit(MD_SB_CHANGE_PENDING, >sb_flags);
@@ -9291,7 +9291,7 @@ void md_check_recovery(struct mddev *mddev)
goto unlock;
}
if (mddev->sync_thread) {
-   md_reap_sync_thread(mddev);
+   md_reap_sync_thread(mddev, true);
goto unlock;
}
/* Set RUNNING before clearing NEEDED to avoid
@@ -9364,14 +9364,18 @@ void md_check_recovery(struct mddev *mddev)
  }
  EXPORT_SYMBOL(md_check_recovery);
  
-void md_reap_sync_thread(struct mddev *mddev)

+void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
  {
struct md_rdev *rdev;
sector_t old_dev_sectors = mddev->dev_sectors;
bool is_reshaped = false;
  
  	/* resync has finished, collect result */

+   if (reconfig_mutex_held)
+   mddev_unlock(mddev);



If one thread got here, e.g. via action_store( /* "idle" */ ), now that the 
mutex is unlocked, is there anything which would prevent another thread getting  here as 
well, e.g. via the same path?

If not, they both might call


md_unregister_thread(>sync_thread);


Which is not reentrant:

void md_unregister_thread(struct md_thread **threadp)
{
struct md_thread *thread = *threadp;
if (!thread)
return;
pr_debug("interrupting MD-thread pid %d\n", task_pid_nr(thread->tsk));
/* Locking ensures that mddev_unlock does not wake_up a
 * non-existent thread
 */
spin_lock(_lock);
*threadp = NULL;
spin_unlock(_lock);

 

Re: [dm-devel] [PATCH V2] md: don't unregister sync_thread with reconfig_mutex held

2021-12-09 Thread Donald Buczek

[Update Guoqing’s email address]

On 15.02.21 12:07, Paul Menzel wrote:

[+cc Donald]

Am 13.02.21 um 01:49 schrieb Guoqing Jiang:

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
    idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
    which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
    to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc...@molgen.mpg.de/T/#t

And add one parameter to md_reap_sync_thread since it could be called by
dm-raid which doesn't hold reconfig_mutex.

Reported-and-tested-by: Donald Buczek 


Thanks, Paul, for putting me into the cc.

Guoqing, I don't think, I've tested this patch. Please remove the tested-by.

btw: We have the fix I suggested [1] running on 59 production raid6 sets with 
16 disk each with various loads and with monthly mdcheck (paused during 
daytime, so a few transitions each month) on several kernel versions running 
for nearly a year now. Many more transitions during testing. That doesn't mean 
the fix is correct, of course. The configurations of our systems are almost 
identical and we don't do suspend or anything. But maybe you might want to 
reconsider.

[1]: 
https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777...@molgen.mpg.de/

If you want me to test V3 of your patch, please put me in the cc.

Best
  Donald


Signed-off-by: Guoqing Jiang 
---
V2:
1. add one parameter to md_reap_sync_thread per Jack's suggestion.

  drivers/md/dm-raid.c |  2 +-
  drivers/md/md.c  | 14 +-
  drivers/md/md.h  |  2 +-
  3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index cab12b2..0c4cbba 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3668,7 +3668,7 @@ static int raid_message(struct dm_target *ti, unsigned 
int argc, char **argv,
  if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
  if (mddev->sync_thread) {
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, false);
  }
  } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
  return -EBUSY;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ca40942..0c12b7f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4857,7 +4857,7 @@ action_store(struct mddev *mddev, const char *page, 
size_t len)
  flush_workqueue(md_misc_wq);
  if (mddev->sync_thread) {
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  }
  mddev_unlock(mddev);
  }
@@ -6234,7 +6234,7 @@ static void __md_stop_writes(struct mddev *mddev)
  flush_workqueue(md_misc_wq);
  if (mddev->sync_thread) {
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  }
  del_timer_sync(>safemode_timer);
@@ -9256,7 +9256,7 @@ void md_check_recovery(struct mddev *mddev)
   * ->spare_active and clear saved_raid_disk
   */
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  clear_bit(MD_RECOVERY_RECOVER, >recovery);
  clear_bit(MD_RECOVERY_NEEDED, >recovery);
  clear_bit(MD_SB_CHANGE_PENDING, >sb_flags);
@@ -9291,7 +9291,7 @@ void md_check_recovery(struct mddev *mddev)
  goto unlock;
  }
  if (mddev->sync_thread) {
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  goto unlock;
  }
  /* Set RUNNING before clearing NEEDED to avoid
@@ -9364,14 +9364,18 @@ void md_check_recovery(struct mddev *mddev)
  }
  EXPORT_SYMBOL(md_check_recovery);
-void md_reap_sync_thread(struct mddev *mddev)
+void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
  {
  struct md_rdev *rdev;
  sector_t old_dev_sectors = mddev->dev_sectors;
  bool is_reshaped = false;
  /* resync has finished, collect result */
+    if (reconfig_mutex_held)
+    mddev_unlock(mddev);
  md_unregister_thread(>sync_thread);
+    if (reconfig_mutex_held)
+    mddev_lock_nointr(mddev);
  if (!test_bit(MD_RECOVERY_INTR, >recovery) &&
  !test_bit(MD_RECOVERY_REQUESTED, >recovery) &&
  

Re: [dm-devel] [PATCH V2] md: don't unregister sync_thread with reconfig_mutex held

2021-12-09 Thread Donald Buczek

On 15.02.21 12:07, Paul Menzel wrote:

[+cc Donald]

Am 13.02.21 um 01:49 schrieb Guoqing Jiang:

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
    idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
    which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
    to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc...@molgen.mpg.de/T/#t

And add one parameter to md_reap_sync_thread since it could be called by
dm-raid which doesn't hold reconfig_mutex.

Reported-and-tested-by: Donald Buczek 


Thanks, Paul, for putting me into the cc.

Guoqing, I don't think, I've tested this patch. Please remove the tested-by.

btw: We have the fix I suggested [1] running on 59 production raid6 sets with 
16 disk each with various loads and with monthly mdcheck (paused during 
daytime, so a few transitions each month) on several kernel versions running 
for nearly a year now. Many more transitions during testing. That doesn't mean 
the fix is correct, of course. The configurations of our systems are almost 
identical and we don't do suspend or anything. But maybe you might want to 
reconsider.

[1]: 
https://lore.kernel.org/linux-raid/bc342de0-98d2-1733-39cd-cc1999777...@molgen.mpg.de/

If you want me to test V3 of your patch, please put me in the cc.

Best
  Donald


Signed-off-by: Guoqing Jiang 
---
V2:
1. add one parameter to md_reap_sync_thread per Jack's suggestion.

  drivers/md/dm-raid.c |  2 +-
  drivers/md/md.c  | 14 +-
  drivers/md/md.h  |  2 +-
  3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index cab12b2..0c4cbba 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3668,7 +3668,7 @@ static int raid_message(struct dm_target *ti, unsigned 
int argc, char **argv,
  if (!strcasecmp(argv[0], "idle") || !strcasecmp(argv[0], "frozen")) {
  if (mddev->sync_thread) {
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, false);
  }
  } else if (decipher_sync_action(mddev, mddev->recovery) != st_idle)
  return -EBUSY;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index ca40942..0c12b7f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4857,7 +4857,7 @@ action_store(struct mddev *mddev, const char *page, 
size_t len)
  flush_workqueue(md_misc_wq);
  if (mddev->sync_thread) {
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  }
  mddev_unlock(mddev);
  }
@@ -6234,7 +6234,7 @@ static void __md_stop_writes(struct mddev *mddev)
  flush_workqueue(md_misc_wq);
  if (mddev->sync_thread) {
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  }
  del_timer_sync(>safemode_timer);
@@ -9256,7 +9256,7 @@ void md_check_recovery(struct mddev *mddev)
   * ->spare_active and clear saved_raid_disk
   */
  set_bit(MD_RECOVERY_INTR, >recovery);
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  clear_bit(MD_RECOVERY_RECOVER, >recovery);
  clear_bit(MD_RECOVERY_NEEDED, >recovery);
  clear_bit(MD_SB_CHANGE_PENDING, >sb_flags);
@@ -9291,7 +9291,7 @@ void md_check_recovery(struct mddev *mddev)
  goto unlock;
  }
  if (mddev->sync_thread) {
-    md_reap_sync_thread(mddev);
+    md_reap_sync_thread(mddev, true);
  goto unlock;
  }
  /* Set RUNNING before clearing NEEDED to avoid
@@ -9364,14 +9364,18 @@ void md_check_recovery(struct mddev *mddev)
  }
  EXPORT_SYMBOL(md_check_recovery);
-void md_reap_sync_thread(struct mddev *mddev)
+void md_reap_sync_thread(struct mddev *mddev, bool reconfig_mutex_held)
  {
  struct md_rdev *rdev;
  sector_t old_dev_sectors = mddev->dev_sectors;
  bool is_reshaped = false;
  /* resync has finished, collect result */
+    if (reconfig_mutex_held)
+    mddev_unlock(mddev);
  md_unregister_thread(>sync_thread);
+    if (reconfig_mutex_held)
+    mddev_lock_nointr(mddev);
  if (!test_bit(MD_RECOVERY_INTR, >recovery) &&
  !test_bit(MD_RECOVERY_REQUESTED, >recovery) &&
  mddev->degraded != mdde