Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Donald Buczek Mon, 25 Jan 2021 15:11:19 -0800



On 25.01.21 09:54, Donald Buczek wrote:

Dear Guoqing,

a colleague of mine was able to produce the issue inside a vm and were able to 
find a procedure to run the vm into the issue within minutes (not unreliably 
after hours on a physical system as before). This of course helped to pinpoint 
the problem.

My current theory of what is happening is:

- MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants 
to do a write and the array transitions from "clean" to "active". 
(https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)

- Before raid5d gets to write the superblock (its busy processing active stripes 
because of the sync activity) , userspace wants to pause the check by `echo idle 
> /sys/block/mdX/md/sync_action`

- action_store() takes the reconfig_mutex before trying to stop the sync 
thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) 
Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.

- raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a 
positive batch size ( 
https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although 
raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe 
because of MD_SB_CHANGE_PENDING. 
(https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the 
reason, raid5d is busy looping.

- raid5d()->md_check_recovery() is called by the raid5d main loop. One of its 
duties is to write the superblock, if a change is pending. However to do so, it 
needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. 
(https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , 
https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not 
true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.

- (as discussed previously) the sync thread is waiting for the number of active 
stripes to go down and doesn't terminate. The userspace thread is waiting for 
the sync thread to terminate.

Does this make sense?

Just for reference, I add the procedure which triggers the issue on the vm 
(with /dev/md3 mounted on /mnt/raid_md3) and some debug output:

```
#! /bin/bash

(
         while true; do
                 echo "start check"
                 echo check > /sys/block/md3/md/sync_action
                 sleep 10
                 echo "stop check"
                 echo idle > /sys/block/md3/md/sync_action
                 sleep 2
         done
) &

(
         while true; do
                 dd bs=1k count=$((5*1024*1024)) if=/dev/zero 
of=/mnt/raid_md3/bigfile status=none
                 sync /mnt/raid_md3/bigfile
                 rm /mnt/raid_md3/bigfile
                 sleep .1
         done
) &

start="$(date +%s)"
cd /sys/block/md3/md
wp_count=0
while true; do
         array_state=$(cat array_state)
         if [ "$array_state" = write-pending ]; then
                 wp_count=$(($wp_count+1))
         else
                 wp_count=0
         fi
         echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) 
$array_state $(cat stripe_cache_active)
         if [ $wp_count -ge 3 ]; then
                 kill -- -$$
                 exit
         fi
         sleep 1
done
```

The time, this needs to trigger the issue, varies from under a minute to one 
hour with 5 minute being typical. The output ends like this:

     309 check 6283872 / 8378368 active-idle 4144
     310 check 6283872 / 8378368 active 1702
     311 check 6807528 / 8378368 active 4152
     312 check 7331184 / 8378368 clean 3021
     stop check
     313 check 7331184 / 8378368 write-pending 3905
     314 check 7331184 / 8378368 write-pending 3905
     315 check 7331184 / 8378368 write-pending 3905
     Terminated

If I add

     pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, 
released, mddev->sb_flags);

in raid5d after the call to handle_active_stripes and enable the debug location 
after the deadlock occurred, I get

     [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 
mdddev->sb_flags 6
     [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 
mdddev->sb_flags 6
     [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 
mdddev->sb_flags 6
     [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 
mdddev->sb_flags 6

If I add

     pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, 
__LINE__, mddev->flags, mddev->sb_flags);

at the head of md_check_recovery, I get:

     [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 
mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 
mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 
mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 
mddev->flags 00000000 mddev->sb_flags 00000006
     [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 
mddev->flags 00000000 mddev->sb_flags 00000006

More debug lines in md_check_recovery confirm the control flow ( `if 
(mddev_trylock(mddev))` block not taken )

What approach would you suggest to fix this?


I naively tried the following patch and it seems to fix the problem. The test 
procedure didn't trigger the deadlock in 10 hours.

D.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, 
size_t len)
                        clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
                if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
                    mddev_lock(mddev) == 0) {
+                       set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
                        flush_workqueue(md_misc_wq);
                        if (mddev->sync_thread) {
                                set_bit(MD_RECOVERY_INTR, &mddev->recovery);
                                md_reap_sync_thread(mddev);
                        }
+                       clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
                        mddev_unlock(mddev);
                }
        } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
--
2.30.0

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

Reply via email to