Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-02-09 Thread Donald Buczek
On 09.02.21 01:46, Guoqing Jiang wrote: Great. I will send a formal patch with your reported-by and tested-by. Yes, that's fine. Thanks a lot for your help! Donald Thanks, Guoqing

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-02-08 Thread Guoqing Jiang
Hi Donald, On 2/8/21 19:41, Donald Buczek wrote: Dear Guoqing, On 08.02.21 15:53, Guoqing Jiang wrote: On 2/8/21 12:38, Donald Buczek wrote: 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. /* resync has finished, collect result */

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-02-08 Thread Donald Buczek
Dear Guoqing, On 08.02.21 15:53, Guoqing Jiang wrote: On 2/8/21 12:38, Donald Buczek wrote: 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. /* resync has finished, collect result */ mddev_unlock(mddev);

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-02-08 Thread Guoqing Jiang
On 2/8/21 12:38, Donald Buczek wrote: 5. maybe don't hold reconfig_mutex when try to unregister sync_thread, like this. /* resync has finished, collect result */ mddev_unlock(mddev); md_unregister_thread(>sync_thread); mddev_lock(mddev); As above: While

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-02-08 Thread Donald Buczek
On 02.02.21 16:42, Guoqing Jiang wrote: Hi Donald, On 1/26/21 17:05, Donald Buczek wrote: Dear Guoqing, On 26.01.21 15:06, Guoqing Jiang wrote: On 1/26/21 13:58, Donald Buczek wrote: Hmm, how about wake the waiter up in the while loop of raid5d? @@ -6520,6 +6532,11 @@ static void

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-02-02 Thread Guoqing Jiang
Hi Donald, On 1/26/21 17:05, Donald Buczek wrote: Dear Guoqing, On 26.01.21 15:06, Guoqing Jiang wrote: On 1/26/21 13:58, Donald Buczek wrote: Hmm, how about wake the waiter up in the while loop of raid5d? @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread)  

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-26 Thread Donald Buczek
Dear Guoqing, a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem. My current theory of what is

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-26 Thread Donald Buczek
Dear Guoqing, On 26.01.21 15:06, Guoqing Jiang wrote: On 1/26/21 13:58, Donald Buczek wrote: Hmm, how about wake the waiter up in the while loop of raid5d? @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) md_check_recovery(mddev);   

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-26 Thread Guoqing Jiang
On 1/26/21 13:58, Donald Buczek wrote: Hmm, how about wake the waiter up in the while loop of raid5d? @@ -6520,6 +6532,11 @@ static void raid5d(struct md_thread *thread) md_check_recovery(mddev); spin_lock_irq(>device_lock);    

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-26 Thread Donald Buczek
On 26.01.21 12:14, Guoqing Jiang wrote: Hi Donald, On 1/26/21 10:50, Donald Buczek wrote: [...] diff --git a/drivers/md/md.c b/drivers/md/md.c index 2d21c298ffa7..f40429843906 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev,

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-26 Thread Guoqing Jiang
Hi Donald, On 1/26/21 10:50, Donald Buczek wrote: [...] diff --git a/drivers/md/md.c b/drivers/md/md.c index 2d21c298ffa7..f40429843906 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)  

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-26 Thread Donald Buczek
Dear Guoqing, On 26.01.21 01:44, Guoqing Jiang wrote: Hi Donald, On 1/25/21 22:32, Donald Buczek wrote: On 25.01.21 09:54, Donald Buczek wrote: Dear Guoqing, a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-25 Thread Guoqing Jiang
Hi Donald, On 1/25/21 22:32, Donald Buczek wrote: On 25.01.21 09:54, Donald Buczek wrote: Dear Guoqing, a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-25 Thread Donald Buczek
On 25.01.21 09:54, Donald Buczek wrote: Dear Guoqing, a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-23 Thread Donald Buczek
Dear Guoqing, On 20.01.21 17:33, Guoqing Jiang wrote: Hi Donald, On 1/19/21 12:30, Donald Buczek wrote: Dear md-raid people, I've reported a problem in this thread in December: "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-20 Thread Guoqing Jiang
Hi Donald, On 1/19/21 12:30, Donald Buczek wrote: Dear md-raid people, I've reported a problem in this thread in December: "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2021-01-19 Thread Donald Buczek
Dear md-raid people, I've reported a problem in this thread in December: "We are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions." It was clear,

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2020-12-21 Thread Donald Buczek
Dear Guoging, I think now that this is not an issue for md. I've driven a system into that situation again and have clear indication, that this is a problem of the member block device driver. With md0 in the described errornous state (md0_raid6 busy looping, echo idle > .../sync_action

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2020-12-03 Thread Donald Buczek
Dear Guoqing, On 12/3/20 2:55 AM, Guoqing Jiang wrote: Hi Donald, On 12/2/20 18:28, Donald Buczek wrote: Dear Guoqing, unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used: --- a/drivers/md/md.c +++ b/drivers/md/md.c @@

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2020-12-02 Thread Guoqing Jiang
Hi Donald, On 12/2/20 18:28, Donald Buczek wrote: Dear Guoqing, unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used:     --- a/drivers/md/md.c     +++ b/drivers/md/md.c     @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2020-12-02 Thread Donald Buczek
Dear Guoqing, unfortunately the patch didn't fix the problem (unless I messed it up with my logging). This is what I used: --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -9305,6 +9305,14 @@ void md_check_recovery(struct mddev *mddev)

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2020-12-01 Thread Donald Buczek
Am 30.11.20 um 03:06 schrieb Guoqing Jiang: On 11/28/20 13:25, Donald Buczek wrote: Dear Linux mdraid people, we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on

Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2020-11-29 Thread Guoqing Jiang
On 11/28/20 13:25, Donald Buczek wrote: Dear Linux mdraid people, we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions. The last time

md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

2020-11-28 Thread Donald Buczek
Dear Linux mdraid people, we are using raid6 on several servers. Occasionally we had failures, where a mdX_raid6 process seems to go into a busy loop and all I/O to the md device blocks. We've seen this on various kernel versions. The last time this happened (in this case with Linux