2.6.23.1: mdadm/raid5 hung/d-state
# ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. # mdadm -D /dev/md3 /dev/md3: Version : 00.90.03 Creation Time : Wed Aug 22 10:38:53 2007 Raid Level : raid5 Array Size : 1318680576 (1257.59 GiB 1350.33 GB) Used Dev Size : 146520064 (139.73 GiB 150.04 GB) Raid Devices : 10 Total Devices : 10 Preferred Minor : 3 Persistence : Superblock is persistent Update Time : Sun Nov 4 06:38:29 2007 State : active Active Devices : 10 Working Devices : 10 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 1024K UUID : e37a12d1:1b0b989a:083fb634:68e9eb49 Events : 0.4309 Number Major Minor RaidDevice State 0 8 330 active sync /dev/sdc1 1 8 491 active sync /dev/sdd1 2 8 652 active sync /dev/sde1 3 8 813 active sync /dev/sdf1 4 8 974 active sync /dev/sdg1 5 8 1135 active sync /dev/sdh1 6 8 1296 active sync /dev/sdi1 7 8 1457 active sync /dev/sdj1 8 8 1618 active sync /dev/sdk1 9 8 1779 active sync /dev/sdl1 If I wanted to find out what is causing this, what type of debugging would I have to enable to track it down? Any attempt to read/write files on the devices fails (also going into d-state). Is there any useful information I can get currently before rebooting the machine? # pwd /sys/block/md3/md # ls array_state dev-sdj1/ rd2@ stripe_cache_active bitmap_set_bits dev-sdk1/ rd3@ stripe_cache_size chunk_size dev-sdl1/ rd4@ suspend_hi component_size layoutrd5@ suspend_lo dev-sdc1/level rd6@ sync_action dev-sdd1/metadata_version rd7@ sync_completed dev-sde1/mismatch_cnt rd8@ sync_speed dev-sdf1/new_dev rd9@ sync_speed_max dev-sdg1/raid_disksreshape_position sync_speed_min dev-sdh1/rd0@ resync_start dev-sdi1/rd1@ safe_mode_delay # cat array_state active-idle # cat mismatch_cnt 0 # cat stripe_cache_active 1 # cat stripe_cache_size 16384 # cat sync_action idle # cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : active raid1 sdb2[1] sda2[0] 136448 blocks [2/2] [UU] md2 : active raid1 sdb3[1] sda3[0] 129596288 blocks [2/2] [UU] md3 : active raid5 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] sdc1[0] 1318680576 blocks level 5, 1024k chunk, algorithm 2 [10/10] [UU] md0 : active raid1 sdb1[1] sda1[0] 16787776 blocks [2/2] [UU] unused devices: # Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Justin Piszcz wrote: > # ps auxww | grep D > USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] > root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] > > After several days/weeks, this is the second time this has happened, > while doing regular file I/O (decompressing a file), everything on the > device went into D-state. The next time you come across something like that, do a SysRq-T dump and post that. It shows a stack trace of all processes - and in particular, where exactly each task is stuck. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Justin Piszcz wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. Same observation here (kernel 2.6.23). I can see this bug when I try to synchronize a raid1 volume over iSCSI (each element is a raid5 volume), or sometimes only with a 1,5 TB raid5 volume. When this bug occurs, md subsystem eats 100% of one CPU and pdflush remains in D state too. What is your architecture ? I use two 32-threads T1000 (sparc64), and I'm trying to determine if this bug is arch specific. Regards, JKB - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Sun, 4 Nov 2007, BERTRAND Joël wrote: Justin Piszcz wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. Same observation here (kernel 2.6.23). I can see this bug when I try to synchronize a raid1 volume over iSCSI (each element is a raid5 volume), or sometimes only with a 1,5 TB raid5 volume. When this bug occurs, md subsystem eats 100% of one CPU and pdflush remains in D state too. What is your architecture ? I use two 32-threads T1000 (sparc64), and I'm trying to determine if this bug is arch specific. Regards, JKB Using x86_64 here (Q6600/Intel DG965WH). Justin.
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Justin Piszcz wrote: > On Sun, 4 Nov 2007, Michael Tokarev wrote: [] >> The next time you come across something like that, do a SysRq-T dump and >> post that. It shows a stack trace of all processes - and in particular, >> where exactly each task is stuck. > Yes I got it before I rebooted, ran that and then dmesg > file. > > Here it is: > > [1172609.665902] 80747dc0 80747dc0 80747dc0 > 80744d80 > [1172609.668768] 80747dc0 81015c3aa918 810091c899b4 > 810091c899a8 That's only partial list. All the kernel threads - which are most important in this context - aren't shown. You ran out of dmesg buffer, and the most interesting entries was at the beginning. If your /var/log partition is working, the stuff should be in /var/log/kern.log or equivalent. If it's not working, there is a way to capture the info still, by stopping syslogd, cat'ing /proc/kmsg to some tmpfs file and scp'ing it elsewhere. /mjt - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Sun, 4 Nov 2007, Michael Tokarev wrote: Justin Piszcz wrote: On Sun, 4 Nov 2007, Michael Tokarev wrote: [] The next time you come across something like that, do a SysRq-T dump and post that. It shows a stack trace of all processes - and in particular, where exactly each task is stuck. Yes I got it before I rebooted, ran that and then dmesg > file. Here it is: [1172609.665902] 80747dc0 80747dc0 80747dc0 80744d80 [1172609.668768] 80747dc0 81015c3aa918 810091c899b4 810091c899a8 That's only partial list. All the kernel threads - which are most important in this context - aren't shown. You ran out of dmesg buffer, and the most interesting entries was at the beginning. If your /var/log partition is working, the stuff should be in /var/log/kern.log or equivalent. If it's not working, there is a way to capture the info still, by stopping syslogd, cat'ing /proc/kmsg to some tmpfs file and scp'ing it elsewhere. /mjt Will do that the next time it happens, thanks. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Michael Tokarev wrote: > Justin Piszcz wrote: >> On Sun, 4 Nov 2007, Michael Tokarev wrote: > [] >>> The next time you come across something like that, do a SysRq-T dump and >>> post that. It shows a stack trace of all processes - and in particular, >>> where exactly each task is stuck. > >> Yes I got it before I rebooted, ran that and then dmesg > file. >> >> Here it is: >> >> [1172609.665902] 80747dc0 80747dc0 80747dc0 >> 80744d80 >> [1172609.668768] 80747dc0 81015c3aa918 810091c899b4 >> 810091c899a8 > > That's only partial list. All the kernel threads - which are most important > in this context - aren't shown. You ran out of dmesg buffer, and the most > interesting entries was at the beginning. If your /var/log partition is > working, the stuff should be in /var/log/kern.log or equivalent. If it's > not working, there is a way to capture the info still, by stopping syslogd, > cat'ing /proc/kmsg to some tmpfs file and scp'ing it elsewhere. or netconsole is actually pretty easy and incredibly useful in this kind of situation even if there's no disk at all :) David - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Sunday November 4, [EMAIL PROTECTED] wrote: > # ps auxww | grep D > USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] > root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] > > After several days/weeks, this is the second time this has happened, while > doing regular file I/O (decompressing a file), everything on the device > went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 NeilBrown Fix misapplied patch in raid5.c commit 4ae3f847e49e3787eca91bced31f8fd328d50496 did not get applied correctly, presumably due to substantial similarities between handle_stripe5 and handle_stripe6. This patch (with lots of context) moves the chunk of new code from handle_stripe6 (where it isn't needed (yet)) to handle_stripe5. Signed-off-by: Neil Brown <[EMAIL PROTECTED]> ### Diffstat output ./drivers/md/raid5.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c --- .prev/drivers/md/raid5.c2007-11-02 12:10:49.0 +1100 +++ ./drivers/md/raid5.c2007-11-02 12:25:31.0 +1100 @@ -2607,40 +2607,47 @@ static void handle_stripe5(struct stripe struct bio *return_bi = NULL; struct stripe_head_state s; struct r5dev *dev; unsigned long pending = 0; memset(&s, 0, sizeof(s)); pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d " "ops=%lx:%lx:%lx\n", (unsigned long long)sh->sector, sh->state, atomic_read(&sh->count), sh->pd_idx, sh->ops.pending, sh->ops.ack, sh->ops.complete); spin_lock(&sh->lock); clear_bit(STRIPE_HANDLE, &sh->state); clear_bit(STRIPE_DELAYED, &sh->state); s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); /* Now to look around and see what can be done */ + /* clean-up completed biofill operations */ + if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) { + clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending); + clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack); + clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete); + } + rcu_read_lock(); for (i=disks; i--; ) { mdk_rdev_t *rdev; struct r5dev *dev = &sh->dev[i]; clear_bit(R5_Insync, &dev->flags); pr_debug("check %d: state 0x%lx toread %p read %p write %p " "written %p\n", i, dev->flags, dev->toread, dev->read, dev->towrite, dev->written); /* maybe we can request a biofill operation * * new wantfill requests are only permitted while * STRIPE_OP_BIOFILL is clear */ if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread && !test_bit(STRIPE_OP_BIOFILL, &sh->ops.pending)) set_bit(R5_Wantfill, &dev->flags); /* now count some things */ @@ -2880,47 +2887,40 @@ static void handle_stripe6(struct stripe struct stripe_head_state s; struct r6_state r6s; struct r5dev *dev, *pdev, *qdev; r6s.qd_idx = raid6_next_disk(pd_idx, disks); pr_debug("handling stripe %llu, state=%#lx cnt=%d, " "pd_idx=%d, qd_idx=%d\n", (unsigned long long)sh->sector, sh->state, atomic_read(&sh->count), pd_idx, r6s.qd_idx); memset(&s, 0, sizeof(s)); spin_lock(&sh->lock); clear_bit(STRIPE_HANDLE, &sh->state); clear_bit(STRIPE_DELAYED, &sh->state); s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); /* Now to look around and see what can be done */ - /* clean-up completed biofill operations */ - if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) { - clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending); - clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack); - clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete); - } - rcu_read_lock(); for (i=disks; i--; ) { mdk_rdev_t *rdev; dev = &sh->dev[i]; clear_bit(R5_Insync, &dev->flags); pr_debug("check %d: state 0x%lx read %p write %p written %p\n", i, dev
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Mon, 5 Nov 2007, Neil Brown wrote: On Sunday November 4, [EMAIL PROTECTED] wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 Ah, thanks Neil, will be updating as soon as it is released, thanks. Justin. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Neil Brown wrote: On Sunday November 4, [EMAIL PROTECTED] wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 My linux-2.6.23/drivers/md/raid5.c contains your patch for a long time : ... spin_lock(&sh->lock); clear_bit(STRIPE_HANDLE, &sh->state); clear_bit(STRIPE_DELAYED, &sh->state); s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); /* Now to look around and see what can be done */ /* clean-up completed biofill operations */ if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) { clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending); clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack); clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete); } rcu_read_lock(); for (i=disks; i--; ) { mdk_rdev_t *rdev; struct r5dev *dev = &sh->dev[i]; ... but it doesn't fix this bug. Regards, JKB - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On 11/4/07, Justin Piszcz <[EMAIL PROTECTED]> wrote: > > > On Mon, 5 Nov 2007, Neil Brown wrote: > > > On Sunday November 4, [EMAIL PROTECTED] wrote: > >> # ps auxww | grep D > >> USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > >> root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] > >> root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] > >> > >> After several days/weeks, this is the second time this has happened, while > >> doing regular file I/O (decompressing a file), everything on the device > >> went into D-state. > > > > At a guess (I haven't looked closely) I'd say it is the bug that was > > meant to be fixed by > > > > commit 4ae3f847e49e3787eca91bced31f8fd328d50496 > > > > except that patch applied badly and needed to be fixed with > > the following patch (not in git yet). > > These have been sent to stable@ and should be in the queue for 2.6.23.2 > > > > Ah, thanks Neil, will be updating as soon as it is released, thanks. > Are you seeing the same "md thread takes 100% of the CPU" that Joël is reporting? - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Mon, 5 Nov 2007, Dan Williams wrote: On 11/4/07, Justin Piszcz <[EMAIL PROTECTED]> wrote: On Mon, 5 Nov 2007, Neil Brown wrote: On Sunday November 4, [EMAIL PROTECTED] wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 Ah, thanks Neil, will be updating as soon as it is released, thanks. Are you seeing the same "md thread takes 100% of the CPU" that Joël is reporting? Yes, in another e-mail I posted the top output with md3_raid5 at 100%. Justin.
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On 11/5/07, Justin Piszcz <[EMAIL PROTECTED]> wrote: [..] > > Are you seeing the same "md thread takes 100% of the CPU" that Joël is > > reporting? > > > > Yes, in another e-mail I posted the top output with md3_raid5 at 100%. > This seems too similar to Joël's situation for them not to be correlated, and it shows that iscsi is not a necessary component of the failure. The attached patch allows the debug statements in MD to be enabled via sysfs. Joël, since it is easier for you to reproduce can you capture the kernel log output after the raid thread goes into the spin? It will help if you have CONFIG_PRINTK_TIME=y set in your kernel configuration. After the failure run: echo 1 > /sys/block/md_d0/md/debug_print_enable; sleep 5; echo 0 > /sys/block/md_d0/md/debug_print_enable ...to enable the print messages for a few seconds. Please send the output in a private message if it proves too big for the mailing list. raid5-debug-print-enable.patch Description: Binary data
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Done. Here is obtained ouput : [ 1260.967796] for sector 7629696, rmw=0 rcw=0 [ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 [ 1260.980606] check 5: state 0x6 toread read write f800ffcffcc0 written [ 1260.994808] check 4: state 0x6 toread read write f800fdd4e360 written [ 1261.009325] check 3: state 0x1 toread read write written [ 1261.244478] check 2: state 0x1 toread read write written [ 1261.270821] check 1: state 0x6 toread read write f800ff517e40 written [ 1261.312320] check 0: state 0x6 toread read write f800fd4cae60 written [ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 [ 1261.443120] for sector 7629696, rmw=0 rcw=0 [ 1261.453348] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 [ 1261.491538] check 5: state 0x6 toread read write f800ffcffcc0 written [ 1261.529120] check 4: state 0x6 toread read write f800fdd4e360 written [ 1261.560151] check 3: state 0x1 toread read write written [ 1261.599180] check 2: state 0x1 toread read write written [ 1261.637138] check 1: state 0x6 toread read write f800ff517e40 written [ 1261.674502] check 0: state 0x6 toread read write f800fd4cae60 written [ 1261.712589] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 [ 1261.864338] for sector 7629696, rmw=0 rcw=0 [ 1261.873475] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 [ 1261.907840] check 5: state 0x6 toread read write f800ffcffcc0 written [ 1261.950770] check 4: state 0x6 toread read write f800fdd4e360 written [ 1261.989003] check 3: state 0x1 toread read write written [ 1262.019621] check 2: state 0x1 toread read write written [ 1262.068705] check 1: state 0x6 toread read write f800ff517e40 written [ 1262.113265] check 0: state 0x6 toread read write f800fd4cae60 written [ 1262.150511] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 [ 1262.171143] for sector 7629696, rmw=0 rcw=0 [ 1262.179142] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 [ 1262.201905] check 5: state 0x6 toread read write f800ffcffcc0 written [ 1262.252750] check 4: state 0x6 toread read write f800fdd4e360 written [ 1262.289631] check 3: state 0x1 toread read write written [ 1262.344709] check 2: state 0x1 toread read write written [ 1262.400411] check 1: state 0x6 toread read write f800ff517e40 written [ 1262.437353] check 0: state 0x6 toread read write f800fd4cae60 written [ 1262.492561] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 [ 1262.524993] for sector 7629696, rmw=0 rcw=0 [ 1262.533314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 [ 1262.561900] check 5: state 0x6 toread read write f800ffcffcc0 written [ 1262.588986] check 4: state 0x6 toread read write f800fdd4e360 written [ 1262.619455] check 3: state 0x1 toread read write written [ 1262.671006] check 2: state 0x1 toread read write written [ 1262.709065] check 1: state 0x6 toread read write f800ff517e40 written [ 1262.746904] check 0: state 0x6 toread read write f800fd4cae60 written [ 1262.780203] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 [ 1262.805941] for sector 7629696, rmw=0 rcw=0 [ 1262.815759] handl
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Tue, 6 Nov 2007, BERTRAND Joël wrote: Done. Here is obtained ouput : [ 1265.899068] check 4: state 0x6 toread read write f800fdd4e360 written [ 1265.941328] check 3: state 0x1 toread read write written [ 1265.972129] check 2: state 0x1 toread read write written For information, after crash, I have : Root poulenc:[/sys/block] > cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU] Regards, JKB After the crash it is not 'resyncing' ? Justin.
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Justin Piszcz wrote: On Tue, 6 Nov 2007, BERTRAND Joël wrote: Done. Here is obtained ouput : [ 1265.899068] check 4: state 0x6 toread read write f800fdd4e360 written [ 1265.941328] check 3: state 0x1 toread read write written [ 1265.972129] check 2: state 0x1 toread read write written For information, after crash, I have : Root poulenc:[/sys/block] > cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU] Regards, JKB After the crash it is not 'resyncing' ? No, it isn't... JKB - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Tue, 6 Nov 2007, BERTRAND Joël wrote: Justin Piszcz wrote: On Tue, 6 Nov 2007, BERTRAND Joël wrote: Done. Here is obtained ouput : [ 1265.899068] check 4: state 0x6 toread read write f800fdd4e360 written [ 1265.941328] check 3: state 0x1 toread read write written [ 1265.972129] check 2: state 0x1 toread read write written For information, after crash, I have : Root poulenc:[/sys/block] > cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU] Regards, JKB After the crash it is not 'resyncing' ? No, it isn't... JKB After any crash/unclean shutdown the RAID should resync, if it doesn't, that's not good, I'd suggest running a raid check. The 'repair' is supposed to clean it, in some cases (md0=swap) it gets dirty again. Tue May 8 09:19:54 EDT 2007: Executing RAID health check for /dev/md0... Tue May 8 09:19:55 EDT 2007: Executing RAID health check for /dev/md1... Tue May 8 09:19:56 EDT 2007: Executing RAID health check for /dev/md2... Tue May 8 09:19:57 EDT 2007: Executing RAID health check for /dev/md3... Tue May 8 10:09:58 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 2176 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: The meta-device /dev/md0 has 2176 mismatched sectors. Tue May 8 10:09:58 EDT 2007: Executing repair on /dev/md0 Tue May 8 10:09:59 EDT 2007: The meta-device /dev/md1 has no mismatched sectors. Tue May 8 10:10:00 EDT 2007: The meta-device /dev/md2 has no mismatched sectors. Tue May 8 10:10:01 EDT 2007: The meta-device /dev/md3 has no mismatched sectors. Tue May 8 10:20:02 EDT 2007: All devices are clean... Tue May 8 10:20:02 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 2176 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Justin Piszcz wrote: On Tue, 6 Nov 2007, BERTRAND Joël wrote: Justin Piszcz wrote: On Tue, 6 Nov 2007, BERTRAND Joël wrote: Done. Here is obtained ouput : [ 1265.899068] check 4: state 0x6 toread read write f800fdd4e360 written [ 1265.941328] check 3: state 0x1 toread read write written [ 1265.972129] check 2: state 0x1 toread read write written For information, after crash, I have : Root poulenc:[/sys/block] > cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md_d0 : active raid5 sdc1[0] sdh1[5] sdg1[4] sdf1[3] sde1[2] sdd1[1] 1464725760 blocks level 5, 64k chunk, algorithm 2 [6/6] [UU] Regards, JKB After the crash it is not 'resyncing' ? No, it isn't... JKB After any crash/unclean shutdown the RAID should resync, if it doesn't, that's not good, I'd suggest running a raid check. The 'repair' is supposed to clean it, in some cases (md0=swap) it gets dirty again. Tue May 8 09:19:54 EDT 2007: Executing RAID health check for /dev/md0... Tue May 8 09:19:55 EDT 2007: Executing RAID health check for /dev/md1... Tue May 8 09:19:56 EDT 2007: Executing RAID health check for /dev/md2... Tue May 8 09:19:57 EDT 2007: Executing RAID health check for /dev/md3... Tue May 8 10:09:58 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 2176 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Tue May 8 10:09:58 EDT 2007: 0 Tue May 8 10:09:58 EDT 2007: The meta-device /dev/md0 has 2176 mismatched sectors. Tue May 8 10:09:58 EDT 2007: Executing repair on /dev/md0 Tue May 8 10:09:59 EDT 2007: The meta-device /dev/md1 has no mismatched sectors. Tue May 8 10:10:00 EDT 2007: The meta-device /dev/md2 has no mismatched sectors. Tue May 8 10:10:01 EDT 2007: The meta-device /dev/md3 has no mismatched sectors. Tue May 8 10:20:02 EDT 2007: All devices are clean... Tue May 8 10:20:02 EDT 2007: cat /sys/block/md0/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 2176 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md1/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md2/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0 Tue May 8 10:20:02 EDT 2007: cat /sys/block/md3/md/mismatch_cnt Tue May 8 10:20:02 EDT 2007: 0 I cannot repair this raid volume. I cannot reboot server without sending stop+A. init 6 stops at "INIT:". After reboot, md0 is resynchronized. Regards, JKB - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Tue, 2007-11-06 at 03:19 -0700, BERTRAND Joël wrote: > Done. Here is obtained ouput : Much appreciated. > > [ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 > [ 1260.980606] check 5: state 0x6 toread read > write f800ffcffcc0 written > [ 1260.994808] check 4: state 0x6 toread read > write f800fdd4e360 written > [ 1261.009325] check 3: state 0x1 toread read > write written > [ 1261.244478] check 2: state 0x1 toread read > write written > [ 1261.270821] check 1: state 0x6 toread read > write f800ff517e40 written > [ 1261.312320] check 0: state 0x6 toread read > write f800fd4cae60 written > [ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 > [ 1261.443120] for sector 7629696, rmw=0 rcw=0 [..] This looks as if the blocks were prepared to be written out, but were never handled in ops_run_biodrain(), so they remain locked forever. The operations flags are all clear which means handle_stripe thinks nothing else needs to be done. The following patch, also attached, cleans up cases where the code looks at sh->ops.pending when it should be looking at the consistent stack-based snapshot of the operations flags. --- drivers/md/raid5.c | 16 +--- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 496b9a3..e1a3942 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) } static struct dma_async_tx_descriptor * -ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) +ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx, +unsigned long pending) { int disks = sh->disks; int pd_idx = sh->pd_idx, i; @@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) /* check if prexor is active which means only process blocks * that are part of a read-modify-write (Wantprexor) */ - int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending); + int prexor = test_bit(STRIPE_OP_PREXOR, &pending); pr_debug("%s: stripe %llu\n", __FUNCTION__, (unsigned long long)sh->sector); @@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref) } static void -ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) +ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx, + unsigned long pending) { /* kernel stack size limits the total number of disks */ int disks = sh->disks; @@ -786,7 +788,7 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) int count = 0, pd_idx = sh->pd_idx, i; struct page *xor_dest; - int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending); + int prexor = test_bit(STRIPE_OP_PREXOR, &pending); unsigned long flags; dma_async_tx_callback callback; @@ -813,7 +815,7 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) } /* check whether this postxor is part of a write */ - callback = test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending) ? + callback = test_bit(STRIPE_OP_BIODRAIN, &pending) ? ops_complete_write : ops_complete_postxor; /* 1/ if we prexor'd then the dest is reused as a source @@ -901,12 +903,12 @@ static void raid5_run_ops(struct stripe_head *sh, unsigned long pending) tx = ops_run_prexor(sh, tx); if (test_bit(STRIPE_OP_BIODRAIN, &pending)) { - tx = ops_run_biodrain(sh, tx); + tx = ops_run_biodrain(sh, tx, pending); overlap_clear++; } if (test_bit(STRIPE_OP_POSTXOR, &pending)) - ops_run_postxor(sh, tx); + ops_run_postxor(sh, tx, pending); if (test_bit(STRIPE_OP_CHECK, &pending)) ops_run_check(sh); raid5: fix unending write sequence From: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 16 +--- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 496b9a3..e1a3942 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) } static struct dma_async_tx_descriptor * -ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) +ops_run_biodrain(struct stripe_head *sh, struct dm
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Dan Williams wrote: > The following patch, also attached, cleans up cases where the code looks > at sh->ops.pending when it should be looking at the consistent > stack-based snapshot of the operations flags. I tried this patch (against a stock 2.6.23), and it did not work for me. Not only did I/O to the effected RAID5 & XFS partition stop, but also I/O to all other disks. I was not able to capture any debugging information, but I should be able to do that tomorrow when I can hook a serial console to the machine. I'm not sure if my problem is identical to these others, as mine only seems to manifest with RAID5+XFS. The RAID rebuilds with no problem, and I've not had any problems with RAID5+ext3. > > > --- > > drivers/md/raid5.c | 16 +--- > 1 files changed, 9 insertions(+), 7 deletions(-) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index 496b9a3..e1a3942 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -693,7 +693,8 @@ ops_run_prexor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) > } > > static struct dma_async_tx_descriptor * > -ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) > +ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx, > + unsigned long pending) > { >int disks = sh->disks; >int pd_idx = sh->pd_idx, i; > @@ -701,7 +702,7 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) >/* check if prexor is active which means only process blocks > * that are part of a read-modify-write (Wantprexor) > */ > - int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending); > + int prexor = test_bit(STRIPE_OP_PREXOR, &pending); > >pr_debug("%s: stripe %llu\n", __FUNCTION__, >(unsigned long long)sh->sector); > @@ -778,7 +779,8 @@ static void ops_complete_write(void *stripe_head_ref) > } > > static void > -ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) > +ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx, > + unsigned long pending) > { >/* kernel stack size limits the total number of disks */ >int disks = sh->disks; > @@ -786,7 +788,7 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) > >int count = 0, pd_idx = sh->pd_idx, i; >struct page *xor_dest; > - int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending); > + int prexor = test_bit(STRIPE_OP_PREXOR, &pending); >unsigned long flags; >dma_async_tx_callback callback; > > @@ -813,7 +815,7 @@ ops_run_postxor(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) >} > >/* check whether this postxor is part of a write */ > - callback = test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending) ? > + callback = test_bit(STRIPE_OP_BIODRAIN, &pending) ? >ops_complete_write : ops_complete_postxor; > >/* 1/ if we prexor'd then the dest is reused as a source > @@ -901,12 +903,12 @@ static void raid5_run_ops(struct stripe_head *sh, unsigned long pending) >tx = ops_run_prexor(sh, tx); > >if (test_bit(STRIPE_OP_BIODRAIN, &pending)) { > - tx = ops_run_biodrain(sh, tx); > + tx = ops_run_biodrain(sh, tx, pending); >overlap_clear++; >} > >if (test_bit(STRIPE_OP_POSTXOR, &pending)) > - ops_run_postxor(sh, tx); > + ops_run_postxor(sh, tx, pending); > >if (test_bit(STRIPE_OP_CHECK, &pending)) >ops_run_check(sh); > > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Dan Williams wrote: On Tue, 2007-11-06 at 03:19 -0700, BERTRAND Joël wrote: Done. Here is obtained ouput : Much appreciated. [ 1260.969314] handling stripe 7629696, state=0x14 cnt=1, pd_idx=2 ops=0:0:0 [ 1260.980606] check 5: state 0x6 toread read write f800ffcffcc0 written [ 1260.994808] check 4: state 0x6 toread read write f800fdd4e360 written [ 1261.009325] check 3: state 0x1 toread read write written [ 1261.244478] check 2: state 0x1 toread read write written [ 1261.270821] check 1: state 0x6 toread read write f800ff517e40 written [ 1261.312320] check 0: state 0x6 toread read write f800fd4cae60 written [ 1261.361030] locked=4 uptodate=2 to_read=0 to_write=4 failed=0 failed_num=0 [ 1261.443120] for sector 7629696, rmw=0 rcw=0 [..] This looks as if the blocks were prepared to be written out, but were never handled in ops_run_biodrain(), so they remain locked forever. The operations flags are all clear which means handle_stripe thinks nothing else needs to be done. The following patch, also attached, cleans up cases where the code looks at sh->ops.pending when it should be looking at the consistent stack-based snapshot of the operations flags. Thanks for this patch. I'm testing it for three hours. I'm rebuilding a 1.5 TB raid1 array over iSCSI without any trouble. gershwin:[/usr/scripts] > cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md7 : active raid1 sdi1[2] md_d0p1[0] 1464725632 blocks [2/1] [U_] [=>...] recovery = 6.7% (99484736/1464725632) finish=1450.9min speed=15679K/sec Without your patch, I never reached 1%... I hope it fix this bug and I shall come back when my raid1 volume shall be resynchronized. Regards, JKB - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Chuck Ebbert wrote: On 11/05/2007 03:36 AM, BERTRAND Joël wrote: Neil Brown wrote: On Sunday November 4, [EMAIL PROTECTED] wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 My linux-2.6.23/drivers/md/raid5.c contains your patch for a long time : ... spin_lock(&sh->lock); clear_bit(STRIPE_HANDLE, &sh->state); clear_bit(STRIPE_DELAYED, &sh->state); s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); /* Now to look around and see what can be done */ /* clean-up completed biofill operations */ if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) { clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending); clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack); clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete); } rcu_read_lock(); for (i=disks; i--; ) { mdk_rdev_t *rdev; struct r5dev *dev = &sh->dev[i]; ... but it doesn't fix this bug. Did that chunk starting with "clean-up completed biofill operations" end up where it belongs? The patch with the big context moves it to a different place from where the original one puts it when applied to 2.6.23... Lately I've seen several problems where the context isn't enough to make a patch apply properly when some offsets have changed. In some cases a patch won't apply at all because two nearly-identical areas are being changed and the first chunk gets applied where the second one should, leaving nowhere for the second chunk to apply. I always apply this kind of patches by hands, and no by patch command. Last patch sent here seems to fix this bug : gershwin:[/usr/scripts] > cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md7 : active raid1 sdi1[2] md_d0p1[0] 1464725632 blocks [2/1] [U_] [=>...] recovery = 27.1% (396992504/1464725632) finish=1040.3min speed=17104K/sec Regards, JKB - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On 11/05/2007 03:36 AM, BERTRAND Joël wrote: > Neil Brown wrote: >> On Sunday November 4, [EMAIL PROTECTED] wrote: >>> # ps auxww | grep D >>> USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND >>> root 273 0.0 0.0 0 0 ?DOct21 14:40 >>> [pdflush] >>> root 274 0.0 0.0 0 0 ?DOct21 13:00 >>> [pdflush] >>> >>> After several days/weeks, this is the second time this has happened, >>> while doing regular file I/O (decompressing a file), everything on >>> the device went into D-state. >> >> At a guess (I haven't looked closely) I'd say it is the bug that was >> meant to be fixed by >> >> commit 4ae3f847e49e3787eca91bced31f8fd328d50496 >> >> except that patch applied badly and needed to be fixed with >> the following patch (not in git yet). >> These have been sent to stable@ and should be in the queue for 2.6.23.2 > > My linux-2.6.23/drivers/md/raid5.c contains your patch for a long > time : > > ... > spin_lock(&sh->lock); > clear_bit(STRIPE_HANDLE, &sh->state); > clear_bit(STRIPE_DELAYED, &sh->state); > > s.syncing = test_bit(STRIPE_SYNCING, &sh->state); > s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); > s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); > /* Now to look around and see what can be done */ > > /* clean-up completed biofill operations */ > if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) { > clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending); > clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack); > clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete); > } > > rcu_read_lock(); > for (i=disks; i--; ) { > mdk_rdev_t *rdev; > struct r5dev *dev = &sh->dev[i]; > ... > > but it doesn't fix this bug. > Did that chunk starting with "clean-up completed biofill operations" end up where it belongs? The patch with the big context moves it to a different place from where the original one puts it when applied to 2.6.23... Lately I've seen several problems where the context isn't enough to make a patch apply properly when some offsets have changed. In some cases a patch won't apply at all because two nearly-identical areas are being changed and the first chunk gets applied where the second one should, leaving nowhere for the second chunk to apply. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
BERTRAND Joël wrote: Chuck Ebbert wrote: On 11/05/2007 03:36 AM, BERTRAND Joël wrote: Neil Brown wrote: On Sunday November 4, [EMAIL PROTECTED] wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 My linux-2.6.23/drivers/md/raid5.c contains your patch for a long time : ... spin_lock(&sh->lock); clear_bit(STRIPE_HANDLE, &sh->state); clear_bit(STRIPE_DELAYED, &sh->state); s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); /* Now to look around and see what can be done */ /* clean-up completed biofill operations */ if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) { clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending); clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack); clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete); } rcu_read_lock(); for (i=disks; i--; ) { mdk_rdev_t *rdev; struct r5dev *dev = &sh->dev[i]; ... but it doesn't fix this bug. Did that chunk starting with "clean-up completed biofill operations" end up where it belongs? The patch with the big context moves it to a different place from where the original one puts it when applied to 2.6.23... Lately I've seen several problems where the context isn't enough to make a patch apply properly when some offsets have changed. In some cases a patch won't apply at all because two nearly-identical areas are being changed and the first chunk gets applied where the second one should, leaving nowhere for the second chunk to apply. I always apply this kind of patches by hands, and no by patch command. Last patch sent here seems to fix this bug : gershwin:[/usr/scripts] > cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md7 : active raid1 sdi1[2] md_d0p1[0] 1464725632 blocks [2/1] [U_] [=>...] recovery = 27.1% (396992504/1464725632) finish=1040.3min speed=17104K/sec Resync done. Patch fix this bug. Regards, JKB - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Thu, 8 Nov 2007, BERTRAND Joël wrote: BERTRAND Joël wrote: Chuck Ebbert wrote: On 11/05/2007 03:36 AM, BERTRAND Joël wrote: Neil Brown wrote: On Sunday November 4, [EMAIL PROTECTED] wrote: # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 273 0.0 0.0 0 0 ?DOct21 14:40 [pdflush] root 274 0.0 0.0 0 0 ?DOct21 13:00 [pdflush] After several days/weeks, this is the second time this has happened, while doing regular file I/O (decompressing a file), everything on the device went into D-state. At a guess (I haven't looked closely) I'd say it is the bug that was meant to be fixed by commit 4ae3f847e49e3787eca91bced31f8fd328d50496 except that patch applied badly and needed to be fixed with the following patch (not in git yet). These have been sent to stable@ and should be in the queue for 2.6.23.2 My linux-2.6.23/drivers/md/raid5.c contains your patch for a long time : ... spin_lock(&sh->lock); clear_bit(STRIPE_HANDLE, &sh->state); clear_bit(STRIPE_DELAYED, &sh->state); s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); /* Now to look around and see what can be done */ /* clean-up completed biofill operations */ if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) { clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending); clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack); clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete); } rcu_read_lock(); for (i=disks; i--; ) { mdk_rdev_t *rdev; struct r5dev *dev = &sh->dev[i]; ... but it doesn't fix this bug. Did that chunk starting with "clean-up completed biofill operations" end up where it belongs? The patch with the big context moves it to a different place from where the original one puts it when applied to 2.6.23... Lately I've seen several problems where the context isn't enough to make a patch apply properly when some offsets have changed. In some cases a patch won't apply at all because two nearly-identical areas are being changed and the first chunk gets applied where the second one should, leaving nowhere for the second chunk to apply. I always apply this kind of patches by hands, and no by patch command. Last patch sent here seems to fix this bug : gershwin:[/usr/scripts] > cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md7 : active raid1 sdi1[2] md_d0p1[0] 1464725632 blocks [2/1] [U_] [=>...] recovery = 27.1% (396992504/1464725632) finish=1040.3min speed=17104K/sec Resync done. Patch fix this bug. Regards, JKB Excellent! I cannot easily re-produce the bug on my system so I will wait for the next stable patch set to include it and let everyone know if it happens again, thanks.
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Jeff Lessem wrote: Dan Williams wrote: > The following patch, also attached, cleans up cases where the code looks > at sh->ops.pending when it should be looking at the consistent > stack-based snapshot of the operations flags. I tried this patch (against a stock 2.6.23), and it did not work for me. Not only did I/O to the effected RAID5 & XFS partition stop, but also I/O to all other disks. I was not able to capture any debugging information, but I should be able to do that tomorrow when I can hook a serial console to the machine. That can't be good! This is worrisome because Joel is giddy with joy because it fixes his iSCSI problems. I was going to try it with nbd, but perhaps I'll wait a week or so and see if others have more information. Applying patches before a holiday weekend is a good way to avoid time off. :-( I'm not sure if my problem is identical to these others, as mine only seems to manifest with RAID5+XFS. The RAID rebuilds with no problem, and I've not had any problems with RAID5+ext3. Hopefully it's not the raid which is the issue. -- bill davidsen <[EMAIL PROTECTED]> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On 11/8/07, Bill Davidsen <[EMAIL PROTECTED]> wrote: > Jeff Lessem wrote: > > Dan Williams wrote: > > > The following patch, also attached, cleans up cases where the code > > looks > > > at sh->ops.pending when it should be looking at the consistent > > > stack-based snapshot of the operations flags. > > > > I tried this patch (against a stock 2.6.23), and it did not work for > > me. Not only did I/O to the effected RAID5 & XFS partition stop, but > > also I/O to all other disks. I was not able to capture any debugging > > information, but I should be able to do that tomorrow when I can hook > > a serial console to the machine. > > That can't be good! This is worrisome because Joel is giddy with joy > because it fixes his iSCSI problems. I was going to try it with nbd, but > perhaps I'll wait a week or so and see if others have more information. > Applying patches before a holiday weekend is a good way to avoid time > off. :-( We need to see more information on the failure that Jeff is seeing, and whether it goes away with the two known patches applied. He applied this most recent patch against stock 2.6.23 which means that the platform was still open to the first biofill flags issue. -- Dan - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Jeff Lessem ([EMAIL PROTECTED]) wrote on 6 November 2007 22:00: >Dan Williams wrote: > > The following patch, also attached, cleans up cases where the code looks > > at sh->ops.pending when it should be looking at the consistent > > stack-based snapshot of the operations flags. > >I tried this patch (against a stock 2.6.23), and it did not work for >me. Not only did I/O to the effected RAID5 & XFS partition stop, but >also I/O to all other disks. I was not able to capture any debugging >information, but I should be able to do that tomorrow when I can hook >a serial console to the machine. > >I'm not sure if my problem is identical to these others, as mine only >seems to manifest with RAID5+XFS. The RAID rebuilds with no problem, >and I've not had any problems with RAID5+ext3. Us too! We're stuck trying to build a disk server with several disks in a raid5 array, and the rsync from the old machine stops writing to the new filesystem. It only happens under heavy IO. We can make it lock without rsync, using 8 simultaneous dd's to the array. All IO stops, including the resync after a newly created raid or after an unclean reboot. We could not trigger the problem with ext3 or reiser3; it only happens with xfs. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
On Thu, 8 Nov 2007, Carlos Carvalho wrote: Jeff Lessem ([EMAIL PROTECTED]) wrote on 6 November 2007 22:00: >Dan Williams wrote: > > The following patch, also attached, cleans up cases where the code looks > > at sh->ops.pending when it should be looking at the consistent > > stack-based snapshot of the operations flags. > >I tried this patch (against a stock 2.6.23), and it did not work for >me. Not only did I/O to the effected RAID5 & XFS partition stop, but >also I/O to all other disks. I was not able to capture any debugging >information, but I should be able to do that tomorrow when I can hook >a serial console to the machine. > >I'm not sure if my problem is identical to these others, as mine only >seems to manifest with RAID5+XFS. The RAID rebuilds with no problem, >and I've not had any problems with RAID5+ext3. Us too! We're stuck trying to build a disk server with several disks in a raid5 array, and the rsync from the old machine stops writing to the new filesystem. It only happens under heavy IO. We can make it lock without rsync, using 8 simultaneous dd's to the array. All IO stops, including the resync after a newly created raid or after an unclean reboot. We could not trigger the problem with ext3 or reiser3; it only happens with xfs. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Including XFS mailing list as well can you provide more information to them? - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state
Dan Williams wrote: > On 11/8/07, Bill Davidsen <[EMAIL PROTECTED]> wrote: >> Jeff Lessem wrote: >>> Dan Williams wrote: The following patch, also attached, cleans up cases where the code >>> looks at sh->ops.pending when it should be looking at the consistent stack-based snapshot of the operations flags. >>> I tried this patch (against a stock 2.6.23), and it did not work for >>> me. Not only did I/O to the effected RAID5 & XFS partition stop, but >>> also I/O to all other disks. I was not able to capture any debugging >>> information, but I should be able to do that tomorrow when I can hook >>> a serial console to the machine. >> That can't be good! This is worrisome because Joel is giddy with joy >> because it fixes his iSCSI problems. I was going to try it with nbd, but >> perhaps I'll wait a week or so and see if others have more information. >> Applying patches before a holiday weekend is a good way to avoid time >> off. :-( > > We need to see more information on the failure that Jeff is seeing, > and whether it goes away with the two known patches applied. He > applied this most recent patch against stock 2.6.23 which means that > the platform was still open to the first biofill flags issue. I applied both of the patches. The biofill one did not apply cleanly, as it was adding biofill to one section, and removing it from another, but it appears that biofill does not need to be removed from a stock 2.6.23 kernel. The second patch applies with a slight offset, but no errors. I can report success so far with both patches applied. I created an 1100GB RAID5, formated it XFS, and successfully "tar c | tar x" 895GB of data onto it. I'm also in the process of rsync-ing the 895GB of data from the (slightly changed) original. In the past, I would always get a hang within 0-50GB of data transfer. For each drive in the RAID I also: echo 128 > /sys/block/"$i"/queue/max_sectors_kb echo 512 > /sys/block/"$i"/queue/nr_requests echo 1 > /sys/block/"$i"/device/queue_depth blockdev --setra 65536 /dev/md3 echo 16384 > /sys/block/md3/md/stripe_cache_size These changes appear to improve performance, along with a RAID5 chunk size of 1024k, but these changes alone (without the patches) do not fix the problem. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23.1: mdadm/raid5 hung/d-state (md3_raid5 stuck in endless loop?)
Time to reboot, before reboot: top - 07:30:23 up 13 days, 13:33, 10 users, load average: 16.00, 15.99, 14.96 Tasks: 221 total, 7 running, 209 sleeping, 0 stopped, 5 zombie Cpu(s): 0.0%us, 25.5%sy, 0.0%ni, 74.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8039432k total, 1744356k used, 6295076k free, 164k buffers Swap: 16787768k total, 160k used, 16787608k free, 616960k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 688 root 15 -5 000 R 100 0.0 121:21.43 md3_raid5 273 root 20 0 000 D0 0.0 14:40.68 pdflush 274 root 20 0 000 D0 0.0 13:00.93 pdflush # cat /proc/fs/xfs/stat extent_alloc 301974 256068291 310513 240764389 abt 1900173 15346352 738568 731314 blk_map 276979807 235589732 864002 211245834 591619 513439614 0 bmbt 50717 367726 14177 11846 dir 3818065 361561 359723 975628 trans 48452 2648064 570998 ig 6034530 2074424 43153 3960106 0 3869384 460831 log 282781 10454333 3028 399803 173488 push_ail 3267594 0 1620 2611 730365 0 4476 0 10269 0 xstrat 291940 0 rw 61423078 103732605 attr 0 0 0 0 icluster 312958 97323 419837 vnodes 90721 4019823 0 1926744 3929102 3929102 3929102 0 buf 14678900 11027087 3651843 25743 760449 0 0 15775888 280425 xpc 966925905920 1047628533165 1162276949815 debug 0 # cat meminfo MemTotal: 8039432 kB MemFree: 6287000 kB Buffers: 164 kB Cached: 617072 kB SwapCached: 0 kB Active: 178404 kB Inactive: 589880 kB SwapTotal:16787768 kB SwapFree: 16787608 kB Dirty: 494280 kB Writeback: 86004 kB AnonPages: 151240 kB Mapped: 17092 kB Slab: 259696 kB SReclaimable: 170876 kB SUnreclaim: 88820 kB PageTables: 11448 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 20807484 kB Committed_AS: 353536 kB VmallocTotal: 34359738367 kB VmallocUsed: 15468 kB VmallocChunk: 34359722699 kB # echo 3 > /proc/sys/vm/drop_caches # cat /proc/meminfo MemTotal: 8039432 kB MemFree: 6418352 kB Buffers:32 kB Cached: 597908 kB SwapCached: 0 kB Active: 172028 kB Inactive: 579808 kB SwapTotal:16787768 kB SwapFree: 16787608 kB Dirty: 494312 kB Writeback: 86004 kB AnonPages: 154104 kB Mapped: 17416 kB Slab: 144072 kB SReclaimable:53100 kB SUnreclaim: 90972 kB PageTables: 11832 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 20807484 kB Committed_AS: 360748 kB VmallocTotal: 34359738367 kB VmallocUsed: 15468 kB VmallocChunk: 34359722699 kB Nothing is actually happening on the device itself however. Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdb 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdc 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdd 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sde 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdf 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdg 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdh 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdi 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdj 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdk 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 sdl 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 md0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 md3 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 md2 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 md1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.00 0.00 0.00 # vmstat 1 procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 6 0160 6420244 32 60009200 221 22751 1 1 98 0 6 0160 6420228 32 60012000 0 0 1015 142 0 25 75 0 6 0160 6420228 32 60012000 0 0 1005 127 0 25 75 0 6 0160 6420228 32 60012000 041 1022 151 0 26 74 0 6 0160 6420228