Hello, mingo and Tso,
Please help. I had the exactly same problem, i.e., one of my raid-5 array
disks was crashed, and the other one is marked bad. However, I was
able to recover from it. Consequently, it is not perfect recovery.
e2fsck is not working. :-(
In message <[EMAIL PROTECTED]>
mingo wrote:
>> The strange thing is that after one disk failure (sdd1) another one is
>> reported faulty (sda1) without there being an error from the SCSI layer.
>
>this should not happen - but lets first see the startup messages, maybe
>sda1 was faulty already when the array started up?
I have the exactly same problem. My startup message was as follows:
Sep 7 12:28:41 yfs kernel: running: <sdi1><sdg1><sdf1><sde1><sdd1><sdc1><sdb1><sda1>
Sep 7 12:28:41 yfs kernel: now!
Sep 7 12:28:41 yfs kernel: sdi1's event counter: 00000009
Sep 7 12:28:41 yfs kernel: sdg1's event counter: 00000009
Sep 7 12:28:41 yfs kernel: sdf1's event counter: 00000009
Sep 7 12:28:41 yfs kernel: sde1's event counter: 00000009
Sep 7 12:28:41 yfs kernel: sdd1's event counter: 00000009
Sep 7 12:28:41 yfs kernel: sdc1's event counter: 00000007
Sep 7 12:28:41 yfs kernel: sdb1's event counter: 00000009
Sep 7 12:28:41 yfs kernel: sda1's event counter: 00000009
Sep 7 12:28:41 yfs kernel: md: superblock update time inconsistency -- using the
most recent one
Sep 7 12:28:41 yfs kernel: freshest: sdi1
Sep 7 12:28:41 yfs kernel: md: kicking non-fresh sdc1 from array!
Sep 7 12:28:41 yfs kernel: unbind<sdc1,7>
Sep 7 12:28:41 yfs kernel: export_rdev(sdc1)
Sep 7 12:28:41 yfs kernel: md0: removing former faulty sdc1!
Note that /dev/sdh1 is the failed disk in the above system.
>> With two disks down the RAID5 array is beyond automatic recovery.
>> I would like to suggest the following:
>>
>> - When an array is not recoverable shut it down automatically, in my case
>> that means after "skipping faulty sda1"
>
>I followed the policy we do on normal disks - ie. we report the IO error
>back to higher levels and let them decide. Ie. ext2fs might ignore the
>error, or panic, or remount read-only.
>
>> - For enhanced recovery make a tool that can make a superblock backup on
>> another disk (and restore it)
>
>so which superblock would you restore? The problem was not the lack of
>utilities - you can reload the wrong superblock just as easily as you can
>misconfigure /etc/raidtab. But i agree that we should work on avoiding
>situations like yours in the future.
During the kernel hacking, I relaized that my /dev/sdc1 has good(?)
superblock, which in my case, the oldest event counter. The stock
kernel uses freshest event counter. I changed md.c to use event=7,
and the array came back. :-) Then I reverted to the standard
2-2.10-raid kernel.
Now my problem. I tried e2fsck. It ran successfully after a bunch of
complaints. I ran e2fsck again to make sure the recovered data is ok.
However, it complains about a screen full of lines. I did it again, and
got exactly same complaints. I did it after rebooting - same.
I tried it again. And same complaints. I am appending the output of
e2fsck on my /dev/md0.
Fortunately, I am copying my 30 Gb data from the partially recovered
array to the older raid-0 server. I estimate the copy will finish
about eight hours, which is about six hours later from now. Looks like
I can recover about 99% of data at least. :-)
--
sysuh
PS: If the output of e2fsdump and/or "mkraid --debug" are useful,
please write me. I will try to run them after my copy finises.
I will probably rebuild the array tommorrow morning because
many people are waiting for the disk space.
System information:
yfs:~$ df
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/hda2 256590 70736 172601 29% /
/dev/hda3 256590 42191 201146 17% /var
/dev/hda5 2035606 600143 1330239 31% /usr
/dev/hda6 9112340 12688 8626941 0% /home
/dev/hdb1 8119744 513004 7194276 7% /b0
/dev/md0 139974592 31958420 100905876 24% /u4
yfs:~$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [translucent]
read_ahead 1024 sectors
md0 : active raid5 sdi1[8] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
142205952 blocks level 5, 128k chunk, algorithm 2 [9/8] [UUUUUUU_U]
unused devices: <none>
yfs:/home/sysuh# /sbin/e2fsck -y /dev/md0
e2fsck 1.15, 18-Jul-1999 for EXT2 FS 0.5b, 95/08/09
/dev/md0: clean, 1539/17776640 files, 8547445/35551488 blocks
yfs:/home/sysuh# /sbin/e2fsck -f -y /dev/md0
e2fsck 1.15, 18-Jul-1999 for EXT2 FS 0.5b, 95/08/09
Pass 1: Checking inodes, blocks, and sizes
Duplicate blocks found... invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 135841: 47 47 47
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 1 inodes containing duplicate/bad blocks.)
File /lost+found/#135841 (inode #135841, mod time Sat Sep 6 11:07:10 2014)
has 3 duplicate block(s), shared with 1 file(s):
<filesystem metadata>
Clone duplicate/bad blocks? yes
clone_file: Ext2 directory block not found returned from clone_file_block
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md0: 1539/17776640 files (12.9% non-contiguous), 8547445/35551488 blocks
### I am running e2fsck once again here ###
yfs:/home/sysuh# /sbin/e2fsck -f -y /dev/md0
e2fsck 1.15, 18-Jul-1999 for EXT2 FS 0.5b, 95/08/09
Pass 1: Checking inodes, blocks, and sizes
Duplicate blocks found... invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 135841: 47 47 47
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 1 inodes containing duplicate/bad blocks.)
File /lost+found/#135841 (inode #135841, mod time Sat Sep 6 11:07:10 2014)
has 3 duplicate block(s), shared with 1 file(s):
<filesystem metadata>
Clone duplicate/bad blocks? yes
clone_file: Ext2 directory block not found returned from clone_file_block
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md0: 1539/17776640 files (12.9% non-contiguous), 8547445/35551488 blocks