Hello, mingo and Tso,
Please help.  I had the exactly same problem, i.e., one of my raid-5 array
disks was crashed, and the other one is marked bad.  However, I was
able to recover from it.  Consequently, it is not perfect recovery.
e2fsck is not working. :-(

In message <[EMAIL PROTECTED]>
mingo wrote:
>> The strange thing is that after one disk failure (sdd1) another one is 
>> reported faulty (sda1) without there being an error from the SCSI layer.
>
>this should not happen - but lets first see the startup messages, maybe
>sda1 was faulty already when the array started up?

I have the exactly same problem.  My startup message was as follows:

  Sep  7 12:28:41 yfs kernel: running: <sdi1><sdg1><sdf1><sde1><sdd1><sdc1><sdb1><sda1>
  Sep  7 12:28:41 yfs kernel: now!
  Sep  7 12:28:41 yfs kernel: sdi1's event counter: 00000009
  Sep  7 12:28:41 yfs kernel: sdg1's event counter: 00000009
  Sep  7 12:28:41 yfs kernel: sdf1's event counter: 00000009
  Sep  7 12:28:41 yfs kernel: sde1's event counter: 00000009
  Sep  7 12:28:41 yfs kernel: sdd1's event counter: 00000009
  Sep  7 12:28:41 yfs kernel: sdc1's event counter: 00000007
  Sep  7 12:28:41 yfs kernel: sdb1's event counter: 00000009
  Sep  7 12:28:41 yfs kernel: sda1's event counter: 00000009
  Sep  7 12:28:41 yfs kernel: md: superblock update time inconsistency -- using the 
most recent one
  Sep  7 12:28:41 yfs kernel: freshest: sdi1
  Sep  7 12:28:41 yfs kernel: md: kicking non-fresh sdc1 from array!
  Sep  7 12:28:41 yfs kernel: unbind<sdc1,7>
  Sep  7 12:28:41 yfs kernel: export_rdev(sdc1)
  Sep  7 12:28:41 yfs kernel: md0: removing former faulty sdc1!

Note that /dev/sdh1 is the failed disk in the above system.

>> With two disks down the RAID5 array is beyond automatic recovery.
>> I would like to suggest the following:
>> 
>> - When an array is not recoverable shut it down automatically, in my case
>>   that means after "skipping faulty sda1"
>
>I followed the policy we do on normal disks - ie. we report the IO error
>back to higher levels and let them decide. Ie. ext2fs might ignore the
>error, or panic, or remount read-only.
>
>> - For enhanced recovery make a tool that can make a superblock backup on 
>>   another disk (and restore it)
>
>so which superblock would you restore? The problem was not the lack of
>utilities - you can reload the wrong superblock just as easily as you can
>misconfigure /etc/raidtab. But i agree that we should work on avoiding
>situations like yours in the future.

During the kernel hacking, I relaized that my /dev/sdc1 has good(?)
superblock, which in my case, the oldest event counter.  The stock
kernel uses freshest event counter.  I changed md.c to use event=7,
and the array came back. :-)  Then I reverted to the standard
2-2.10-raid kernel.

Now my problem.  I tried e2fsck.  It ran successfully after a bunch of
complaints.  I ran e2fsck again to make sure the recovered data is ok.
However, it complains about a screen full of lines.  I did it again, and
got exactly same complaints.  I did it after rebooting - same.
I tried it again. And same complaints.  I am appending the output of
e2fsck on my /dev/md0.

Fortunately, I am copying my 30 Gb data from the partially recovered
array to the older raid-0 server.  I estimate the copy will finish
about eight hours, which is about six hours later from now.  Looks like
I can recover about 99% of data at least. :-)
--
sysuh

PS: If the output of e2fsdump and/or "mkraid --debug" are useful,
    please write me.  I will try to run them after my copy finises.
    I will probably rebuild the array tommorrow morning because
    many people are waiting for the disk space.

System information:

yfs:~$ df
Filesystem         1024-blocks  Used Available Capacity Mounted on
/dev/hda2             256590   70736   172601     29%   /
/dev/hda3             256590   42191   201146     17%   /var
/dev/hda5            2035606  600143  1330239     31%   /usr
/dev/hda6            9112340   12688  8626941      0%   /home
/dev/hdb1            8119744  513004  7194276      7%   /b0
/dev/md0             139974592 31958420  100905876     24%   /u4

yfs:~$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [translucent]
read_ahead 1024 sectors
md0 : active raid5 sdi1[8] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0] 
142205952 blocks level 5, 128k chunk, algorithm 2 [9/8] [UUUUUUU_U]
unused devices: <none>

yfs:/home/sysuh# /sbin/e2fsck -y /dev/md0
e2fsck 1.15, 18-Jul-1999 for EXT2 FS 0.5b, 95/08/09
/dev/md0: clean, 1539/17776640 files, 8547445/35551488 blocks
 
yfs:/home/sysuh# /sbin/e2fsck -f -y /dev/md0
e2fsck 1.15, 18-Jul-1999 for EXT2 FS 0.5b, 95/08/09
Pass 1: Checking inodes, blocks, and sizes
Duplicate blocks found... invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 135841: 47 47 47
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 1 inodes containing duplicate/bad blocks.)
 
File /lost+found/#135841 (inode #135841, mod time Sat Sep  6 11:07:10 2014)
  has 3 duplicate block(s), shared with 1 file(s):
        <filesystem metadata>
Clone duplicate/bad blocks? yes              
 
clone_file: Ext2 directory block not found returned from clone_file_block
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
 
/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md0: 1539/17776640 files (12.9% non-contiguous), 8547445/35551488 blocks
 
### I am running e2fsck once again here ###
yfs:/home/sysuh# /sbin/e2fsck -f -y /dev/md0
e2fsck 1.15, 18-Jul-1999 for EXT2 FS 0.5b, 95/08/09
Pass 1: Checking inodes, blocks, and sizes
Duplicate blocks found... invoking duplicate block passes.
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 135841: 47 47 47
Pass 1C: Scan directories for inodes with dup blocks.
Pass 1D: Reconciling duplicate blocks
(There are 1 inodes containing duplicate/bad blocks.)
 
File /lost+found/#135841 (inode #135841, mod time Sat Sep  6 11:07:10 2014)
  has 3 duplicate block(s), shared with 1 file(s):                                
        <filesystem metadata>
Clone duplicate/bad blocks? yes
 
clone_file: Ext2 directory block not found returned from clone_file_block
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
 
/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md0: 1539/17776640 files (12.9% non-contiguous), 8547445/35551488 blocks     

Reply via email to