Major bug in raid driver

David Harris Wed, 9 Dec 1998 23:14:34 -0500
Hi,

I've been torture testing my raid setup and encountered one of those "bug in
file md.c" errors. The md device still came up and ran properly, so I
thought everything was okay. But then I rebooted and I could no longer
raidstart the partition. I suspect the superblock drive listing is messed up
somehow. Here is how it happened:

I've got two drives, sda and sdb, which support multiple raid1 md devices. I
got the system running (without any filesystems mounted) and raidstarted
md0, then killed power to the sda drive. This caused the raid driver to mark
the sda drive as bad in the raid superblock.

I then rebooted with the sda drive powered off, which I can do because each
drive has a working MBR and kernel and initrd images. At this point, sda is
now what sdb was when I created the raid array and there is no sdb. I wanted
to see if the raiddriver would detect this properly. On running the
"raidstart /dev/md0" I got notified that "device name has changed from sdb6
to sda6 since last import!". Cool, this is what needed to happen! Then came
a "bug in file md.c, line 1321" error.

Despite the "bug in file" warning the driver kept on doing stuff and
reported that the md0 device was up an running in a degraded mode. I then
mounted the filesystem readonly and looked at it to verify that the md0
device was really up.

Then I rebooted with both drives working and I could not "raidstart" the md0
device. Tried rebooting with only the one drive working like I had before
and had the same problem.

Included is (1) the /proc/mdstat status showing md0 in the degraded mode
when sda and sdb are correct (2) a cut-and-paste of the system console
running the "raidstart md0" when sda was disk two and there was no sdb, (3)
raidstart failing with both drives working, (4) raidstart failing with one
drive working. I'm running raid199811110-2.0.35.

What I need to do is restore a superblock that will force it to view sdb
(the second scsi disk when both are in) as the up-to-date version, and have
it sync sda. I'll look into using "mkraid --force-resync" or mucking with
the superblock myself.

In the future, I think I'm going to keep superblock backups. Three for each
device: both in sync, a is correct and b is out of date, and b is correct
and a is out of date. This way I can restore the superblocks and force a
sync whichever way I want.

I'll keep you all posted on the solution to this nasty problem.

 - David Harris
   Principal Engineer, DRH Internet Services


First attachment:

bash# cat /proc/mdstat
Personalities : [raid1] [raid5]
read_ahead 128 sectors
md2 : active raid1 sdb9[1] sda9[0] 0 blocks [2/2] [UU]
md1 : active raid1 sdb7[1] sda7[0] 0 blocks [2/2] [UU]
md0 : active raid1 sdb6[1] sda6[0](F) 0 blocks [2/1] [_U]


Second attachment:

bash# raidstart /dev/md0
(read) sda6's sb offset: 208704
bind<sda6,1>
md: sdb6 has zero size, marking faulty!
bind<sdb6,2>
md: auto-running md0.
md: device name has changed from sdb6 to sda6 since last import!
md: bug in file md.c, line 1321

       **********************************
       * <COMPLETE RAID STATE PRINTOUT> *
       **********************************
md0: <sdb6><sda6> array superblock:
  SB: (V:0.90.0) ID:<3ee3a861.d204f956.d7e1be0f.fc18e0f8> CT:36673d63
     L1 S00208640 ND:2 RD:2 md0 LO:0 CS:131072
     UT:366de58e ST:1 AD:1 WD:1 FD:1 SD:0 CSUM:fae8e279
     D  0:  DISK<N:0,sda6(8,6),R:0,S:1>
     D  1:  DISK<N:1,sda6(8,6),R:1,S:6>
     D  2:  DISK<N:2,[dev 00:00](0,0),R:2,S:9>
     D  3:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  4:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  5:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  6:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  7:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  8:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  9:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D 10:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D 11:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     THIS:  DISK<N:1,sdb6(8,22),R:1,S:6>
 rdev sdb6: O:[dev 00:00], SZ:00000000 F:1 DN:-1 no rdev superblock!
 rdev sda6: O:sdb6, SZ:00000000 F:0 DN:1 rdev superblock:
  SB: (V:0.90.0) ID:<3ee3a861.d204f956.d7e1be0f.fc18e0f8> CT:36673d63
     L1 S00208640 ND:2 RD:2 md0 LO:0 CS:131072
     UT:366de58e ST:1 AD:1 WD:1 FD:1 SD:0 CSUM:fae8e279
     D  0:  DISK<N:0,sda6(8,6),R:0,S:1>
     D  1:  DISK<N:1,sdb6(8,22),R:1,S:6>
     D  2:  DISK<N:2,[dev 00:00](0,0),R:2,S:9>
     D  3:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  4:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  5:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  6:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  7:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  8:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D  9:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D 10:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     D 11:  DISK<N:0,[dev 00:00](0,0),R:0,S:0>
     THIS:  DISK<N:1,sda6(8,6),R:1,S:6>
       **********************************

md: dropping descriptor-less faulty sdb6
unbind<sdb6,1>
export_rdev(sdb6)
raid1: device sda6 operational as mirror 1
raid1: md0, not all disks are operational -- trying to recover array
raid1: raid set md0 active with 1 out of 2 mirrors
md: updating md0 RAID superblock on device
sda6(write) sda6's sb offset: 208704
md: recovery thread got woken up ...
md0: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...
.
bash# cat /proc/mdstat
Personalities : [raid1] [raid5]
read_ahead 128 sectors
md0 : active raid1 sda6[0] 0 blocks [2/1] [_U]
unused devices: <none>
bash# mkdir /mnt/root
bash# mount /dev/md0 /mnt/root -o ro
bash# ls /mnt/root
bin         dev         lib         mnt1        sbin
boot1       etc         lost+found  mnt2        tmp
boot2       home        misc        proc        usr
bru         initrd      mnt         root        var
bash#
bash# cat /proc/mdstat
Personalities : [raid1] [raid5]
read_ahead 128 sectors
md0 : active raid1 sda6[0] 0 blocks [2/1] [_U]
unused devices: <none>
bash#
bash#


Third attachment:

bash# raidstart /dev/md0
(read) sda6's sb offset: 208704
bind<sda6,1>
(read) sdb6's sb offset: 208704
bind<sdb6,2>
md: auto-running md0.
md: superblock update time inconsistency -- using the most recent one
md: marking non-fresh sda6 faulty!
md: device name has changed from sda6 to sdb6 since last import!
md: marking faulty sdb6 descriptor's rdev faulty too!
md: former device sda6 is unavailable, removing from array!
md: dropping descriptor-less faulty sda6
unbind<sda6,1>
export_rdev(sda6)
md: md0: raid array is not clean -- starting background reconstruction
raid1: disabled mirror sdb6 (errors detected)
raid1: no operational mirrors for md0
pers->run() failed ...
md: auto-running md0 FAILED (error -22).
unbind<sdb6,0>
export_rdev(sdb6)
md0 stopped.
autostart sda6 failed!
/dev/md0: Invalid argument
bash# cat /proc/mdstat
Personalities : [raid1] [raid5]
read_ahead 262144 sectors
unused devices: <none>
bash#


Fourth attachment:

bash# raidstart /dev/md0
(read) sda6's sb offset: 208704
bind<sda6,1>
md: auto-running md0.
md: marking faulty sda6 descriptor's rdev faulty too!
md: former device sda6 is unavailable, removing from array!
md: md0: raid array is not clean -- starting background reconstruction
raid1: disabled mirror sda6 (errors detected)
raid1: no operational mirrors for md0
pers->run() failed ...
md: auto-running md0 FAILED (error -22).
unbind<sda6,0>
export_rdev(sda6)
md0 stopped.
autostart sda6 failed!
/dev/md0: Invalid argument
bash#

 - David Harris
   Principal Engineer, DRH Internet Services
Major bug in raid driver

Reply via email to