I know I'm resurrecting an old thread here, but I just saw a post in Planet CentOS that seems to have some info on fixing the mismatch_cnt is no 0 error, Take a loog at this blog post where the author's suggests some md actions that can be taken to clear these errors: http://www.arrfab.net/blog/?p=199
-Shawn On Sun, Nov 1, 2009 at 10:00 PM, Ben Scott <[email protected]> wrote: > CentOS 5.4. Running kernel is 2.6.18-92.1.22.el5. The system has > two disks, each with two partitions, making up two md mirror devices. > md0 is ~ 509 MB and holds /boot; md1 is ~ 69 GB (the rest of the disk) > and holds an LVM PE. The following arrived in my mailbox today: > > On Sun, Nov 1, 2009 at 4:22 AM, Cron Daemon <[email protected]> > wrote: > > /etc/cron.weekly/99-raid-check: > > > > WARNING: mismatch_cnt is not 0 on /dev/md0 > > Investigation finds: > > /proc/mdstat reports everything is peachy for both mirrors. "[2/2] [UU]" > > Under /sys/block/md0/md/ I find the following: > > array_state: clean > mismatch_cnt: 256 > rd{0,1}/errors: 0 > rd{0,1}/state: in_snyc > > Google finds lots of people reporting similar, but nothing > conclusive or particularly pertinent to this situation. Lots of > people saying that swap can cause this (because swap can commit a > block to one member, then learn it won't ever re-read that block, and > so won't bother committing the other member), but this is the /boot > filesystem, not swap. (swap is in an LV; the md device backing that > LVM's sole PE reports a mismatch_cnt of zero.) > > I did find some people saying this started happening after CentOS > 5.3 -> 5.4. I did do that recently. One person said the "raid-check" > was added in 5.4. So I presume this mismatch_cnt might have been > non-zero for ages, and I just never knew to look before now. > mdmonitor has been running, but it mainly reports if a RAID member > goes offline, and as noted, md is reporting all's quiet on the western > front. > > I tried dismounting the /boot filesystem and running some tests. > (Since it's a separate partition and md device, and outside of LVM, I > can poke at it without taking the system down.) > > "e2fsck -f -n" says /dev/md0 is okay. > > I tried stopping the RAID device with "mdadm --stop /dev/md0", then > sync'ing disks. Then I ran "cmp /dev/sda1 /dev/sdb1". The result: > > /dev/sda1 /dev/sdb1 differ: byte 331875867, line 215880 > > So the two mirror members are **NOT** identical. That's usually bad. > > Running "e2fsck -f -n" on each member says no trouble found. That > implies whatever the mismatch is, it is not in filesystem metadata. > > Running a "badblocks" read-only test on each member says no read errors. > > mdadm says the MD superblocks are okay, and comparing the two finds > most things are the same -- only the checksum and device relationships > differ (expected). > > One nice thing about simple mirrors is that you can mount the > members read-only and examine the contents without breaking the mirror > set. So: > > liberty$ sudo mount -o ro -t ext2 /dev/sda1 /mnt/sda1 > liberty$ sudo mount -o ro -t ext2 /dev/sdb1 /mnt/sdb1 > liberty$ sudo diff -r sda1 sdb1 > Binary files sda1/grub/stage2 and sdb1/grub/stage2 differ > liberty$ > > (You have to mount as ext2 because ext3 will replay a journal even > if you said "read-only".) > > It may be normal for the GRUB stage2 to differ in this > configuration. There may be device numbers encoded into them. GRUB > was installed on each disk separately, by booting from floppy, so that > would do it. Or it could be one disk has an undetected bad block and > the boot loader on that disk is shot. > > No other differences detected in file data, though. So between fsck > and diff, it looks like most of the contents are intact. Maybe all of > them. > > I'm unsure as to how to proceed. > > The general procedure for repairing a broken mirror is to resync > from the good member, assuming you can determine which is good. My > problem is, I'm not sure which is the good member, or even if there > *is* a good member: If GRUB writes different device numbers into the > boot stage files, the two disks necessarily won't match. Which, come > to think of it, is probably something to worry about, since a legit > mirror resync will scrogg that. > > "smartctl -a" reveals something that may be relevant. sda reports > several non-zero values in the "Error counter log" section. No > uncorrectable errors, but ECC has been used. At the same time, sdb > reports all zeros for those same values. Further, the counts for sda > have increased since the disks were installed. (I saved the output of > "smartctl -a" back then. Now you see why.) Now, ECC usage is not an > automatic cause for alarm on a modern hard disk, but the fact that sda > is non-zero and increasing while sdb is zero and flat suggests sdb is > in better overall health. However, this probably has nothing to do > with the mirror mismatch, since both disks report zero *uncorrectable* > errors. Uncorrectable media defects would certainly cause a mirror > mismatch, but the drives think they've been able to handle everything > so far. > > There are newer kernels available; the system hasn't been rebooted > in 251 days. But I'm somewhat loathe to try rebooting with /boot in a > suspect state. > > The thing I find really confusing is why "mismatch_cnt" can be > non-zero while the rest of the in-kernel md monitoring stuff reports > everything is good. > > Anyone here have suggestions, ideas, knowledge, or even wild schemes? > > -- Ben > _______________________________________________ > gnhlug-discuss mailing list > [email protected] > http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/ >
_______________________________________________ gnhlug-discuss mailing list [email protected] http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
