Richard Freeman <r...@thefreemanclan.net> posted 4978dd5d.4040...@thefreemanclan.net, excerpted below, on Thu, 22 Jan 2009 15:55:57 -0500:
> I was running lvm2 on top of several raid-5 devices (that is, the raid-5 > devices were the lmv2 physical volumes). [...] I was running ext3 on my lvs (and swap). > > The problem was that I was having some kind of glitch that was causing > my computer to reset (I traced it to one of my drives), and when it > happened the array would sometimes come up with one of the drives > missing. If the glitch happened again while the array was degraded it > could cause data loss (no worse than not having RAID at all). > > When I finally got the bad drive replaced (which generally fixed the > resets), I rebuilt my arrays. At that point mdadm was happy with the > state of affairs, but fsck was showing loads of errors on some of my > filesystems. When I went ahead and let fsck do its job, I immediately > started noticing corrupt files all over the place. The majority of the > data volume was mpg files from mythtv and I'd find hour-long TV episodes > where one minute of some other show would get spliced in. It seemed > obvious that files were somehow getting cross-linked (I'm not intimately > familiar with ext3, but I could see how this could happen in FAT). Oh - > these errors were on a partition that WASN'T fsck'ed (in the > command-line-utility sense of the world only I suppose). > > I also started getting lots of errors on dmesg about attempts to seek > past the end of the md devices. I did some googling and found that this > had been seen by others - but it was obviously very rare. You may well know much of this now, but it might benefit others thinking about RAID... I'd blame that on your choice of RAID (and ultimately on the defective hardware, but it wouldn't have been as bad on RAID-1 or RAID-6), more than on what was running on top of it. As you mentioned, with RAID-5, if the array crashes when both dirty and degraded (one-drive-down), there's ultimately no way to guarantee data integrity, because the checksumming normally available when active isn't available when it's degraded, and the remaining data can't be trusted either because it was dirty and could have crashed in the middle of a write. RAID-6 improves on this by using two independent checksumming algorithms, each on its own stripe. While that gives you one less data spindle for the same total spindles and requires at least four spindles instead of the three spindle minimum of RAID-5, it's MUCH safer as the redundancy is doubled. You can lose a second drive before the recovery of the first is complete without loss of data integrity, and there's much less chance of a random hardware error making it thru to the data, as well. RAID-1 of course doesn't have the checksumming and thus lacks the guarantee of data integrity protection against random hardware error, but since all spindles are mirrors, you can drop to a single spindle without losing data. For roughly comparable (not the same as it's not checksummed) security to RAID-6, you need three spindles minimum instead of RAID-6's four, but you are likewise limited to the data space of one, instead of the two you'd have with RAID-6's four. What I'd guess happened is that the dirty/degraded crash happened while the set of stripes that also had the LVM2 record was being written, altho it wasn't necessarily the LVM data itself being written, but just something that happened to be in the same stripe set so the checksum covering it had to be rewritten as well. It's also possible the hardware error you mentioned was affecting the reliability of what the spindle returned even when it didn't cause resets. In that case, even if the data was on a different stripe, the resulting checksum written could end up invalid, thus playing havoc with a recovery. In any case, what would have ordinarily been a simple problem with likely one, or at most a relatively small handful of files, cascaded up the stack. Once the LVM data was screwed, it started causing data writes to invalid locations, and once that happens, God help you because the filesystem's not going to be able to! The other thing that may play into it was what level of journaling you were doing. data=journal may have saved your butt, while if you were using data=writeback, you'd have had the same issues with ext3 that early reiserfs was so infamous for, when it was doing the same data=writeback. data=ordered is the middle ground and I believe what ext3 has always defaulted to, and what reiserfs has defaulted to for years. But the thing is, if your hardware was returning invalid data or once LVM got screwed up if that was indeed the problem, I don't care what the filesystem was there was no way it could maintain a logically consistent view of the data, and you were probably lucky to be able to recover what you did. > Fortunately all my most critical data is backed up weekly (only a day or > two before the final crash), and I didn't care about the TV too much (I > saved what I could and re-recorded anything that got truncated or wasn't > watchable). I did find that some of my DVD backups of digital photos > were unreadable which has taught me a valuable lesson. Fortunately only > some of the photos actually had errors in them, and most were > successfully backed up. Lucky, or more appropriately wise you! There aren't so many folks that backup to normally offline external device that regularly. Honestly, I don't. > I'm not longer using lvm2. If I need to expand my RAID I can > potentially reshape it (after backups where possible). I miss some of > the flexibility, but when I need a few GB of scratch space to test out a > filesystem upgrade or something I just use losetup - but I don't care > about performance in these cases. That makes sense. > I would say that lvm2 is probably safe if you have more reliable > hardware. My problem was that a failing drive not only made the drive > inaccessible, but it took down the whole system (since hardware on a > typical desktop isn't well-isolated). On a decent server a drive > failure shouldn't cause errors that bring down the whole system. So, I > didn't get the full benefit from RAID. I believe you're correct. Years ago, LVM was the only way to get partitions on RAID, since RAID didn't allow its own partitioning. However, partitioned RAID has been available since the beginning of the 2.6 kernel series, and while it lacks a bit of LVM2's flexibility, it's arguably worth that tradeoff in ordered to eliminate the LVM layer entirely, and with it, the additional risk (however low it may be) and administrative overhead associated with running major parts of a system on block devices that require userspace setup. The same would apply to FUSE based filesystems as well, at least to some extent. So... I guess that's something else I can add to my list now, for the next time I setup a new disk set or whatever. To the everything-portage- touches-on-root that I explained in the other replies, and the RAID-6 that I had already chosen over RAID-5, I can now add to the list killing the LVM2 used in my current setup. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman