Richard Freeman <r...@thefreemanclan.net> posted
4978dd5d.4040...@thefreemanclan.net, excerpted below, on  Thu, 22 Jan 2009
15:55:57 -0500:

> I was running lvm2 on top of several raid-5 devices (that is, the raid-5
> devices were the lmv2 physical volumes). [...] I was running ext3 on my 
lvs (and swap).
> 
> The problem was that I was having some kind of glitch that was causing
> my computer to reset (I traced it to one of my drives), and when it
> happened the array would sometimes come up with one of the drives
> missing.  If the glitch happened again while the array was degraded it
> could cause data loss (no worse than not having RAID at all).
> 
> When I finally got the bad drive replaced (which generally fixed the
> resets), I rebuilt my arrays.  At that point mdadm was happy with the
> state of affairs, but fsck was showing loads of errors on some of my
> filesystems.  When I went ahead and let fsck do its job, I immediately
> started noticing corrupt files all over the place.  The majority of the
> data volume was mpg files from mythtv and I'd find hour-long TV episodes
> where one minute of some other show would get spliced in.  It seemed
> obvious that files were somehow getting cross-linked (I'm not intimately
> familiar with ext3, but I could see how this could happen in FAT).  Oh -
> these errors were on a partition that WASN'T fsck'ed (in the
> command-line-utility sense of the world only I suppose).
> 
> I also started getting lots of errors on dmesg about attempts to seek
> past the end of the md devices.  I did some googling and found that this
> had been seen by others - but it was obviously very rare.

You may well know much of this now, but it might benefit others thinking 
about RAID...

I'd blame that on your choice of RAID (and ultimately on the defective 
hardware, but it wouldn't have been as bad on RAID-1 or RAID-6), more 
than on what was running on top of it.  

As you mentioned, with RAID-5, if the array crashes when both dirty and 
degraded (one-drive-down), there's ultimately no way to guarantee data 
integrity, because the checksumming normally available when active isn't 
available when it's degraded, and the remaining data can't be trusted 
either because it was dirty and could have crashed in the middle of a 
write.

RAID-6 improves on this by using two independent checksumming algorithms, 
each on its own stripe.  While that gives you one less data spindle for 
the same total spindles and requires at least four spindles instead of 
the three spindle minimum of RAID-5, it's MUCH safer as the redundancy is 
doubled.  You can lose a second drive before the recovery of the first is 
complete without loss of data integrity, and there's much less chance of 
a random hardware error making it thru to the data, as well.

RAID-1 of course doesn't have the checksumming and thus lacks the 
guarantee of data integrity protection against random hardware error, but 
since all spindles are mirrors, you can drop to a single spindle without 
losing data.  For roughly comparable (not the same as it's not 
checksummed) security to RAID-6, you need three spindles minimum instead 
of RAID-6's four, but you are likewise limited to the data space of one, 
instead of the two you'd have with RAID-6's four.

What I'd guess happened is that the dirty/degraded crash happened while 
the set of stripes that also had the LVM2 record was being written, altho 
it wasn't necessarily the LVM data itself being written, but just 
something that happened to be in the same stripe set so the checksum 
covering it had to be rewritten as well.  It's also possible the hardware 
error you mentioned was affecting the reliability of what the spindle 
returned even when it didn't cause resets.  In that case, even if the 
data was on a different stripe, the resulting checksum written could end 
up invalid, thus playing havoc with a recovery.

In any case, what would have ordinarily been a simple problem with likely 
one, or at most a relatively small handful of files, cascaded up the 
stack.  Once the LVM data was screwed, it started causing data writes to 
invalid locations, and once that happens, God help you because the 
filesystem's not going to be able to!

The other thing that may play into it was what level of journaling you 
were doing.  data=journal may have saved your butt, while if you were 
using data=writeback, you'd have had the same issues with ext3 that early 
reiserfs was so infamous for, when it was doing the same data=writeback.  
data=ordered is the middle ground and I believe what ext3 has always 
defaulted to, and what reiserfs has defaulted to for years.  But the 
thing is, if your hardware was returning invalid data or once LVM got 
screwed up if that was indeed the problem, I don't care what the 
filesystem was there was no way it could maintain a logically consistent 
view of the data, and you were probably lucky to be able to recover what 
you did.

> Fortunately all my most critical data is backed up weekly (only a day or
> two before the final crash), and I didn't care about the TV too much (I
> saved what I could and re-recorded anything that got truncated or wasn't
> watchable).  I did find that some of my DVD backups of digital photos
> were unreadable which has taught me a valuable lesson.  Fortunately only
> some of the photos actually had errors in them, and most were
> successfully backed up.

Lucky, or more appropriately wise you!  There aren't so many folks that 
backup to normally offline external device that regularly.  Honestly, I 
don't.

> I'm not longer using lvm2.  If I need to expand my RAID I can
> potentially reshape it (after backups where possible).  I miss some of
> the flexibility, but when I need a few GB of scratch space to test out a
> filesystem upgrade or something I just use losetup - but I don't care
> about performance in these cases.

That makes sense.

> I would say that lvm2 is probably safe if you have more reliable
> hardware.  My problem was that a failing drive not only made the drive
> inaccessible, but it took down the whole system (since hardware on a
> typical desktop isn't well-isolated).  On a decent server a drive
> failure shouldn't cause errors that bring down the whole system.  So, I
> didn't get the full benefit from RAID.

I believe you're correct.  Years ago, LVM was the only way to get 
partitions on RAID, since RAID didn't allow its own partitioning.  
However, partitioned RAID has been available since the beginning of the 
2.6 kernel series, and while it lacks a bit of LVM2's flexibility, it's 
arguably worth that tradeoff in ordered to eliminate the LVM layer 
entirely, and with it, the additional risk (however low it may be) and 
administrative overhead associated with running major parts of a system 
on block devices that require userspace setup.  The same would apply to 
FUSE based filesystems as well, at least to some extent.

So... I guess that's something else I can add to my list now, for the 
next time I setup a new disk set or whatever.  To the everything-portage-
touches-on-root that I explained in the other replies, and the RAID-6 
that I had already chosen over RAID-5, I can now add to the list killing 
the LVM2 used in my current setup.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


Reply via email to