On Tue Sep 16, 2025 at 2:11 PM UTC, Crystal Kolipe wrote: > If you had a choice of two setups, one which had a greater resiliance against > total failure but which may return bad data for every 1^24 bit read, and > another which has less resiliance against total failure, but only returned bad > data once in every 1^32 bit read, which would you consider more 'trustworthy'?
To be pedantic, errors every 1^32 is equally as bad as every 1^24! I understand what you're saying, though. Sometimes silent corruption is worse than total failure. In fact, probably often is the case. > So it seems that you're looking to achieve the highest possible availability > of this machine. But the best approach to achieving that also depends on the > workload. For example, whether it's primary a read only or read-write > situation. And whether uptime is more important than possible data errors. > > There would be a big difference, for example, in the media library application > that I described, (which is primarily read-only and where a single bit error > might go completely un-noticed), and recording data from an MRI scanner in > real time, (which is primary writing, and where errors might be catastrophic, > and where uptime is probably fairly important, but in the case of an error, > the data can likely be re-collected), and recording data recieved from a > weather satellite, or space probe, etc, etc, each has it's own balance of > what failure modes are more acceptable than others. > > In some applications, it's better to have two redundant servers each with a > single disk, (SLED), than building a fancy RAID on a single machine. I agree, two machines with single disks can be preferable in a lot of situations. > Yes, assuming that the SMART data being collected is confidable, (I've seen > some drives for which it was dubious), then it's obviously better to take the > disk out of service sooner rather than later. I routinely see bogus figures in SMART, but on many drives it seems like a lot of the most important fields are valid. Seagates seem to be some of the worst about bogus fields, in my experience. > With devices that are doing a read-modify-rewrite, if they read bad data as > good during this process then it will write that bad data back and mark it as > good. > > This can happen with flash-based devices, but also some magnetic drives, if > they are configured to emulate 512 byte blocks on physical media formatted for > 4K blocks. Also, a SMR drive that's shuffling data around in it's idle time > could also corrupt data whilst in flight internally. Interesting. Are you saying that harddrives with 512 byte sectors might be slightly less "error" prone? I can see how a 4K sector drive with 512 byte emulation might be able to rewrite a bogus sector all on its own, even if the 512 sector it was handed was fine. It's a shame that 4K native drives, or the ability to force 4K native, is kind of a rare and complicated thing. > (All this is another reason that I favour optical WORM media for data which > doesn't change.) Absolutely! > Patches are welcome! If you have a look at the softraid chapter in the > reckless guide to OpenBSD hosted on our research site, then you'll see > something like this was suggested as a programming exercise by one of my > collegues a few years ago. Wow, your research site is fantastic! I spent a while reading through it and there's lots more to cover. It's a terrific resource. It sure would be cool to have a sysctl tunable for the raid method, or some other way to configure how it behaved. > Is the read throughput benefit important in _your_ application? Nope. It's nice to have, but I would prefer slower and more reliable rather than faster reads. > It depends on _your_ specific application. Indeed. > A bad block which is detected on read by the device and causes a read failure > will cause the softraid code to read from the next drive in the RAID 1 array. Right. > A bad block detected on write should be handled transparently by the drive, > I.E. it writes the data elsewhere on the physical medium and updates it's > internal mappings so that reads from that block read the data written in the > new location. I think on a SSD this will happen, and maybe a SMR harddrive. I tend to think it doesn't happen on regular harddrives. I may be mitsaken, but I was testing a number of consumer grade SATA drives. I would see an entry in smart indicating a bad block. Then I'd run badblocks and once that spot was hit, I believe on the write cycle, the drive would timeout and be dropped. > Or are you saying that you're getting bad data returned as good off of these > drives? (Which will almost always happen _occasionally_, and should be > checked for at the application level.) I haven't had that happen yet, but it's certainly possible. It shouldn't typically happen with good firmware, though, especially if this is 4K or 512 byte native on a harddrive. I think they will be somewhat more failure aware given the lack of self-rewriting corruptibility, and the checksum should read invalid and report a bad block (that hopefully might be able to trigger the raid 1 code to check another drive.) >> Harddrive failures are pretty much guaranteed on a long enough timeline. >> On most 24/7 systems, it's more convenient to swap a drive than do a >> full reinstall. That's not to say that backups aren't more important >> than RAID, but it can be a faster way to get back online (or never go >> offline, in the first place.) > > Sure, but once again, you increase risks _elsewhere_. > > With three disks connected to the same PSU instead of one disk, there is more > risk that any one of those disks could develop a fault, (E.G. leaky > capacitor), that shorts the power supply. Now your machine is off-line. I mean, yes... but if I'm straining a 200 watt PSU with 30 watts instead of 20, I don't think this is much to consider. And redundant PSUs are definitely a thing. >> I've been dabbling with 2.5" SATA harddrives which use very little >> power. Though they do seem much less reliable than 3.5" drives, > > It depends a lot on the drives, whether they are designed to resist vibration, > resist excessive heat, whatever. Comparing random 2.5" drives to other random > 3.5" drives is a very broad comparison. I realize this is painting with broad strokes, but I went through a collection of old 2.5" and 3.5" SATA drives. The 2.5" drives, all consumer/laptop grade, had a very high bad block rate. I think like a third or half were bad. Versus, say, 10-20% with 3.5" drives with much higher hours. A lot of the 2.5" drives that failed had very low hours. The one saving grace I found is that Toshiba 2.5" HDDs seem to be more reliable than most other brands. So I am testing them more. Not sure how they will do in 24/7 service, so it's a gamble but I'm curious. >> I could >> have 3 or 4 of them using the same amount of power if I was paranoid >> about drive reliability. And two for most cases is sufficient. > > So instead of one or two reliable, (by your own definition), 3.5" drives, you > are suggesting using three or four less reliable, (also by your own > definition), and hoping that the combination of these within a RAID 1 will > give you increased reliability overall? > > It's possible. But adding more and more crappy drives to a RAID 1 generally > does not make it more reliable. > > Mathematically, if _any one_ of the drives causes problems, the array as a > whole _might_ be worse off than it was without that drive. It depends on > _how_ that drive fails and what it does. Oh, you're definitely right with the current RAID 1 code. Thinking about it more, this is not such a good idea, unless I rewrite the RAID 1 code for a "majority wins" concept reading from different drives. Such a system could be awfully robust from drive issues. > What 'easier corruptibility'? The pure crypto discipline uses a single disk, > so there is no risk of mismatched data coming from different underlying > volumes. Given that RAID 1, with a single drive, could induce issues with a power cut that without RAID 1 could not, I just find the softraid layer itself somewhat suspect. It may well be that the crypto discipline does not have this, and hopefully is the case. > I manage dozens of machines that use softraid crypto volumes, and I've never > seen a data loss issue that can be attributed to the softraid crypto code, > _when used with correctly functioning underlying devices_. That is promising, it's just speculation on my part. > What do people expect exactly in this situation of randomly cutting the > power to a machine that is doing a write? Why would you do that, or at > least how would you expect it to behave? In my case... I had a precariously placed power strip placed in such a way that it was very easy to switch it off. This happened a few times, and I was surprised when the system wouldn't boot without intervention. Then with intervention, that it paniced. Crashes are another avenue for this to happen, and they certainly do. I've just never had another system have a partition become so corrupt that even after manual fsck, it was having panics. This may not be the biggest deal if it's /tmp, but if it's /var it's downright annoying. I also don't always have physical access to machines. It's nice to know the machine will have SSH up and running after a power cut so I can get into it without KVM, serial, etc. I'd expect the write to likely have a corrupt or partial block that needs rewritten (not quite sure how that behaves.) I am not sure how this manifests -- if the last block in cache is always written, or if it can be partial. In that case, I wonder if it's considered a bad block in SMART on read? The real trick is if the block can be written to and then successfully read from. > Once you add in the small detail of power being cut at random intervals, you > need to consider how each drive is going to handle that, whether it is > capable of writing data in it's internal cache to the permanent medium, using > it's remaining residual power. Also, whether the host even sent the write > to each drive, (which it would have done if the power hadn't been cut). Yeah, this makes sense. Either way, there's a dramatic difference in how easily it is to corrupt the system with default mount flags, versus with sync (without RAID 1.) Under the right workload, there's certainly a greater than 50% chance that a manual fsck is required and I need some kind of physical or physical-like access to get things going again. >> I just thought it was very interesting and unusual that I had (as so far >> tested) 100% reliability with sync and no RAID 1 > > And there is your answer - you already had a good working setup for your > needs. I will certainly be using it more for now, though RAID 1 does interest me with some work. >> , but with RAID 1, even a >> single drive, and mounted sync, I could reproduce issues easily. > > Because RAID 1 is not really suited to _your_ application, (of randomly > cutting power to a running machine). I get it with different drives, but with a single drive I don't understand how the RAID 1 discipline makes corruption more likely from my testing. > Without decreasing _any_ metric of reliability? > > When a drive fails, it can do virtually _anything_. How can two drives > ever be more reliable than one, in every single possible aspect? You're right, this is only the case if the majority disk approach is implemented for RAID 1. >> (other than drives possibly giving bad data -- which they >> *shouldn't* do due to checksums.) > > Honestly, _forget_ this idea that drives can be trusted to return good data, > and if not to always report the fault. > > There are enough studies and published data available on-line that show that > this is not the case, (and my practical experience also backs this up). I've had good luck with it, but I believe you that you've seen issues first hand. I appreciate your email. I think you're right that RAID 1 is not the solution for me for the time being. I figured someone might have an idea why RAID 1 on a single drive is more corruption prone than without RAID 1, though. With two drives, I get it, there is a potential consistency issue between the drives when the power is cut. Certainly gives me lots to think about. -Henrich
