On Tue, Sep 16, 2025 at 01:38:03AM +0000, H. Hartzer wrote: > Hi Crystal, > > On Mon Sep 15, 2025 at 5:37 PM UTC, Crystal Kolipe wrote: > > On Mon, Sep 15, 2025 at 03:41:14PM +0000, H. Hartzer wrote: > >> This is a little bit frustrating as one usually uses RAID 1 to > >> improve reliability, not decrease it! > > > > Define "reliability". > > A conceptual idea of how trustworthy something is.
Sure, but you are trying to put a single 'number' on a concept that has many facets. If you had a choice of two setups, one which had a greater resiliance against total failure but which may return bad data for every 1^24 bit read, and another which has less resiliance against total failure, but only returned bad data once in every 1^32 bit read, which would you consider more 'trustworthy'? > > In simple terms, RAID 1 offers you _some_ protection against a complete disk > > failure, (traditionally the risk most envisaged was a head crash). > > > > If and only if the failure of one drive doesn't take down the whole machine, > > (or at least other drives in the mirror, as was common on multi-drop > > parallel > > SCSI), then you hopefully gain uptime. The example I usually give for this > > use-case would be media playout at a radio or tv station - you don't want to > > wait for 3 hours while you restore from a backup, you need that output to > > continue immediately. > > This is true. It would be ideal in my case, if AHCI hotplug was > supported, to be able to replace a drive on the fly. But replacing from > a hotswap tray and requiring one reboot isn't too bad, either. So it seems that you're looking to achieve the highest possible availability of this machine. But the best approach to achieving that also depends on the workload. For example, whether it's primary a read only or read-write situation. And whether uptime is more important than possible data errors. There would be a big difference, for example, in the media library application that I described, (which is primarily read-only and where a single bit error might go completely un-noticed), and recording data from an MRI scanner in real time, (which is primary writing, and where errors might be catastrophic, and where uptime is probably fairly important, but in the case of an error, the data can likely be re-collected), and recording data recieved from a weather satellite, or space probe, etc, etc, each has it's own balance of what failure modes are more acceptable than others. In some applications, it's better to have two redundant servers each with a single disk, (SLED), than building a fancy RAID on a single machine. > I've seen a lot of drives that starting getting bad blocks before > catastrophic failure. SMART tends to show these errors. Yes, assuming that the SMART data being collected is confidable, (I've seen some drives for which it was dubious), then it's obviously better to take the disk out of service sooner rather than later. > > However, (most implementations of), RAID 1 _increase_ your risk of silent > > data > > corruption, because they read round-robin from all of the disks in the > > array. > > That is true, though most disks *should* know if the block's checksum is > invalid. Nice theory. The real world is _very_ different. Especially on cheap consumer drives, (although more expensive 'prosumer' drives are not always a better choice). > Now I have seen it happen before that RAID 1 can have > mismatching blocks from one drive to another. I'm not 100% sure what > this is from. Firmware bugs, torn writes, bit flips anywhere along the chain of events that goes from one device to another, software writing directly to one of the raw volumes. With devices that are doing a read-modify-rewrite, if they read bad data as good during this process then it will write that bad data back and mark it as good. This can happen with flash-based devices, but also some magnetic drives, if they are configured to emulate 512 byte blocks on physical media formatted for 4K blocks. Also, a SMR drive that's shuffling data around in it's idle time could also corrupt data whilst in flight internally. (All this is another reason that I favour optical WORM media for data which doesn't change.) > I think an ideal case, for RAID 1, might be 3+ drives where a correct > block would be solved in a "majority wins" situation. Patches are welcome! If you have a look at the softraid chapter in the reckless guide to OpenBSD hosted on our research site, then you'll see something like this was suggested as a programming exercise by one of my collegues a few years ago. > Now this would > lose the read throughput benefits of RAID 1. Is the read throughput benefit important in _your_ application? It depends on _your_ specific application. > > Why are you using RAID in the first place? What are you trying to achieve? > > Do you actually have a use-case that would benefit from improved anything > > over what you get with a simple one SSD, using a normal FFS filesystem, and > > normal backups? > > In my case, I was testing with SSDs on a baremetal provider just to > quickly simulate things without going out to the office. I've been using > more spinning platter harddrives as of late which certainly do get the > occasional bad block. A bad block which is detected on read by the device and causes a read failure will cause the softraid code to read from the next drive in the RAID 1 array. A bad block detected on write should be handled transparently by the drive, I.E. it writes the data elsewhere on the physical medium and updates it's internal mappings so that reads from that block read the data written in the new location. Or are you saying that you're getting bad data returned as good off of these drives? (Which will almost always happen _occasionally_, and should be checked for at the application level.) > Harddrive failures are pretty much guaranteed on a long enough timeline. > On most 24/7 systems, it's more convenient to swap a drive than do a > full reinstall. That's not to say that backups aren't more important > than RAID, but it can be a faster way to get back online (or never go > offline, in the first place.) Sure, but once again, you increase risks _elsewhere_. With three disks connected to the same PSU instead of one disk, there is more risk that any one of those disks could develop a fault, (E.G. leaky capacitor), that shorts the power supply. Now your machine is off-line. > I've been dabbling with 2.5" SATA harddrives which use very little > power. Though they do seem much less reliable than 3.5" drives, It depends a lot on the drives, whether they are designed to resist vibration, resist excessive heat, whatever. Comparing random 2.5" drives to other random 3.5" drives is a very broad comparison. > I could > have 3 or 4 of them using the same amount of power if I was paranoid > about drive reliability. And two for most cases is sufficient. So instead of one or two reliable, (by your own definition), 3.5" drives, you are suggesting using three or four less reliable, (also by your own definition), and hoping that the combination of these within a RAID 1 will give you increased reliability overall? It's possible. But adding more and more crappy drives to a RAID 1 generally does not make it more reliable. Mathematically, if _any one_ of the drives causes problems, the array as a whole _might_ be worse off than it was without that drive. It depends on _how_ that drive fails and what it does. > > I fail to understand why there is such a desire by so many people to > > over-complicate things such as filesystems that are already, (in the case of > > FFS), complicated enough. > > > > In general, unless one has a specific use-case for RAID, or they are > > actually > > testing and developing the RAID code, then leave it out, (and that applies > > to > > any OS, not just OpenBSD). > > One concern of mine, that I have not tested, is if this easier > corruptibility also applies to the crypto discipline. What 'easier corruptibility'? The pure crypto discipline uses a single disk, so there is no risk of mismatched data coming from different underlying volumes. I manage dozens of machines that use softraid crypto volumes, and I've never seen a data loss issue that can be attributed to the softraid crypto code, _when used with correctly functioning underlying devices_. > > Furthermore, there seem to have been a lot of scare stories recently about > > data loss on FFS, in various scenarios, but hard facts and reproducible > > steps > > are much more thin on the ground. > > I was readily reproducing corruption to the point of panics, or at least > requiring manual fsck. Syncing the Monero blockchain and having the > power cut seems quite reliable. What do people expect exactly in this situation of randomly cutting the power to a machine that is doing a write? Why would you do that, or at least how would you expect it to behave? Once you add in the small detail of power being cut at random intervals, you need to consider how each drive is going to handle that, whether it is capable of writing data in it's internal cache to the permanent medium, using it's remaining residual power. Also, whether the host even sent the write to each drive, (which it would have done if the power hadn't been cut). > I just thought it was very interesting and unusual that I had (as so far > tested) 100% reliability with sync and no RAID 1 And there is your answer - you already had a good working setup for your needs. > , but with RAID 1, even a > single drive, and mounted sync, I could reproduce issues easily. Because RAID 1 is not really suited to _your_ application, (of randomly cutting power to a running machine). > You may well be right that RAID can overcomplicate things, but I feel > like RAID 1 should be possible without decreasing any metric of > reliability Without decreasing _any_ metric of reliability? When a drive fails, it can do virtually _anything_. How can two drives ever be more reliable than one, in every single possible aspect? > (other than drives possibly giving bad data -- which they > *shouldn't* do due to checksums.) Honestly, _forget_ this idea that drives can be trusted to return good data, and if not to always report the fault. There are enough studies and published data available on-line that show that this is not the case, (and my practical experience also backs this up).
