On Tue, Sep 16, 2025 at 01:38:03AM +0000, H. Hartzer wrote:
> Hi Crystal,
> 
> On Mon Sep 15, 2025 at 5:37 PM UTC, Crystal Kolipe wrote:
> > On Mon, Sep 15, 2025 at 03:41:14PM +0000, H. Hartzer wrote:
> >> This is a little bit frustrating as one usually uses RAID 1 to
> >> improve reliability, not decrease it!
> >
> > Define "reliability".
> 
> A conceptual idea of how trustworthy something is.

Sure, but you are trying to put a single 'number' on a concept that has many
facets.

If you had a choice of two setups, one which had a greater resiliance against
total failure but which may return bad data for every 1^24 bit read, and
another which has less resiliance against total failure, but only returned bad
data once in every 1^32 bit read, which would you consider more 'trustworthy'?

> > In simple terms, RAID 1 offers you _some_ protection against a complete disk
> > failure, (traditionally the risk most envisaged was a head crash).
> >
> > If and only if the failure of one drive doesn't take down the whole machine,
> > (or at least other drives in the mirror, as was common on multi-drop 
> > parallel
> > SCSI), then you hopefully gain uptime.  The example I usually give for this
> > use-case would be media playout at a radio or tv station - you don't want to
> > wait for 3 hours while you restore from a backup, you need that output to
> > continue immediately.
> 
> This is true. It would be ideal in my case, if AHCI hotplug was
> supported, to be able to replace a drive on the fly. But replacing from
> a hotswap tray and requiring one reboot isn't too bad, either.

So it seems that you're looking to achieve the highest possible availability
of this machine.  But the best approach to achieving that also depends on the
workload.  For example, whether it's primary a read only or read-write
situation.  And whether uptime is more important than possible data errors.

There would be a big difference, for example, in the media library application
that I described, (which is primarily read-only and where a single bit error
might go completely un-noticed), and recording data from an MRI scanner in
real time, (which is primary writing, and where errors might be catastrophic,
and where uptime is probably fairly important, but in the case of an error,
the data can likely be re-collected), and recording data recieved from a
weather satellite, or space probe, etc, etc, each has it's own balance of
what failure modes are more acceptable than others.

In some applications, it's better to have two redundant servers each with a
single disk, (SLED), than building a fancy RAID on a single machine.

> I've seen a lot of drives that starting getting bad blocks before
> catastrophic failure. SMART tends to show these errors.

Yes, assuming that the SMART data being collected is confidable, (I've seen
some drives for which it was dubious), then it's obviously better to take the
disk out of service sooner rather than later.

> > However, (most implementations of), RAID 1 _increase_ your risk of silent 
> > data
> > corruption, because they read round-robin from all of the disks in the 
> > array.
> 
> That is true, though most disks *should* know if the block's checksum is
> invalid.

Nice theory.  The real world is _very_ different.  Especially on cheap
consumer drives, (although more expensive 'prosumer' drives are not always
a better choice).

> Now I have seen it happen before that RAID 1 can have
> mismatching blocks from one drive to another. I'm not 100% sure what
> this is from.

Firmware bugs, torn writes, bit flips anywhere along the chain of events that
goes from one device to another, software writing directly to one of the raw
volumes.

With devices that are doing a read-modify-rewrite, if they read bad data as
good during this process then it will write that bad data back and mark it as
good.

This can happen with flash-based devices, but also some magnetic drives, if
they are configured to emulate 512 byte blocks on physical media formatted for
4K blocks.  Also, a SMR drive that's shuffling data around in it's idle time
could also corrupt data whilst in flight internally.

(All this is another reason that I favour optical WORM media for data which
 doesn't change.)

> I think an ideal case, for RAID 1, might be 3+ drives where a correct
> block would be solved in a "majority wins" situation.

Patches are welcome!  If you have a look at the softraid chapter in the
reckless guide to OpenBSD hosted on our research site, then you'll see
something like this was suggested as a programming exercise by one of my
collegues a few years ago.

> Now this would
> lose the read throughput benefits of RAID 1.

Is the read throughput benefit important in _your_ application?

It depends on _your_ specific application.

> > Why are you using RAID in the first place?  What are you trying to achieve?
> > Do you actually have a use-case that would benefit from improved anything
> > over what you get with a simple one SSD, using a normal FFS filesystem, and
> > normal backups?
> 
> In my case, I was testing with SSDs on a baremetal provider just to
> quickly simulate things without going out to the office. I've been using
> more spinning platter harddrives as of late which certainly do get the
> occasional bad block.

A bad block which is detected on read by the device and causes a read failure
will cause the softraid code to read from the next drive in the RAID 1 array.

A bad block detected on write should be handled transparently by the drive,
I.E. it writes the data elsewhere on the physical medium and updates it's
internal mappings so that reads from that block read the data written in the
new location.

Or are you saying that you're getting bad data returned as good off of these
drives?  (Which will almost always happen _occasionally_, and should be
checked for at the application level.)

> Harddrive failures are pretty much guaranteed on a long enough timeline.
> On most 24/7 systems, it's more convenient to swap a drive than do a
> full reinstall. That's not to say that backups aren't more important
> than RAID, but it can be a faster way to get back online (or never go
> offline, in the first place.)

Sure, but once again, you increase risks _elsewhere_.

With three disks connected to the same PSU instead of one disk, there is more
risk that any one of those disks could develop a fault, (E.G. leaky
capacitor), that shorts the power supply.  Now your machine is off-line.

> I've been dabbling with 2.5" SATA harddrives which use very little
> power. Though they do seem much less reliable than 3.5" drives,

It depends a lot on the drives, whether they are designed to resist vibration,
resist excessive heat, whatever.  Comparing random 2.5" drives to other random
3.5" drives is a very broad comparison.

> I could
> have 3 or 4 of them using the same amount of power if I was paranoid
> about drive reliability. And two for most cases is sufficient.

So instead of one or two reliable, (by your own definition), 3.5" drives, you
are suggesting using three or four less reliable, (also by your own
definition), and hoping that the combination of these within a RAID 1 will
give you increased reliability overall?

It's possible.  But adding more and more crappy drives to a RAID 1 generally
does not make it more reliable.

Mathematically, if _any one_ of the drives causes problems, the array as a
whole _might_ be worse off than it was without that drive.  It depends on
_how_ that drive fails and what it does.

> > I fail to understand why there is such a desire by so many people to
> > over-complicate things such as filesystems that are already, (in the case of
> > FFS), complicated enough.
> >
> > In general, unless one has a specific use-case for RAID, or they are 
> > actually
> > testing and developing the RAID code, then leave it out, (and that applies 
> > to
> > any OS, not just OpenBSD).
> 
> One concern of mine, that I have not tested, is if this easier
> corruptibility also applies to the crypto discipline.

What 'easier corruptibility'?  The pure crypto discipline uses a single disk,
so there is no risk of mismatched data coming from different underlying
volumes.

I manage dozens of machines that use softraid crypto volumes, and I've never
seen a data loss issue that can be attributed to the softraid crypto code,
_when used with correctly functioning underlying devices_.

> > Furthermore, there seem to have been a lot of scare stories recently about
> > data loss on FFS, in various scenarios, but hard facts and reproducible 
> > steps
> > are much more thin on the ground.
> 
> I was readily reproducing corruption to the point of panics, or at least
> requiring manual fsck. Syncing the Monero blockchain and having the
> power cut seems quite reliable.

What do people expect exactly in this situation of randomly cutting the
power to a machine that is doing a write?  Why would you do that, or at
least how would you expect it to behave?

Once you add in the small detail of power being cut at random intervals, you
need to consider how each drive is going to handle that, whether it is
capable of writing data in it's internal cache to the permanent medium, using
it's remaining residual power.  Also, whether the host even sent the write
to each drive, (which it would have done if the power hadn't been cut).

> I just thought it was very interesting and unusual that I had (as so far
> tested) 100% reliability with sync and no RAID 1

And there is your answer - you already had a good working setup for your
needs.

> , but with RAID 1, even a
> single drive, and mounted sync, I could reproduce issues easily.

Because RAID 1 is not really suited to _your_ application, (of randomly
cutting power to a running machine).

> You may well be right that RAID can overcomplicate things, but I feel
> like RAID 1 should be possible without decreasing any metric of
> reliability

Without decreasing _any_ metric of reliability?

When a drive fails, it can do virtually _anything_.  How can two drives
ever be more reliable than one, in every single possible aspect?

> (other than drives possibly giving bad data -- which they
> *shouldn't* do due to checksums.)

Honestly, _forget_ this idea that drives can be trusted to return good data,
and if not to always report the fault.

There are enough studies and published data available on-line that show that
this is not the case, (and my practical experience also backs this up).

Reply via email to