On 2017-01-19 13:23, Roman Mamedov wrote:
On Thu, 19 Jan 2017 17:39:37 +0100
"Alejandro R. Mosteo" <alejan...@mosteo.com> wrote:
I was wondering, from a point of view of data safety, if there is any
difference between using dup or making a raid1 from two partitions in
the same disk. This is thinking on having some protection against the
typical aging HDD that starts to have bad sectors.
RAID1 will write slower compared to DUP, as any optimization to make RAID1
devices work in parallel will cause a total performance disaster for you as
you will start trying to write to both partitions at the same time, turning
all linear writes into random ones, which are about two orders of magnitude
slower than linear on spinning hard drives. DUP shouldn't have this issue, but
still it will be twice slower than single, since you are writing everything
twice.
As of right now, there will actually be near zero impact on write
performance (or at least, it's way less than the theoretical 50%)
because there really isn't any optimization to speak of in the
multi-device code. That will hopefully change over time, but it's not
likely to do so any time in the future since nobody appears to be
working on multi-device write performance.
You could consider DUP data for when a disk is already known to be getting bad
sectors from time to time -- but then it's a fringe exercise to try and keep
using such disk in the first place. Yeah with DUP data DUP metadata you can
likely have some more life out of such disk as a throwaway storage space for
non-essential data, at half capacity, but is it worth the effort, as it's
likely to start failing progressively worse over time.
In all other cases the performance and storage space penalty of DUP within a
single device are way too great (and gained redundancy is too low) compared
to a proper system of single profile data + backups, or a RAID5/6 system (not
Btrfs-based) + backups.
That really depends on your usage. In my case, I run DUP data on single
disks regularly. I still do backups of course, but the performance is
worth far less for me (especially in the cases where I'm using NVMe
SSD's which have performance measured in thousands of MB/s for both
reads and writes) than the ability to recover from transient data
corruption without needing to go to a backup.
As long as /home and any other write heavy directories are on a separate
partition, I would actually advocate using DUP data on your root
filesystem if you can afford the space simply because it's a whole lot
easier to recover other data if the root filesystem still works. Most
of the root filesystem except some stuff under /var follows a WORM
access pattern, and even the stuff that doesn't in /var is usually not
performance critical, so the write performance penalty won't have
anywhere near as much impact on how well the system runs as you might think.
There's also the fact that you're writing more metadata than data most
of the time unless you're dealing with really big files, and metadata is
already DUP mode (unless you are using an SSD), so the performance hit
isn't 50%, it's actually a bit more than half the ratio of data writes
to metadata writes.
On a related note, I see this caveat about dup in the manpage:
"For example, a SSD drive can remap the blocks internally to a single
copy thus deduplicating them. This negates the purpose of increased
redunancy (sic) and just wastes space"
That ability is vastly overestimated in the man page. There is no miracle
content-addressable storage system working at 500 MB/sec speeds all within a
little cheap controller on SSDs. Likely most of what it can do, is just
compress simple stuff, such as runs of zeroes or other repeating byte
sequences.
Most of those that do in-line compression don't implement it in
firmware, they implement it in hardware, and even DEFLATE can get 500
MB/second speeds if properly implemented in hardware. The firmware may
control how the hardware works, but it's usually hardware doing heavy
lifting in that case, and getting a good ASIC made that can hit the
required performance point for a reasonable compression algorithm like
LZ4 or Snappy is insanely cheap once you've gotten past the VLSI work.
And the DUP mode is still useful on SSDs, for cases when one copy of the DUP
gets corrupted in-flight due to a bad controller or RAM or cable, you could
then restore that block from its good-CRC DUP copy.
The only window of time during which bad RAM could result in only one
copy of a block being bad is after the first copy is written but before
the second is, which is usually an insanely small amount of time. As
far as the cabling, the window for errors resulting in a single bad copy
of a block is pretty much the same as for RAM, and if they're
persistently bad, you're more likely to lose data for other reasons.
That said, I do still feel that DUP mode has value on SSD's. The
primary arguments against it are:
1. It wears out the SSD faster.
2. The blocks are likely to end up in the same erase block, and
therefore there will be no benefit.
The first argument is accurate, but not usually an issue for most
people. Average life expectancy for a decent SSD is well over 10 years,
which is more than twice the usual life expectancy for a consumer hard
drive. Putting it in further perspective, the 575GB SSD's have been
running essentially 24/7 for the past year and a half (13112 hours
powered on now), and have seen just short of 25.7TB of writes over that
time. This equates to roughly 2GB/hour, which is well within typical
desktop usage. It also means they've seen more than 44.5 times their
total capacity in writes. Despite this, the wear-out indicators all
show that I can still expect at least 9 years more of run-time on these.
Normalizing that, that means I'm likely to see between 8 and 12 years
of life on these. Equivalent stats for the HDD's I used to use (NAS
rated Seagate drives) gave me a roughly 3-5 year life expectancy, less
than half that of the SSD. In both cases however, you're talking well
beyond the typical life expectancy of anything short of a server or a
tight-embedded system, and worrying about a 4-year versus 8-year life
expectancy on your storage device is kind of pointless when you need to
upgrade the rest of the system in 3 years.
As far as the second argument against it, that one is partially correct,
but ignores an important factor that many people who don't do hardware
design (and some who do) don't often consider. The close temporal
proximity of the writes for each copy are likely to mean they end up in
the same erase block on the SSD (especially if the SSD has a large write
cache). However, that doesn't mean that one getting corrupted due to
device failure is guaranteed to corrupt the other. The reason for this
is exactly the same reason that single word errors in RAM are
exponentially more common than losing a whole chip or the whole memory
module: The primary error source is environmental noise (EMI, cosmic
rays, quantum interference, background radiation, etc), not system
failure. In other words, you're far more likely to lose a single cell
(which is usually not more than a single byte in the MLC flash that gets
used in most modern SSD's) in the erase block than the whole erase
block. In that event, you obviously have only got corruption in the
particular filesystem block that that particular cell was storing data for.
There's also a third argument for not using DUP on SSD's however:
The SSD already does most of the data integrity work itself.
This is only true of good SSD's, but many do have some degree of
built-in erasure coding in the firmware which can handle losing large
chunks of an erase block and still return the data safely. This is part
of the reason that you almost never see nice power-of-two sizes for
flash Storage despite flash chips being made that way them,selves (the
other part is the spare blocks). Depending on the degree of protection
provided by this erasure coding, it can actually cancel out my argument
against argument 2. In all practicality though, that requires you to
actually trust the SSD manufacturer to have implemented things properly
for it to be a valid counter-argument, and most people who would care
enough about data integrity to use BTRFS for that reason are not likely
to trust the storage device that much.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html