On Sat, Sep 16, 2017 at 1:48 PM, Kai Krakow <hurikha...@gmail.com> wrote:
> Am Sat, 16 Sep 2017 10:05:21 -0700
> schrieb Rich Freeman <ri...@gentoo.org>:
>
>>
>> My main concern with xfs/ext4 is that neither provides on-disk
>> checksums or protection against the raid write hole.
>
> Btrfs suffers the same RAID5 write hole problem since years. I always
> planned moving to RAID5 later (which is why I have 3 disks) but I fear
> this won't be fixed any time soon due to design decisions made too
> early.
>

Btrfs RAID5 simply doesn't work.  I don't think it was ever able to
recover from a failed drive - it really only exists so that they can
develop it.

>
>> I just switched motherboards a few weeks ago and either a connection
>> or a SATA port was bad because one of my drives was getting a TON of
>> checksum errors on zfs.  I moved it to an LSI card and scrubbed, and
>> while it took forever and the system degraded the array more than once
>> due to the high error rate, eventually it patched up all the errors
>> and now the array is working without issue.  I didn't suffer more than
>> a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE
>> headache trying to recover from that (doing who knows how much
>> troubleshooting before realizing I had to do a slow full restore from
>> backup with the system down).
>
> I found md raid not very reliable in the past but I didn't try again in
> years. So this may have changed. I only remember it destroyed a file
> system after an unclean shutdown not only once, that's not what I
> expect from RAID1. Other servers with file systems on bare metal
> survived this just fine.
>

mdadm provides no protection against either silent corruption or the
raid hole.

If your system dies/panics/etc while it is in the middle of writing a
stripe, then whatever previously occupied the space in that stripe is
likely to be lost.

If your hard drive writes something to disk other than what the OS
told it to write, you'll also be likely to lose a stripe unless you
want to try to manually repair it (in theory you could try to process
the data manually excluding each of the drives and try to work out
which version of the data is correct, and then do that for every
damaged stripe).

Sure, both failure modes are rare, but they still exist.  The fact
that you haven't personally experienced them doesn't change that.

If I had been using mdadm a few weeks ago I'd be restoring from
backups.  The software would have worked fine, but if the disk doesn't
write what it was supposed to write, and the software has no way to
recover from this, then you're up the creek.  With zfs and btrfs you
aren't dependent on the drive hardware detecting and reporting errors.
(In my case there were no errors reported by the drive at all for any
of this.  I suspect the issue was in the SATA port or something else
on the motherboard.  I haven't tried plugging in a scratch drive to
try to debug it, but I will be taking care not to use that port in the
future.)

>
> I think the reasoning using own caching is, that block caching at the
> vfs layer cannot just be done in an efficient way for a cow file system
> with scrubbing and everything.

Btrfs doesn't appear to have any issues despite being COW.  There
might or might not be truth to your statement.

However, I think the real reason that ZFS on Linux uses its own cache
is just because it made it easier to just port the code over wholesale
by doing it this way.  The goal of migrating the existing code was to
reduce the risk of regressions, which is why ZFS on Linux works as
well as it does.  It would take them a long time to replace the
caching layer and there would be a lot of risk of introducing errors
along the way, so it just isn't as high a priority as getting it
running in the first place.  Plus ZFS has a bunch of both read and
write cache features which aren't really built into the kernel as far
as I'm aware.  Sure, there is bcache and so on, but that isn't part of
the regular kernel cache.  Rewriting ZFS to do things the linux way
would be down the road, and it wouldn't help them get it into the
mainline kernel anyway due to the licensing issue.


-- 
Rich

Reply via email to