On Sat, Sep 16, 2017 at 1:48 PM, Kai Krakow <hurikha...@gmail.com> wrote: > Am Sat, 16 Sep 2017 10:05:21 -0700 > schrieb Rich Freeman <ri...@gentoo.org>: > >> >> My main concern with xfs/ext4 is that neither provides on-disk >> checksums or protection against the raid write hole. > > Btrfs suffers the same RAID5 write hole problem since years. I always > planned moving to RAID5 later (which is why I have 3 disks) but I fear > this won't be fixed any time soon due to design decisions made too > early. >
Btrfs RAID5 simply doesn't work. I don't think it was ever able to recover from a failed drive - it really only exists so that they can develop it. > >> I just switched motherboards a few weeks ago and either a connection >> or a SATA port was bad because one of my drives was getting a TON of >> checksum errors on zfs. I moved it to an LSI card and scrubbed, and >> while it took forever and the system degraded the array more than once >> due to the high error rate, eventually it patched up all the errors >> and now the array is working without issue. I didn't suffer more than >> a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE >> headache trying to recover from that (doing who knows how much >> troubleshooting before realizing I had to do a slow full restore from >> backup with the system down). > > I found md raid not very reliable in the past but I didn't try again in > years. So this may have changed. I only remember it destroyed a file > system after an unclean shutdown not only once, that's not what I > expect from RAID1. Other servers with file systems on bare metal > survived this just fine. > mdadm provides no protection against either silent corruption or the raid hole. If your system dies/panics/etc while it is in the middle of writing a stripe, then whatever previously occupied the space in that stripe is likely to be lost. If your hard drive writes something to disk other than what the OS told it to write, you'll also be likely to lose a stripe unless you want to try to manually repair it (in theory you could try to process the data manually excluding each of the drives and try to work out which version of the data is correct, and then do that for every damaged stripe). Sure, both failure modes are rare, but they still exist. The fact that you haven't personally experienced them doesn't change that. If I had been using mdadm a few weeks ago I'd be restoring from backups. The software would have worked fine, but if the disk doesn't write what it was supposed to write, and the software has no way to recover from this, then you're up the creek. With zfs and btrfs you aren't dependent on the drive hardware detecting and reporting errors. (In my case there were no errors reported by the drive at all for any of this. I suspect the issue was in the SATA port or something else on the motherboard. I haven't tried plugging in a scratch drive to try to debug it, but I will be taking care not to use that port in the future.) > > I think the reasoning using own caching is, that block caching at the > vfs layer cannot just be done in an efficient way for a cow file system > with scrubbing and everything. Btrfs doesn't appear to have any issues despite being COW. There might or might not be truth to your statement. However, I think the real reason that ZFS on Linux uses its own cache is just because it made it easier to just port the code over wholesale by doing it this way. The goal of migrating the existing code was to reduce the risk of regressions, which is why ZFS on Linux works as well as it does. It would take them a long time to replace the caching layer and there would be a lot of risk of introducing errors along the way, so it just isn't as high a priority as getting it running in the first place. Plus ZFS has a bunch of both read and write cache features which aren't really built into the kernel as far as I'm aware. Sure, there is bcache and so on, but that isn't part of the regular kernel cache. Rewriting ZFS to do things the linux way would be down the road, and it wouldn't help them get it into the mainline kernel anyway due to the licensing issue. -- Rich