On Sun, Nov 1, 2020 at 6:51 AM John Mellor <john.mel...@gmail.com> wrote: > > On 2020-10-31 10:46 p.m., Tim via users wrote: > > On Sat, 2020-10-31 at 16:11 +0000, lancelasset...@gmail.com wrote: > >> Will NFS tell you data has been corrupted during the transfer and > >> write process? > > Does any filing system? In general, writes to storage are assumed to > > have worked unless something throws up an error message. Your hard > > drive could be silently corrupting data as it writes to the drive due > > to various reasons (defects in its media, bugs in its firmware, > > glitches from bad power supplies). You'd never know unless your > > filing system did a sanity check after writing. Some specialised ones > > might do that, but the average ones don't > > > You are correct for some very popular filesystems. EXT2/3/4, XFS, NTFS > etc. will not detect this situation. However, newer filesystems (<10 > years old) do handle silent data glitches, bad RAM and cosmic ray hits > correctly. > > BTRFS has been the default filesystem on SUSE Linux for years, and is > now the default filesystem on Fedora-33. ZFS is an optional filesystem > on Ubuntu-20 and all the Berkeley-derived Unixen like FreeBSD, and > standard on Oracle Linux and Solaris. BTRFS and ZFS are both COW > filesystems using checksumming of both data and metadata. When you push > something to the disk(s) with some kind of RAM error or power glitch, > the first write will be stored with the error, and then the checksummed > metadata is simply redirected to reference the new stuff. This will > detect the checksum errors on the data on ZFS with the reread to verify > the checksum, but I believe that BTRFS will return a successful write > without one of the RAID configurations set on the pool. If you are > running one of the RAID configurations, the checksum error will be > detected before the write completes. To guard against on-disk > corruption (bit rot), both ZFS and BTRFS will also correct it on the > next read of that data if you are running the filesystem in one of the > RAID-z configurations (multiple copies stored), or upon running a > filesystem integrity check.
Short story: When an application receives EIO (input/output error) from the storage layer, it's up to the application how to handle it. One of the more common ways EIO happens is when a bad sector is read, and the drive itself reports uncorrectable read error. That propagates up through the various layers to the application as EIO. I'm told NFS will substitute zeros for any block the file system reports EIO, and continues on reading subsequent blocks. And finally ddrescue by default reads the whole file (via the mounted file system, not pointing it to raw sectors), but with truncated bad 4KiB blocks. The bad blocks are simply missing, there is no gap filled with zeros or some other pattern unless you ask ddrescue for that. Long story: Btrfs by default computes and stores a 4 byte CRC-32C per data block. Data blocks are 4KiB on x86, and metadata (the fs itself) are 16KiB by default. If a data block fails checksum verification, EIO is reported to the requesting application, and that application can do basically whatever it wants. For example if I 'cat' a log that happens to have a data block corrupted: 2020-09-20 cat: irc.freenode.#btrfs.weechatlog: Input/output error And it stops at that block and does not continue reading the rest of the file. At the time the EIO happens, I get a few kernel messages but I'll just list one: [155108.915822] BTRFS warning (device sdb2): csum failed root 349 ino 7176 off 10989568 csum 0x4d3d334d expected csum 0xb210f188 mirror 1 That's a bit secret decoder ring, as kernel messages often are. But this translates into "checksum failure in subvolume/snapshot ID 349, inode 7176, at logical byte 10989568, with the checksum just computed during read versus the one originally recorded in the csum tree, and which copy is affected - and by the way this is just a warning it's not some critical problem with the file system." The same inode number can exist multiple times on Btrfs. Each subvolume has its own pool of inodes. Hence the reference to both subvolume ID and inode number. Two more possibilities exist: - ignore the checksum verification and just give me all the blocks as they are including corruption - same as above, but the file is compressed (a feature of Btrfs) The first is a reference to 'btrfs restore' which is an offline scraping tool to get data out no matter the damage. The UI is just OK, the UX is ugly, but coming from a long career in data corruption, noise, and recovery - it's professional grade. There's a very good chance of getting your data out of a Btrfs file system *if* you have the patience. However, I strongly recommend backups, no matter the file system, so you can avoid the pleasure of 'btrfs restore'. This is going to get better with kernel 5.11, but I'm gonna save talking about that feature for another time. (It's not vaporware, it's merged, but we don't have it yet so... no point talking about it yet.) The second can complicate recovery because one bitflip could mean an entire 128KiB block of data is corrupted. 128KiB is the compression block size, i.e. a maximum of 128KiB of uncompressed data is what gets compressed at a time when compression is enabled on Btrfs. In my example above it turns out compression is used, and the offline scrape tool does a worse job of recovery than just copying the file out using ddrescue (on the mounted file system, so the kernel is doing the heavy lifting and ddrescue is merely continuing to read all subsequent blocks despite EIO on two of them). But I am suffering the loss of only two 4KiB blocks that are corrupt, EIO is issued, and the contents of those blocks is not handed over by Btrfs kernel code (yet). While I'm slightly into the weeds at this point, it's actually not so common to have ECC memory on the desktop, and rare on laptops. Memory bitflips are a real thing, and while rare we will absolutely see them in Fedora with Btrfs. It's just one of those things to get used to. There are other sources of bitflips, bad cables, drive firmware bugs, the memory in the drive itself. We've got two rather detailed reports of bad RAM caused bitflips caught by Btrfs since it was made the default in July (starting in Rawhide). And one bad media caused corruption (above). For anyone who likes reading alien autopsy reports, this most recent bad RAM one is quite straightforward. https://bugzilla.redhat.com/show_bug.cgi?id=1882875 What's noteworthy is Btrfs doesn't assign blame. It just states the facts. It's up to us to figure out the puzzle, and it is a learnable skill. -- Chris Murphy _______________________________________________ users mailing list -- users@lists.fedoraproject.org To unsubscribe send an email to users-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org