Re: Install Fedora -

Chris Murphy Sun, 01 Nov 2020 18:01:51 -0800

On Sun, Nov 1, 2020 at 6:51 AM John Mellor <john.mel...@gmail.com> wrote:
>
> On 2020-10-31 10:46 p.m., Tim via users wrote:
> > On Sat, 2020-10-31 at 16:11 +0000, lancelasset...@gmail.com wrote:
> >> Will NFS tell you data has been corrupted during the transfer and
> >> write process?
> > Does any filing system?  In general, writes to storage are assumed to
> > have worked unless something throws up an error message.  Your hard
> > drive could be silently corrupting data as it writes to the drive due
> > to various reasons (defects in its media, bugs in its firmware,
> > glitches from bad power supplies).  You'd never know unless your
> > filing system did a sanity check after writing.  Some specialised ones
> > might do that, but the average ones don't
> >
> You are correct for some very popular filesystems.  EXT2/3/4, XFS, NTFS
> etc. will not detect this situation.  However, newer filesystems (<10
> years old) do handle silent data glitches, bad RAM and cosmic ray hits
> correctly.
>
> BTRFS has been the default filesystem on SUSE Linux for years, and is
> now the default filesystem on Fedora-33.  ZFS is an optional filesystem
> on Ubuntu-20 and all the Berkeley-derived Unixen like FreeBSD, and
> standard on Oracle Linux and Solaris. BTRFS and ZFS are both COW
> filesystems using checksumming of both data and metadata.  When you push
> something to the disk(s) with some kind of RAM error or power glitch,
> the first write will be stored with the error, and then the checksummed
> metadata is simply redirected to reference the new stuff.  This will
> detect the checksum errors on the data on ZFS with the reread to verify
> the checksum, but I believe that BTRFS will return a successful write
> without one of the RAID configurations set on the pool.  If you are
> running one of the RAID configurations, the checksum error will be
> detected before the write completes.  To guard against on-disk
> corruption (bit rot), both ZFS and BTRFS will also correct it on the
> next read of that data if you are running the filesystem in one of the
> RAID-z configurations (multiple copies stored), or upon running a
> filesystem integrity check.


Short story:

When an application receives EIO (input/output error) from the storage
layer, it's up to the application how to handle it. One of the more
common ways EIO happens is when a bad sector is read, and the drive
itself reports uncorrectable read error. That propagates up through
the various layers to the application as EIO.

I'm told NFS will substitute zeros for any block the file system
reports EIO, and continues on reading subsequent blocks. And finally
ddrescue by default reads the whole file (via the mounted file system,
not pointing it to raw sectors), but with truncated bad 4KiB blocks.
The bad blocks are simply missing, there is no gap filled with zeros
or some other pattern unless you ask ddrescue for that.

Long story:

Btrfs by default computes and stores a 4 byte CRC-32C per data block.
Data blocks are 4KiB on x86, and metadata (the fs itself) are 16KiB by
default. If a data block fails checksum verification, EIO is reported
to the requesting application, and that application can do basically
whatever it wants.

For example if I 'cat' a log that happens to have a data block corrupted:

2020-09-20 cat: irc.freenode.#btrfs.weechatlog: Input/output error

And it stops at that block and does not continue reading the rest of
the file. At the time the EIO happens, I get a few kernel messages but
I'll just list one:

[155108.915822] BTRFS warning (device sdb2): csum failed root 349 ino
7176 off 10989568 csum 0x4d3d334d expected csum 0xb210f188 mirror 1

That's a bit secret decoder ring, as kernel messages often are. But
this translates into "checksum failure in subvolume/snapshot ID 349,
inode 7176, at logical byte 10989568, with the checksum just computed
during read versus the one originally recorded in the csum tree, and
which copy is affected - and by the way this is just a warning it's
not some critical problem with the file system."

The same inode number can exist multiple times on Btrfs. Each
subvolume has its own pool of inodes. Hence the reference to both
subvolume ID and inode number.

Two more possibilities exist:

- ignore the checksum verification and just give me all the blocks as
they are including corruption
- same as above, but the file is compressed (a feature of Btrfs)

The first is a reference to 'btrfs restore' which is an offline
scraping tool to get data out no matter the damage. The UI is just OK,
the UX is ugly, but coming from a long career in data corruption,
noise, and recovery - it's professional grade. There's a very good
chance of getting your data out of a Btrfs file system *if* you have
the patience. However, I strongly recommend backups, no matter the
file system, so you can avoid the pleasure of 'btrfs restore'. This is
going to get better with kernel 5.11, but I'm gonna save talking about
that feature for another time. (It's not vaporware, it's merged, but
we don't have it yet so... no point talking about it yet.)

The second can complicate recovery because one bitflip could mean an
entire 128KiB block of data is corrupted. 128KiB is the compression
block size, i.e. a maximum of 128KiB of uncompressed data is what gets
compressed at a time when compression is enabled on Btrfs. In my
example above it turns out compression is used, and the offline scrape
tool does a worse job of recovery than just copying the file out using
ddrescue (on the mounted file system, so the kernel is doing the heavy
lifting and ddrescue is merely continuing to read all subsequent
blocks despite EIO on two of them). But I am suffering the loss of
only two 4KiB blocks that are corrupt, EIO is issued, and the contents
of those blocks is not handed over by Btrfs kernel code (yet).

While I'm slightly into the weeds at this point, it's actually not so
common to have ECC memory on the desktop, and rare on laptops. Memory
bitflips are a real thing, and while rare we will absolutely see them
in Fedora with Btrfs. It's just one of those things to get used to.
There are other sources of bitflips, bad cables, drive firmware bugs,
the memory in the drive itself.

We've got two rather detailed reports of bad RAM caused bitflips
caught by Btrfs since it was made the default in July (starting in
Rawhide). And one bad media caused corruption (above). For anyone who
likes reading alien autopsy reports, this most recent bad RAM one is
quite straightforward.

https://bugzilla.redhat.com/show_bug.cgi?id=1882875

What's noteworthy is Btrfs doesn't assign blame. It just states the
facts. It's up to us to figure out the puzzle, and it is a learnable
skill.

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org

Re: Install Fedora -

Reply via email to