On Thu, Sep 12, 2019 at 2:09 PM Christoph Anton Mitterer <cales...@scientia.net> wrote: > > Hi. > > First, thanks for finding&fixing this :-) > > > On Thu, 2019-09-12 at 08:50 +0100, Filipe Manana wrote: > > 1) either a hang when committing a transaction, reported by several > > users recently and hit it myself too twice when running fstests (test > > case generic/475 and generic/561) after I upgradaded my development > > branch from a 5.1.x kernel to a 5.3-rcX kernel. If this happens you > > risk no corruption, still the hang is very inconvenient of course, as > > you have to reboot. > > Okay inconvenient, but not so bad if there is no corruption risk. > > > > 2) writeback for some btree nodes may never be started and we end up > > committing a transaction without noticing that. This is really > > serious > > and that will lead to the "parent transid verify failed on ..." > > messages. > > As some people have already pointed out, it will be infeasible for many > end users to downgrade (no security updates) or manually patch (well, > end-users).
Yes, but I can't do anything about that. I'm not skilled to build a time machine to go back in time :) > > Can you elaborate under which circumstances this problem occurs, > whether there are any intermediate workarounds, and whether it's always > noticed (i.e. no silence corruption)? It can happen whenever a transaction is being committed (or committing the fsync log). Every fs is at risk, unless it's always mounted in read-only and with -o nologreplay. A btree node/leaf (extent buffer) is dirty in memory, needs to be written to disk, this always happens at transaction commit time, but can also happen before that, if for some reason writeback on the btree inode happens (due to reclaim, system under memory pressure, etc). If the writeback happens only at the transaction commit time, and if one the node's pages is locked (not necessarily by btrfs, it can happen everywhere in the memory management subsystem, page migration for example), we ended up skipping the writeback (start the process of writing what's in memory to disk) of a node. This is case 2), the corruption with the error messages "parent transid verify failed ..." in dmesg/syslog after mounting the filesystem again. This is very likely (as we can never rule out other bugs, be it in btrfs or some other layer, or even hardware/firmware) what Swâmi ran into, since he never had problems with 5.1 and older kernel versions and has been using the same hardware for a long time. For case 1), the hang, it happens if writeback happened before the transaction commit as well. At transaction commit we trigger writeback again for the same node(s), and here we hang because of the previous attempt. Two people reported the hang yesterday here on the list, plus at least one more some weeks ago. I hit it myself once last week and once 2 evenings ago with test cases from fstests after changing my development branch from 5.1 to 5.3-rcX. To hit any of the problems, sure, you still need to have some bad luck, but it's impossible to tell how likely to run into it. It depends on so many things, from workloads, system configuration, etc. No matter how likely (and how likely will not be the same for everyone), it's serious because if it happens you can get a corrupt filesystem. > > > Thanks, > Chris. > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”