Re: Btrfs filesystem trashed after OOM scenario

Austin S. Hemmelgarn Thu, 26 Sep 2019 04:27:43 -0700

On 2019-09-25 00:25, Nick Bowler wrote:

On Tue, Sep 24, 2019, 18:34 Chris Murphy, <li...@colorremedies.com> wrote:

On Tue, Sep 24, 2019 at 4:04 PM Nick Bowler <nbow...@draconx.ca> wrote:

- Running Linux 5.2.14, I pushed this system to OOM; the oom killer
ran and killed some userspace tasks.  At this point many of the
remaining tasks were stuck in uninterruptible sleeps.  Not really
worried, I turned the machine off and on again to just get everything
back to normal.  But I guess now that everything had gone horribly
wrong already at this point...


Yeah the kernel oomkiller is pretty much only about kernel
preservation, not user space preservation.


Indeed I am not bothered at all by needing to turn it off and on again
in this situation.  But filesystems being completely trashed is
another matter...

- Upon reboot, the system boots OK but now btrfs is throwing zillions
of checksum errors.  After some time the filesystem is remounted
readonly and I lose the ability to interact with the system at all, so
it gets powered off.

- Now the filesystem is unmountable.


The transid errors look like they might be caused by the 5.2 regression

https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdman...@kernel.org/T/#u

Fixed since 5.2.15 and 5.3.0.


Yikes, so my decision to update the latest kernel two weeks ago
perhaps was a very bad one.  Should've stuck with 4.19.y I guess.

So if you're willing to blow shit up again, you can try to reproduce
with one of those.


Well I could try but it sounds like this might be hard to reproduce...

I was also doing oomkiller blow shit up tests a few weeks ago with
these same problem kernels and never hit this bug, or any others. I
also had to do a LOT of force power offs because the system just
became totally wedged in and I had no way of estimating how long it
would be for recovery so after 30 minutes I hit the power button. Many
times. Zero corruptions. That's with a single Samsung 840 EVO in a
laptop relegated to such testing.


Just a thought... the system was alive but I was able to briefly
inspect the situation and notice that tasks were blocked and
unkillable... until my shell hung too and then I was hosed.  But I
didn't hit the power button but rather rebooted with sysrq+e, sysrq+u,
sysrq+b.  Not sure if that makes a difference.

Not sure if this mattered, but as a general rule, unless you're dealingwith an issue with the disk, you should always issue sysrq+s and wait afew seconds (or until the message that all filesystems have been syncedshows up if you're on the console and can see kernel messages) beforeissuing a sysrq+u. Remounting all filesystems read-only through sysrq+udoes not reliably flush caches before forcing everything read-only.

Might be a different bug. Not sure. But also, this is with

[  347.551595] CPU: 3 PID: 1143 Comm: mount Not tainted 4.19.34-1-lts #1


So I don't know how an older kernel will report on the problem caused
by the 5.2 bug.


This is the kernel from systemrescuecd.  I can try taking a disk image
and mounting on another machine with a newer linux version.

Thanks,
   Nick

Re: Btrfs filesystem trashed after OOM scenario

Reply via email to