On 2019-09-25 00:25, Nick Bowler wrote:
On Tue, Sep 24, 2019, 18:34 Chris Murphy, <li...@colorremedies.com> wrote:
On Tue, Sep 24, 2019 at 4:04 PM Nick Bowler <nbow...@draconx.ca> wrote:
- Running Linux 5.2.14, I pushed this system to OOM; the oom killer
ran and killed some userspace tasks. At this point many of the
remaining tasks were stuck in uninterruptible sleeps. Not really
worried, I turned the machine off and on again to just get everything
back to normal. But I guess now that everything had gone horribly
wrong already at this point...
Yeah the kernel oomkiller is pretty much only about kernel
preservation, not user space preservation.
Indeed I am not bothered at all by needing to turn it off and on again
in this situation. But filesystems being completely trashed is
another matter...
- Upon reboot, the system boots OK but now btrfs is throwing zillions
of checksum errors. After some time the filesystem is remounted
readonly and I lose the ability to interact with the system at all, so
it gets powered off.
- Now the filesystem is unmountable.
The transid errors look like they might be caused by the 5.2 regression
https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdman...@kernel.org/T/#u
Fixed since 5.2.15 and 5.3.0.
Yikes, so my decision to update the latest kernel two weeks ago
perhaps was a very bad one. Should've stuck with 4.19.y I guess.
So if you're willing to blow shit up again, you can try to reproduce
with one of those.
Well I could try but it sounds like this might be hard to reproduce...
I was also doing oomkiller blow shit up tests a few weeks ago with
these same problem kernels and never hit this bug, or any others. I
also had to do a LOT of force power offs because the system just
became totally wedged in and I had no way of estimating how long it
would be for recovery so after 30 minutes I hit the power button. Many
times. Zero corruptions. That's with a single Samsung 840 EVO in a
laptop relegated to such testing.
Just a thought... the system was alive but I was able to briefly
inspect the situation and notice that tasks were blocked and
unkillable... until my shell hung too and then I was hosed. But I
didn't hit the power button but rather rebooted with sysrq+e, sysrq+u,
sysrq+b. Not sure if that makes a difference.
Not sure if this mattered, but as a general rule, unless you're dealing
with an issue with the disk, you should always issue sysrq+s and wait a
few seconds (or until the message that all filesystems have been synced
shows up if you're on the console and can see kernel messages) before
issuing a sysrq+u. Remounting all filesystems read-only through sysrq+u
does not reliably flush caches before forcing everything read-only.
Might be a different bug. Not sure. But also, this is with
[ 347.551595] CPU: 3 PID: 1143 Comm: mount Not tainted 4.19.34-1-lts #1
So I don't know how an older kernel will report on the problem caused
by the 5.2 bug.
This is the kernel from systemrescuecd. I can try taking a disk image
and mounting on another machine with a newer linux version.
Thanks,
Nick