On Sun, Sep 15, 2019 at 1:46 PM James Harvey <jamespharve...@gmail.com> wrote:
>
> For several weeks, I've had an enormous amount of frustration when
> under heavy I/O load with a filesystem putting processes using it into
> permanent uninterruptible sleep.  Sometimes the filesystem allows
> reads and only locks up writing processes, sometimes it locks up both.
>
> I really, really want this fixed, so will be happy to perform further
> diagnostics.
>
> I've ruled out hardware.  I have two identical Xeon/ECC machines that
> have been functioning perfectly for years (other amdgpu crashes and a
> btrfs encryption kernel crash that Qu patched.)  I can replicate this
> on both machines, using completely separate hardware.
>
> I am running into this within QEMU, and originally thought it must be
> a virtio issue.  But, I believe I've ruled that out.  Within the VM,
> after a filesystem lock condition starts, I have always been able to
> dd the entire block device to /dev/null, and I have always been able
> to dd part of the block device to /tmp, and re-write it back onto
> itself.  Additionally, as a test, I created a new LVM volume, and
> within the VM setup LVM and 2 btrfs volumes within it.  When the heavy
> I/O volume locked up, I could still properly use the other "dummy"
> volume that was (from the VM's perspective) on the same underlying
> block device.
>
> I've also had a few VM's under minimal I/O load have BTRFS related
> "blocked for" problems for several minutes, then come out of it.
>
> The VM is actually given two LVM partitions, one for the btrfs root
> filesystem, and one for the btrfs heavy I/O filesystem.  Its root
> filesystem doesn't also start having trouble, so it doesn't lock up
> the entire VM.  Since I saw someone else mention this, I'll mention
> that no fstrim or dedupe has been involved with me.
>
> I started to report this as a BTRFS issue about 4 days ago, but saw it
> had already been reported and a proposed patch was given for a
> "serious regression" in the 5.2 kernel.
>
> Because the heavy I/O involves mongodb, and it really doesn't do well
> in a crash, and I wasn't sure if there could be any residual
> filesystem corruption, I just decided to create a new VM and rebuild
> the database from its source material.
>
> Running a custom compiled 5.2.14 WITH Filipe Manana's "fix unwritten
> extent buffers and hangs on future writeback attempts" patch, it ran
> for about a day under heavy I/O.  And, then, it went into a state
> where anything reading or writing goes to uninterruptible sleep.
>
> Here is everything logged near the beginning of the lockup in the VM.
> The host has never logged a single thing related to any of these
> issues.
>
> Host and VM are up to date Arch Linux, linux 5.2.14 with Filipe's
> patch, and QEMU 4.1.0.
>
> The physical drive is a Samsung 970 EVO 1TB NVMe, and a host LVM
> partition is given to QEMU.  I've used both virtio-blk and
> virtio-scsi.  I don't use QEMU drive caching, because with this drive,
> I've found it's faster not to.
>
>
> View relevant journalctl here: http://ix.io/1VeT
>
>
> You'll see they're different looking backtraces than without the
> patch, so I don't actually know if it's related to the original
> regression that several others reported or not.

It's a different problem.

A fsync task is trying to get an inode that is being freed, while
holding a transaction open, and the vfs waits (at btrfs_iget5() ->
find_inode()) for the inode to be freed before returning.
Another task which is freeing the inode and running eviction is trying
to commit the transaction, but it can't, because the fsync task is
holding the transaction open.
So, there's a deadlock between those two tasks.

All the other tasks are also trying to commit the transaction but
can't because of those 2 tasks that are deadlocking.

So getting an inode while fsync'ing another one turns out to be a bad
idea as it can cause this type of deadlock, which I haven't thought
about when I added that code in commit [1].
The problem goes all way back to kernel 4.10, and commit [1]
introduced this issue, which was meant to fix a performance regression
that could be detected with dbench (it came to my knowledge through
the SUSE performance team at the time).

Commit [2] while fixing a data loss (actually entire file) caused by
fsync after rename scenarios, introduced the performance regression,
but just because before we had
incorrect fsync behaviour that lead to the file/data loss, that is we
were not doing enough work to avoid the file loss. But it was deadlock
safe, since it was simple and
just triggered a transaction commit instead of trying to get and log
other inodes.

That was back in 2016, and a few other commits built on top that just
added a few more "get other inode" operations while fsync'ing a file.
So I'll have to undo the performance optimization and just fallback to
transaction commits whenever that file loss scenario is detected.

I'll send a fix this week for that. Thanks for the report!

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=44f714dae50a2e795d3268a6831762aa6fa54f55
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56f23fdbb600e6087db7b009775b95ce07cc3195

>
> After that, everything that was reading/writing is hung, and anything
> new that tries to do so is also hung.  It doesn't report more "task...
> blocked" messages, even for new processes attempting reads/writes.



--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

Reply via email to