Re: btrfs filesystem failing with 'No space left on device' after 4 hours

Chris Murphy Wed, 06 Mar 2019 12:39:33 -0800

On Wed, Mar 6, 2019 at 7:29 AM Michael Firth <mfi...@nevion.com> wrote:
>
> Hi,
>
> I have a BTRFS filesystem that seems to have become very ill. After 4 hours 
> of being mounted, it will fail with every write attempt saying "No space left 
> on device".


What program/process is trying to write to the volume? Even "touch
~/hello" fails with this message? What happens if you strace the
command? If you strace and output to a file, make sure you direct the
file to a file system OTHER than root and the file system you've had
this problem with (redirect it to a USB stick or to /tmp), but if the
problem happens even with touch, or writing some zeros with dd, you
should be able to strace just to std out and copy paste the results
into a file.


>
> Unmounting and remounting the filesystem clears the issue for another 4 hours
>
> From every check I have done, no messages are logged at the point of the 
> failure to "dmesg" or any system log.

The lack of a message doesn't sound like the usual enospc. If the file
system runs out of space, even if it's wrong and it's a bug, Btrfs
will warn or info in dmesg.



>
> The output of the three (why on earth are there three?) disk space commands 
> on the filesystem are:

The three come from different eras, and the legacy 'btrfs filesystem
df' and 'btrfs filesystem show' commands were kept around for script
support I assume. I personally find it ridiculous, but also I know
developers are busy with other important issues. I think there should
be one command for humans and when meaningful improvements are made,
the old way is flat out removed. And there should be a switch to
output machine readable raw spew for scripts and such. But whatever,
not up to me!



>
> From my understanding of the output in this, there don't seem to be any areas 
> that are even close to full. And if it was a genuine full condition, even due 
> to running out of metadata or something, then I wouldn't expect unmounting 
> and remounting to clear the issue.

Yep, it's suspicious that it is kernel related. But there's a lot that
happens at umount (you can strace umount and see some of it!) that's
not just implicating Btrfs as a possible cause. It could be something
else. The lack of Btrfs errors strongly suggests it's not directly
related to Btrfs. The program is getting some idea that there's no
space left so that needs to be tracked down why it thinks this. Btrfs
doesn't think that because when it does, it reports it to dmesg.

I don't know anything about Debian and its default kernel console
message logging level, but sometimes I see for some distros that
'dmesg -n 7' needs to be issued before reproducing a problem. Maybe in
your case a hint is just not being retained by dmesg? If you're
running systemd an alternative is to get kernel messages from
'journalctl -k' for the current boot; or also 'journalctl -k
--no-pager' or output with monotonic time 'journalclt -k -o
short-monotonic > journal.txt' and so on.


> Is there any known issue that may cause this behaviour?

This list is upstream development. You'll find on ext4 and XFS list a
similar notion that distro kernels are supported by distros, not
upstream. It's a function of almost pure luck if you get the attention
of a developer who knows something about a 2 year old kernel. And 4.9
is more than 2 years old from a Btrfs development perspective, closer
to three years. Current development is happening on kernel 5.2; where
bug fixes are happening for 5.1. For practical purposes it's ordinary
to be asked to use a mainline or stable (5.0 or 4.20) kernel to see if
the problem still happens. If it does, then you've likely discovered
an unfixed bug. If it doesn't happen, you've discovered a fixed bug.
For various reasons it can be difficult to backport all bug fixes so
maybe it's in a 4.19 Debian built kernel, you'd have to test it. But
the way to limit the testing as much as possible is go straight to
5.0. If it happens there you've almost certainly found a bug that's
not yet fixed.

But even before changing kernels in your case I suggest stracing the
simplest program that reproduces the error, like even touch or cp. We
need to have some idea why the program thinks there's no more space
left while the kernel isn't reporting it.





>
> Is there any way to get more debugging from what is going on?

dmesg -n 7
and reproduce with strace + some simple command simpler reproduction the better



>
> My initial thought was that it might be related to snapshots, as I was 
> generating regular snapshots (for a 'previous versions' feature), and many of 
> the failures were just after a snapshot was created. However, I have now 
> disabled the snapshot creation and I am still seeing regular failures.

Could be one of the edge cases that was fixed in 4.12 but off hand I'd
guess those went back to 4.9. But there have been other edge case
fixes for enospc since then. Note that every merge cycle for the
kernel, Btrfs sees ~1000-2000 commits. It's a lot of changes to keep
track of in someone's memory when it's literally tens of thousands of
changes since kernel 4.9.


-- 
Chris Murphy

Re: btrfs filesystem failing with 'No space left on device' after 4 hours

Reply via email to