On Wed, Mar 6, 2019 at 7:29 AM Michael Firth <mfi...@nevion.com> wrote: > > Hi, > > I have a BTRFS filesystem that seems to have become very ill. After 4 hours > of being mounted, it will fail with every write attempt saying "No space left > on device".
What program/process is trying to write to the volume? Even "touch ~/hello" fails with this message? What happens if you strace the command? If you strace and output to a file, make sure you direct the file to a file system OTHER than root and the file system you've had this problem with (redirect it to a USB stick or to /tmp), but if the problem happens even with touch, or writing some zeros with dd, you should be able to strace just to std out and copy paste the results into a file. > > Unmounting and remounting the filesystem clears the issue for another 4 hours > > From every check I have done, no messages are logged at the point of the > failure to "dmesg" or any system log. The lack of a message doesn't sound like the usual enospc. If the file system runs out of space, even if it's wrong and it's a bug, Btrfs will warn or info in dmesg. > > The output of the three (why on earth are there three?) disk space commands > on the filesystem are: The three come from different eras, and the legacy 'btrfs filesystem df' and 'btrfs filesystem show' commands were kept around for script support I assume. I personally find it ridiculous, but also I know developers are busy with other important issues. I think there should be one command for humans and when meaningful improvements are made, the old way is flat out removed. And there should be a switch to output machine readable raw spew for scripts and such. But whatever, not up to me! > > From my understanding of the output in this, there don't seem to be any areas > that are even close to full. And if it was a genuine full condition, even due > to running out of metadata or something, then I wouldn't expect unmounting > and remounting to clear the issue. Yep, it's suspicious that it is kernel related. But there's a lot that happens at umount (you can strace umount and see some of it!) that's not just implicating Btrfs as a possible cause. It could be something else. The lack of Btrfs errors strongly suggests it's not directly related to Btrfs. The program is getting some idea that there's no space left so that needs to be tracked down why it thinks this. Btrfs doesn't think that because when it does, it reports it to dmesg. I don't know anything about Debian and its default kernel console message logging level, but sometimes I see for some distros that 'dmesg -n 7' needs to be issued before reproducing a problem. Maybe in your case a hint is just not being retained by dmesg? If you're running systemd an alternative is to get kernel messages from 'journalctl -k' for the current boot; or also 'journalctl -k --no-pager' or output with monotonic time 'journalclt -k -o short-monotonic > journal.txt' and so on. > Is there any known issue that may cause this behaviour? This list is upstream development. You'll find on ext4 and XFS list a similar notion that distro kernels are supported by distros, not upstream. It's a function of almost pure luck if you get the attention of a developer who knows something about a 2 year old kernel. And 4.9 is more than 2 years old from a Btrfs development perspective, closer to three years. Current development is happening on kernel 5.2; where bug fixes are happening for 5.1. For practical purposes it's ordinary to be asked to use a mainline or stable (5.0 or 4.20) kernel to see if the problem still happens. If it does, then you've likely discovered an unfixed bug. If it doesn't happen, you've discovered a fixed bug. For various reasons it can be difficult to backport all bug fixes so maybe it's in a 4.19 Debian built kernel, you'd have to test it. But the way to limit the testing as much as possible is go straight to 5.0. If it happens there you've almost certainly found a bug that's not yet fixed. But even before changing kernels in your case I suggest stracing the simplest program that reproduces the error, like even touch or cp. We need to have some idea why the program thinks there's no more space left while the kernel isn't reporting it. > > Is there any way to get more debugging from what is going on? dmesg -n 7 and reproduce with strace + some simple command simpler reproduction the better > > My initial thought was that it might be related to snapshots, as I was > generating regular snapshots (for a 'previous versions' feature), and many of > the failures were just after a snapshot was created. However, I have now > disabled the snapshot creation and I am still seeing regular failures. Could be one of the edge cases that was fixed in 4.12 but off hand I'd guess those went back to 4.9. But there have been other edge case fixes for enospc since then. Note that every merge cycle for the kernel, Btrfs sees ~1000-2000 commits. It's a lot of changes to keep track of in someone's memory when it's literally tens of thousands of changes since kernel 4.9. -- Chris Murphy