On 2015-11-13 13:42, Hugo Mills wrote:
We should get this listed on the Wiki on the Gotcha's page ASAP, especially considering that it's a pretty significant bug (not quite as bad as data corruption, but pretty darn close).On Fri, Nov 13, 2015 at 01:10:12PM -0500, Austin S Hemmelgarn wrote:On 2015-11-13 12:30, Vedran Vucic wrote:Hello,Here are outputs of commands as you requested: btrfs fi df / Data, single: total=8.00GiB, used=7.71GiB System, DUP: total=32.00MiB, used=16.00KiB Metadata, DUP: total=1.12GiB, used=377.25MiB GlobalReserve, single: total=128.00MiB, used=0.00B btrfs fi show Label: none uuid: d6934db3-3ac9-49d0-83db-287be7b995a5 Total devices 1 FS bytes used 8.08GiB devid 1 size 18.71GiB used 10.31GiB path /dev/sda6 btrfs-progs v4.0+20150429Hmm, that's odd, based on these numbers, you should be having no issue at all trying to run a balance. You might be hitting some other bug in the kernel, however, but I don't remember if there were any known bugs related to ENOSPC or balance in the version you're running.There's one specific bug that shows up with ENOSPC exactly like this. It's in all versions of the kernel, there's no known solution, and no guaranteed mitigation strategy, I'm afraid. Various things like balancing, or adding, balancing, and removing a device again have been tried. Sometimes they seem to help; sometimes they just make the problem worse. We average maybe one report a week or so with this particular set of symptoms.
Vedran, could you try running the balance with just '-dusage=40' and then again with just '-musage=40'? If just one of those fails, it could help narrow things down significantly.
Hugo, is there anything else known about this issue (I don't recall seeing it mentioned before, and a quick web search didn't turn up much)? In particular: 1. Is there any known way to reliably reproduce it (I would assume not, as that would likely lead to a mitigation strategy. If someone does find a reliable reproducer, please let me know, I've got some significant spare processor time and storage space I could dedicate to getting traces and filesystem images for debugging, and already have most of the required infrastructure set up for something like this)? 2. Is it contagious (that is, if I send a snapshot from a filesystem that is affected by it, does the filesystem that receives the snapshot become affected; if we could find a way to reproduce it, I could easily answer this question within a couple of minutes of reproducing it)? 3. Do we have any kind of statistics beyond the rate of reports (for example, does it happen more often on bigger filesystems, or possibly more frequently with certain chunk profiles)?
smime.p7s
Description: S/MIME Cryptographic Signature