On 2015-11-13 13:42, Hugo Mills wrote:
On Fri, Nov 13, 2015 at 01:10:12PM -0500, Austin S Hemmelgarn wrote:
On 2015-11-13 12:30, Vedran Vucic wrote:
Hello,

Here are outputs of commands as you requested:
  btrfs fi df /
Data, single: total=8.00GiB, used=7.71GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.12GiB, used=377.25MiB
GlobalReserve, single: total=128.00MiB, used=0.00B

btrfs fi show
Label: none  uuid: d6934db3-3ac9-49d0-83db-287be7b995a5
         Total devices 1 FS bytes used 8.08GiB
         devid    1 size 18.71GiB used 10.31GiB path /dev/sda6

btrfs-progs v4.0+20150429

Hmm, that's odd, based on these numbers, you should be having no
issue at all trying to run a balance. You might be hitting some
other bug in the kernel, however, but I don't remember if there were
any known bugs related to ENOSPC or balance in the version you're
running.

    There's one specific bug that shows up with ENOSPC exactly like
this. It's in all versions of the kernel, there's no known solution,
and no guaranteed mitigation strategy, I'm afraid. Various things like
balancing, or adding, balancing, and removing a device again have been
tried. Sometimes they seem to help; sometimes they just make the
problem worse.

    We average maybe one report a week or so with this particular
set of symptoms.
We should get this listed on the Wiki on the Gotcha's page ASAP, especially considering that it's a pretty significant bug (not quite as bad as data corruption, but pretty darn close).

Vedran, could you try running the balance with just '-dusage=40' and then again with just '-musage=40'? If just one of those fails, it could help narrow things down significantly.

Hugo, is there anything else known about this issue (I don't recall seeing it mentioned before, and a quick web search didn't turn up much)? In particular: 1. Is there any known way to reliably reproduce it (I would assume not, as that would likely lead to a mitigation strategy. If someone does find a reliable reproducer, please let me know, I've got some significant spare processor time and storage space I could dedicate to getting traces and filesystem images for debugging, and already have most of the required infrastructure set up for something like this)? 2. Is it contagious (that is, if I send a snapshot from a filesystem that is affected by it, does the filesystem that receives the snapshot become affected; if we could find a way to reproduce it, I could easily answer this question within a couple of minutes of reproducing it)? 3. Do we have any kind of statistics beyond the rate of reports (for example, does it happen more often on bigger filesystems, or possibly more frequently with certain chunk profiles)?

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to