Hi, I have a 6TB btrfs filesystem I created last year (about 60% used). It is my main data disk for my home server so it gets a lot of usage (particularly mail). I do frequent snapshots (using btrbk) so I have a lot of snapshots (about 1500 now, although it was about double that until I cut back the retention times recently).
A while ago I had a "no space" problem (despite fi df, fi show and fi usage all agreeing I had over 1TB free). But this email isn't about that. As part of fixing that problem, I tried to do a "balance -dusage=20" on the disk. I was expecting it to have system impact, but it was a major disaster. The balance didn't just run for a long time, it locked out all activity on the disk for hours. A simple "touch" command to create one file took over an hour. More seriously, because of that, mail was being lost: all mail delivery timed out and the timeout error was interpreted as a fatal delivery error causing mail to be discarded, mailing lists to cancel subscriptions, etc. The balance never completed, of course. I eventually got it cancelled. I have since managed to complete the "balance -dusage=20" by running it repeatedly with "limit=N" (for small N). I wrote a script to automate that process, and rerun it every week. If anyone is interested, the script is on GitHub: https://github.com/GrahamCobb/btrfs-balance-slowly Out of that experience, I have a couple of thoughts about how to possibly make balance more friendly. 1) It looks like the balance process seems to (effectively) lock all file (extent?) creation for long periods of time. Would it be possible for balance to make more effort to yield locks to allow other processes/threads to get in to continue to create/write files while it is running? 2) btrfs scrub has options to set ionice options. Could balance have something similar? Or would reducing the IO priority make things worse because locks would be held for longer? 3) My btrfs-balance-slowly script would work better if there was a time-based limit filter for balance, not just the current count-based filter. I would like to be able to say, for example, run balance for no more than 10 minutes (completing the operation in progress, of course) then return. 4) My btrfs-balance-slowly script would be more reliable if there was a way to get an indication of whether there was more work to be done, instead of parsing the output for the number of relocations. Any thoughts about these? Or other things I could be doing to reduce the impact on my services? Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html