Hi,

I have a 6TB btrfs filesystem I created last year (about 60% used).  It
is my main data disk for my home server so it gets a lot of usage
(particularly mail). I do frequent snapshots (using btrbk) so I have a
lot of snapshots (about 1500 now, although it was about double that
until I cut back the retention times recently).

A while ago I had a "no space" problem (despite fi df, fi show and fi
usage all agreeing I had over 1TB free).  But this email isn't about that.

As part of fixing that problem, I tried to do a "balance -dusage=20" on
the disk.  I was expecting it to have system impact, but it was a major
disaster.  The balance didn't just run for a long time, it locked out
all activity on the disk for hours.  A simple "touch" command to create
one file took over an hour.

More seriously, because of that, mail was being lost: all mail delivery
timed out and the timeout error was interpreted as a fatal delivery
error causing mail to be discarded, mailing lists to cancel
subscriptions, etc. The balance never completed, of course.  I
eventually got it cancelled.

I have since managed to complete the "balance -dusage=20" by running it
repeatedly with "limit=N" (for small N).  I wrote a script to automate
that process, and rerun it every week.  If anyone is interested, the
script is on GitHub: https://github.com/GrahamCobb/btrfs-balance-slowly

Out of that experience, I have a couple of thoughts about how to
possibly make balance more friendly.

1) It looks like the balance process seems to (effectively) lock all
file (extent?) creation for long periods of time.  Would it be possible
for balance to make more effort to yield locks to allow other
processes/threads to get in to continue to create/write files while it
is running?

2) btrfs scrub has options to set ionice options.  Could balance have
something similar?  Or would reducing the IO priority make things worse
because locks would be held for longer?

3) My btrfs-balance-slowly script would work better if there was a
time-based limit filter for balance, not just the current count-based
filter.  I would like to be able to say, for example, run balance for no
more than 10 minutes (completing the operation in progress, of course)
then return.

4) My btrfs-balance-slowly script would be more reliable if there was a
way to get an indication of whether there was more work to be done,
instead of parsing the output for the number of relocations.

Any thoughts about these?  Or other things I could be doing to reduce
the impact on my services?

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to