Re: Fixing Btrfs Filesystem Full Problems typo?

Duncan Sat, 22 Nov 2014 23:53:13 -0800

Marc MERLIN posted on Sat, 22 Nov 2014 17:07:42 -0800 as excerpted:

> On Sun, Nov 23, 2014 at 12:05:04AM +0000, Hugo Mills wrote:
>> > Which is correct?
>> 
>>    Less than or equal to 55% full.
>  
> This confuses me. Does that mean that the fullest blocks do not get
> rebalanced?


Yes. =:^)

> I guess I was under the mistaken impression that the more data you had
> the more you could be out of balance.

What you were thinking is a misstatement of the situation, so yes, again, 
that was a mistaken impression. =:^)

>>    A chunk is the part of a block group that lives on one device, so
>> in RAID-1, every block group is precisely two chunks; in RAID-0, every
>> block group is 2 or more chunks, up to the number of devices in the FS.
>> A chunk is usually 1 GiB in size for data and 250 MiB for metadata, but
>> can be smaller under some circumstances.
> 
> Right. So, why would you rebalance empty chunks or near empty chunks?
> Don't you want to rebalance almost full chunks first, and work you way
> to less and less full as needed?

No, the closer to empty a chunk is, the more effect you can get in 
rebalancing it along with others of the same fullness.


Think of it this way.

One goal of a rebalance, the goal we have when data and metadata is 
unbalanced and we're hitting ENOSPC as a result (as opposed to the goal 
of converting or balancing among devices when one has just been added or 
removed), and thus the goal that the usage filter is designed to help 
solve, is this: Free excess chunk-allocated but chunk-empty space back to 
unallocated, so it can be used by the other type, data or metadata.

More specifically, all available space has been allocated to data and 
metadata chunks leaving no space available to allocate more chunks, and 
one of two extremes has been reached, we'll call them D and M:

(

D1: All data chunks are full and more need to be allocated, but they 
can't be as there's no more unallocated space to allocate the new data 
chunks from, 

*AND* 

D2: There's a whole bunch of excess metadata chunks allocated, using up 
all that unallocated space, but they're mostly empty, and need to be 
rebalanced to consolidate usage into fewer but fuller metadata chunks, 
thus freeing the space currently taken by all those mostly empty metadata 
chunks.

)

*OR* the reverse:

(

M1: All metadata chunks are full and more need to be allocated, but they 
can't be as there's no more unallocated space to allocate the new 
metadata chunks from,

*AND*

M2: There's a whole bunch of excess data chunks allocated, using up all 
the unallocated space, but they're mostly empty, and need to be 
rebalanced to consoldidate usage into fewer but fuller data chunks, thus 
freeing the space currently taken by all those mostly empty data chunks.

)


In both cases, the one type is full and needs more allocation, but the 
other type is hogging all the space with mostly empty chunks.  In both 
cases, then, you *DON'T* want to bother with the full type, since it's 
full and rewriting it won't do anything but shuffle the full chunks 
around -- you can't combine any because they're all full.

In both cases, What you *WANT* to do is deal with the EMPTY type, the 
chunks that are hogging all the space but not actually using it.

This is evidently a bit counterintuitive on first glance as you're not 
the first to have problems with it, but it /is/ the case, and once you 
understand what's actually happening and why, it /does/ make sense.

More specifically, in the D case, where all /data/ chunks are full, you 
want to rebalance the mostly empty /metadata/ chunks, combining for 
example 5 near 20% full metadata chunks into a single near 100% full 
metadata chunk, deallocating the other four metadata chunks (instead of 
rewriting empty chunks) once there's nothing in them at all.  Five just 
became one, freeing four to unallocated space, which can now be used to 
allocate new data chunks.

And the reverse in the M case, where all metadata chunks are full.  Here, 
you want to rebalance the mostly empty data chunks, again combining say 
five 20% usage data chunks into a single 100% usage data chunk, 
deallocating the other four data chunks once there's nothing in them at 
all.  Again, five just become one, freeing four to unallocated space, 
which now can be used to allocate new, in this case, metadata chunks.


Thus the goal is to rebalance the nearly /empty/ chunks of the *OPPOSITE* 
type to the one you're running short on, combining multiple nearly empty 
chunks of the type you have too many of, thus freeing that empty space 
back to unallocated, so the type that you're actually short on can 
actually allocate chunks from the just freed to unallocated space.

That being the goal, working with the full chunks won't get you much.  
Suppose you work with the 95% full chunks, 5% empty.  You'll have to 
rewrite *TWENTY* of them to combine all those 5% empties to free just 
*ONE* chunk!  And rewriting 100% full chunks won't get you anything at 
all toward this goal, since they're already full and no more can be 
stuffed into them.  Rewrite 100 chunks 100% full, and you still have 100 
chunks 100% full! =:^(

OTOH, suppose you work with 5% full chunks, 95% empty.  Rewrite just two 
of them, and you've already freed one, with the one left only 10% full.  
Add a third one and free a second, with the one you're left with still 
only 15% full.  Continue until you've rewritten 20 of them, AND YOU FREE 
19 OF THEM! =:^)


So it *CLEARLY* pays to work with the mostly empty ones.  Usage=N, where 
balance only works with the ones with LESS than or equal usage to that, 
lets you do exactly that, work with the mostly EMPTY ones.


*BUT*, the payoff is even HIGHER than that.  Consider, since only the 
actually used blocks in a blockgroup need rewritten, an almost full chunk 
is going to take FAR longer than an almost empty chunk to rewrite.  Now 
there's going to be /some/ overhead, but let's consider that 5% full 
example again.  For chunks only 5% full, you're only writing 5% of the 
data or metadata that you'd be writing for a 100% full chunk, 1/20th as 
much.

So in our example above, where we find and rewrite 20 5% usage chunk into 
a single 100% usage chunk, while there will be /some/ overhead, you might 
well write those 20 5% used chunks into a single 100% used chuck in 
perhaps the same time it'd take you to rewrite just ONE 95% usage chunk.

IOW, rewriting 20 95% usage chunks to 19, freeing just one, is going to 
take you nearly 20 times as long as rewriting 20 5% usage chunks, freeing 
19 of them, since in the latter case you're actually only rewriting one 
full chunk's worth of data or metadata.

So working with 5% usage chunks as opposed to 95% usage chunks, you free 
19 times as much space, using only a bit over a 20th as much time.  Even 
with 100% overhead, you'd still spend a tenth as much time freeing 19 
times as many chunks!


Which is why the usage= filter is such a big deal.  In many cases, it 
allows you *HUGE* bang for the buck!  While I'm pulling numbers out of 
the air for this example, they're well within reason.  Something like 
usage=10 might take you half an hour and free up 70% of the space that a 
full balance would free, while the full balance may well take a whole 24-
hour day!


OK, so what /is/ the effect of a fuller filesystem?  Simply this.  As the 
filesystem fills up, there's less and less fully free unallocated space 
available even after a full balance, meaning that free space can be used 
up with fewer and fewer chunk allocations, so you have to rebalance more 
and more often to keep what's left from getting out of balance and 
running into ENOSPC conditions.

Compounding the problem, as the filesystem fills up, it's less and less 
likely that there will be more than just one mostly free chunk available 
(the one that's actively being written into), with others full or nearly 
so), so it'll be necessary to use higher and higher usage=N balances to 
get anything back, and the bonus payoff we had above will be working in 
reverse as now we WILL be having to do 20 95% full chunks to free just 
one chunk back to unallocated.  Compounding the problem even FURTHER, 
will be the fact that we have ALL THOSE GiB (TiB?) of actual data to 
rewrite, so it'll be a worse and worse slog for fewer and fewer freed 
chunks in payback.

Again, numbers out of thin air, but for illustrative purposes...

When a TiB filesystem is say 10% full, 90% of it could be in almost-empty 
chunks.  Not only will it take a relatively long time to get to that 
point with only 10% usage, but a usage=10 filter will very likely free 
say 80% (leaving 10% that would require a higher usage filter to 
recover), in only a few minutes or a half hour or whatever.  And you do 
it once and could be good for six months or a year before you start 
running low on space again and need to redo it.

When it's 90% full, you're likely to need at least usage=80 to get 
anywhere, and you'll be rewriting a good portion of that 900+ GiB in 
ordered to get just a handful of chunks worth of space recovered, with 
the balance taking say 10-12 hours, perhaps longer.  What's worse, you 
may well find yourself having to do a rebalance like that every week, 
because your total deallocatable free space (even after a full balance) 
is approaching your weekly working set!

Obviously at/before that point it's time to invest in more storage!


But, beware!  Just because your filesystem is say 55% full (number from 
your example earlier), does **NOT** mean usage=55 is the best number to 
use.  That may well be the case, or it may not.  There's simply no 
necessarily direct correlation in that regard, and a recommended N for 
usage=N cannot be determined without a LOT more use-case information than 
simply knowing the filesystem is at 55% capacity.

The most that can be /reliably/ stated is that in general, as usage of 
the filesystem goes up, so will the necessary N for the usage=N balance 
filter -- there's a general correlation, yes, but it's nowhere NEAR 
possible to assume any particular ratio like 1:1, without knowing rather 
more about the use-case.

In particular, with the filesystem at 55% capacity, the extremes are all 
used chunks at 100% capacity except for one (the one that's actively 
being used, this is in theory the case immediately after a full balance, 
and even a full balance wouldn't do anything further here), *OR* all used 
chunks at 56% usage but for one (in this case usage=55 would do nothing, 
since all those 56% used chunks are above the 55% cutoff and the single 
chunk that might be rewritten has nothing to combine with, but a usage=56 
or a usage=60 would be as effective as a full balance), *OR* most chunks 
are actually empty, with the remember but one at 100% usage (nearly the 
same as the first case, except in that case there's no empty chunks 
allocated, in this case all available space is allocated to empty chunks, 
such that a usage=0 would be as effective as a full balance), *OR* all 
used chunks but one are at 54-55% usage (usage=55 would in this case 
just /happen/ to be the magic number that is as effective as a full 
balance, while usage=54 would do nothing).


Another way of looking at that would be the old pick a number between 0 
and 100 game.  So you're using two d10 (10-sided dice, with one marked to 
be the 10s digit thus generating 01-(1)00 as the range) to generate the 
number and know the dice are weighted slightly to favor 5s, you along 
with two friends are picking, and you pick first.

So you pick 55.  But your two friends, not being dummies, pick 54 and 
56.  Unless those d10s are HEAVILY weighted, despite the weighing, your 
odds of being the closest with that 55 aren't very good, are they?

Given no differences in time necessary and no additional knowledge about 
how long it has been since the last full balance (which would have tended 
to cram everything to 100% usage), and no knowledge about usage pattern, 
55 would indeed be arguably the best choice to begin with.  

But given the huge time advantage of lower values of N for usage=N if 
they /do/ happen to do what you need, and thus the chance of usage=20 
either doing the job in MUCH less time, or getting done in even LESS time 
because it couldn't actually do /anything/, there's a good chance I'd try 
something like that first, if only to then have some idea how much higher 
I might want to go, because it'll be done SO much faster and has a /small/
chance of doing all I need anyway!

If usage=20 wasn't enough, I might then try usage=40, hoping that it 
would do the rest, knowing that a rerun at higher but still under 100 
number would at most redo only a single chunk from the previous run, the 
one that didn't get filled up all the way at the end -- all the others 
would either be 100% or would have been deallocated as empty, and knowing 
that the higher the number, the MUCH higher the time required, in general.

So the 55% filesystem capacity would probably inform my choice of jumps, 
say 20% at a time, but I'd still start much lower and jump at that 20% or 
so at a time.

Meanwhile, if the filesystem was only at say 20% capacity, I'd probably 
start with usage=0 and jump by 5% at a time, while if it was at say 80% 
capacity, I might still start at usage=0 to see if I could get lucky, but 
then jump to usage=60, and then usage=98 or 99, because the really high 
number still under 100 would still avoid rewriting all the full chunks 
I'd created with the previous runs as well as all 100% full chunks that 
would yield no benefit toward our goal, but would still recover pretty 
much everything it was possible to recover, which once you reach 80% 
capacity is going to start looking pretty necessary at some point.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Fixing Btrfs Filesystem Full Problems typo?

Reply via email to