Re: Recovery from full metadata with all device space consumed?

Duncan Fri, 20 Apr 2018 02:19:45 -0700

Timofey Titovets posted on Fri, 20 Apr 2018 01:32:42 +0300 as excerpted:

> 2018-04-20 1:08 GMT+03:00 Drew Bloechl <d...@cesspool.net>:
>> I've got a btrfs filesystem that I can't seem to get back to a useful
>> state. The symptom I started with is that rename() operations started
>> dying with ENOSPC, and it looks like the metadata allocation on the
>> filesystem is full:
>>
>> # btrfs fi df /broken
>> Data, RAID0: total=3.63TiB, used=67.00GiB
>> System, RAID1: total=8.00MiB, used=224.00KiB
>> Metadata, RAID1: total=3.00GiB, used=2.50GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> All of the consumable space on the backing devices also seems to be in
>> use:
>>
>> # btrfs fi show /broken Label: 'mon_data'  uuid:
>> 85e52555-7d6d-4346-8b37-8278447eb590
>>         Total devices 4 FS bytes used 69.50GiB
>>         devid    1 size 931.51GiB used 931.51GiB path /dev/sda1
>>         devid    2 size 931.51GiB used 931.51GiB path /dev/sdb1
>>         devid    3 size 931.51GiB used 931.51GiB path /dev/sdc1
>>         devid    4 size 931.51GiB used 931.51GiB path /dev/sdd1
>>
>> Even the smallest balance operation I can start fails (this doesn't
>> change even with an extra temporary device added to the filesystem):
>>
>> # btrfs balance start -v -dusage=1 /broken
>> Dumping filters: flags 0x1, state 0x0, force is off
>>   DATA (flags 0x2): balancing, usage=1
>> ERROR: error during balancing '/broken': No space left on device
>> There may be more info in syslog - try dmesg | tail
>> # dmesg | tail -1
>> [11554.296805] BTRFS info (device sdc1): 757 enospc errors during
>> balance
>>
>> The current kernel is 4.15.0 from Debian's stretch-backports
>> (specifically linux-image-4.15.0-0.bpo.2-amd64), but it was Debian's
>> 4.9.30 when the filesystem got into this state. I upgraded it in the
>> hopes that a newer kernel would be smarter, but no dice.
>>
>> btrfs-progs is currently at v4.7.3.
>>
>> Most of what this filesystem stores is Prometheus 1.8's TSDB for its
>> metrics, which are constantly written at around 50MB/second. The
>> filesystem never really gets full as far as data goes, but there's a
>> lot of never-ending churn for what data is there.
>>
>> Question 1: Are there other steps that can be tried to rescue a
>> filesystem in this state? I still have it mounted in the same state,
>> and I'm willing to try other things or extract debugging info.
>>
>> Question 2: Is there something I could have done to prevent this from
>> happening in the first place?
> 
> Not sure why this happening,
> but if you stuck at that state:
>   - Reboot to ensure no other problems will exists
>   - Add any other external device temporary to FS, as example zram.
>     After you free small part of fs, delete external dev from FS and
> continue balance chunks.


He did try adding a temporary device.  Requoting from above:

>> Even the smallest balance operation I can start fails (this doesn't
>> change even with an extra temporary device added to the filesystem):

Never-the-less, that's the right idea in general, but I believe the 
following additional suggestions, now addressed to the original poster, 
will help.

1) Try with -dusage=0.

With any luck there will be some totally empty data chunks, which this 
should free, hopefully getting you at least enough space for the -dusage=1 
to work and free additional space.

The reason this can work is that unlike with actual usage, entirely empty 
chunks don't require writing a fresh block to copy the used extents 
into... because there aren't any.  But of course it does require that 
there's some totally empty chunks available to free, which with your 
numbers is somewhat likely, but not a given, especially since newer 
kernels (well, since some time now, but...) normally free entirely empty 
chunks automatically.

FWIW, 0-usage balances are near instant as all it has to do is eliminate 
the empty chunks from the chunk list.  1% usage balances, once you can do 
them, will go real fast too, and in your state may get you back some 
decent unallocated, tho they probably won't do much for people in less 
extreme unbalance conditions.  10% will do more and take a bit longer, 
but still be fast as it's only writing 1/10th of the chunk size, and as 
long as there's enough chunks at that level, it'll still be returning 10 
for every full one it rewrites.  At 50% it'll take much longer but will 
still be returning 2 chunks for every one it writes.  Above that, the 
payback goes down rather fast, so you're only getting back 1 for 2 
written at 67%, and one for 9 written at 90%.  As such, on spinning rust 
it's rarely worth trying above 70% or so, and often not worth trying 
above 50%, unless of course the filesystem really is almost full and you 
are trying to reclaim every last bit of unused chunk space to unallocated 
you can, regardless of the time it takes.  FWIW I'm on ssd and partition 
up so my filesystems are normally under 100 GiB, so even a full balance 
normally only takes a few minutes, but I still don't normally bother with 
anything over -dusage=70  (or -musage=70, for metadata) or so.

If starting with -dusage=0 doesn't get you anything back...

2) Unfortunately, due to metadata being 100% full (that reserve space 
comes from metadata, and adding reserve to used metadata, you're at 100%) 
and relocating data chunks requiring rewriting metadata, with copy-on-
write so it must be rewritten elsewhere BUT no unallocated space 
available to create more chunks to rewrite it, AND with metadata being 
the default raid1 mode...

Adding a single additional device will still not work, because there's 
still no space to write the raid1 second copy of that needed metadata 
chunk.  That explains the failure in that case.

BUT, adding *TWO* additional devices should work rather better, because 
that'll let btrfs create the necessary raid1 copy of the new metadata 
chunk.  (IDR if btrfs raid0 for the data would require a second device or 
not and my experience is raid1, but it might, and the metadata needs it 
anyway, so...)

The raid0 data suggests data chunks are likely to be 4 GiB (as 1 GiB 
across four devices), so while smaller "extra" devices might work, I'd 
shoot for a pair of say 16 GB each, minimum (bigger would be fine), and 
would be unsurprised if under 5 GiB each failed, with 5-16 GiB each 
possibly working, possibly not.

I don't /think/ you'll need four additional devices, but if two devices 
of 16 gig each minimum doesn't help, it couldn't hurt to try four, just 
in case.  (You probably don't have 64 GiB of free RAM, or maybe you do 
but don't want to risk losing the data in a crash, but a 64 GiB or so 
thumb drive, partitioned up into four 16 GiB partitions so each can be 
used as a different virtual device should do it... if a bit slow due to 
thumb-drive flash.)


Once you get things working again, avoiding the same problem repeating...

3) If perchance the filesystem is getting mounted with the ssd option, 
either because you put it there or implied/detected as such by btrfs due 
to lack of the rotational attribute on the composing devices...

There's a recent (4.14 IIRC, definitely after the 4.9 you were using 
previously) btrfs ssd mode extent-allocator change that should keep btrfs 
from being so data-chunk hungry, as it'll fill in existing partially used 
chunks more instead of constantly allocating new ones.  If it is using 
ssd mode that should help, but with your usage pattern it might not have 
been the only problem, and of course without ssd mode it wouldn't have 
been the problem at all.

In any case, to prevent the same problem again...

4) Keep an eye on your data chunk total vs. used numbers, and more 
importantly, your unallocated space (more about how to get these in #5).  
If the spread between data total and data used gets too big, or the 
unallocated space drops too low, do a balance -dusage= accordingly.

Currently your btrfs fi df shows 3.6+ TiB total data chunks allocated, 
but only 67 GiB used.  That's ***WAY*** out of whack.  Again, your usage 
pattern is at least part of the reason, but ssd mode on older kernels 
would have certainly exacerbated the problem.

Until your filesystem fills up more, try to keep total data chunks 
allocated under say half or one TiB.  That should leave well over a TiB 
of entirely unallocated free space, even if you don't catch a runaway 
right away and it gets to 2 TiB allocated before you catch it.

As your filesystem fills up, you'll obviously need to allow more data 
allocation and drop the unallocated, but keeping at least say 16 GiB free 
(not chunk allocated at all) on each device should keep you out of 
trouble.

5) The easiest way to check usage is the btrfs fi usage command, but it's 
also a relatively new command and isn't available in older btrfs-progs.  
I /think/ progs 4.7 had it, but I'd suggest upgrading to newer in any 
case.  It doesn't have to be the newest (until you want the best chance 
at recovery with btrfs restore or check and repair with btrfs check), but 
something near 4.14 or newer would be nice.

The older and more difficult way to get almost the same information is 
comparing both btrfs fi show and btrfs fi df.  Since that's what you 
posted, I'll use it here:

>> # btrfs fi df /broken
>> Data, RAID0: total=3.63TiB, used=67.00GiB
>> System, RAID1: total=8.00MiB, used=224.00KiB
>> Metadata, RAID1: total=3.00GiB, used=2.50GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> All of the consumable space on the backing devices also seems to be in
>> use:
>>
>> # btrfs fi show /broken Label: 'mon_data'  uuid:
>> 85e52555-7d6d-4346-8b37-8278447eb590
>>         Total devices 4 FS bytes used 69.50GiB
>>         devid    1 size 931.51GiB used 931.51GiB path /dev/sda1
>>         devid    2 size 931.51GiB used 931.51GiB path /dev/sdb1
>>         devid    3 size 931.51GiB used 931.51GiB path /dev/sdc1
>>         devid    4 size 931.51GiB used 931.51GiB path /dev/sdd1

As you suggest, all space on all devices is used.  While fi usage breaks 
out unallocated as its own line-item, both per device and overall, with
fi show/df you have to derive it from the difference between size and 
used on each device listed in the fi show report.

If (after getting it that way with balance) you keep fi show per-device 
used under say 250 or 500 MiB, that'll go to unallocated, as fi usage 
will make clearer.

Meanwhile, for fi df, that data line says 3.6+ TiB total data chunk 
allocations, but only 67 GiB used.  As I said, that's ***WAY*** out of 
whack, and getting it back into something a bit more normal and keeping 
it there, for under 100 GiB actually used, say under say 250 or 500 GiB 
total, with the rest returned to unallocated, dropping the used in the fi 
df report and increasing unallocated in fi usage, should keep you well 
out of trouble.

As for fi usage, While I use a bunch of much smaller filesystems here, 
all raid1 or dup, so it'll be of limited direct help, I'll post the 
output from one of mine, just so you can see how much easier it is to 
read the fi usage report:

$$ sudo btrfs filesystem usage /
Overall:
    Device size:                  16.00GiB
    Device allocated:              7.02GiB
    Device unallocated:            8.98GiB
    Device missing:                  0.00B
    Used:                          4.90GiB
    Free (estimated):              5.25GiB      (min: 5.25GiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:               16.00MiB      (used: 0.00B)

Data,RAID1: Size:3.00GiB, Used:2.24GiB
   /dev/sda5       3.00GiB
   /dev/sdb5       3.00GiB

Metadata,RAID1: Size:512.00MiB, Used:209.59MiB
   /dev/sda5     512.00MiB
   /dev/sdb5     512.00MiB

System,RAID1: Size:8.00MiB, Used:16.00KiB
   /dev/sda5       8.00MiB
   /dev/sdb5       8.00MiB

Unallocated:
   /dev/sda5       4.49GiB
   /dev/sdb5       4.49GiB

(FWIW there's also btrfs device usage, if you want a device-focused 
report.)

This is a btrfs raid1 both data and metadata, on a pair of 8 GiB devices, 
thus 16 GiB total.

Of that 8 GiB per device, a very healthy 4.49 GiB per device, over half 
the filesystem, remains entirely chunk-level unallocated and thus free to 
allocate to data or metadata chunks as needed.

Meanwhile, data chunk allocation is 3 GiB total per device, of which 2.24 
GiB is used.  Again, that's healthy, as data chunks are nominally 1 GiB 
so that's probably three 1 GiB chunks allocated, with 2.24 GiB of it used.

By contrast, your in-trouble fi usage report will show (near) 0 
unallocated and a ***HUGE*** gap between size/total and used for data, 
while you should be easily able to get per-device data totals down to say 
250 GiB or so (or down to 10 GiB or so with more work), with it all 
switching to unallocated, and then keep it healthy by doing a balance 
with -dusage= as necessary any time the numbers start getting out of line 
again.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recovery from full metadata with all device space consumed?

Reply via email to