Re: fatal database corruption with btrfs "out of space" with ~50 GB left

Qu Wenruo Wed, 14 Feb 2018 20:32:54 -0800


On 2018年02月15日 12:19, Tomasz Chmielewski wrote:
> On 2018-02-15 10:47, Qu Wenruo wrote:
>> On 2018年02月14日 22:19, Tomasz Chmielewski wrote:
>>> Just FYI, how dangerous running btrfs can be - we had a fatal,
>>> unrecoverable MySQL corruption when btrfs decided to do one of these "I
>>> have ~50 GB left, so let's do out of space (and corrupt some files at
>>> the same time, ha ha!)".
>>
>> I'm recently looking into unexpected corruption problem of btrfs.
>>
>> Would you please provide some extra info about how the corruption
>> happened?
>>
>> 1) Is there any power reset?
>>    Btrfs should be bullet proof, but in fact it's not, so I'm here to
>>    get some clue.
> 
> No power reset.
> 
> 
>> 2) Are MySQL files set with nodatacow?
>>    If so, data corruption is more or less expected, but should be
>>    handled by checkpoint of MySQL.
> 
> Yes, MySQL files were using "nodatacow".
> 
> I've seen many cases of "filesystem full" with ext4, but none lead to
> database corruption (i.e. the database would always recover after
> releasing some space)
> 
> On the other hand, I've seen a handful of "out of space" with gigabytes
> of free space with btrfs, which lead to some light, heavy or
> unrecoverable MySQL or mongo corruption.


Is there any kernel message like kernel warning or backtrace?

> 
> 
> Can it be because of of how "predictable" out of space situations are
> with btrfs and other filesystems?

It's possible.
As other filesystem doesn't really dynamically allocate its metadata
space, (ext4/xfs just have limited inode numbers) and btrfs strictly
split metadata and data usage, it's possible that we still have gigas of
data space, but run out of meta space.

> 
> - in short, ext4 will report out of space when there is 0 bytes left
> (perhaps slightly faster for non-root users) - the application trying to
> write data will see "out of space" at some point, and it can stay like
> this for hours (i.e. until some data is removed manually)
> 
> - on the other hand, btrfs can report out of space when there is still
> 10, 50 or 100 GB left, meaning, any capacity planning is close to
> impossible; also, the application trying to write data can be seeing the
> fs as transitioning between "out of space" and "data written
> successfully" many times per minute/second?
> 
> 
>> 3) Is the filesystem metadata corrupted? (AKA, btrfs check report error)
>>    If so, that should be the problem I'm looking into.
> 
> I don't think so, there are no scary things in dmesg. However, I didn't
> unmount the filesystem to run btrfs check.

If no scary kernel warning, then it may be less serious.

One of my assumption is, snapshots are used in your btrfs setup (well,
more or less the only two selling points of btrfs), and even your DB is
using nodatacow, due to snapshot, it still does data CoW.

And when the DB fails to write some critical data, maybe write ahead log
or similiar things, due to ENOSPC, it causes inconsistency.

The problem here is, most DB assumes the filesystem is defaulted to do
overwrite successfully, while it's not always true to btrfs.

If that's the case, would you please remove all snapshots of your DB
subvolume? Or just put all DB into one subvolume and never snapshot it?
Then nodatacow should work as expected, so it would be more or less more
similar to ext4/xfs. (Although still slower than ext4/xfs)

> 
> 
>> 4) Metadata/data ratio?
>>    "btrfs fi usage" could have quite good result about it.
>>    And "btrfs fi df" also helps.
> 
> Here it is - however, that's after removing some 80 GB data, so most
> likely doesn't reflect when the failure happened.
> 
> # btrfs fi usage /var/lib/lxd
> Overall:
>     Device size:                 846.25GiB
>     Device allocated:            840.05GiB
>     Device unallocated:            6.20GiB

That's should prevent further ENOSPC, as long as this number is beyond 1G.

>     Device missing:                  0.00B
>     Used:                        498.26GiB
>     Free (estimated):            167.96GiB      (min: 167.96GiB)
>     Data ratio:                       2.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,RAID1: Size:411.00GiB, Used:246.14GiB
>    /dev/sda3     411.00GiB
>    /dev/sdb3     411.00GiB
> 
> Metadata,RAID1: Size:9.00GiB, Used:2.99GiB
>    /dev/sda3       9.00GiB
>    /dev/sdb3       9.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:80.00KiB
>    /dev/sda3      32.00MiB
>    /dev/sdb3      32.00MiB
> 
> Unallocated:
>    /dev/sda3       3.10GiB
>    /dev/sdb3       3.10GiB
> 
> 
> 
> # btrfs fi df /var/lib/lxd
> Data, RAID1: total=411.00GiB, used=246.15GiB
> System, RAID1: total=32.00MiB, used=80.00KiB
> Metadata, RAID1: total=9.00GiB, used=2.99GiB

Not sure if the removal of 80G has anything to do with this, but this
seems that your metadata (along with data) is quite scattered.

It's really recommended to keep some unallocated device space, and one
of the method to do that is to use balance to free such scattered space
from data/metadata usage.

And that's why balance routine is recommened for btrfs.

Thanks,
Qu

> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> 
> # btrfs fi show /var/lib/lxd
> Label: 'btrfs'  uuid: f5f30428-ec5b-4497-82de-6e20065e6f61
>         Total devices 2 FS bytes used 249.15GiB
>         devid    1 size 423.13GiB used 420.03GiB path /dev/sda3
>         devid    2 size 423.13GiB used 420.03GiB path /dev/sdb3
> 
> 
> 
> Tomasz Chmielewski
> https://lxadm.com

signature.asc
Description: OpenPGP digital signature

Re: fatal database corruption with btrfs "out of space" with ~50 GB left

Reply via email to