Re: exclusive subvolume space missing

Qu Wenruo Sun, 10 Dec 2017 15:45:07 -0800


On 2017年12月10日 19:27, Tomasz Pala wrote:
> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:
> 
>>> 1. is there any switch resulting in 'defrag only exclusive data'?
>>
>> IIRC, no.
> 
> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
> defrag. After defragging files were not snapshotted again and I've lost
> 3.6 GB again, so I got this fully reproducible.
> There are 7 files, one of which is 99% of the space (10 MB). None of
> them has nocow set, so they're riding all-btrfs.
> 
> I could debug something before I'll clean this up, is there anything you
> want to me to check/know about the files?


fiemap result along with btrfs dump-tree -t2 result.

Both output has nothing related to file name/dir name, but only some
"meaningless" bytenr, so it should be completely OK to share them.

> 
> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
> condition which could be triggered by malicious user during a few hours
> or faster

You won't want to hear this:
The biggest ratio in theory is, 128M / 4K = 32768.

> - I've lost 3.6 GB during the night with reasonably small
> amount of writes, I guess it might be possible to trash entire
> filesystem within 10 minutes if doing this on purpose.

That's a little complex.
To get into such situation, snapshot must be used and one must know
which file extent is shared and how it's shared.

But yes, it's possible.

While on the other hand, XFS, which also supports reflink, handles it
quite well, so I'm wondering if it's possible for btrfs to follow its
behavior.

> 
>>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>>    reclaiming space that was lost due to fragmentation, without breaking
>>>    spanshoted CoW where it would be not only pointless, but actually 
>>> harmful?
>>
>> What about using old kernel, like v4.13?
> 
> Unfortunately (I guess you had 3.13 on mind), I need the new ones and
> will be pushing towards 4.14.

No, I really mean v4.13.

From btrfs(5):
---
               Warning
               Defragmenting with Linux kernel versions < 3.9 or ≥
3.14-rc2 as
               well as with Linux stable kernel versions ≥ 3.10.31, ≥
3.12.12
               or ≥ 3.13.4 will break up the ref-links of CoW data (for
               example files copied with cp --reflink, snapshots or
               de-duplicated data). This may cause considerable increase of
               space usage depending on the broken up ref-links.
---

> 
>>> 4. How can I prevent this from happening again? All the files, that are
>>>    written constantly (stats collector here, PostgreSQL database and
>>>    logs on other machines), are marked with nocow (+C); maybe some new
>>>    attribute to mark file as autodefrag? +t?
>>
>> Unfortunately, nocow only works if there is no other subvolume/inode
>> referring to it.
> 
> This shouldn't be my case anymore after defrag (==breaking links).
> I guess no easy way to check refcounts of the blocks?

No easy way unfortunately.
It's either time consuming (used by qgroup) or complex (manually tree
search and do the backref walk by yourself)

> 
>> But in my understanding, btrfs is not suitable for such conflicting
>> situation, where you want to have snapshots of frequent partial updates.
>>
>> IIRC, btrfs is better for use case where either update is less frequent,
>> or update is replacing the whole file, not just part of it.
>>
>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.
> 
> That is something coherent with my conclusions after 2 years on btrfs,
> however I didn't expect a single file to eat 1000 times more space than it
> should...
> 
> 
> I wonder how many other filesystems were trashed like this - I'm short
> of ~10 GB on other system, many other users might be affected by that
> (telling the Internet stories about btrfs running out of space).

Firstly, no other filesystem supports snapshot.
So it's pretty hard to get a baseline.

But as I mentioned, XFS supports reflink, which means file extent can be
shared between several inodes.

From the message I got from XFS guys, they free any unused space of a
file extent, so it should handle it quite well.

But it's quite a hard work to achieve in btrfs, needs years development
at least.

> 
> It is not a problem that I need to defrag a file, the problem is I don't know:
> 1. whether I need to defrag,
> 2. *what* should I defrag
> nor have a tool that would defrag smart - only the exclusive data or, in
> general, the block that are worth defragging if space released from
> extents is greater than space lost on inter-snapshot duplication.
> 
> I can't just defrag entire filesystem since it breaks links with snapshots.
> This change was a real deal-breaker here...

IIRC it's better to add a option to make defrag snapshot-aware.
(Don't break snapshot sharing but only to defrag exclusive data)

Thanks,
Qu

> 
> Any way to fed the deduplication code with snapshots maybe? There are
> directories and files in the same layout, this could be fast-tracked to
> check and deduplicate.
>

signature.asc
Description: OpenPGP digital signature

Re: exclusive subvolume space missing

Reply via email to