On 2017年12月13日 02:01, David Sterba wrote:
> On Tue, Dec 12, 2017 at 04:16:08PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 12.12.2017 09:34, Qu Wenruo wrote:
>>> [Overall]
>>> The previous rework on qgroup reservation system put a lot of effort on
>>> data, which works quite fine.
>>>
>>> But it takes less focus on metadata reservation, causing some problem
>>> like metadata reservation underflow and noisy kernel warning.
>>>
>>> This patchset will try to address the remaining problem of metadata
>>> reservation.
>>>
>>> The idea of new qgroup metadata reservation is to use 2 types of
>>> metadata reservation:
>>> 1) Per-transaction reservation
>>>    Life span will be inside a transaction. Will be freed at transaction
>>>    commit time.
>>>
>>> 2) Preallocated reservation
>>>    For case where we reserve space before starting a transaction.
>>>    Operation like dealloc and delayed-inode/item belongs to this type.
>>>
>>>    This works similar to block_rsv, its reservation can be
>>>    reserved/released at any timing caller like.
>>>
>>>    The only point to notice is, if preallocated reservation is used and
>>>    finished without problem, it should be converted to per-transaction
>>>    type instead of just freeing.
>>>    This is to co-operate with qgroup update at commit time.
>>>
>>> For preallocated type, this patch will integrate them into inode_rsv
>>> mechanism reworked by Josef, and delayed-inode/item reservation.
>>>
>>>
>>> [Problem: Over-reserve]
>>> Currently the patchset addresses metadata underflow quite well, but
>>> due to the over-reserve nature of btrfs and highly bounded to inode_rsv,
>>> qgroup metadata reservation also tends to be over-reserved.
>>>
>>> This is especially obvious for small limit.
>>> For 128M limit, it's will only be able to write about 70M before hitting
>>> quota limit.
>>> Although for larger limit, like 5G limit, it can reach 4.5G or more
>>> before hitting limit.
>>>
>>> Such over-reserved behavior can lead to some problem with existing test
>>> cases (where limit is normally less than 20M).
>>>
>>> While it's also possible to be addressed by use more accurate space other
>>> than max estimations.
>>>
>>> For example, to calculate metadata needed for delalloc, we use
>>> btrfs_calc_trans_metadata_size(), which always try to reserve space for
>>> CoW a full-height tree, and will also include csum size.
>>> Both calculate is way over-killed for qgroup metadata reservation.
>> In private chat with Chris couple of months ago we discussed making the
>> reservation a lot less pessimistic. One assumption which we could
>> exploit is the fact that upon a tree split it's unlikely we will create
>> more than 1 additional level in the tree. So we could potentially modify
>> btrfs_calc_trans_metadata_size to take a root parameter and instead of
>>
>> BTRFS_MAX_LEVEL * 2 we could change this to root_level * 2. How does
>> that sound?

My plan is more aggressive.
Since qgroup doesn't really need to care about low level space
reservation for extents, here we only needs to care about "net"
modification.

For item insert, it will takes one new tree block for split at most (if
I don't miss something).
So for qgroup, one ordered extent should only reserve one nodesize.

In fact, for per-trans reservation, we have been using this for a long time.

> 
> That sounds unsafe for low levels, because the reservations can be made
> with level N and the tree height can increase at commit time to > N,
> then we'd go ENOSPC. But BTRFS_MAX_LEVEL can be reduced to
> min(root_level + 1, 4) and even the +1 can be dropped for higher levels
> if the root node is filled up to some percent (say < 50%). The 4 is a
> guess and could be possibly 3, with some more precise calculations.

So the idea only applies to qgroup, nothing is changed for low level
space reservation.
Although this needs extra work in qgroup side.

Current patchset completely rely on current outstanding extent mechanism
to ensure its reservation is reserved and released correctly.

If qgroup is going to do its own reservation calculation, it needs extra
work to handle.

But compared to too early EDQUOT, it's still worthy anyway.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to