On Tue, Dec 12, 2017 at 04:16:08PM +0200, Nikolay Borisov wrote: > > > On 12.12.2017 09:34, Qu Wenruo wrote: > > [Overall] > > The previous rework on qgroup reservation system put a lot of effort on > > data, which works quite fine. > > > > But it takes less focus on metadata reservation, causing some problem > > like metadata reservation underflow and noisy kernel warning. > > > > This patchset will try to address the remaining problem of metadata > > reservation. > > > > The idea of new qgroup metadata reservation is to use 2 types of > > metadata reservation: > > 1) Per-transaction reservation > > Life span will be inside a transaction. Will be freed at transaction > > commit time. > > > > 2) Preallocated reservation > > For case where we reserve space before starting a transaction. > > Operation like dealloc and delayed-inode/item belongs to this type. > > > > This works similar to block_rsv, its reservation can be > > reserved/released at any timing caller like. > > > > The only point to notice is, if preallocated reservation is used and > > finished without problem, it should be converted to per-transaction > > type instead of just freeing. > > This is to co-operate with qgroup update at commit time. > > > > For preallocated type, this patch will integrate them into inode_rsv > > mechanism reworked by Josef, and delayed-inode/item reservation. > > > > > > [Problem: Over-reserve] > > Currently the patchset addresses metadata underflow quite well, but > > due to the over-reserve nature of btrfs and highly bounded to inode_rsv, > > qgroup metadata reservation also tends to be over-reserved. > > > > This is especially obvious for small limit. > > For 128M limit, it's will only be able to write about 70M before hitting > > quota limit. > > Although for larger limit, like 5G limit, it can reach 4.5G or more > > before hitting limit. > > > > Such over-reserved behavior can lead to some problem with existing test > > cases (where limit is normally less than 20M). > > > > While it's also possible to be addressed by use more accurate space other > > than max estimations. > > > > For example, to calculate metadata needed for delalloc, we use > > btrfs_calc_trans_metadata_size(), which always try to reserve space for > > CoW a full-height tree, and will also include csum size. > > Both calculate is way over-killed for qgroup metadata reservation. > In private chat with Chris couple of months ago we discussed making the > reservation a lot less pessimistic. One assumption which we could > exploit is the fact that upon a tree split it's unlikely we will create > more than 1 additional level in the tree. So we could potentially modify > btrfs_calc_trans_metadata_size to take a root parameter and instead of > > BTRFS_MAX_LEVEL * 2 we could change this to root_level * 2. How does > that sound?
That sounds unsafe for low levels, because the reservations can be made with level N and the tree height can increase at commit time to > N, then we'd go ENOSPC. But BTRFS_MAX_LEVEL can be reduced to min(root_level + 1, 4) and even the +1 can be dropped for higher levels if the root node is filled up to some percent (say < 50%). The 4 is a guess and could be possibly 3, with some more precise calculations. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html