Re: [PATCH RFC 00/14] Accurate qgroup reserve framework
Chris Mason wrote on 2015/09/10 19:34 -0400: On Tue, Sep 08, 2015 at 05:08:26PM +0800, Qu Wenruo wrote: Sorry for the confusing cover letter title. This patch is no longer RFC now. It's already a working one, and we're doing stress test to ensure it's completely OK, but seems quite good for now. To Chris, I know the timing I sent the patchset is quite awful, as there is only less than 1 week for rc1, and the merge window will close soon. But I still hope there would be a small chance we can merge it into early v4.3-rc. Maybe rc2 or rc3? As the reserve space leaking problem is quite annoying, sometimes even making qgroup limit unusable. Sorry, this is much too big for rc2 or rc3. Completely acceptable, as I also consider it's too big anyway. If that's not possible, I'm completely OK with that though, as Linus won't be happy about that without doubt. Lets use the rest of the 4.3 cycle to get reviews (esp from Mark) and work through any problems. I'd really like to focus on this and the subvol deletion accounting Makes a lot of sense. BTW, we were originally to submit another qgroup enhancement for 4.4 cycle. (The one originally submitted by Yang Dongsheng, seperate btrfs qgroup accounting for data and metadata) Will it be OK to submit them at the same time for 4.4? Or better to postpone it for 4.5? (40+ patches will surely be quite a hell to merge) Thanks, Qu -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/14] Accurate qgroup reserve framework
Mark Fasheh wrote on 2015/09/10 14:01 -0700: Hi Qu, On Tue, Sep 08, 2015 at 04:56:52PM +0800, Qu Wenruo wrote: [[BUG]] One of the most common case to trigger the bug is the following method: 1) Enable quota 2) Limit excl of qgroup 5 to 16M 3) Write [0,2M) of a file inside subvol 5 10 times without sync EQUOT will be triggered at about the 8th write. Does this happen on all kernels with qgroups or is this related to your recent rewrite? All kernels. My recent rewrite only affects the accounting part (the excl/rfer numbers), reserve part is somewhat independent from accounting part. But I have to admit that, in fact my rewrite introduced some incompatibility with old reserve codes. One of the most obvious one is the hot fix introduced in late 4.2-rc. And still some hidden one. For example, old reserved space will be freed at end_trans() time. But with new accounting rewrite, we shouldn't do that until commit_trans(). As reserved space will be converted into rfer/exel only at commit_trans(). If freed too early like old codes, we may have the possibility to exceed the limit. Thankfully, all these will be addressed in the big patchset. [[CAUSE]] The problem is caused by the fact that qgroup will reserve space even the data space is already reserved. In above reproducer, each time we buffered write [0,2M) qgroup will reserve 2M space, but in fact, at the 1st time, we have already reserved 2M and from then on, we don't need to reserved any data space as we are only writing [0,2M). Also, the reserved space will only be freed *ONCE* when its backref is run at commit_transaction() time. That's causing the reserved space leaking. [[FIX]] The fix is not a simple one, as currently btrfs_qgroup_reserve() follow Indeed, this is quite a large patch series and I see no testing details from you. Can you please at the least provide a single reproducer in the form of something that can be added to xfstests? Like Filipe mentioned, it's already submitted to fstests. And sorry for not mentioning it in the comment message. BTW, there will be more test cases coming for qgroup soon, with a lot of error exposed in the development of the patchset. the very bad btrfs space allocating principle: Allocate as much as you needed, even it's not fully used. So for accurate qgroup reserve, we introduce a completely new framework for data and metadata. 1) Per-inode data reserve map Now, each inode will have a data reserve map, recording which range of data is already reserved. If we are writing a range which is already reserved, we won't need to reserve space again. Also, for the fact that qgroup is only accounted at commit_trans(), for data commit into disc and its metadata is also inserted into current tree, we should free the data reserved range, but still keep the reserved space until commit_trans(). So delayed_ref_head will have new members to record how much space is reserved and free them at commit_trans() time. 2) Per-root metadata reserve counter For metadata(tree block), it's impossible to know how much space it will use exactly in advance. And due to the new qgroup accounting framework, the old free-at-end-trans may lead to exceeding limit. So we record how much metadata space is reserved for each root, and free them at commit_trans() time. This method is not perfect, but thanks to the compared small size of metadata, it should be quite good. More detailed info can be found in each commit message and source commend. Qu Wenruo (19): btrfs: qgroup: New function declaration for new reserve implement btrfs: qgroup: Implement data_rsv_map init/free functions btrfs: qgroup: Introduce new function to search most left reserve range btrfs: qgroup: Introduce function to insert non-overlap reserve range btrfs: qgroup: Introduce function to reserve data range per inode btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function btrfs: qgroup: Introduce function to release reserved range btrfs: qgroup: Introduce function to release/free reserved data range btrfs: delayed_ref: Add new function to record reserved space into delayed ref btrfs: delayed_ref: release and free qgroup reserved at proper timing btrfs: qgroup: Introduce new functions to reserve/free metadata btrfs: qgroup: Use new metadata reservation. btrfs: extent-tree: Add new verions of btrfs_check_data_free_space btrfs: Switch to new check_data_free_space btrfs: fallocate: Add support to accurate qgroup reserve btrfs: extent-tree: Add new version of btrfs_delalloc_reserve_space btrfs: extent-tree: Use new __btrfs_delalloc_reserve_space function btrfs: qgroup: Cleanup old inaccurate facilities btrfs: qgroup: Add handler for NOCOW and inline I took a quick look through a few of these, none of them have any trace_* functions, yet you're adding several new entrypoints to the qgroup code. Those ar
Re: [PATCH RFC 00/14] Accurate qgroup reserve framework
On Thu, Sep 10, 2015 at 10:33:02PM +0100, Filipe David Manana wrote: > On Thu, Sep 10, 2015 at 10:01 PM, Mark Fasheh wrote: > > Hi Qu, > > > > On Tue, Sep 08, 2015 at 04:56:52PM +0800, Qu Wenruo wrote: > >> [[BUG]] > >> One of the most common case to trigger the bug is the following method: > >> 1) Enable quota > >> 2) Limit excl of qgroup 5 to 16M > >> 3) Write [0,2M) of a file inside subvol 5 10 times without sync > >> > >> EQUOT will be triggered at about the 8th write. > > > > Does this happen on all kernels with qgroups or is this related to your > > recent rewrite? > > > > > >> [[CAUSE]] > >> The problem is caused by the fact that qgroup will reserve space even > >> the data space is already reserved. > >> > >> In above reproducer, each time we buffered write [0,2M) qgroup will > >> reserve 2M space, but in fact, at the 1st time, we have already reserved > >> 2M and from then on, we don't need to reserved any data space as we are > >> only writing [0,2M). > >> > >> Also, the reserved space will only be freed *ONCE* when its backref is > >> run at commit_transaction() time. > >> > >> That's causing the reserved space leaking. > >> > >> [[FIX]] > >> The fix is not a simple one, as currently btrfs_qgroup_reserve() follow > > > > Indeed, this is quite a large patch series and I see no testing details from > > you. Can you please at the least provide a single reproducer in the form of > > something that can be added to xfstests? > > https://patchwork.kernel.org/patch/7047641/ > > Came way before this patchset :) Ok, thanks. IMHO that sort of thing should be part of the topic e-mail so potential reviewers don't have to go googling for a test case ;) --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/14] Accurate qgroup reserve framework
On Tue, Sep 08, 2015 at 05:08:26PM +0800, Qu Wenruo wrote: > Sorry for the confusing cover letter title. > > This patch is no longer RFC now. > It's already a working one, and we're doing stress test to ensure it's > completely OK, but seems quite good for now. > > To Chris, > > I know the timing I sent the patchset is quite awful, as there is only less > than 1 week for rc1, and the merge window will close soon. > > But I still hope there would be a small chance we can merge it into early > v4.3-rc. Maybe rc2 or rc3? > As the reserve space leaking problem is quite annoying, sometimes even > making qgroup limit unusable. Sorry, this is much too big for rc2 or rc3. > > If that's not possible, I'm completely OK with that though, as Linus won't > be happy about that without doubt. Lets use the rest of the 4.3 cycle to get reviews (esp from Mark) and work through any problems. I'd really like to focus on this and the subvol deletion accounting -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/14] Accurate qgroup reserve framework
On Thu, Sep 10, 2015 at 10:01 PM, Mark Fasheh wrote: > Hi Qu, > > On Tue, Sep 08, 2015 at 04:56:52PM +0800, Qu Wenruo wrote: >> [[BUG]] >> One of the most common case to trigger the bug is the following method: >> 1) Enable quota >> 2) Limit excl of qgroup 5 to 16M >> 3) Write [0,2M) of a file inside subvol 5 10 times without sync >> >> EQUOT will be triggered at about the 8th write. > > Does this happen on all kernels with qgroups or is this related to your > recent rewrite? > > >> [[CAUSE]] >> The problem is caused by the fact that qgroup will reserve space even >> the data space is already reserved. >> >> In above reproducer, each time we buffered write [0,2M) qgroup will >> reserve 2M space, but in fact, at the 1st time, we have already reserved >> 2M and from then on, we don't need to reserved any data space as we are >> only writing [0,2M). >> >> Also, the reserved space will only be freed *ONCE* when its backref is >> run at commit_transaction() time. >> >> That's causing the reserved space leaking. >> >> [[FIX]] >> The fix is not a simple one, as currently btrfs_qgroup_reserve() follow > > Indeed, this is quite a large patch series and I see no testing details from > you. Can you please at the least provide a single reproducer in the form of > something that can be added to xfstests? https://patchwork.kernel.org/patch/7047641/ Came way before this patchset :) > > >> the very bad btrfs space allocating principle: >> Allocate as much as you needed, even it's not fully used. >> >> So for accurate qgroup reserve, we introduce a completely new framework >> for data and metadata. >> 1) Per-inode data reserve map >>Now, each inode will have a data reserve map, recording which range >>of data is already reserved. >>If we are writing a range which is already reserved, we won't need to >>reserve space again. >> >>Also, for the fact that qgroup is only accounted at commit_trans(), >>for data commit into disc and its metadata is also inserted into >>current tree, we should free the data reserved range, but still keep >>the reserved space until commit_trans(). >> >>So delayed_ref_head will have new members to record how much space is >>reserved and free them at commit_trans() time. >> >> 2) Per-root metadata reserve counter >>For metadata(tree block), it's impossible to know how much space it >>will use exactly in advance. >>And due to the new qgroup accounting framework, the old >>free-at-end-trans may lead to exceeding limit. >> >>So we record how much metadata space is reserved for each root, and >>free them at commit_trans() time. >>This method is not perfect, but thanks to the compared small size of >>metadata, it should be quite good. >> >> More detailed info can be found in each commit message and source >> commend. >> >> Qu Wenruo (19): >> btrfs: qgroup: New function declaration for new reserve implement >> btrfs: qgroup: Implement data_rsv_map init/free functions >> btrfs: qgroup: Introduce new function to search most left reserve >> range >> btrfs: qgroup: Introduce function to insert non-overlap reserve range >> btrfs: qgroup: Introduce function to reserve data range per inode >> btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function >> btrfs: qgroup: Introduce function to release reserved range >> btrfs: qgroup: Introduce function to release/free reserved data range >> btrfs: delayed_ref: Add new function to record reserved space into >> delayed ref >> btrfs: delayed_ref: release and free qgroup reserved at proper timing >> btrfs: qgroup: Introduce new functions to reserve/free metadata >> btrfs: qgroup: Use new metadata reservation. >> btrfs: extent-tree: Add new verions of btrfs_check_data_free_space >> btrfs: Switch to new check_data_free_space >> btrfs: fallocate: Add support to accurate qgroup reserve >> btrfs: extent-tree: Add new version of btrfs_delalloc_reserve_space >> btrfs: extent-tree: Use new __btrfs_delalloc_reserve_space function >> btrfs: qgroup: Cleanup old inaccurate facilities >> btrfs: qgroup: Add handler for NOCOW and inline > > I took a quick look through a few of these, none of them have any trace_* > functions, yet you're adding several new entrypoints to the qgroup code. > Those are incredibly useful for debugging on live systems and in fact I've > got a patch which reintroduces the ones you removed in your last patch > series ;) > > This time around can you please provde tracepoints for at least your new > high level entrypoint functions into the qgroup code? > > Thanks, > --Mark > > -- > Mark Fasheh > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all
Re: [PATCH RFC 00/14] Accurate qgroup reserve framework
Hi Qu, On Tue, Sep 08, 2015 at 04:56:52PM +0800, Qu Wenruo wrote: > [[BUG]] > One of the most common case to trigger the bug is the following method: > 1) Enable quota > 2) Limit excl of qgroup 5 to 16M > 3) Write [0,2M) of a file inside subvol 5 10 times without sync > > EQUOT will be triggered at about the 8th write. Does this happen on all kernels with qgroups or is this related to your recent rewrite? > [[CAUSE]] > The problem is caused by the fact that qgroup will reserve space even > the data space is already reserved. > > In above reproducer, each time we buffered write [0,2M) qgroup will > reserve 2M space, but in fact, at the 1st time, we have already reserved > 2M and from then on, we don't need to reserved any data space as we are > only writing [0,2M). > > Also, the reserved space will only be freed *ONCE* when its backref is > run at commit_transaction() time. > > That's causing the reserved space leaking. > > [[FIX]] > The fix is not a simple one, as currently btrfs_qgroup_reserve() follow Indeed, this is quite a large patch series and I see no testing details from you. Can you please at the least provide a single reproducer in the form of something that can be added to xfstests? > the very bad btrfs space allocating principle: > Allocate as much as you needed, even it's not fully used. > > So for accurate qgroup reserve, we introduce a completely new framework > for data and metadata. > 1) Per-inode data reserve map >Now, each inode will have a data reserve map, recording which range >of data is already reserved. >If we are writing a range which is already reserved, we won't need to >reserve space again. > >Also, for the fact that qgroup is only accounted at commit_trans(), >for data commit into disc and its metadata is also inserted into >current tree, we should free the data reserved range, but still keep >the reserved space until commit_trans(). > >So delayed_ref_head will have new members to record how much space is >reserved and free them at commit_trans() time. > > 2) Per-root metadata reserve counter >For metadata(tree block), it's impossible to know how much space it >will use exactly in advance. >And due to the new qgroup accounting framework, the old >free-at-end-trans may lead to exceeding limit. > >So we record how much metadata space is reserved for each root, and >free them at commit_trans() time. >This method is not perfect, but thanks to the compared small size of >metadata, it should be quite good. > > More detailed info can be found in each commit message and source > commend. > > Qu Wenruo (19): > btrfs: qgroup: New function declaration for new reserve implement > btrfs: qgroup: Implement data_rsv_map init/free functions > btrfs: qgroup: Introduce new function to search most left reserve > range > btrfs: qgroup: Introduce function to insert non-overlap reserve range > btrfs: qgroup: Introduce function to reserve data range per inode > btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function > btrfs: qgroup: Introduce function to release reserved range > btrfs: qgroup: Introduce function to release/free reserved data range > btrfs: delayed_ref: Add new function to record reserved space into > delayed ref > btrfs: delayed_ref: release and free qgroup reserved at proper timing > btrfs: qgroup: Introduce new functions to reserve/free metadata > btrfs: qgroup: Use new metadata reservation. > btrfs: extent-tree: Add new verions of btrfs_check_data_free_space > btrfs: Switch to new check_data_free_space > btrfs: fallocate: Add support to accurate qgroup reserve > btrfs: extent-tree: Add new version of btrfs_delalloc_reserve_space > btrfs: extent-tree: Use new __btrfs_delalloc_reserve_space function > btrfs: qgroup: Cleanup old inaccurate facilities > btrfs: qgroup: Add handler for NOCOW and inline I took a quick look through a few of these, none of them have any trace_* functions, yet you're adding several new entrypoints to the qgroup code. Those are incredibly useful for debugging on live systems and in fact I've got a patch which reintroduces the ones you removed in your last patch series ;) This time around can you please provde tracepoints for at least your new high level entrypoint functions into the qgroup code? Thanks, --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/14] Accurate qgroup reserve framework
Sorry for the confusing cover letter title. This patch is no longer RFC now. It's already a working one, and we're doing stress test to ensure it's completely OK, but seems quite good for now. To Chris, I know the timing I sent the patchset is quite awful, as there is only less than 1 week for rc1, and the merge window will close soon. But I still hope there would be a small chance we can merge it into early v4.3-rc. Maybe rc2 or rc3? As the reserve space leaking problem is quite annoying, sometimes even making qgroup limit unusable. If that's not possible, I'm completely OK with that though, as Linus won't be happy about that without doubt. Thanks, Qu Qu Wenruo wrote on 2015/09/08 16:56 +0800: [[BUG]] One of the most common case to trigger the bug is the following method: 1) Enable quota 2) Limit excl of qgroup 5 to 16M 3) Write [0,2M) of a file inside subvol 5 10 times without sync EQUOT will be triggered at about the 8th write. [[CAUSE]] The problem is caused by the fact that qgroup will reserve space even the data space is already reserved. In above reproducer, each time we buffered write [0,2M) qgroup will reserve 2M space, but in fact, at the 1st time, we have already reserved 2M and from then on, we don't need to reserved any data space as we are only writing [0,2M). Also, the reserved space will only be freed *ONCE* when its backref is run at commit_transaction() time. That's causing the reserved space leaking. [[FIX]] The fix is not a simple one, as currently btrfs_qgroup_reserve() follow the very bad btrfs space allocating principle: Allocate as much as you needed, even it's not fully used. So for accurate qgroup reserve, we introduce a completely new framework for data and metadata. 1) Per-inode data reserve map Now, each inode will have a data reserve map, recording which range of data is already reserved. If we are writing a range which is already reserved, we won't need to reserve space again. Also, for the fact that qgroup is only accounted at commit_trans(), for data commit into disc and its metadata is also inserted into current tree, we should free the data reserved range, but still keep the reserved space until commit_trans(). So delayed_ref_head will have new members to record how much space is reserved and free them at commit_trans() time. 2) Per-root metadata reserve counter For metadata(tree block), it's impossible to know how much space it will use exactly in advance. And due to the new qgroup accounting framework, the old free-at-end-trans may lead to exceeding limit. So we record how much metadata space is reserved for each root, and free them at commit_trans() time. This method is not perfect, but thanks to the compared small size of metadata, it should be quite good. More detailed info can be found in each commit message and source commend. Qu Wenruo (19): btrfs: qgroup: New function declaration for new reserve implement btrfs: qgroup: Implement data_rsv_map init/free functions btrfs: qgroup: Introduce new function to search most left reserve range btrfs: qgroup: Introduce function to insert non-overlap reserve range btrfs: qgroup: Introduce function to reserve data range per inode btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function btrfs: qgroup: Introduce function to release reserved range btrfs: qgroup: Introduce function to release/free reserved data range btrfs: delayed_ref: Add new function to record reserved space into delayed ref btrfs: delayed_ref: release and free qgroup reserved at proper timing btrfs: qgroup: Introduce new functions to reserve/free metadata btrfs: qgroup: Use new metadata reservation. btrfs: extent-tree: Add new verions of btrfs_check_data_free_space btrfs: Switch to new check_data_free_space btrfs: fallocate: Add support to accurate qgroup reserve btrfs: extent-tree: Add new version of btrfs_delalloc_reserve_space btrfs: extent-tree: Use new __btrfs_delalloc_reserve_space function btrfs: qgroup: Cleanup old inaccurate facilities btrfs: qgroup: Add handler for NOCOW and inline fs/btrfs/btrfs_inode.h | 6 + fs/btrfs/ctree.h | 8 +- fs/btrfs/delayed-ref.c | 29 +++ fs/btrfs/delayed-ref.h | 14 + fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 99 +--- fs/btrfs/file.c| 169 + fs/btrfs/inode-map.c | 2 +- fs/btrfs/inode.c | 51 +++- fs/btrfs/ioctl.c | 3 +- fs/btrfs/qgroup.c | 674 - fs/btrfs/qgroup.h | 18 +- fs/btrfs/transaction.c | 34 +-- fs/btrfs/transaction.h | 1 - 14 files changed, 979 insertions(+), 130 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html