On 23/03/2021 10:57, Filipe Manana wrote:
> On Fri, Mar 19, 2021 at 10:52 AM Johannes Thumshirn
> <[email protected]> wrote:
>>
>> When a file gets deleted on a zoned file system, the space freed is not
>> returned back into the block group's free space, but is migrated to
>> zone_unusable.
>>
>> As this zone_unusable space is behind the current write pointer it is not
>> possible to use it for new allocations. In the current implementation a
>> zone is reset once all of the block group's space is accounted as zone
>> unusable.
>>
>> This behaviour can lead to premature ENOSPC errors on a busy file system.
>>
>> Instead of only reclaiming the zone once it is completely unusable,
>> kick off a reclaim job once the amount of unusable bytes exceeds a user
>> configurable threshold between 51% and 100%. It can be set per mounted
>> filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
>> per default.
>>
>> Similar to reclaiming unused block groups, these dirty block groups are
>> added to a to_reclaim list and then on a transaction commit, the reclaim
>> process is triggered but after we deleted unused block groups, which will
>> free space for the relocation process.
>>
>> Signed-off-by: Johannes Thumshirn <[email protected]>
>> ---
>>
>> AFAICT sysfs_create_files() does not have the ability to provide a is_visible
>> callback, so the bg_reclaim_threshold sysfs file is visible for non zoned
>> filesystems as well, even though only for zoned filesystems we're adding
>> block
>> groups to the reclaim list. I'm all ears for a approach that is sensible in
>> this regard.
>>
>>
>> fs/btrfs/block-group.c | 84 ++++++++++++++++++++++++++++++++++++
>> fs/btrfs/block-group.h | 2 +
>> fs/btrfs/ctree.h | 3 ++
>> fs/btrfs/disk-io.c | 11 +++++
>> fs/btrfs/free-space-cache.c | 9 +++-
>> fs/btrfs/sysfs.c | 35 +++++++++++++++
>> fs/btrfs/volumes.c | 2 +-
>> fs/btrfs/volumes.h | 1 +
>> include/trace/events/btrfs.h | 12 ++++++
>> 9 files changed, 157 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index 9ae3ac96a521..af9026795ddd 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -1485,6 +1485,80 @@ void btrfs_mark_bg_unused(struct btrfs_block_group
>> *bg)
>> spin_unlock(&fs_info->unused_bgs_lock);
>> }
>>
>> +void btrfs_reclaim_bgs(struct btrfs_fs_info *fs_info)
>> +{
>> + struct btrfs_block_group *bg;
>> + struct btrfs_space_info *space_info;
>> + int ret = 0;
>> +
>> + if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
>> + return;
>> +
>> + if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
>> + return;
>> +
>> + mutex_lock(&fs_info->reclaim_bgs_lock);
>> + while (!list_empty(&fs_info->reclaim_bgs)) {
>> + bg = list_first_entry(&fs_info->reclaim_bgs,
>> + struct btrfs_block_group,
>> + bg_list);
>> + list_del_init(&bg->bg_list);
>> +
>> + space_info = bg->space_info;
>> + mutex_unlock(&fs_info->reclaim_bgs_lock);
>> +
>> + /* Don't want to race with allocators so take the groups_sem
>> */
>> + down_write(&space_info->groups_sem);
>> +
>> + spin_lock(&bg->lock);
>> + if (bg->reserved || bg->pinned || bg->ro) {
>> + /*
>> + * We want to bail if we made new allocations or have
>> + * outstanding allocations in this block group. We
>> do
>> + * the ro check in case balance is currently acting
>> on
>> + * this block group.
>> + */
>> + spin_unlock(&bg->lock);
>> + up_write(&space_info->groups_sem);
>> + goto next;
>> + }
>> + spin_unlock(&bg->lock);
>> +
>> + ret = inc_block_group_ro(bg, 0);
>> + up_write(&space_info->groups_sem);
>> + if (ret < 0) {
>> + ret = 0;
>> + goto next;
>> + }
>> +
>> + btrfs_info(fs_info, "reclaiming chunk %llu", bg->start);
>> + trace_btrfs_reclaim_block_group(bg);
>> + ret = btrfs_relocate_chunk(fs_info, bg->start);
>
> I think you forgot to test this with lockdep enabled, it should have
> complained loudly otherwise.
>
> btrfs_relocate_chunk() calls lockdep_assert_head() to verify we are
> holding fs_info->reclaim_bgs_lock,
> but we just unlocked it.
I thought I did, but you're right. I'll need to send a v3 anyways addressing
Josef and Anand's comments.
Thanks,
Johannes