Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

Alex Lyakas Sun, 13 Dec 2015 08:17:32 -0800

Thank you, Filipe. Now it is more clear.
Fortunately, in my kernel 3.18 I do not have do_chunk_alloc() calling
btrfs_create_pending_block_groups(), so I cannot hit this deadlock.
But can hit the issue that this call is meant to fix.


Thanks,
Alex.


On Sun, Dec 13, 2015 at 5:45 PM, Filipe Manana <fdman...@kernel.org> wrote:
> On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas <a...@zadarastorage.com> wrote:
>> Hi Filipe Manana,
>>
>> Can't the call to btrfs_create_pending_block_groups() cause a
>> deadlock, like in
>> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
>> call updates the device tree, and we may be calling do_chunk_alloc()
>> from find_free_extent() when holding a lock on the device tree root
>> (because we want to COW a block of the device tree).
>>
>> My understanding from Josef's chunk allocator rework
>> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
>> when allocating a new chunk we do not immediately update the
>> device/chunk tree. We keep the new chunk in "pending_chunks" and in
>> "new_bgs" on a transaction handle, and we actually update the
>> chunk/device tree only when we are done with a particular transaction
>> handle. This way we avoid that sort of deadlocks.
>>
>> But this patch breaks this rule, as it may make us update the
>> device/chunk tree in the context of chunk allocation, which is the
>> scenario that the rework was meant to avoid.
>>
>> Can you please point me at what I am missing?
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff
>
>>
>> Thanks,
>> Alex.
>>
>>
>> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval <osan...@fb.com> wrote:
>>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
>>>> From: Filipe Manana <fdman...@suse.com>
>>>>
>>>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>>>> finishing block group creation"), introduced in 4.2-rc1, the following
>>>> test was failing due to exhaustion of the system array in the superblock:
>>>>
>>>>   #!/bin/bash
>>>>
>>>>   truncate -s 100T big.img
>>>>   mkfs.btrfs big.img
>>>>   mount -o loop big.img /mnt/loop
>>>>
>>>>   num=5
>>>>   sz=10T
>>>>   for ((i = 0; i < $num; i++)); do
>>>>       echo fallocate $i $sz
>>>>       fallocate -l $sz /mnt/loop/testfile$i
>>>>   done
>>>>   btrfs filesystem sync /mnt/loop
>>>>
>>>>   for ((i = 0; i < $num; i++)); do
>>>>         echo rm $i
>>>>         rm /mnt/loop/testfile$i
>>>>         btrfs filesystem sync /mnt/loop
>>>>   done
>>>>   umount /mnt/loop
>>>>
>>>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>>>> allocation of system block groups. This happened because the test creates
>>>> a large number of data block groups per transaction and when committing
>>>> the transaction we start the writeout of the block group caches for all
>>>> the new new (dirty) block groups, which results in pre-allocating space
>>>> for each block group's free space cache using the same transaction handle.
>>>> That in turn often leads to creation of more block groups, and all get
>>>> attached to the new_bgs list of the same transaction handle to the point
>>>> of getting a list with over 1500 elements, and creation of new block groups
>>>> leads to the need of reserving space in the chunk block reserve and often
>>>> creating a new system block group too.
>>>>
>>>> So that made us quickly exhaust the chunk block reserve/system space info,
>>>> because as of the commit mentioned before, we do reserve space for each
>>>> new block group in the chunk block reserve, unlike before where we would
>>>> not and would at most allocate one new system block group and therefore
>>>> would only ensure that there was enough space in the system space info to
>>>> allocate 1 new block group even if we ended up allocating thousands of
>>>> new block groups using the same transaction handle. That worked most of
>>>> the time because the computed required space at check_system_chunk() is
>>>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>>>> that all nodes/leafs in a path will be COWed and split) and since the
>>>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>>>> it is unlikely that a path needs to be COWed more than once (unless
>>>> writepages() for the btree inode is called by mm in between) and that
>>>> compensated for the need of creating any new nodes/leads in the chunk
>>>> tree.
>>>>
>>>> So fix this by ensuring we don't accumulate a too large list of new block
>>>> groups in a transaction's handles new_bgs list, inserting/updating the
>>>> chunk tree for all accumulated new block groups and releasing the unused
>>>> space from the chunk block reserve whenever the list becomes sufficiently
>>>> large. This is a generic solution even though the problem currently can
>>>> only happen when starting the writeout of the free space caches for all
>>>> dirty block groups (btrfs_start_dirty_block_groups()).
>>>>
>>>> Reported-by: Omar Sandoval <osan...@fb.com>
>>>> Signed-off-by: Filipe Manana <fdman...@suse.com>
>>>
>>> Thanks a lot for taking a look.
>>>
>>> Tested-by: Omar Sandoval <osan...@fb.com>
>>>
>>>> ---
>>>>  fs/btrfs/extent-tree.c | 18 ++++++++++++++++++
>>>>  1 file changed, 18 insertions(+)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index 171312d..07204bf 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -4227,6 +4227,24 @@ out:
>>>>       space_info->chunk_alloc = 0;
>>>>       spin_unlock(&space_info->lock);
>>>>       mutex_unlock(&fs_info->chunk_mutex);
>>>> +     /*
>>>> +      * When we allocate a new chunk we reserve space in the chunk block
>>>> +      * reserve to make sure we can COW nodes/leafs in the chunk tree or
>>>> +      * add new nodes/leafs to it if we end up needing to do it when
>>>> +      * inserting the chunk item and updating device items as part of the
>>>> +      * second phase of chunk allocation, performed by
>>>> +      * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
>>>> +      * large number of new block groups to create in our transaction
>>>> +      * handle's new_bgs list to avoid exhausting the chunk block reserve
>>>> +      * in extreme cases - like having a single transaction create many 
>>>> new
>>>> +      * block groups when starting to write out the free space caches of 
>>>> all
>>>> +      * the block groups that were made dirty during the lifetime of the
>>>> +      * transaction.
>>>> +      */
>>>> +     if (trans->chunk_bytes_reserved >= (2 * 1024 * 1024ull)) {
>>>> +             btrfs_create_pending_block_groups(trans, trans->root);
>>>> +             btrfs_trans_release_chunk_metadata(trans);
>>>> +     }
>>>>       return ret;
>>>>  }
>>>>
>>>> --
>>>> 2.1.3
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majord...@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Omar
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

Reply via email to