Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
Hi Filipe Manana, Can't the call to btrfs_create_pending_block_groups() cause a deadlock, like in http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this call updates the device tree, and we may be calling do_chunk_alloc() from find_free_extent() when holding a lock on the device tree root (because we want to COW a block of the device tree). My understanding from Josef's chunk allocator rework (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now when allocating a new chunk we do not immediately update the device/chunk tree. We keep the new chunk in "pending_chunks" and in "new_bgs" on a transaction handle, and we actually update the chunk/device tree only when we are done with a particular transaction handle. This way we avoid that sort of deadlocks. But this patch breaks this rule, as it may make us update the device/chunk tree in the context of chunk allocation, which is the scenario that the rework was meant to avoid. Can you please point me at what I am missing? Thanks, Alex. On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandovalwrote: > On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote: >> From: Filipe Manana >> >> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when >> finishing block group creation"), introduced in 4.2-rc1, the following >> test was failing due to exhaustion of the system array in the superblock: >> >> #!/bin/bash >> >> truncate -s 100T big.img >> mkfs.btrfs big.img >> mount -o loop big.img /mnt/loop >> >> num=5 >> sz=10T >> for ((i = 0; i < $num; i++)); do >> echo fallocate $i $sz >> fallocate -l $sz /mnt/loop/testfile$i >> done >> btrfs filesystem sync /mnt/loop >> >> for ((i = 0; i < $num; i++)); do >> echo rm $i >> rm /mnt/loop/testfile$i >> btrfs filesystem sync /mnt/loop >> done >> umount /mnt/loop >> >> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive >> allocation of system block groups. This happened because the test creates >> a large number of data block groups per transaction and when committing >> the transaction we start the writeout of the block group caches for all >> the new new (dirty) block groups, which results in pre-allocating space >> for each block group's free space cache using the same transaction handle. >> That in turn often leads to creation of more block groups, and all get >> attached to the new_bgs list of the same transaction handle to the point >> of getting a list with over 1500 elements, and creation of new block groups >> leads to the need of reserving space in the chunk block reserve and often >> creating a new system block group too. >> >> So that made us quickly exhaust the chunk block reserve/system space info, >> because as of the commit mentioned before, we do reserve space for each >> new block group in the chunk block reserve, unlike before where we would >> not and would at most allocate one new system block group and therefore >> would only ensure that there was enough space in the system space info to >> allocate 1 new block group even if we ended up allocating thousands of >> new block groups using the same transaction handle. That worked most of >> the time because the computed required space at check_system_chunk() is >> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and >> that all nodes/leafs in a path will be COWed and split) and since the >> updates to the chunk tree all happen at btrfs_create_pending_block_groups >> it is unlikely that a path needs to be COWed more than once (unless >> writepages() for the btree inode is called by mm in between) and that >> compensated for the need of creating any new nodes/leads in the chunk >> tree. >> >> So fix this by ensuring we don't accumulate a too large list of new block >> groups in a transaction's handles new_bgs list, inserting/updating the >> chunk tree for all accumulated new block groups and releasing the unused >> space from the chunk block reserve whenever the list becomes sufficiently >> large. This is a generic solution even though the problem currently can >> only happen when starting the writeout of the free space caches for all >> dirty block groups (btrfs_start_dirty_block_groups()). >> >> Reported-by: Omar Sandoval >> Signed-off-by: Filipe Manana > > Thanks a lot for taking a look. > > Tested-by: Omar Sandoval > >> --- >> fs/btrfs/extent-tree.c | 18 ++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 171312d..07204bf 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -4227,6 +4227,24 @@ out: >> space_info->chunk_alloc = 0; >> spin_unlock(_info->lock); >> mutex_unlock(_info->chunk_mutex); >> + /* >> + * When we allocate a new chunk we reserve space in the chunk block >> + * reserve
Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakaswrote: > Hi Filipe Manana, > > Can't the call to btrfs_create_pending_block_groups() cause a > deadlock, like in > http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this > call updates the device tree, and we may be calling do_chunk_alloc() > from find_free_extent() when holding a lock on the device tree root > (because we want to COW a block of the device tree). > > My understanding from Josef's chunk allocator rework > (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now > when allocating a new chunk we do not immediately update the > device/chunk tree. We keep the new chunk in "pending_chunks" and in > "new_bgs" on a transaction handle, and we actually update the > chunk/device tree only when we are done with a particular transaction > handle. This way we avoid that sort of deadlocks. > > But this patch breaks this rule, as it may make us update the > device/chunk tree in the context of chunk allocation, which is the > scenario that the rework was meant to avoid. > > Can you please point me at what I am missing? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff > > Thanks, > Alex. > > > On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval wrote: >> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote: >>> From: Filipe Manana >>> >>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when >>> finishing block group creation"), introduced in 4.2-rc1, the following >>> test was failing due to exhaustion of the system array in the superblock: >>> >>> #!/bin/bash >>> >>> truncate -s 100T big.img >>> mkfs.btrfs big.img >>> mount -o loop big.img /mnt/loop >>> >>> num=5 >>> sz=10T >>> for ((i = 0; i < $num; i++)); do >>> echo fallocate $i $sz >>> fallocate -l $sz /mnt/loop/testfile$i >>> done >>> btrfs filesystem sync /mnt/loop >>> >>> for ((i = 0; i < $num; i++)); do >>> echo rm $i >>> rm /mnt/loop/testfile$i >>> btrfs filesystem sync /mnt/loop >>> done >>> umount /mnt/loop >>> >>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive >>> allocation of system block groups. This happened because the test creates >>> a large number of data block groups per transaction and when committing >>> the transaction we start the writeout of the block group caches for all >>> the new new (dirty) block groups, which results in pre-allocating space >>> for each block group's free space cache using the same transaction handle. >>> That in turn often leads to creation of more block groups, and all get >>> attached to the new_bgs list of the same transaction handle to the point >>> of getting a list with over 1500 elements, and creation of new block groups >>> leads to the need of reserving space in the chunk block reserve and often >>> creating a new system block group too. >>> >>> So that made us quickly exhaust the chunk block reserve/system space info, >>> because as of the commit mentioned before, we do reserve space for each >>> new block group in the chunk block reserve, unlike before where we would >>> not and would at most allocate one new system block group and therefore >>> would only ensure that there was enough space in the system space info to >>> allocate 1 new block group even if we ended up allocating thousands of >>> new block groups using the same transaction handle. That worked most of >>> the time because the computed required space at check_system_chunk() is >>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and >>> that all nodes/leafs in a path will be COWed and split) and since the >>> updates to the chunk tree all happen at btrfs_create_pending_block_groups >>> it is unlikely that a path needs to be COWed more than once (unless >>> writepages() for the btree inode is called by mm in between) and that >>> compensated for the need of creating any new nodes/leads in the chunk >>> tree. >>> >>> So fix this by ensuring we don't accumulate a too large list of new block >>> groups in a transaction's handles new_bgs list, inserting/updating the >>> chunk tree for all accumulated new block groups and releasing the unused >>> space from the chunk block reserve whenever the list becomes sufficiently >>> large. This is a generic solution even though the problem currently can >>> only happen when starting the writeout of the free space caches for all >>> dirty block groups (btrfs_start_dirty_block_groups()). >>> >>> Reported-by: Omar Sandoval >>> Signed-off-by: Filipe Manana >> >> Thanks a lot for taking a look. >> >> Tested-by: Omar Sandoval >> >>> --- >>> fs/btrfs/extent-tree.c | 18 ++ >>> 1 file changed, 18 insertions(+) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 171312d..07204bf
Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
Thank you, Filipe. Now it is more clear. Fortunately, in my kernel 3.18 I do not have do_chunk_alloc() calling btrfs_create_pending_block_groups(), so I cannot hit this deadlock. But can hit the issue that this call is meant to fix. Thanks, Alex. On Sun, Dec 13, 2015 at 5:45 PM, Filipe Mananawrote: > On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas wrote: >> Hi Filipe Manana, >> >> Can't the call to btrfs_create_pending_block_groups() cause a >> deadlock, like in >> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this >> call updates the device tree, and we may be calling do_chunk_alloc() >> from find_free_extent() when holding a lock on the device tree root >> (because we want to COW a block of the device tree). >> >> My understanding from Josef's chunk allocator rework >> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now >> when allocating a new chunk we do not immediately update the >> device/chunk tree. We keep the new chunk in "pending_chunks" and in >> "new_bgs" on a transaction handle, and we actually update the >> chunk/device tree only when we are done with a particular transaction >> handle. This way we avoid that sort of deadlocks. >> >> But this patch breaks this rule, as it may make us update the >> device/chunk tree in the context of chunk allocation, which is the >> scenario that the rework was meant to avoid. >> >> Can you please point me at what I am missing? > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff > >> >> Thanks, >> Alex. >> >> >> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval wrote: >>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote: From: Filipe Manana Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when finishing block group creation"), introduced in 4.2-rc1, the following test was failing due to exhaustion of the system array in the superblock: #!/bin/bash truncate -s 100T big.img mkfs.btrfs big.img mount -o loop big.img /mnt/loop num=5 sz=10T for ((i = 0; i < $num; i++)); do echo fallocate $i $sz fallocate -l $sz /mnt/loop/testfile$i done btrfs filesystem sync /mnt/loop for ((i = 0; i < $num; i++)); do echo rm $i rm /mnt/loop/testfile$i btrfs filesystem sync /mnt/loop done umount /mnt/loop This made btrfs_add_system_chunk() fail with -EFBIG due to excessive allocation of system block groups. This happened because the test creates a large number of data block groups per transaction and when committing the transaction we start the writeout of the block group caches for all the new new (dirty) block groups, which results in pre-allocating space for each block group's free space cache using the same transaction handle. That in turn often leads to creation of more block groups, and all get attached to the new_bgs list of the same transaction handle to the point of getting a list with over 1500 elements, and creation of new block groups leads to the need of reserving space in the chunk block reserve and often creating a new system block group too. So that made us quickly exhaust the chunk block reserve/system space info, because as of the commit mentioned before, we do reserve space for each new block group in the chunk block reserve, unlike before where we would not and would at most allocate one new system block group and therefore would only ensure that there was enough space in the system space info to allocate 1 new block group even if we ended up allocating thousands of new block groups using the same transaction handle. That worked most of the time because the computed required space at check_system_chunk() is very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and that all nodes/leafs in a path will be COWed and split) and since the updates to the chunk tree all happen at btrfs_create_pending_block_groups it is unlikely that a path needs to be COWed more than once (unless writepages() for the btree inode is called by mm in between) and that compensated for the need of creating any new nodes/leads in the chunk tree. So fix this by ensuring we don't accumulate a too large list of new block groups in a transaction's handles new_bgs list, inserting/updating the chunk tree for all accumulated new block groups and releasing the unused space from the chunk block reserve whenever the list becomes sufficiently large. This is a generic solution even though the problem currently can only happen when starting the writeout of the free space caches for all dirty block groups
Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote: From: Filipe Manana fdman...@suse.com Omar reported that after commit 4fbcdf669454 (Btrfs: fix -ENOSPC when finishing block group creation), introduced in 4.2-rc1, the following test was failing due to exhaustion of the system array in the superblock: #!/bin/bash truncate -s 100T big.img mkfs.btrfs big.img mount -o loop big.img /mnt/loop num=5 sz=10T for ((i = 0; i $num; i++)); do echo fallocate $i $sz fallocate -l $sz /mnt/loop/testfile$i done btrfs filesystem sync /mnt/loop for ((i = 0; i $num; i++)); do echo rm $i rm /mnt/loop/testfile$i btrfs filesystem sync /mnt/loop done umount /mnt/loop This made btrfs_add_system_chunk() fail with -EFBIG due to excessive allocation of system block groups. This happened because the test creates a large number of data block groups per transaction and when committing the transaction we start the writeout of the block group caches for all the new new (dirty) block groups, which results in pre-allocating space for each block group's free space cache using the same transaction handle. That in turn often leads to creation of more block groups, and all get attached to the new_bgs list of the same transaction handle to the point of getting a list with over 1500 elements, and creation of new block groups leads to the need of reserving space in the chunk block reserve and often creating a new system block group too. So that made us quickly exhaust the chunk block reserve/system space info, because as of the commit mentioned before, we do reserve space for each new block group in the chunk block reserve, unlike before where we would not and would at most allocate one new system block group and therefore would only ensure that there was enough space in the system space info to allocate 1 new block group even if we ended up allocating thousands of new block groups using the same transaction handle. That worked most of the time because the computed required space at check_system_chunk() is very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and that all nodes/leafs in a path will be COWed and split) and since the updates to the chunk tree all happen at btrfs_create_pending_block_groups it is unlikely that a path needs to be COWed more than once (unless writepages() for the btree inode is called by mm in between) and that compensated for the need of creating any new nodes/leads in the chunk tree. So fix this by ensuring we don't accumulate a too large list of new block groups in a transaction's handles new_bgs list, inserting/updating the chunk tree for all accumulated new block groups and releasing the unused space from the chunk block reserve whenever the list becomes sufficiently large. This is a generic solution even though the problem currently can only happen when starting the writeout of the free space caches for all dirty block groups (btrfs_start_dirty_block_groups()). Reported-by: Omar Sandoval osan...@fb.com Signed-off-by: Filipe Manana fdman...@suse.com Thanks a lot for taking a look. Tested-by: Omar Sandoval osan...@fb.com --- fs/btrfs/extent-tree.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 171312d..07204bf 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4227,6 +4227,24 @@ out: space_info-chunk_alloc = 0; spin_unlock(space_info-lock); mutex_unlock(fs_info-chunk_mutex); + /* + * When we allocate a new chunk we reserve space in the chunk block + * reserve to make sure we can COW nodes/leafs in the chunk tree or + * add new nodes/leafs to it if we end up needing to do it when + * inserting the chunk item and updating device items as part of the + * second phase of chunk allocation, performed by + * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a + * large number of new block groups to create in our transaction + * handle's new_bgs list to avoid exhausting the chunk block reserve + * in extreme cases - like having a single transaction create many new + * block groups when starting to write out the free space caches of all + * the block groups that were made dirty during the lifetime of the + * transaction. + */ + if (trans-chunk_bytes_reserved = (2 * 1024 * 1024ull)) { + btrfs_create_pending_block_groups(trans, trans-root); + btrfs_trans_release_chunk_metadata(trans); + } return ret; } -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Omar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in
[PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
From: Filipe Manana fdman...@suse.com Omar reported that after commit 4fbcdf669454 (Btrfs: fix -ENOSPC when finishing block group creation), introduced in 4.2-rc1, the following test was failing due to exhaustion of the system array in the superblock: #!/bin/bash truncate -s 100T big.img mkfs.btrfs big.img mount -o loop big.img /mnt/loop num=5 sz=10T for ((i = 0; i $num; i++)); do echo fallocate $i $sz fallocate -l $sz /mnt/loop/testfile$i done btrfs filesystem sync /mnt/loop for ((i = 0; i $num; i++)); do echo rm $i rm /mnt/loop/testfile$i btrfs filesystem sync /mnt/loop done umount /mnt/loop This made btrfs_add_system_chunk() fail with -EFBIG due to excessive allocation of system block groups. This happened because the test creates a large number of data block groups per transaction and when committing the transaction we start the writeout of the block group caches for all the new new (dirty) block groups, which results in pre-allocating space for each block group's free space cache using the same transaction handle. That in turn often leads to creation of more block groups, and all get attached to the new_bgs list of the same transaction handle to the point of getting a list with over 1500 elements, and creation of new block groups leads to the need of reserving space in the chunk block reserve and often creating a new system block group too. So that made us quickly exhaust the chunk block reserve/system space info, because as of the commit mentioned before, we do reserve space for each new block group in the chunk block reserve, unlike before where we would not and would at most allocate one new system block group and therefore would only ensure that there was enough space in the system space info to allocate 1 new block group even if we ended up allocating thousands of new block groups using the same transaction handle. That worked most of the time because the computed required space at check_system_chunk() is very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and that all nodes/leafs in a path will be COWed and split) and since the updates to the chunk tree all happen at btrfs_create_pending_block_groups it is unlikely that a path needs to be COWed more than once (unless writepages() for the btree inode is called by mm in between) and that compensated for the need of creating any new nodes/leads in the chunk tree. So fix this by ensuring we don't accumulate a too large list of new block groups in a transaction's handles new_bgs list, inserting/updating the chunk tree for all accumulated new block groups and releasing the unused space from the chunk block reserve whenever the list becomes sufficiently large. This is a generic solution even though the problem currently can only happen when starting the writeout of the free space caches for all dirty block groups (btrfs_start_dirty_block_groups()). Reported-by: Omar Sandoval osan...@fb.com Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 171312d..07204bf 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4227,6 +4227,24 @@ out: space_info-chunk_alloc = 0; spin_unlock(space_info-lock); mutex_unlock(fs_info-chunk_mutex); + /* +* When we allocate a new chunk we reserve space in the chunk block +* reserve to make sure we can COW nodes/leafs in the chunk tree or +* add new nodes/leafs to it if we end up needing to do it when +* inserting the chunk item and updating device items as part of the +* second phase of chunk allocation, performed by +* btrfs_finish_chunk_alloc(). So make sure we don't accumulate a +* large number of new block groups to create in our transaction +* handle's new_bgs list to avoid exhausting the chunk block reserve +* in extreme cases - like having a single transaction create many new +* block groups when starting to write out the free space caches of all +* the block groups that were made dirty during the lifetime of the +* transaction. +*/ + if (trans-chunk_bytes_reserved = (2 * 1024 * 1024ull)) { + btrfs_create_pending_block_groups(trans, trans-root); + btrfs_trans_release_chunk_metadata(trans); + } return ret; } -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html