Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Alex Lyakas
Hi Filipe Manana,

Can't the call to btrfs_create_pending_block_groups() cause a
deadlock, like in
http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
call updates the device tree, and we may be calling do_chunk_alloc()
from find_free_extent() when holding a lock on the device tree root
(because we want to COW a block of the device tree).

My understanding from Josef's chunk allocator rework
(http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
when allocating a new chunk we do not immediately update the
device/chunk tree. We keep the new chunk in "pending_chunks" and in
"new_bgs" on a transaction handle, and we actually update the
chunk/device tree only when we are done with a particular transaction
handle. This way we avoid that sort of deadlocks.

But this patch breaks this rule, as it may make us update the
device/chunk tree in the context of chunk allocation, which is the
scenario that the rework was meant to avoid.

Can you please point me at what I am missing?

Thanks,
Alex.


On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval  wrote:
> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>> finishing block group creation"), introduced in 4.2-rc1, the following
>> test was failing due to exhaustion of the system array in the superblock:
>>
>>   #!/bin/bash
>>
>>   truncate -s 100T big.img
>>   mkfs.btrfs big.img
>>   mount -o loop big.img /mnt/loop
>>
>>   num=5
>>   sz=10T
>>   for ((i = 0; i < $num; i++)); do
>>   echo fallocate $i $sz
>>   fallocate -l $sz /mnt/loop/testfile$i
>>   done
>>   btrfs filesystem sync /mnt/loop
>>
>>   for ((i = 0; i < $num; i++)); do
>> echo rm $i
>> rm /mnt/loop/testfile$i
>> btrfs filesystem sync /mnt/loop
>>   done
>>   umount /mnt/loop
>>
>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>> allocation of system block groups. This happened because the test creates
>> a large number of data block groups per transaction and when committing
>> the transaction we start the writeout of the block group caches for all
>> the new new (dirty) block groups, which results in pre-allocating space
>> for each block group's free space cache using the same transaction handle.
>> That in turn often leads to creation of more block groups, and all get
>> attached to the new_bgs list of the same transaction handle to the point
>> of getting a list with over 1500 elements, and creation of new block groups
>> leads to the need of reserving space in the chunk block reserve and often
>> creating a new system block group too.
>>
>> So that made us quickly exhaust the chunk block reserve/system space info,
>> because as of the commit mentioned before, we do reserve space for each
>> new block group in the chunk block reserve, unlike before where we would
>> not and would at most allocate one new system block group and therefore
>> would only ensure that there was enough space in the system space info to
>> allocate 1 new block group even if we ended up allocating thousands of
>> new block groups using the same transaction handle. That worked most of
>> the time because the computed required space at check_system_chunk() is
>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>> that all nodes/leafs in a path will be COWed and split) and since the
>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>> it is unlikely that a path needs to be COWed more than once (unless
>> writepages() for the btree inode is called by mm in between) and that
>> compensated for the need of creating any new nodes/leads in the chunk
>> tree.
>>
>> So fix this by ensuring we don't accumulate a too large list of new block
>> groups in a transaction's handles new_bgs list, inserting/updating the
>> chunk tree for all accumulated new block groups and releasing the unused
>> space from the chunk block reserve whenever the list becomes sufficiently
>> large. This is a generic solution even though the problem currently can
>> only happen when starting the writeout of the free space caches for all
>> dirty block groups (btrfs_start_dirty_block_groups()).
>>
>> Reported-by: Omar Sandoval 
>> Signed-off-by: Filipe Manana 
>
> Thanks a lot for taking a look.
>
> Tested-by: Omar Sandoval 
>
>> ---
>>  fs/btrfs/extent-tree.c | 18 ++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 171312d..07204bf 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4227,6 +4227,24 @@ out:
>>   space_info->chunk_alloc = 0;
>>   spin_unlock(_info->lock);
>>   mutex_unlock(_info->chunk_mutex);
>> + /*
>> +  * When we allocate a new chunk we reserve space in the chunk block
>> +  * reserve 

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Filipe Manana
On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas  wrote:
> Hi Filipe Manana,
>
> Can't the call to btrfs_create_pending_block_groups() cause a
> deadlock, like in
> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
> call updates the device tree, and we may be calling do_chunk_alloc()
> from find_free_extent() when holding a lock on the device tree root
> (because we want to COW a block of the device tree).
>
> My understanding from Josef's chunk allocator rework
> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
> when allocating a new chunk we do not immediately update the
> device/chunk tree. We keep the new chunk in "pending_chunks" and in
> "new_bgs" on a transaction handle, and we actually update the
> chunk/device tree only when we are done with a particular transaction
> handle. This way we avoid that sort of deadlocks.
>
> But this patch breaks this rule, as it may make us update the
> device/chunk tree in the context of chunk allocation, which is the
> scenario that the rework was meant to avoid.
>
> Can you please point me at what I am missing?

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff

>
> Thanks,
> Alex.
>
>
> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval  wrote:
>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
>>> From: Filipe Manana 
>>>
>>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>>> finishing block group creation"), introduced in 4.2-rc1, the following
>>> test was failing due to exhaustion of the system array in the superblock:
>>>
>>>   #!/bin/bash
>>>
>>>   truncate -s 100T big.img
>>>   mkfs.btrfs big.img
>>>   mount -o loop big.img /mnt/loop
>>>
>>>   num=5
>>>   sz=10T
>>>   for ((i = 0; i < $num; i++)); do
>>>   echo fallocate $i $sz
>>>   fallocate -l $sz /mnt/loop/testfile$i
>>>   done
>>>   btrfs filesystem sync /mnt/loop
>>>
>>>   for ((i = 0; i < $num; i++)); do
>>> echo rm $i
>>> rm /mnt/loop/testfile$i
>>> btrfs filesystem sync /mnt/loop
>>>   done
>>>   umount /mnt/loop
>>>
>>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>>> allocation of system block groups. This happened because the test creates
>>> a large number of data block groups per transaction and when committing
>>> the transaction we start the writeout of the block group caches for all
>>> the new new (dirty) block groups, which results in pre-allocating space
>>> for each block group's free space cache using the same transaction handle.
>>> That in turn often leads to creation of more block groups, and all get
>>> attached to the new_bgs list of the same transaction handle to the point
>>> of getting a list with over 1500 elements, and creation of new block groups
>>> leads to the need of reserving space in the chunk block reserve and often
>>> creating a new system block group too.
>>>
>>> So that made us quickly exhaust the chunk block reserve/system space info,
>>> because as of the commit mentioned before, we do reserve space for each
>>> new block group in the chunk block reserve, unlike before where we would
>>> not and would at most allocate one new system block group and therefore
>>> would only ensure that there was enough space in the system space info to
>>> allocate 1 new block group even if we ended up allocating thousands of
>>> new block groups using the same transaction handle. That worked most of
>>> the time because the computed required space at check_system_chunk() is
>>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>>> that all nodes/leafs in a path will be COWed and split) and since the
>>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>>> it is unlikely that a path needs to be COWed more than once (unless
>>> writepages() for the btree inode is called by mm in between) and that
>>> compensated for the need of creating any new nodes/leads in the chunk
>>> tree.
>>>
>>> So fix this by ensuring we don't accumulate a too large list of new block
>>> groups in a transaction's handles new_bgs list, inserting/updating the
>>> chunk tree for all accumulated new block groups and releasing the unused
>>> space from the chunk block reserve whenever the list becomes sufficiently
>>> large. This is a generic solution even though the problem currently can
>>> only happen when starting the writeout of the free space caches for all
>>> dirty block groups (btrfs_start_dirty_block_groups()).
>>>
>>> Reported-by: Omar Sandoval 
>>> Signed-off-by: Filipe Manana 
>>
>> Thanks a lot for taking a look.
>>
>> Tested-by: Omar Sandoval 
>>
>>> ---
>>>  fs/btrfs/extent-tree.c | 18 ++
>>>  1 file changed, 18 insertions(+)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 171312d..07204bf 

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Alex Lyakas
Thank you, Filipe. Now it is more clear.
Fortunately, in my kernel 3.18 I do not have do_chunk_alloc() calling
btrfs_create_pending_block_groups(), so I cannot hit this deadlock.
But can hit the issue that this call is meant to fix.

Thanks,
Alex.


On Sun, Dec 13, 2015 at 5:45 PM, Filipe Manana  wrote:
> On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas  wrote:
>> Hi Filipe Manana,
>>
>> Can't the call to btrfs_create_pending_block_groups() cause a
>> deadlock, like in
>> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
>> call updates the device tree, and we may be calling do_chunk_alloc()
>> from find_free_extent() when holding a lock on the device tree root
>> (because we want to COW a block of the device tree).
>>
>> My understanding from Josef's chunk allocator rework
>> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
>> when allocating a new chunk we do not immediately update the
>> device/chunk tree. We keep the new chunk in "pending_chunks" and in
>> "new_bgs" on a transaction handle, and we actually update the
>> chunk/device tree only when we are done with a particular transaction
>> handle. This way we avoid that sort of deadlocks.
>>
>> But this patch breaks this rule, as it may make us update the
>> device/chunk tree in the context of chunk allocation, which is the
>> scenario that the rework was meant to avoid.
>>
>> Can you please point me at what I am missing?
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff
>
>>
>> Thanks,
>> Alex.
>>
>>
>> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval  wrote:
>>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
 From: Filipe Manana 

 Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
 finishing block group creation"), introduced in 4.2-rc1, the following
 test was failing due to exhaustion of the system array in the superblock:

   #!/bin/bash

   truncate -s 100T big.img
   mkfs.btrfs big.img
   mount -o loop big.img /mnt/loop

   num=5
   sz=10T
   for ((i = 0; i < $num; i++)); do
   echo fallocate $i $sz
   fallocate -l $sz /mnt/loop/testfile$i
   done
   btrfs filesystem sync /mnt/loop

   for ((i = 0; i < $num; i++)); do
 echo rm $i
 rm /mnt/loop/testfile$i
 btrfs filesystem sync /mnt/loop
   done
   umount /mnt/loop

 This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
 allocation of system block groups. This happened because the test creates
 a large number of data block groups per transaction and when committing
 the transaction we start the writeout of the block group caches for all
 the new new (dirty) block groups, which results in pre-allocating space
 for each block group's free space cache using the same transaction handle.
 That in turn often leads to creation of more block groups, and all get
 attached to the new_bgs list of the same transaction handle to the point
 of getting a list with over 1500 elements, and creation of new block groups
 leads to the need of reserving space in the chunk block reserve and often
 creating a new system block group too.

 So that made us quickly exhaust the chunk block reserve/system space info,
 because as of the commit mentioned before, we do reserve space for each
 new block group in the chunk block reserve, unlike before where we would
 not and would at most allocate one new system block group and therefore
 would only ensure that there was enough space in the system space info to
 allocate 1 new block group even if we ended up allocating thousands of
 new block groups using the same transaction handle. That worked most of
 the time because the computed required space at check_system_chunk() is
 very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
 that all nodes/leafs in a path will be COWed and split) and since the
 updates to the chunk tree all happen at btrfs_create_pending_block_groups
 it is unlikely that a path needs to be COWed more than once (unless
 writepages() for the btree inode is called by mm in between) and that
 compensated for the need of creating any new nodes/leads in the chunk
 tree.

 So fix this by ensuring we don't accumulate a too large list of new block
 groups in a transaction's handles new_bgs list, inserting/updating the
 chunk tree for all accumulated new block groups and releasing the unused
 space from the chunk block reserve whenever the list becomes sufficiently
 large. This is a generic solution even though the problem currently can
 only happen when starting the writeout of the free space caches for all
 dirty block groups 

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-07-21 Thread Omar Sandoval
On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
 From: Filipe Manana fdman...@suse.com
 
 Omar reported that after commit 4fbcdf669454 (Btrfs: fix -ENOSPC when
 finishing block group creation), introduced in 4.2-rc1, the following
 test was failing due to exhaustion of the system array in the superblock:
 
   #!/bin/bash
 
   truncate -s 100T big.img
   mkfs.btrfs big.img
   mount -o loop big.img /mnt/loop
 
   num=5
   sz=10T
   for ((i = 0; i  $num; i++)); do
   echo fallocate $i $sz
   fallocate -l $sz /mnt/loop/testfile$i
   done
   btrfs filesystem sync /mnt/loop
 
   for ((i = 0; i  $num; i++)); do
 echo rm $i
 rm /mnt/loop/testfile$i
 btrfs filesystem sync /mnt/loop
   done
   umount /mnt/loop
 
 This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
 allocation of system block groups. This happened because the test creates
 a large number of data block groups per transaction and when committing
 the transaction we start the writeout of the block group caches for all
 the new new (dirty) block groups, which results in pre-allocating space
 for each block group's free space cache using the same transaction handle.
 That in turn often leads to creation of more block groups, and all get
 attached to the new_bgs list of the same transaction handle to the point
 of getting a list with over 1500 elements, and creation of new block groups
 leads to the need of reserving space in the chunk block reserve and often
 creating a new system block group too.
 
 So that made us quickly exhaust the chunk block reserve/system space info,
 because as of the commit mentioned before, we do reserve space for each
 new block group in the chunk block reserve, unlike before where we would
 not and would at most allocate one new system block group and therefore
 would only ensure that there was enough space in the system space info to
 allocate 1 new block group even if we ended up allocating thousands of
 new block groups using the same transaction handle. That worked most of
 the time because the computed required space at check_system_chunk() is
 very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
 that all nodes/leafs in a path will be COWed and split) and since the
 updates to the chunk tree all happen at btrfs_create_pending_block_groups
 it is unlikely that a path needs to be COWed more than once (unless
 writepages() for the btree inode is called by mm in between) and that
 compensated for the need of creating any new nodes/leads in the chunk
 tree.
 
 So fix this by ensuring we don't accumulate a too large list of new block
 groups in a transaction's handles new_bgs list, inserting/updating the
 chunk tree for all accumulated new block groups and releasing the unused
 space from the chunk block reserve whenever the list becomes sufficiently
 large. This is a generic solution even though the problem currently can
 only happen when starting the writeout of the free space caches for all
 dirty block groups (btrfs_start_dirty_block_groups()).
 
 Reported-by: Omar Sandoval osan...@fb.com
 Signed-off-by: Filipe Manana fdman...@suse.com

Thanks a lot for taking a look.

Tested-by: Omar Sandoval osan...@fb.com

 ---
  fs/btrfs/extent-tree.c | 18 ++
  1 file changed, 18 insertions(+)
 
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 171312d..07204bf 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -4227,6 +4227,24 @@ out:
   space_info-chunk_alloc = 0;
   spin_unlock(space_info-lock);
   mutex_unlock(fs_info-chunk_mutex);
 + /*
 +  * When we allocate a new chunk we reserve space in the chunk block
 +  * reserve to make sure we can COW nodes/leafs in the chunk tree or
 +  * add new nodes/leafs to it if we end up needing to do it when
 +  * inserting the chunk item and updating device items as part of the
 +  * second phase of chunk allocation, performed by
 +  * btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
 +  * large number of new block groups to create in our transaction
 +  * handle's new_bgs list to avoid exhausting the chunk block reserve
 +  * in extreme cases - like having a single transaction create many new
 +  * block groups when starting to write out the free space caches of all
 +  * the block groups that were made dirty during the lifetime of the
 +  * transaction.
 +  */
 + if (trans-chunk_bytes_reserved = (2 * 1024 * 1024ull)) {
 + btrfs_create_pending_block_groups(trans, trans-root);
 + btrfs_trans_release_chunk_metadata(trans);
 + }
   return ret;
  }
  
 -- 
 2.1.3
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Omar
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in

[PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-07-20 Thread fdmanana
From: Filipe Manana fdman...@suse.com

Omar reported that after commit 4fbcdf669454 (Btrfs: fix -ENOSPC when
finishing block group creation), introduced in 4.2-rc1, the following
test was failing due to exhaustion of the system array in the superblock:

  #!/bin/bash

  truncate -s 100T big.img
  mkfs.btrfs big.img
  mount -o loop big.img /mnt/loop

  num=5
  sz=10T
  for ((i = 0; i  $num; i++)); do
  echo fallocate $i $sz
  fallocate -l $sz /mnt/loop/testfile$i
  done
  btrfs filesystem sync /mnt/loop

  for ((i = 0; i  $num; i++)); do
echo rm $i
rm /mnt/loop/testfile$i
btrfs filesystem sync /mnt/loop
  done
  umount /mnt/loop

This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
allocation of system block groups. This happened because the test creates
a large number of data block groups per transaction and when committing
the transaction we start the writeout of the block group caches for all
the new new (dirty) block groups, which results in pre-allocating space
for each block group's free space cache using the same transaction handle.
That in turn often leads to creation of more block groups, and all get
attached to the new_bgs list of the same transaction handle to the point
of getting a list with over 1500 elements, and creation of new block groups
leads to the need of reserving space in the chunk block reserve and often
creating a new system block group too.

So that made us quickly exhaust the chunk block reserve/system space info,
because as of the commit mentioned before, we do reserve space for each
new block group in the chunk block reserve, unlike before where we would
not and would at most allocate one new system block group and therefore
would only ensure that there was enough space in the system space info to
allocate 1 new block group even if we ended up allocating thousands of
new block groups using the same transaction handle. That worked most of
the time because the computed required space at check_system_chunk() is
very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
that all nodes/leafs in a path will be COWed and split) and since the
updates to the chunk tree all happen at btrfs_create_pending_block_groups
it is unlikely that a path needs to be COWed more than once (unless
writepages() for the btree inode is called by mm in between) and that
compensated for the need of creating any new nodes/leads in the chunk
tree.

So fix this by ensuring we don't accumulate a too large list of new block
groups in a transaction's handles new_bgs list, inserting/updating the
chunk tree for all accumulated new block groups and releasing the unused
space from the chunk block reserve whenever the list becomes sufficiently
large. This is a generic solution even though the problem currently can
only happen when starting the writeout of the free space caches for all
dirty block groups (btrfs_start_dirty_block_groups()).

Reported-by: Omar Sandoval osan...@fb.com
Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/extent-tree.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 171312d..07204bf 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4227,6 +4227,24 @@ out:
space_info-chunk_alloc = 0;
spin_unlock(space_info-lock);
mutex_unlock(fs_info-chunk_mutex);
+   /*
+* When we allocate a new chunk we reserve space in the chunk block
+* reserve to make sure we can COW nodes/leafs in the chunk tree or
+* add new nodes/leafs to it if we end up needing to do it when
+* inserting the chunk item and updating device items as part of the
+* second phase of chunk allocation, performed by
+* btrfs_finish_chunk_alloc(). So make sure we don't accumulate a
+* large number of new block groups to create in our transaction
+* handle's new_bgs list to avoid exhausting the chunk block reserve
+* in extreme cases - like having a single transaction create many new
+* block groups when starting to write out the free space caches of all
+* the block groups that were made dirty during the lifetime of the
+* transaction.
+*/
+   if (trans-chunk_bytes_reserved = (2 * 1024 * 1024ull)) {
+   btrfs_create_pending_block_groups(trans, trans-root);
+   btrfs_trans_release_chunk_metadata(trans);
+   }
return ret;
 }
 
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html