Hello Josef,

On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik <jba...@fusionio.com> wrote:
> On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
>> An user reported that he has hit an annoying deadlock while playing with
>> ceph based on btrfs.
>>
>> Current updating device tree requires space from METADATA chunk,
>> so we -may- need to do a recursive chunk allocation when adding/updating
>> dev extent, that is where the deadlock comes from.
>>
>> If we use SYSTEM metadata to update device tree, we can avoid the recursive
>> stuff.
>>
>
> This is going to cause us to allocate much more system chunks than we used to
> which could land us in trouble.  Instead let's just keep us from re-entering 
> if
> we're already allocating a chunk.  We do the chunk allocation when we don't 
> have
> enough space for a cluster, but we'll likely have plenty of space to make an
> allocation.  Can you give this patch a try Jim and see if it fixes your 
> problem?
> Thanks,
>
> Josef
>
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e152809..59df5e7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3564,6 +3564,10 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
> *trans,
>         int wait_for_alloc = 0;
>         int ret = 0;
>
> +       /* Don't re-enter if we're already allocating a chunk */
> +       if (trans->allocating_chunk)
> +               return -ENOSPC;
> +
>         space_info = __find_space_info(extent_root->fs_info, flags);
>         if (!space_info) {
>                 ret = update_space_info(extent_root->fs_info, flags,
> @@ -3606,6 +3610,8 @@ again:
>                 goto again;
>         }
>
> +       trans->allocating_chunk = true;
> +
>         /*
>          * If we have mixed data/metadata chunks we want to make sure we keep
>          * allocating mixed chunks instead of individual chunks.
> @@ -3632,6 +3638,7 @@ again:
>         check_system_chunk(trans, extent_root, flags);
>
>         ret = btrfs_alloc_chunk(trans, extent_root, flags);
> +       trans->allocating_chunk = false;
>         if (ret < 0 && ret != -ENOSPC)
>                 goto out;
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index e6509b9..47ad8be 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -388,6 +388,7 @@ again:
>         h->qgroup_reserved = qgroup_reserved;
>         h->delayed_ref_elem.seq = 0;
>         h->type = type;
> +       h->allocating_chunk = false;
>         INIT_LIST_HEAD(&h->qgroup_ref_list);
>         INIT_LIST_HEAD(&h->new_bgs);
>
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 0e8aa1e..69700f7 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -68,6 +68,7 @@ struct btrfs_trans_handle {
>         struct btrfs_block_rsv *orig_rsv;
>         short aborted;
>         short adding_csums;
> +       bool allocating_chunk;
>         enum btrfs_trans_type type;
>         /*
>          * this root is only needed to validate that the root passed to

I hit this problem in a following scenario:
- a data chunk allocation is triggered, and locks chunk_mutex
- the same thread now also wants to allocate a metadata chunk, so it
recursively calls do_chunk_alloc, but cannot lock the chunk_mutex =>
deadlock
- btrfs has only one metadata chunk, the one that was initially
allocated by mkfs, it has:
total_bytes=8388608
bytes_used=8130560
bytes_pinned=77824
bytes_reserved=180224
so bytes_used + bytes_pinned + bytes_reserved == total_bytes

Your patch would have returned ENOSPC and avoid the deadlock, but
there would be a failure to allocate a tree block for metadata. So the
transaction would have probably aborted.

How such situation should be handled?

Idea1:
- lock chunk mutex,
- if we are allocating a data chunk, check whether the metadata space
is below some threshold. If yes, go and allocate a metadata chunk
first and then only a data chunk.

Idea2:
- check if we are the same thread that already locked the chunk mutex.
If yes, allow recursive call but don't attempt to lock/unlock the
chunk_mutex this time

Or some other way?

Thanks!
Alex.






> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to