This sounds reasonable to me, I'll look at it more when I'm on the ground and can look at the code and see for sure. Thanks,
Josef Sent from my iPhone > On Mar 19, 2017, at 1:19 PM, Alex Lyakas <a...@zadarastorage.com> wrote: > > We have a commit_root_sem, which is a read-write semaphore that protects the > commit roots. > But it is also used to protect the list of caching block groups. > > As a result, while doing "slow" caching, the following issue is seen: > > Some of the caching threads are scanning the extent tree with commit_root_sem > acquired in shared mode, with stack like: > [<ffffffffc0ad27b2>] read_extent_buffer_pages+0x2d2/0x300 [btrfs] > [<ffffffffc0a9fbe7>] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 > [btrfs] > [<ffffffffc0aa1550>] read_tree_block+0x40/0x70 [btrfs] > [<ffffffffc0a7aa7c>] read_block_for_search.isra.33+0x12c/0x370 [btrfs] > [<ffffffffc0a7ce86>] btrfs_search_slot+0x3c6/0xb10 [btrfs] > [<ffffffffc0a975b9>] caching_thread+0x1b9/0x820 [btrfs] > [<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs] > [<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs] > > IO requests that want to allocate space are waiting in cache_block_group() > to acquire the commit_root_sem in exclusive mode. But they only want to add > the caching control structure to the list of caching block-groups: > [<ffffffff817136c9>] schedule+0x29/0x70 > [<ffffffff81716085>] rwsem_down_write_failed+0x145/0x320 > [<ffffffff813a1ae3>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffffc0a86d0b>] cache_block_group+0x25b/0x450 [btrfs] > [<ffffffffc0a94d36>] find_free_extent+0xd16/0xdb0 [btrfs] > [<ffffffffc0a94e7f>] btrfs_reserve_extent+0xaf/0x160 [btrfs] > > Other caching threads want to continue their scanning, and for that they > are waiting to acquire commit_root_sem in shared mode. But since there are > IO threads that want the exclusive lock, the caching threads are unable > to continue the scanning, because (I presume) rw_semaphore guarantees some > fairness: > [<ffffffff817136c9>] schedule+0x29/0x70 > [<ffffffff81715ee5>] rwsem_down_read_failed+0xc5/0x120 > [<ffffffff813a1ab4>] call_rwsem_down_read_failed+0x14/0x30 > [<ffffffffc0a975a1>] caching_thread+0x1a1/0x820 [btrfs] > [<ffffffffc0adfa36>] normal_work_helper+0xc6/0x340 [btrfs] > [<ffffffffc0adfd22>] btrfs_cache_helper+0x12/0x20 [btrfs] > [<ffffffff8108bd56>] process_one_work+0x146/0x410 > > This causes slowness of the IO, especially when there are many block groups > that need to be scanned for free space. In some cases it takes minutes > until a single IO thread is able to allocate free space. > > I don't see a deadlock here, because the caching threads that were able to > acquire > the commit_root_sem will call rwsem_is_contended() and should give up the > semaphore, > so that IO threads are able to acquire it in exclusive mode. > > However, introducing a separate mutex that protects only the list of caching > block groups makes things move forward much faster. > > This patch is based on kernel 3.18. > Unfortunately, I am not able to submit a patch based on one of the latest > kernels, because > here btrfs is part of the larger system, and upgrading the kernel is a > significant effort. > Hence marking the patch as RFC. > Hopefully, this patch still has some value to the community. > > Signed-off-by: Alex Lyakas <a...@zadarastorage.com> > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 42d11e7..74feacb 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -1490,6 +1490,8 @@ struct btrfs_fs_info { > struct list_head trans_list; > struct list_head dead_roots; > struct list_head caching_block_groups; > + /* protects the above list */ > + struct mutex caching_block_groups_mutex; > > spinlock_t delayed_iput_lock; > struct list_head delayed_iputs; > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 5177954..130ec58 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb, > INIT_LIST_HEAD(&fs_info->delayed_iputs); > INIT_LIST_HEAD(&fs_info->delalloc_roots); > INIT_LIST_HEAD(&fs_info->caching_block_groups); > + mutex_init(&fs_info->caching_block_groups_mutex); > spin_lock_init(&fs_info->delalloc_root_lock); > spin_lock_init(&fs_info->trans_lock); > spin_lock_init(&fs_info->fs_roots_radix_lock); > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index a067065..906fb08 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -637,10 +637,10 @@ static int cache_block_group(struct > btrfs_block_group_cache *cache, > return 0; > } > > - down_write(&fs_info->commit_root_sem); > + mutex_lock(&fs_info->caching_block_groups_mutex); > atomic_inc(&caching_ctl->count); > list_add_tail(&caching_ctl->list, &fs_info->caching_block_groups); > - up_write(&fs_info->commit_root_sem); > + mutex_unlock(&fs_info->caching_block_groups_mutex); > > btrfs_get_block_group(cache); > > @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct > btrfs_trans_handle *trans, > > down_write(&fs_info->commit_root_sem); > > + mutex_lock(&fs_info->caching_block_groups_mutex); > list_for_each_entry_safe(caching_ctl, next, > &fs_info->caching_block_groups, list) { > cache = caching_ctl->block_group; > @@ -5704,6 +5705,7 @@ void btrfs_prepare_extent_commit(struct > btrfs_trans_handle *trans, > cache->last_byte_to_unpin = caching_ctl->progress; > } > } > + mutex_unlock(&fs_info->caching_block_groups_mutex); > > if (fs_info->pinned_extents == &fs_info->freed_extents[0]) > fs_info->pinned_extents = &fs_info->freed_extents[1]; > @@ -8849,14 +8851,14 @@ int btrfs_free_block_groups(struct btrfs_fs_info > *info) > struct btrfs_caching_control *caching_ctl; > struct rb_node *n; > > - down_write(&info->commit_root_sem); > + mutex_lock(&info->caching_block_groups_mutex); > while (!list_empty(&info->caching_block_groups)) { > caching_ctl = list_entry(info->caching_block_groups.next, > struct btrfs_caching_control, list); > list_del(&caching_ctl->list); > put_caching_control(caching_ctl); > } > - up_write(&info->commit_root_sem); > + mutex_unlock(&info->caching_block_groups_mutex); > > spin_lock(&info->unused_bgs_lock); > while (!list_empty(&info->unused_bgs)) { > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html