Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation
On sun, 27 Mar 2011 14:30:55 +0900, Itaru Kitayama wrote: Chris' stress test, stress.sh -n 50 -c /mnt/linux-2.6 /mnt gave me another lockdep splat (see below). I applied your V5 patches on top of the next-rc branch. I got it. It is because the allocation flag of the metadata's page cache, which is stored in the btree inode's i_mapping, was set to be GFP_HIGHUSER_MOVABLE. So if we allocate pages for btree's page cache, this lockdep warning will be triggered. I think even without my patch, this lockdep warning can also be triggered, btrfs_evict_inode() do the similar operations like what I do in the btrfs_destroy_inode(). Task1 Kswap0 task open() ... btrfs_search_slot() ... btrfs_cow_block() ... alloc_page() wait for reclaiming shrink_slab() ... shrink_icache_memory() ... btrfs_evict_inode() ... btrfs_search_slot() If the path is locked by task1, the deadlock happens. So the btree's page cache is different with the file's page cache, it can not allocate pages by GFP_HIGHUSER_MOVABLE flag. I will make a separate patch to fix it. I haven't triggered it in my actual testing, but do you think we can iterate a list of block groups in an lockless manner using rcu? May be we can use it, but AFAIK, the write-side of the sleepable RCU is quite slow. Though the operations of the block group list are few, I think we should do some test to check the performance regression. Thanks Miao diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2164296..f40ff4e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -740,6 +740,7 @@ struct btrfs_space_info { struct list_head block_groups[BTRFS_NR_RAID_TYPES]; spinlock_t lock; struct rw_semaphore groups_sem; + struct srcu_struct groups_srcu; atomic_t caching_threads; }; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9e4c9f4..22d6dbb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3003,6 +3003,7 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags, for (i = 0; i BTRFS_NR_RAID_TYPES; i++) INIT_LIST_HEAD(found-block_groups[i]); init_rwsem(found-groups_sem); + init_srcu_struct(found-groups_srcu); spin_lock_init(found-lock); found-flags = flags (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | @@ -4853,6 +4854,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, int data) { int ret = 0; + int idx; struct btrfs_root *root = orig_root-fs_info-extent_root; struct btrfs_free_cluster *last_ptr = NULL; struct btrfs_block_group_cache *block_group = NULL; @@ -4929,7 +4931,7 @@ ideal_cache: if (block_group block_group_bits(block_group, data) (block_group-cached != BTRFS_CACHE_NO || search_start == ideal_cache_offset)) { - down_read(space_info-groups_sem); + idx = srcu_read_lock(space_info-groups_srcu); if (list_empty(block_group-list) || block_group-ro) { /* @@ -4939,7 +4941,7 @@ ideal_cache: * valid */ btrfs_put_block_group(block_group); - up_read(space_info-groups_sem); + srcu_read_unlock(space_info-groups_srcu, idx); } else { index = get_block_group_index(block_group); goto have_block_group; @@ -4949,8 +4951,8 @@ ideal_cache: } } search: - down_read(space_info-groups_sem); - list_for_each_entry(block_group, space_info-block_groups[index], + idx = srcu_read_lock(space_info-groups_srcu); + list_for_each_entry_rcu(block_group, space_info-block_groups[index], list) { u64 offset; int cached; @@ -5197,8 +5199,8 @@ loop: BUG_ON(index != get_block_group_index(block_group)); btrfs_put_block_group(block_group); } - up_read(space_info-groups_sem); - + srcu_read_unlock(space_info-groups_srcu, idx); + if (!ins-objectid ++index BTRFS_NR_RAID_TYPES) goto search; = [ INFO: possible irq lock inversion dependency detected ] 2.6.36-v5+ #2
Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation
Hi Miao, On Sun, 27 Mar 2011 15:00:00 +0800 Miao Xie mi...@cn.fujitsu.com wrote: I got it. It is because the allocation flag of the metadata's page cache, which is stored in the btree inode's i_mapping, was set to be GFP_HIGHUSER_MOVABLE. So if we allocate pages for btree's page cache, this lockdep warning will be triggered. I think even without my patch, this lockdep warning can also be triggered, btrfs_evict_inode() do the similar operations like what I do in the btrfs_destroy_inode(). Task1 Kswap0 task open() ... btrfs_search_slot() ... btrfs_cow_block() ... alloc_page() wait for reclaiming shrink_slab() ... shrink_icache_memory() ... btrfs_evict_inode() ... btrfs_search_slot() If the path is locked by task1, the deadlock happens. Ok. balance_pgdat() calls shrink_slab() with GFP_KERNEL so it's still possible for the kswapd0 to call prune_icache(), no? I still see the lockdep warning even with your patch that clears __GFP_FS in open_ctree(). itaru -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation
On sun, 27 Mar 2011 20:09:10 +0900, Itaru Kitayama wrote: Hi Miao, On Sun, 27 Mar 2011 15:00:00 +0800 Miao Xie mi...@cn.fujitsu.com wrote: I got it. It is because the allocation flag of the metadata's page cache, which is stored in the btree inode's i_mapping, was set to be GFP_HIGHUSER_MOVABLE. So if we allocate pages for btree's page cache, this lockdep warning will be triggered. I think even without my patch, this lockdep warning can also be triggered, btrfs_evict_inode() do the similar operations like what I do in the btrfs_destroy_inode(). Task1 Kswap0 task open() ... btrfs_search_slot() ... btrfs_cow_block() ... alloc_page() wait for reclaiming shrink_slab() ... shrink_icache_memory() ... btrfs_evict_inode() ... btrfs_search_slot() If the path is locked by task1, the deadlock happens. Ok. balance_pgdat() calls shrink_slab() with GFP_KERNEL so it's still possible for the kswapd0 to call prune_icache(), no? I still see the lockdep warning even with your patch that clears __GFP_FS in open_ctree(). sorry for my mistake. The above explanation is wrong, it has no business with kswap thread. The correct explanation is Task1 open() ... btrfs_search_slot() ... btrfs_cow_block() ... alloc_page() do_try_to_free_pages() shrink_slab() ... shrink_icache_memory() ... btrfs_evict_inode() ... btrfs_search_slot() If the path is locked by task1, the deadlock happens. So balance_pgdat() is impossible to trigger the lockdep. (My clearing __GFP_FS patch's changelog is also wrong.) I see, except btree's page cache, free space cache's page cache is also special, can not use __GFP_FS flag. Thanks Miao itaru -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation
Excerpts from Miao Xie's message of 2011-03-27 07:44:06 -0400: On sun, 27 Mar 2011 20:09:10 +0900, Itaru Kitayama wrote: Hi Miao, On Sun, 27 Mar 2011 15:00:00 +0800 Miao Xie mi...@cn.fujitsu.com wrote: I got it. It is because the allocation flag of the metadata's page cache, which is stored in the btree inode's i_mapping, was set to be GFP_HIGHUSER_MOVABLE. So if we allocate pages for btree's page cache, this lockdep warning will be triggered. I think even without my patch, this lockdep warning can also be triggered, btrfs_evict_inode() do the similar operations like what I do in the btrfs_destroy_inode(). Task1Kswap0 task open() ... btrfs_search_slot() ... btrfs_cow_block() ... alloc_page() wait for reclaiming shrink_slab() ... shrink_icache_memory() ... btrfs_evict_inode() ... btrfs_search_slot() If the path is locked by task1, the deadlock happens. Ok. balance_pgdat() calls shrink_slab() with GFP_KERNEL so it's still possible for the kswapd0 to call prune_icache(), no? I still see the lockdep warning even with your patch that clears __GFP_FS in open_ctree(). sorry for my mistake. The above explanation is wrong, it has no business with kswap thread. The correct explanation is Task1 open() ... btrfs_search_slot() ... btrfs_cow_block() ... alloc_page() do_try_to_free_pages() shrink_slab() ... shrink_icache_memory() ... btrfs_evict_inode() ... btrfs_search_slot() If the path is locked by task1, the deadlock happens. So balance_pgdat() is impossible to trigger the lockdep. (My clearing __GFP_FS patch's changelog is also wrong.) I see, except btree's page cache, free space cache's page cache is also special, can not use __GFP_FS flag. Ok, I've got your first patch already, I'll add a hunk for the free space cache too. Most of the allocations we're doing are explicitly with GFP_NOFS, so it is just supporting allocations and readahead that should be causing trouble. Thanks! -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation
Excerpts from Miao Xie's message of 2011-03-24 07:41:31 -0400: Changelog V4 - V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. This does do much better than v4, but I'm still hitting oom with stress.sh -n 50. I tried a bunch of variations but haven't been able to get it quite right. I think I need to hold off on this one and get the 2.6.39 pull request out. I'll setup a .40 tree that has this in it and we can fix the ooms. This is a great base for the work, and I'd like to add more items to the delay, especially the initial inode insertions. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation
Hi Miao, Chris' stress test, stress.sh -n 50 -c /mnt/linux-2.6 /mnt gave me another lockdep splat (see below). I applied your V5 patches on top of the next-rc branch. I haven't triggered it in my actual testing, but do you think we can iterate a list of block groups in an lockless manner using rcu? diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2164296..f40ff4e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -740,6 +740,7 @@ struct btrfs_space_info { struct list_head block_groups[BTRFS_NR_RAID_TYPES]; spinlock_t lock; struct rw_semaphore groups_sem; + struct srcu_struct groups_srcu; atomic_t caching_threads; }; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9e4c9f4..22d6dbb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3003,6 +3003,7 @@ static int update_space_info(struct btrfs_fs_info *info, u64 flags, for (i = 0; i BTRFS_NR_RAID_TYPES; i++) INIT_LIST_HEAD(found-block_groups[i]); init_rwsem(found-groups_sem); + init_srcu_struct(found-groups_srcu); spin_lock_init(found-lock); found-flags = flags (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | @@ -4853,6 +4854,7 @@ static noinline int find_free_extent(struct btrfs_trans_handle *trans, int data) { int ret = 0; + int idx; struct btrfs_root *root = orig_root-fs_info-extent_root; struct btrfs_free_cluster *last_ptr = NULL; struct btrfs_block_group_cache *block_group = NULL; @@ -4929,7 +4931,7 @@ ideal_cache: if (block_group block_group_bits(block_group, data) (block_group-cached != BTRFS_CACHE_NO || search_start == ideal_cache_offset)) { - down_read(space_info-groups_sem); + idx = srcu_read_lock(space_info-groups_srcu); if (list_empty(block_group-list) || block_group-ro) { /* @@ -4939,7 +4941,7 @@ ideal_cache: * valid */ btrfs_put_block_group(block_group); - up_read(space_info-groups_sem); + srcu_read_unlock(space_info-groups_srcu, idx); } else { index = get_block_group_index(block_group); goto have_block_group; @@ -4949,8 +4951,8 @@ ideal_cache: } } search: - down_read(space_info-groups_sem); - list_for_each_entry(block_group, space_info-block_groups[index], + idx = srcu_read_lock(space_info-groups_srcu); + list_for_each_entry_rcu(block_group, space_info-block_groups[index], list) { u64 offset; int cached; @@ -5197,8 +5199,8 @@ loop: BUG_ON(index != get_block_group_index(block_group)); btrfs_put_block_group(block_group); } - up_read(space_info-groups_sem); - + srcu_read_unlock(space_info-groups_srcu, idx); + if (!ins-objectid ++index BTRFS_NR_RAID_TYPES) goto search; = [ INFO: possible irq lock inversion dependency detected ] 2.6.36-v5+ #2 - kswapd0/49 just changed the state of lock: (delayed_node-mutex){+.+.-.}, at: [812131f7] btrfs_remove_delayed_node+0x3e/0xd2 but this lock took another, RECLAIM_FS-READ-unsafe lock in the past: (found-groups_sem){.+} and interrupts could create inverse lock ordering between them. other info that might help us debug this: 2 locks held by kswapd0/49: #0: (shrinker_rwsem){..}, at: [810e242a] shrink_slab+0x3d/0x164 #1: (iprune_sem){.-}, at: [811316d0] shrink_icache_memory+0x4d/0x213 the shortest dependencies between 2nd lock and 1st lock: - (found-groups_sem){.+} ops: 1334 { HARDIRQ-ON-W at: [81075ec0] __lock_acquire+0x346/0xda6 [81076a3d] lock_acquire+0x11d/0x143 [814c6a2a] down_write+0x55/0x9b [811c352a] __link_block_group+0x5a/0x83 [811ca562] btrfs_read_block_groups+0x2fb/0x56c [811d4921] open_ctree+0xf78/0x14ab [811bafdf] btrfs_get_sb+0x236/0x467 [8111f25e] vfs_kern_mount+0xbd/0x1a7 [8111f3b0] do_kern_mount+0x4d/0xed [8113668d] do_mount+0x74e/0x7c5 [8113678c] sys_mount+0x88/0xc2
Re: [PATCH V5 2/2] btrfs: implement delayed inode items operation
Hi, there's one thing I want to bring up. It's not related to delayed functionality itself but to git tree base of the patch. There's a merge conflict when your patch is applied directly onto Linus' tree, and not when on Chris' one. On Thu, Mar 24, 2011 at 07:41:31PM +0800, Miao Xie wrote: ... --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6726,6 +6775,9 @@ void btrfs_destroy_inode(struct inode *inode) inode_tree_del(inode); btrfs_drop_extent_cache(inode, 0, (u64)-1, 0); free: + ret = btrfs_remove_delayed_node(inode); + BUG_ON(ret); + kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode)); } the call to kmem_cache_free has been replaced by commit fa0d7e3de6d6fc5004ad9dea0dd6b286af8f03e9 Author: Nick Piggin npig...@kernel.dk Date: Fri Jan 7 17:49:49 2011 +1100 fs: icache RCU free inodes relevant hunk: --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6495,6 +6495,13 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) return inode; } +static void btrfs_i_callback(struct rcu_head *head) +{ + struct inode *inode = container_of(head, struct inode, i_rcu); + INIT_LIST_HEAD(inode-i_dentry); + kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode)); +} + void btrfs_destroy_inode(struct inode *inode) { struct btrfs_ordered_extent *ordered; @@ -6564,7 +6571,7 @@ void btrfs_destroy_inode(struct inode *inode) inode_tree_del(inode); btrfs_drop_extent_cache(inode, 0, (u64)-1, 0); free: - kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode)); + call_rcu(inode-i_rcu, btrfs_i_callback); } I don't think this disqualifies all the testing already done but maybe it's time to rebase btrfs-unstable.git to .38 . Chris? dave -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html