On Wed, Nov 26, 2014 at 4:15 PM, Josef Bacik <jba...@fb.com> wrote: > On 11/26/2014 10:28 AM, Filipe Manana wrote: >> >> Our fs trim operation, which is completely transactionless (doesn't start >> or joins an existing transaction) consists of visiting all block groups >> and then for each one to iterate its free space entries and perform a >> discard operation against the space range represented by the free space >> entries. However before performing a discard, the corresponding free space >> entry is removed from the free space rbtree, and when the discard >> completes >> it is added back to the free space rbtree. >> >> If a block group remove operation happens while the discard is ongoing (or >> before it starts and after a free space entry is hidden), we end up not >> waiting for the discard to complete, remove the extent map that maps >> logical address to physical addresses and the corresponding chunk metadata >> from the the chunk and device trees. After that and before the discard >> completes, the current running transaction can finish and a new one start, >> allowing for new block groups that map to the same physical addresses to >> be allocated and written to. >> >> So fix this by keeping the extent map in memory until the discard >> completes >> so that the same physical addresses aren't reused before it completes. >> >> If the physical locations that are under a discard operation end up being >> used for a new metadata block group for example, and dirty metadata >> extents >> are written before the discard finishes (the VM might call writepages() of >> our btree inode's i_mapping for example, or an fsync log commit happens) >> we >> end up overwriting metadata with zeroes, which leads to errors from fsck >> like the following: >> >> checking extents >> Check tree block failed, want=833912832, have=0 >> Check tree block failed, want=833912832, have=0 >> Check tree block failed, want=833912832, have=0 >> Check tree block failed, want=833912832, have=0 >> Check tree block failed, want=833912832, have=0 >> read block failed check_tree_block >> owner ref check failed [833912832 16384] >> Errors found in extent allocation tree or chunk allocation >> checking free space cache >> checking fs roots >> Check tree block failed, want=833912832, have=0 >> Check tree block failed, want=833912832, have=0 >> Check tree block failed, want=833912832, have=0 >> Check tree block failed, want=833912832, have=0 >> Check tree block failed, want=833912832, have=0 >> read block failed check_tree_block >> root 5 root dir 256 error >> root 5 inode 260 errors 2001, no inode item, link count wrong >> unresolved ref dir 256 index 0 namelen 8 name foobar_3 >> filetype 1 errors 6, no dir index, no inode ref >> root 5 inode 262 errors 2001, no inode item, link count wrong >> unresolved ref dir 256 index 0 namelen 8 name foobar_5 >> filetype 1 errors 6, no dir index, no inode ref >> root 5 inode 263 errors 2001, no inode item, link count wrong >> (...) >> >> Signed-off-by: Filipe Manana <fdman...@suse.com> >> --- >> fs/btrfs/ctree.h | 13 ++++++++++++- >> fs/btrfs/disk-io.c | 14 ++++++++++++++ >> fs/btrfs/extent-tree.c | 24 +++++++++++++++++++++++- >> fs/btrfs/free-space-cache.c | 26 +++++++++++++++++++++++++- >> fs/btrfs/volumes.c | 33 ++++++++++++++++++++++++++------- >> 5 files changed, 100 insertions(+), 10 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 7f40a65..51056c7 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1278,6 +1278,7 @@ struct btrfs_block_group_cache { >> unsigned int dirty:1; >> unsigned int iref:1; >> unsigned int has_caching_ctl:1; >> + unsigned int removed:1; >> >> int disk_cache_state; >> >> @@ -1307,6 +1308,8 @@ struct btrfs_block_group_cache { >> >> /* For delayed block group creation or deletion of empty block >> groups */ >> struct list_head bg_list; >> + >> + atomic_t trimming; >> }; >> >> /* delayed seq elem */ >> @@ -1731,6 +1734,13 @@ struct btrfs_fs_info { >> >> /* For btrfs to record security options */ >> struct security_mnt_opts security_opts; >> + >> + /* >> + * Chunks that can't be freed yet (under a trim/discard operation) >> + * and will be latter freed. >> + */ >> + rwlock_t pinned_chunks_lock; >> + struct list_head pinned_chunks; >> }; >> >> struct btrfs_subvolume_writers { >> @@ -3353,7 +3363,8 @@ int btrfs_make_block_group(struct btrfs_trans_handle >> *trans, >> u64 type, u64 chunk_objectid, u64 chunk_offset, >> u64 size); >> int btrfs_remove_block_group(struct btrfs_trans_handle *trans, >> - struct btrfs_root *root, u64 group_start); >> + struct btrfs_root *root, u64 group_start, >> + bool *remove_em); >> void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info); >> void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, >> struct btrfs_root *root); >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 9d4fb0a..76012d0 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -2397,6 +2397,9 @@ int open_ctree(struct super_block *sb, >> init_waitqueue_head(&fs_info->transaction_blocked_wait); >> init_waitqueue_head(&fs_info->async_submit_wait); >> >> + rwlock_init(&fs_info->pinned_chunks_lock); >> + INIT_LIST_HEAD(&fs_info->pinned_chunks); >> + >> ret = btrfs_alloc_stripe_hash_table(fs_info); >> if (ret) { >> err = ret; >> @@ -3726,6 +3729,17 @@ void close_ctree(struct btrfs_root *root) >> >> btrfs_free_block_rsv(root, root->orphan_block_rsv); >> root->orphan_block_rsv = NULL; >> + >> + write_lock(&fs_info->pinned_chunks_lock); >> + while (!list_empty(&fs_info->pinned_chunks)) { >> + struct extent_map *em; >> + >> + em = list_first_entry(&fs_info->pinned_chunks, >> + struct extent_map, list); >> + list_del_init(&em->list); >> + free_extent_map(em); >> + } >> + write_unlock(&fs_info->pinned_chunks_lock); >> } >> >> int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid, >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 92f61f2..4bf8f02 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -9016,6 +9016,7 @@ btrfs_create_block_group_cache(struct btrfs_root >> *root, u64 start, u64 size) >> INIT_LIST_HEAD(&cache->cluster_list); >> INIT_LIST_HEAD(&cache->bg_list); >> btrfs_init_free_space_ctl(cache); >> + atomic_set(&cache->trimming, 0); >> >> return cache; >> } >> @@ -9317,7 +9318,8 @@ static void clear_avail_alloc_bits(struct >> btrfs_fs_info *fs_info, u64 flags) >> } >> >> int btrfs_remove_block_group(struct btrfs_trans_handle *trans, >> - struct btrfs_root *root, u64 group_start) >> + struct btrfs_root *root, u64 group_start, >> + bool *remove_em) >> { >> struct btrfs_path *path; >> struct btrfs_block_group_cache *block_group; >> @@ -9474,6 +9476,26 @@ int btrfs_remove_block_group(struct >> btrfs_trans_handle *trans, >> >> memcpy(&key, &block_group->key, sizeof(key)); >> >> + spin_lock(&block_group->lock); >> + block_group->removed = 1; > > > Ok you set block_group->removed here, but you don't check it in > btrfs_trim_block_group, so we can easily race in afterwards and start > trimming this block group. > > >> + /* >> + * At this point trimming can't start on this block group, because >> we >> + * removed the block group from the tree >> fs_info->block_group_cache_tree >> + * so no one can't find it anymore. >> + * >> + * And we must tell our caller to not remove the extent map from >> the >> + * fs_info->mapping_tree to prevent the same logical address range >> and >> + * physical device space ranges from being reused for a new block >> group. >> + * This is because our fs trim operation (btrfs_trim_fs(), >> + * btrfs_ioctl_fitrim()) is completely transactionless, so while >> its >> + * trimming a range the currently running transaction might finish >> and >> + * a new one start, allowing for new block groups to be created >> that can >> + * reuse the same physical device locations unless we take this >> special >> + * care. >> + */ >> + *remove_em = (atomic_read(&block_group->trimming) == 0); >> + spin_unlock(&block_group->lock); >> + >> btrfs_put_block_group(block_group); >> btrfs_put_block_group(block_group); >> >> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c >> index 3384819..16c2d39 100644 >> --- a/fs/btrfs/free-space-cache.c >> +++ b/fs/btrfs/free-space-cache.c >> @@ -3101,11 +3101,35 @@ int btrfs_trim_block_group(struct >> btrfs_block_group_cache *block_group, >> >> *trimmed = 0; >> >> + atomic_inc(&block_group->trimming); >> + > > > This needs to be something like this > > > spin_lock(&block_group->lock); > if (block_group->removed == 1) { > spin_unlock(&block_group->lock); > return 0; > } > atomic_inc(&block_group->trimming); > spin_unlock(&block_group->lock); > > To be properly safe. Thanks,
Hi Josef, it is safe. btrfs_trim_block_group() checks for ->removed after decrementing the atomic. Before it starts trimming, it increments the counter - if the block group removal already started it isn't a problem, because it removed all the free space entries from the rbtree while holding the tree's lock. > > Josef > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, "Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html