On Mon, Mar 06, 2017 at 10:55:46AM +0800, Qu Wenruo wrote: > [BUG] > When btrfs_reloc_clone_csum() reports error, it can underflow metadata > and leads to kernel assertion on outstanding extents in > run_delalloc_nocow() and cow_file_range(). > > BTRFS info (device vdb5): relocating block group 12582912 flags data > BTRFS info (device vdb5): found 1 extents > assertion failed: inode->outstanding_extents >= num_extents, file: > fs/btrfs//extent-tree.c, line: 5858 > > Currently, due to another bug blocking ordered extents, the bug is only > reproducible under certain block group layout and using error injection. > > a) Create one data block group with one 4K extent in it. > To avoid the bug that hangs btrfs due to ordered extent which never > finishes > b) Make btrfs_reloc_clone_csum() always fail > c) Relocate that block group > > [CAUSE] > run_delalloc_nocow() and cow_file_range() handles error from > btrfs_reloc_clone_csum() wrongly: > > (The ascii chart shows a more generic case of this bug other than the > bug mentioned above) > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- cleanup range --------------->| > |<----------- ----------->| > \/ > btrfs_finish_ordered_io() range > > So error handler, which calls extent_clear_unlock_delalloc() with > EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io() > will both cover OE n, and free its metadata, causing metadata under flow. > > [Fix] > The fix is to ensure after calling btrfs_add_ordered_extent(), we only > call error handler after increasing the iteration offset, so that > cleanup range won't cover any created ordered extent. > > |<------------------ delalloc range --------------------------->| > | OE 1 | OE 2 | ... | OE n | > |<----------- ----------->|<---------- cleanup range --------->| > \/ > btrfs_finish_ordered_io() range > > Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com> > --- > changelog: > v6: > New, split from v5 patch, as this is a separate bug. > --- > fs/btrfs/inode.c | 51 +++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 39 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index b2bc07aad1ae..1d83d504f2e5 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -998,15 +998,24 @@ static noinline int cow_file_range(struct inode *inode, > BTRFS_DATA_RELOC_TREE_OBJECTID) { > ret = btrfs_reloc_clone_csums(inode, start, > cur_alloc_size); > + /* > + * Only drop cache here, and process as normal. > + * > + * We must not allow extent_clear_unlock_delalloc() > + * at out_unlock label to free meta of this ordered > + * extent, as its meta should be freed by > + * btrfs_finish_ordered_io(). > + * > + * So we must continue until @start is increased to > + * skip current ordered extent. > + */ > if (ret) > - goto out_drop_extent_cache; > + btrfs_drop_extent_cache(BTRFS_I(inode), start, > + start + ram_size - 1, 0); > } > > btrfs_dec_block_group_reservations(fs_info, ins.objectid); > > - if (disk_num_bytes < cur_alloc_size) > - break; > - > /* we're not doing compressed IO, don't unlock the first > * page (which the caller expects to stay locked), don't > * clear any dirty bits and don't set any writeback bits > @@ -1022,10 +1031,21 @@ static noinline int cow_file_range(struct inode > *inode, > delalloc_end, locked_page, > EXTENT_LOCKED | EXTENT_DELALLOC, > op); > - disk_num_bytes -= cur_alloc_size; > + if (disk_num_bytes > cur_alloc_size) > + disk_num_bytes = 0; > + else > + disk_num_bytes -= cur_alloc_size;
I don't get the logic here, why do we 'break' if disk_num_bytes > cur_alloc_size? Thanks, -liubo > num_bytes -= cur_alloc_size; > alloc_hint = ins.objectid + ins.offset; > start += cur_alloc_size; > + > + /* > + * btrfs_reloc_clone_csums() error, since start is increased > + * extent_clear_unlock_delalloc() at out_unlock label won't > + * free metadata of current ordered extent, we're OK to exit. > + */ > + if (ret) > + goto out_unlock; > } > out: > return ret; > @@ -1414,15 +1434,14 @@ static noinline int run_delalloc_nocow(struct inode > *inode, > BUG_ON(ret); /* -ENOMEM */ > > if (root->root_key.objectid == > - BTRFS_DATA_RELOC_TREE_OBJECTID) { > + BTRFS_DATA_RELOC_TREE_OBJECTID) > + /* > + * Error handled later, as we must prevent > + * extent_clear_unlock_delalloc() in error handler > + * from freeing metadata of created ordered extent. > + */ > ret = btrfs_reloc_clone_csums(inode, cur_offset, > num_bytes); > - if (ret) { > - if (!nolock && nocow) > - btrfs_end_write_no_snapshoting(root); > - goto error; > - } > - } > > extent_clear_unlock_delalloc(inode, cur_offset, > cur_offset + num_bytes - 1, end, > @@ -1434,6 +1453,14 @@ static noinline int run_delalloc_nocow(struct inode > *inode, > if (!nolock && nocow) > btrfs_end_write_no_snapshoting(root); > cur_offset = extent_end; > + > + /* > + * btrfs_reloc_clone_csums() error, now we're OK to call error > + * handler, as metadata for created ordered extent will only > + * be freed by btrfs_finish_ordered_io(). > + */ > + if (ret) > + goto error; > if (cur_offset > end) > break; > } > -- > 2.12.0 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html