On 26.10.2018 14:53, Qu Wenruo wrote:
>
>
> On 2018/10/26 下午7:41, Nikolay Borisov wrote:
>> Running btrfs/124 in a loop hung up on me sporadically with the
>> following call trace:
>> btrfs D 0 5760 5324 0x00000000
>> Call Trace:
>> ? __schedule+0x243/0x800
>> schedule+0x33/0x90
>> btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>> ? wait_woken+0xa0/0xa0
>> btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>> btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>> btrfs_relocate_chunk+0x49/0x100 [btrfs]
>> btrfs_balance+0xbeb/0x1740 [btrfs]
>> btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>> btrfs_ioctl+0x1691/0x3110 [btrfs]
>> ? lockdep_hardirqs_on+0xed/0x180
>> ? __handle_mm_fault+0x8e7/0xfb0
>> ? _raw_spin_unlock+0x24/0x30
>> ? __handle_mm_fault+0x8e7/0xfb0
>> ? do_vfs_ioctl+0xa5/0x6e0
>> ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>> do_vfs_ioctl+0xa5/0x6e0
>> ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>> ksys_ioctl+0x3a/0x70
>> __x64_sys_ioctl+0x16/0x20
>> do_syscall_64+0x60/0x1b0
>> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Turns out during page writeback it's possible that the code in
>> writepage_delalloc can instantiate a delalloc range which doesn't
>> belong to the page currently being written back.
>
> Just a nitpick, would you please split long paragraphs with newlines?
>
> It makes reviewers' eyes less painful.
>
>> This happens since
>> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
>> range when asked and doens't really consider the range of the passed
>> page.
>
>
>> When such a foregin range is found the code proceeds to
>> run_delalloc_range and calls the appropriate function to fill the
>> delalloc and create ordered extents. If, however, a failure occurs
>> while this operation is in effect then the clean up code in
>> btrfs_cleanup_ordered_extents will be called. This function has the
>> wrong assumption that caller of run_delalloc_range always properly
>> cleans the first page of the range hence when it calls
>> __endio_write_update_ordered it explicitly ommits the first page of
>> the delalloc range.
>
> Yes, that's the old assumption, at least well explained by some ascii
> art. (even it's wrong)
>
>> This is wrong because this function could be
>> cleaning a delalloc range that doesn't belong to the current page. This
>> in turn means that the page cleanup code in __extent_writepage will
>> not really free the initial page for the range, leaving a hanging
>> ordered extent with bytes_left set to 4k. This bug has been present
>> ever since the original introduction of the cleanup code in 524272607e88.
>
> The cause sounds valid, however would you please explain more about how
> such cleaning on unrelated delalloc range happens?
So in my case the following happened - 2 block groups were created as
delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
dirtied, so when page 0 - 0-4k when into writepage_delalloc,
find_lock_delalloc_range would return the range 0-1m. So the call to
fill_delalloc instantiates OE 0-1m and writeback continues as expected.
Now, when the 2nd page from range 0-1m is again set for writeback and
find_lock_delalloc_range is called with delalloc_start == 4096 it will
actually return the range 1m-128m. Then fill_delalloc is called with
locked_page belonging to the range which was already instantiated and
the start/end arguments pointing to 1m-128m if an error occurred in
run_delalloc_range in this case then btrfs_cleanup_ordered_extents will
be called which will clear Private2 bit for pages belonging to 1m-128m
range and the OE will be cleared of all but the first page (since the
code wrongly assumed locked_page always belongs to the range currently
being instantiated).
>
> Even the fix looks solid, it's still better to explain the cause a
> little more, as the more we understand the cause, the better solution we
> may get.
>
>>
>> Fix this by correctly checking whether the current page belongs to the
>> range being instantiated and if so correctly adjust the range parameters
>> passed for cleaning up. If it doesn't, then just clean the whole OE
>> range directly.
>
> And the solution also looks good to me, and much more robust, without
> any (possibly wrong) assumption.
>
> Thanks,
> Qu
>
>>
>> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
>> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid
>> ordered extent hang")
>> ---
>>
>> V2:
>> * Fix compilation failure due to missing parentheses
>> * Fixed the "Fixes" tag.
>>
>> fs/btrfs/inode.c | 29 ++++++++++++++++++++---------
>> 1 file changed, 20 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index e1f00d8c24ce..5906564ae2e9 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -109,17 +109,17 @@ static void __endio_write_update_ordered(struct inode
>> *inode,
>> * extent_clear_unlock_delalloc() to clear both the bits
>> EXTENT_DO_ACCOUNTING
>> * and EXTENT_DELALLOC simultaneously, because that causes the reserved
>> metadata
>> * to be released, which we want to happen only when finishing the ordered
>> - * extent (btrfs_finish_ordered_io()). Also note that the caller of the
>> - * fill_delalloc() callback already does proper cleanup for the first page
>> of
>> - * the range, that is, it invokes the callback writepage_end_io_hook() for
>> the
>> - * range of the first page.
>> + * extent (btrfs_finish_ordered_io()).
>> */
>> static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>> - const u64 offset,
>> - const u64 bytes)
>> + struct page *locked_page,
>> + u64 offset, u64 bytes)
>> {
>> unsigned long index = offset >> PAGE_SHIFT;
>> unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
>> + u64 page_start = page_offset(locked_page);
>> + u64 page_end = page_start + PAGE_SIZE - 1;
>> +
>> struct page *page;
>>
>> while (index <= end_index) {
>> @@ -130,8 +130,18 @@ static inline void btrfs_cleanup_ordered_extents(struct
>> inode *inode,
>> ClearPagePrivate2(page);
>> put_page(page);
>> }
>> - return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
>> - bytes - PAGE_SIZE, false);
>> +
>> + /*
>> + * In case this page belongs to the delalloc range being instantiated
>> + * then skip it, since the first page of a range is going to be
>> + * properly cleaned up by the caller of run_delalloc_range
>> + */
>> + if (page_start >= offset && page_end <= (offset + bytes - 1)) {
>> + offset += PAGE_SIZE;
>> + bytes -= PAGE_SIZE;
>> + }
>> +
>> + return __endio_write_update_ordered(inode, offset, bytes, false);
>> }
>>
>> static int btrfs_dirty_inode(struct inode *inode);
>> @@ -1606,7 +1616,8 @@ static int run_delalloc_range(void *private_data,
>> struct page *locked_page,
>> write_flags);
>> }
>> if (ret)
>> - btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>> + btrfs_cleanup_ordered_extents(inode, locked_page, start,
>> + end - start + 1);
>> return ret;
>> }
>>
>>
>