On 26.10.2018 14:53, Qu Wenruo wrote:
> 
> 
> On 2018/10/26 下午7:41, Nikolay Borisov wrote:
>> Running btrfs/124 in a loop hung up on me sporadically with the
>> following call trace:
>>      btrfs           D    0  5760   5324 0x00000000
>>      Call Trace:
>>       ? __schedule+0x243/0x800
>>       schedule+0x33/0x90
>>       btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>>       ? wait_woken+0xa0/0xa0
>>       btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>>       btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>>       btrfs_relocate_chunk+0x49/0x100 [btrfs]
>>       btrfs_balance+0xbeb/0x1740 [btrfs]
>>       btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>>       btrfs_ioctl+0x1691/0x3110 [btrfs]
>>       ? lockdep_hardirqs_on+0xed/0x180
>>       ? __handle_mm_fault+0x8e7/0xfb0
>>       ? _raw_spin_unlock+0x24/0x30
>>       ? __handle_mm_fault+0x8e7/0xfb0
>>       ? do_vfs_ioctl+0xa5/0x6e0
>>       ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>>       do_vfs_ioctl+0xa5/0x6e0
>>       ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>>       ksys_ioctl+0x3a/0x70
>>       __x64_sys_ioctl+0x16/0x20
>>       do_syscall_64+0x60/0x1b0
>>       entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> Turns out during page writeback it's possible that the code in
>> writepage_delalloc can instantiate a delalloc range which doesn't
>> belong to the page currently being written back.
> 
> Just a nitpick, would you please split long paragraphs with newlines?
> 
> It makes reviewers' eyes less painful.
> 
>> This happens since
>> find_lock_delalloc_range returns up to BTRFS_MAX_EXTENT_SIZE delalloc
>> range when asked and doens't really consider the range of the passed
>> page.
> 
> 
>> When such a foregin range is found the code proceeds to
>> run_delalloc_range and calls the appropriate function to fill the
>> delalloc and create ordered extents. If, however, a failure occurs
>> while this operation is in effect then the clean up code in
>> btrfs_cleanup_ordered_extents will be called. This function has the
>> wrong assumption that caller of run_delalloc_range always properly
>> cleans the first page of the range hence when it calls
>> __endio_write_update_ordered it explicitly ommits the first page of
>> the delalloc range.
> 
> Yes, that's the old assumption, at least well explained by some ascii
> art. (even it's wrong)
> 
>> This is wrong because this function could be
>> cleaning a delalloc range that doesn't belong to the current page. This
>> in turn means that the page cleanup code in __extent_writepage will
>> not really free the initial page for the range, leaving a hanging
>> ordered extent with bytes_left set to 4k. This bug has been present
>> ever since the original introduction of the cleanup code in 524272607e88.
> 
> The cause sounds valid, however would you please explain more about how
> such cleaning on unrelated delalloc range happens?

So in my case the following happened - 2 block groups were created as
delalloc ranges in the - 0-1m and 1m-128m. Their respective pages were
dirtied, so when page 0 - 0-4k when into writepage_delalloc,
find_lock_delalloc_range would return the range 0-1m. So the call to
fill_delalloc instantiates OE 0-1m and writeback continues as expected.

Now, when the 2nd page from range 0-1m is again set for writeback and
find_lock_delalloc_range is called with delalloc_start ==  4096 it will
actually return the range 1m-128m. Then fill_delalloc is called with
locked_page belonging to the range which was already instantiated and
the start/end arguments pointing to 1m-128m if an error occurred in
run_delalloc_range in this case then  btrfs_cleanup_ordered_extents will
be called which will clear Private2 bit for pages belonging to 1m-128m
range and the OE will be cleared of all but the first page (since the
code wrongly assumed locked_page always belongs to the range currently
being instantiated).

> 
> Even the fix looks solid, it's still better to explain the cause a
> little more, as the more we understand the cause, the better solution we
> may get.
> 
>>
>> Fix this by correctly checking whether the current page belongs to the
>> range being instantiated and if so correctly adjust the range parameters
>> passed for cleaning up. If it doesn't, then just clean the whole OE
>> range directly.
> 
> And the solution also looks good to me, and much more robust, without
> any (possibly wrong) assumption.
> 
> Thanks,
> Qu
> 
>>
>> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
>> Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid 
>> ordered extent hang")
>> ---
>>
>> V2: 
>>  * Fix compilation failure due to missing parentheses
>>  * Fixed the "Fixes" tag. 
>>
>>  fs/btrfs/inode.c | 29 ++++++++++++++++++++---------
>>  1 file changed, 20 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index e1f00d8c24ce..5906564ae2e9 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -109,17 +109,17 @@ static void __endio_write_update_ordered(struct inode 
>> *inode,
>>   * extent_clear_unlock_delalloc() to clear both the bits 
>> EXTENT_DO_ACCOUNTING
>>   * and EXTENT_DELALLOC simultaneously, because that causes the reserved 
>> metadata
>>   * to be released, which we want to happen only when finishing the ordered
>> - * extent (btrfs_finish_ordered_io()). Also note that the caller of the
>> - * fill_delalloc() callback already does proper cleanup for the first page 
>> of
>> - * the range, that is, it invokes the callback writepage_end_io_hook() for 
>> the
>> - * range of the first page.
>> + * extent (btrfs_finish_ordered_io()).
>>   */
>>  static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
>> -                                             const u64 offset,
>> -                                             const u64 bytes)
>> +                                             struct page *locked_page,
>> +                                             u64 offset, u64 bytes)
>>  {
>>      unsigned long index = offset >> PAGE_SHIFT;
>>      unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
>> +    u64 page_start = page_offset(locked_page);
>> +    u64 page_end = page_start + PAGE_SIZE - 1;
>> +
>>      struct page *page;
>>  
>>      while (index <= end_index) {
>> @@ -130,8 +130,18 @@ static inline void btrfs_cleanup_ordered_extents(struct 
>> inode *inode,
>>              ClearPagePrivate2(page);
>>              put_page(page);
>>      }
>> -    return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
>> -                                        bytes - PAGE_SIZE, false);
>> +
>> +    /*
>> +     * In case this page belongs to the delalloc range being instantiated
>> +     * then skip it, since the first page of a range is going to be
>> +     * properly cleaned up by the caller of run_delalloc_range
>> +     */
>> +    if (page_start >= offset && page_end <= (offset + bytes - 1)) {
>> +            offset += PAGE_SIZE;
>> +            bytes -= PAGE_SIZE;
>> +    }
>> +
>> +    return __endio_write_update_ordered(inode, offset, bytes, false);
>>  }
>>  
>>  static int btrfs_dirty_inode(struct inode *inode);
>> @@ -1606,7 +1616,8 @@ static int run_delalloc_range(void *private_data, 
>> struct page *locked_page,
>>                                         write_flags);
>>      }
>>      if (ret)
>> -            btrfs_cleanup_ordered_extents(inode, start, end - start + 1);
>> +            btrfs_cleanup_ordered_extents(inode, locked_page, start,
>> +                                          end - start + 1);
>>      return ret;
>>  }
>>  
>>
> 

Reply via email to