On 09/29/2016 03:57 AM, Qu Wenruo wrote:
> Thanks for your test script.
> 
> I succeeded in pinning down the problem.
> 
> The problem is, btrfs is invalidate pages that belongs to ordered
> extent(goes through writeback)


No, I don't think invalidated pages are going through writeback. The
problem is that the space for extent allocation is done before the
writeback and if pages are invalidated before writeback, it is not
accounted for.

> 
> The ftrace(with more tracepoints to trace qgroup->reserved change) shows:
> ------
> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096
> btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096
> btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096
> btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096
> btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096

This is a little confusing. There is no qgroup_update_reserve() function
in the source. Where did you add this?

> 
> 
> ^^^^^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K.
> 
> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880,
> reserved=5222400, op=release
> 
> ^^^^^ Here ordered_io finishes, write the whole 5M data, while
> QGROUP_RESERVED only returned 5M - 20K.
> ------
> 
> The problem is at btrfs_add_delayed_data_ref(), we use the assumption
> that, qgroup reserved space is the same as extent size(5M)
> But in fact, btrfs_qgroup_release_data() only release (5M - 20K).
> Leaking the 20K.

Yes, this is more like it. However, I would put it the other way. It
releases 5M when it should have released 5M-20K, because 20k was
released when invalidating page.

> 
> Calltrace:
> btrfs_finish_ordered_io()
> |- inserted_reserved_file_extent()
>    |- btrfs_alloc_reserved_file_extent()
>    |  ^^ Here we tell delayed_ref to free 5M later
>    |- btrfs_qgroup_release_data()
>       ^^ We only released (5M - 20K), as the first 5 pages
>          are freed by btrfs_invalidatepage()
> 
> If btrfs is designed to freeing pages which belongs to ordered_extent,
> then it totally breaks the qgroup assumption.
> 
> Then we have several ways to fix it:
> 1) Fix race between ordered extent(writeback) with invalidatepage()
>    Make btrfs_invalidatepage() skips pages that are already in
>    ordered_extent, then we're OK.
> 
>    This is much like your original fix, but I'm not sure if DIRTY page
>    flag is bond to ordered_extent.
>    And more comment will help.


So, what you mean is that fix is correct, I just need to reword the
patch header.

> 
> 2) Make btrfs_qgroup_release_data() return accurate num bytes
>    And then pass it to delayed_ref head.
> 
>    This is quite anti-intuition. If we're adding a new extent, how
>    could it happen that some pages are already invalidated?
> 

I agree. It is counter-intutive and will increase complexity.

> Anyway, awesome work, exposed a quite strange race and makes us re-think
> the base of qgroup reserve space framework.

Thanks.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to