On 09/29/2016 03:57 AM, Qu Wenruo wrote:
Thanks for your test script.
I succeeded in pinning down the problem.
The problem is, btrfs is invalidate pages that belongs to ordered
extent(goes through writeback)
No, I don't think invalidated pages are going through writeback. The
problem is that the space for extent allocation is done before the
writeback and if pages are invalidated before writeback, it is not
accounted for.
The ftrace(with more tracepoints to trace qgroup->reserved change) shows:
------
btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096
This is a little confusing. There is no qgroup_update_reserve() function
in the source. Where did you add this?
^^^^^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K.
btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880,
reserved=5222400, op=release
^^^^^ Here ordered_io finishes, write the whole 5M data, while
QGROUP_RESERVED only returned 5M - 20K.
------
The problem is at btrfs_add_delayed_data_ref(), we use the assumption
that, qgroup reserved space is the same as extent size(5M)
But in fact, btrfs_qgroup_release_data() only release (5M - 20K).
Leaking the 20K.
Yes, this is more like it. However, I would put it the other way. It
releases 5M when it should have released 5M-20K, because 20k was
released when invalidating page.
Calltrace:
btrfs_finish_ordered_io()
|- inserted_reserved_file_extent()
|- btrfs_alloc_reserved_file_extent()
| ^^ Here we tell delayed_ref to free 5M later
|- btrfs_qgroup_release_data()
^^ We only released (5M - 20K), as the first 5 pages
are freed by btrfs_invalidatepage()
If btrfs is designed to freeing pages which belongs to ordered_extent,
then it totally breaks the qgroup assumption.
Then we have several ways to fix it:
1) Fix race between ordered extent(writeback) with invalidatepage()
Make btrfs_invalidatepage() skips pages that are already in
ordered_extent, then we're OK.
This is much like your original fix, but I'm not sure if DIRTY page
flag is bond to ordered_extent.
And more comment will help.
So, what you mean is that fix is correct, I just need to reword the
patch header.
2) Make btrfs_qgroup_release_data() return accurate num bytes
And then pass it to delayed_ref head.
This is quite anti-intuition. If we're adding a new extent, how
could it happen that some pages are already invalidated?
I agree. It is counter-intutive and will increase complexity.
Anyway, awesome work, exposed a quite strange race and makes us re-think
the base of qgroup reserve space framework.
Thanks.