At 09/30/2016 10:18 AM, Qu Wenruo wrote:
Your original fix is good.

Feel free to my tag after enriching the comment in btrfs_invalidatepage().
Feel free to "add" my tag...

Reviewed-by: Qu Wenruo <quwen...@cn.fujitsu.com>

Thanks again for your founding and sorry for the extra time reviewing.
for your "finding"...

My English just sucks....

Thanks
Qu


At 09/29/2016 07:13 PM, Goldwyn Rodrigues wrote:


On 09/29/2016 03:57 AM, Qu Wenruo wrote:
Thanks for your test script.

I succeeded in pinning down the problem.

The problem is, btrfs is invalidate pages that belongs to ordered
extent(goes through writeback)


No, I don't think invalidated pages are going through writeback. The
problem is that the space for extent allocation is done before the
writeback and if pages are invalidated before writeback, it is not
accounted for.


The ftrace(with more tracepoints to trace qgroup->reserved change)
shows:
------
btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096

This is a little confusing. There is no qgroup_update_reserve() function
in the source. Where did you add this?



^^^^^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K.

btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880,
reserved=5222400, op=release

^^^^^ Here ordered_io finishes, write the whole 5M data, while
QGROUP_RESERVED only returned 5M - 20K.
------

The problem is at btrfs_add_delayed_data_ref(), we use the assumption
that, qgroup reserved space is the same as extent size(5M)
But in fact, btrfs_qgroup_release_data() only release (5M - 20K).
Leaking the 20K.

Yes, this is more like it. However, I would put it the other way. It
releases 5M when it should have released 5M-20K, because 20k was
released when invalidating page.


Calltrace:
btrfs_finish_ordered_io()
|- inserted_reserved_file_extent()
   |- btrfs_alloc_reserved_file_extent()
   |  ^^ Here we tell delayed_ref to free 5M later
   |- btrfs_qgroup_release_data()
      ^^ We only released (5M - 20K), as the first 5 pages
         are freed by btrfs_invalidatepage()

If btrfs is designed to freeing pages which belongs to ordered_extent,
then it totally breaks the qgroup assumption.

Then we have several ways to fix it:
1) Fix race between ordered extent(writeback) with invalidatepage()
   Make btrfs_invalidatepage() skips pages that are already in
   ordered_extent, then we're OK.

   This is much like your original fix, but I'm not sure if DIRTY page
   flag is bond to ordered_extent.
   And more comment will help.


So, what you mean is that fix is correct, I just need to reword the
patch header.


2) Make btrfs_qgroup_release_data() return accurate num bytes
   And then pass it to delayed_ref head.

   This is quite anti-intuition. If we're adding a new extent, how
   could it happen that some pages are already invalidated?


I agree. It is counter-intutive and will increase complexity.

Anyway, awesome work, exposed a quite strange race and makes us re-think
the base of qgroup reserve space framework.

Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to