[PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-30 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

While free'ing qgroup->reserved resources, we much check if
the page has not been invalidated by a truncate operation
by checking if the page is still dirty before reducing the
qgroup resources. Resources in such a case are free'd when
the entire extent is released by delayed_ref.

This fixes a double accounting while releasing resources
in case of truncating a file, reproduced by the following testcase.

SCRATCH_DEV=/dev/vdb
SCRATCH_MNT=/mnt
mkfs.btrfs -f $SCRATCH_DEV
mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT
cd $SCRATCH_MNT
btrfs quota enable $SCRATCH_MNT
btrfs subvolume create a
btrfs qgroup limit 500m a $SCRATCH_MNT
sync
for c in {1..15}; do
dd if=/dev/zero  bs=1M count=40 of=$SCRATCH_MNT/a/file;
done

sleep 10
sync
sleep 5

touch $SCRATCH_MNT/a/newfile

echo "Removing file"
rm $SCRATCH_MNT/a/file

Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page")
Signed-off-by: Goldwyn Rodrigues 
Reviewed-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6811c4..bc1a004 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8915,9 +8915,14 @@ again:
 *So even we call qgroup_free_data(), it won't decrease reserved
 *space.
 * 2) Not written to disk
-*This means the reserved space should be freed here.
+*This means the reserved space should be freed here. However,
+*if a truncate invalidates the page (by clearing PageDirty)
+*and the page is accounted for while allocating extent
+*in btrfs_check_data_free_space() we let delayed_ref to
+*free the entire extent.
 */
-   btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE);
+   if (PageDirty(page))
+   btrfs_qgroup_free_data(inode, page_start, PAGE_SIZE);
if (!inode_evicting) {
clear_extent_bit(tree, page_start, page_end,
 EXTENT_LOCKED | EXTENT_DIRTY |
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-29 Thread Qu Wenruo



At 09/30/2016 10:18 AM, Qu Wenruo wrote:

Your original fix is good.

Feel free to my tag after enriching the comment in btrfs_invalidatepage().

Feel free to "add" my tag...


Reviewed-by: Qu Wenruo 

Thanks again for your founding and sorry for the extra time reviewing.

for your "finding"...

My English just sucks


Thanks
Qu


At 09/29/2016 07:13 PM, Goldwyn Rodrigues wrote:



On 09/29/2016 03:57 AM, Qu Wenruo wrote:

Thanks for your test script.

I succeeded in pinning down the problem.

The problem is, btrfs is invalidate pages that belongs to ordered
extent(goes through writeback)



No, I don't think invalidated pages are going through writeback. The
problem is that the space for extent allocation is done before the
writeback and if pages are invalidated before writeback, it is not
accounted for.



The ftrace(with more tracepoints to trace qgroup->reserved change)
shows:
--
btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096


This is a little confusing. There is no qgroup_update_reserve() function
in the source. Where did you add this?




^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K.

btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880,
reserved=5222400, op=release

^ Here ordered_io finishes, write the whole 5M data, while
QGROUP_RESERVED only returned 5M - 20K.
--

The problem is at btrfs_add_delayed_data_ref(), we use the assumption
that, qgroup reserved space is the same as extent size(5M)
But in fact, btrfs_qgroup_release_data() only release (5M - 20K).
Leaking the 20K.


Yes, this is more like it. However, I would put it the other way. It
releases 5M when it should have released 5M-20K, because 20k was
released when invalidating page.



Calltrace:
btrfs_finish_ordered_io()
|- inserted_reserved_file_extent()
   |- btrfs_alloc_reserved_file_extent()
   |  ^^ Here we tell delayed_ref to free 5M later
   |- btrfs_qgroup_release_data()
  ^^ We only released (5M - 20K), as the first 5 pages
 are freed by btrfs_invalidatepage()

If btrfs is designed to freeing pages which belongs to ordered_extent,
then it totally breaks the qgroup assumption.

Then we have several ways to fix it:
1) Fix race between ordered extent(writeback) with invalidatepage()
   Make btrfs_invalidatepage() skips pages that are already in
   ordered_extent, then we're OK.

   This is much like your original fix, but I'm not sure if DIRTY page
   flag is bond to ordered_extent.
   And more comment will help.



So, what you mean is that fix is correct, I just need to reword the
patch header.



2) Make btrfs_qgroup_release_data() return accurate num bytes
   And then pass it to delayed_ref head.

   This is quite anti-intuition. If we're adding a new extent, how
   could it happen that some pages are already invalidated?



I agree. It is counter-intutive and will increase complexity.


Anyway, awesome work, exposed a quite strange race and makes us re-think
the base of qgroup reserve space framework.


Thanks.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-29 Thread Qu Wenruo

Your original fix is good.

Feel free to my tag after enriching the comment in btrfs_invalidatepage().

Reviewed-by: Qu Wenruo 

Thanks again for your founding and sorry for the extra time reviewing.

Thanks
Qu


At 09/29/2016 07:13 PM, Goldwyn Rodrigues wrote:



On 09/29/2016 03:57 AM, Qu Wenruo wrote:

Thanks for your test script.

I succeeded in pinning down the problem.

The problem is, btrfs is invalidate pages that belongs to ordered
extent(goes through writeback)



No, I don't think invalidated pages are going through writeback. The
problem is that the space for extent allocation is done before the
writeback and if pages are invalidated before writeback, it is not
accounted for.



The ftrace(with more tracepoints to trace qgroup->reserved change) shows:
--
btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096


This is a little confusing. There is no qgroup_update_reserve() function
in the source. Where did you add this?




^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K.

btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880,
reserved=5222400, op=release

^ Here ordered_io finishes, write the whole 5M data, while
QGROUP_RESERVED only returned 5M - 20K.
--

The problem is at btrfs_add_delayed_data_ref(), we use the assumption
that, qgroup reserved space is the same as extent size(5M)
But in fact, btrfs_qgroup_release_data() only release (5M - 20K).
Leaking the 20K.


Yes, this is more like it. However, I would put it the other way. It
releases 5M when it should have released 5M-20K, because 20k was
released when invalidating page.



Calltrace:
btrfs_finish_ordered_io()
|- inserted_reserved_file_extent()
   |- btrfs_alloc_reserved_file_extent()
   |  ^^ Here we tell delayed_ref to free 5M later
   |- btrfs_qgroup_release_data()
  ^^ We only released (5M - 20K), as the first 5 pages
 are freed by btrfs_invalidatepage()

If btrfs is designed to freeing pages which belongs to ordered_extent,
then it totally breaks the qgroup assumption.

Then we have several ways to fix it:
1) Fix race between ordered extent(writeback) with invalidatepage()
   Make btrfs_invalidatepage() skips pages that are already in
   ordered_extent, then we're OK.

   This is much like your original fix, but I'm not sure if DIRTY page
   flag is bond to ordered_extent.
   And more comment will help.



So, what you mean is that fix is correct, I just need to reword the
patch header.



2) Make btrfs_qgroup_release_data() return accurate num bytes
   And then pass it to delayed_ref head.

   This is quite anti-intuition. If we're adding a new extent, how
   could it happen that some pages are already invalidated?



I agree. It is counter-intutive and will increase complexity.


Anyway, awesome work, exposed a quite strange race and makes us re-think
the base of qgroup reserve space framework.


Thanks.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-29 Thread Qu Wenruo



At 09/29/2016 07:13 PM, Goldwyn Rodrigues wrote:



On 09/29/2016 03:57 AM, Qu Wenruo wrote:

Thanks for your test script.

I succeeded in pinning down the problem.

The problem is, btrfs is invalidate pages that belongs to ordered
extent(goes through writeback)



No, I don't think invalidated pages are going through writeback. The
problem is that the space for extent allocation is done before the
writeback and if pages are invalidated before writeback, it is not
accounted for.



The ftrace(with more tracepoints to trace qgroup->reserved change) shows:
--
btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096,
reserved=4096, op=free
qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096


This is a little confusing. There is no qgroup_update_reserve() function
in the source. Where did you add this?


Sorry, I just submitted them few minutes ago.

Have CCed you.






^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K.

btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880,
reserved=5222400, op=release

^ Here ordered_io finishes, write the whole 5M data, while
QGROUP_RESERVED only returned 5M - 20K.
--

The problem is at btrfs_add_delayed_data_ref(), we use the assumption
that, qgroup reserved space is the same as extent size(5M)
But in fact, btrfs_qgroup_release_data() only release (5M - 20K).
Leaking the 20K.


Yes, this is more like it. However, I would put it the other way. It
releases 5M when it should have released 5M-20K, because 20k was
released when invalidating page.



Calltrace:
btrfs_finish_ordered_io()
|- inserted_reserved_file_extent()
   |- btrfs_alloc_reserved_file_extent()
   |  ^^ Here we tell delayed_ref to free 5M later
   |- btrfs_qgroup_release_data()
  ^^ We only released (5M - 20K), as the first 5 pages
 are freed by btrfs_invalidatepage()

If btrfs is designed to freeing pages which belongs to ordered_extent,
then it totally breaks the qgroup assumption.

Then we have several ways to fix it:
1) Fix race between ordered extent(writeback) with invalidatepage()
   Make btrfs_invalidatepage() skips pages that are already in
   ordered_extent, then we're OK.

   This is much like your original fix, but I'm not sure if DIRTY page
   flag is bond to ordered_extent.
   And more comment will help.



So, what you mean is that fix is correct, I just need to reword the
patch header.


Not patch header, but comment before PageDirty() check.

I'm still digging the code to see if page DIRTY check is safe and if 
there is more elegant fix.


Thanks,
Qu





2) Make btrfs_qgroup_release_data() return accurate num bytes
   And then pass it to delayed_ref head.

   This is quite anti-intuition. If we're adding a new extent, how
   could it happen that some pages are already invalidated?



I agree. It is counter-intutive and will increase complexity.


Anyway, awesome work, exposed a quite strange race and makes us re-think
the base of qgroup reserve space framework.


Thanks.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-29 Thread Goldwyn Rodrigues


On 09/29/2016 03:57 AM, Qu Wenruo wrote:
> Thanks for your test script.
> 
> I succeeded in pinning down the problem.
> 
> The problem is, btrfs is invalidate pages that belongs to ordered
> extent(goes through writeback)


No, I don't think invalidated pages are going through writeback. The
problem is that the space for extent allocation is done before the
writeback and if pages are invalidated before writeback, it is not
accounted for.

> 
> The ftrace(with more tracepoints to trace qgroup->reserved change) shows:
> --
> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096
> btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096
> btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096
> btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096
> btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096,
> reserved=4096, op=free
> qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096

This is a little confusing. There is no qgroup_update_reserve() function
in the source. Where did you add this?

> 
> 
> ^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K.
> 
> btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880,
> reserved=5222400, op=release
> 
> ^ Here ordered_io finishes, write the whole 5M data, while
> QGROUP_RESERVED only returned 5M - 20K.
> --
> 
> The problem is at btrfs_add_delayed_data_ref(), we use the assumption
> that, qgroup reserved space is the same as extent size(5M)
> But in fact, btrfs_qgroup_release_data() only release (5M - 20K).
> Leaking the 20K.

Yes, this is more like it. However, I would put it the other way. It
releases 5M when it should have released 5M-20K, because 20k was
released when invalidating page.

> 
> Calltrace:
> btrfs_finish_ordered_io()
> |- inserted_reserved_file_extent()
>|- btrfs_alloc_reserved_file_extent()
>|  ^^ Here we tell delayed_ref to free 5M later
>|- btrfs_qgroup_release_data()
>   ^^ We only released (5M - 20K), as the first 5 pages
>  are freed by btrfs_invalidatepage()
> 
> If btrfs is designed to freeing pages which belongs to ordered_extent,
> then it totally breaks the qgroup assumption.
> 
> Then we have several ways to fix it:
> 1) Fix race between ordered extent(writeback) with invalidatepage()
>Make btrfs_invalidatepage() skips pages that are already in
>ordered_extent, then we're OK.
> 
>This is much like your original fix, but I'm not sure if DIRTY page
>flag is bond to ordered_extent.
>And more comment will help.


So, what you mean is that fix is correct, I just need to reword the
patch header.

> 
> 2) Make btrfs_qgroup_release_data() return accurate num bytes
>And then pass it to delayed_ref head.
> 
>This is quite anti-intuition. If we're adding a new extent, how
>could it happen that some pages are already invalidated?
> 

I agree. It is counter-intutive and will increase complexity.

> Anyway, awesome work, exposed a quite strange race and makes us re-think
> the base of qgroup reserve space framework.

Thanks.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-29 Thread Qu Wenruo

Thanks for your test script.

I succeeded in pinning down the problem.

The problem is, btrfs is invalidate pages that belongs to ordered 
extent(goes through writeback)


The ftrace(with more tracepoints to trace qgroup->reserved change) shows:
--
btrfs_qgroup_release_data: root=257, ino=257, start=0, len=4096, 
reserved=4096, op=free

qgroup_update_reserve: qgid = 257, cur_reserved = 16171008, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=4096, len=4096, 
reserved=4096, op=free

qgroup_update_reserve: qgid = 257, cur_reserved = 16166912, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=8192, len=4096, 
reserved=4096, op=free

qgroup_update_reserve: qgid = 257, cur_reserved = 16162816, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=12288, len=4096, 
reserved=4096, op=free

qgroup_update_reserve: qgid = 257, cur_reserved = 16158720, diff = -4096
btrfs_qgroup_release_data: root=257, ino=257, start=16384, len=4096, 
reserved=4096, op=free

qgroup_update_reserve: qgid = 257, cur_reserved = 16154624, diff = -4096


^ Btrfs invalidates 5 pages of ino 257 qg 257, range 0 ~ 20K.

btrfs_qgroup_release_data: root=257, ino=257, start=0, len=5242880, 
reserved=5222400, op=release


^ Here ordered_io finishes, write the whole 5M data, while 
QGROUP_RESERVED only returned 5M - 20K.

--

The problem is at btrfs_add_delayed_data_ref(), we use the assumption 
that, qgroup reserved space is the same as extent size(5M)

But in fact, btrfs_qgroup_release_data() only release (5M - 20K).
Leaking the 20K.

Calltrace:
btrfs_finish_ordered_io()
|- inserted_reserved_file_extent()
   |- btrfs_alloc_reserved_file_extent()
   |  ^^ Here we tell delayed_ref to free 5M later
   |- btrfs_qgroup_release_data()
  ^^ We only released (5M - 20K), as the first 5 pages
 are freed by btrfs_invalidatepage()

If btrfs is designed to freeing pages which belongs to ordered_extent, 
then it totally breaks the qgroup assumption.


Then we have several ways to fix it:
1) Fix race between ordered extent(writeback) with invalidatepage()
   Make btrfs_invalidatepage() skips pages that are already in
   ordered_extent, then we're OK.

   This is much like your original fix, but I'm not sure if DIRTY page
   flag is bond to ordered_extent.
   And more comment will help.

2) Make btrfs_qgroup_release_data() return accurate num bytes
   And then pass it to delayed_ref head.

   This is quite anti-intuition. If we're adding a new extent, how
   could it happen that some pages are already invalidated?


Anyway, awesome work, exposed a quite strange race and makes us re-think 
the base of qgroup reserve space framework.


Thanks,
Qu

At 09/28/2016 10:19 AM, Goldwyn Rodrigues wrote:



On 09/27/2016 08:44 PM, Qu Wenruo wrote:




Finally reproduced it.

Although in a low chance, about 1/10.

Under most case, the final remove gets executed without error.


Change 50m to 500m while setting the qgroup limit, the probability will
increase.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-27 Thread Goldwyn Rodrigues


On 09/27/2016 08:44 PM, Qu Wenruo wrote:



> Finally reproduced it.
> 
> Although in a low chance, about 1/10.
> 
> Under most case, the final remove gets executed without error.

Change 50m to 500m while setting the qgroup limit, the probability will
increase.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-27 Thread Qu Wenruo



At 09/27/2016 10:04 PM, Goldwyn Rodrigues wrote:



On 09/26/2016 10:10 PM, Qu Wenruo wrote:



At 09/26/2016 10:31 PM, Goldwyn Rodrigues wrote:



On 09/25/2016 09:33 PM, Qu Wenruo wrote:



At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote:




[snipped]


Sorry I still don't get the point.
Would you please give a call flow of the timing dirtying page and
calling btrfs_qgroup_reserve/free/release_data()?

Like:
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
|  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages()<- Mark page dirty


[[Timing of btrfs_invalidatepage()]]
About your commit message "once while invalidating the page, and the
next time while free'ing delalloc."
"Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref().

If so, it means one extent goes through full write back, and long before
calling btrfs_qgroup_free_delayed_ref(), it will call
btrfs_qgroup_release_data() to clear QGROUP_RESERVED.

So the call will be:
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
|  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages()<- Mark page dirty


run_delalloc_range()
|- cow_file_range()
   |- extent_clear_unlock_delalloc() <- Clear page dirty



btrfs_finish_ordered_io()
|- insert_reserved_file_extent()
   |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit
 but not decrease reserved space


run_one_delyaed_refs()
|- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space


So the problem seems to be, btrfs_invalidatepage() is called after
run_delalloc_range() but before btrfs_finish_ordered_io().

Did you mean that?


This happens event before a writeback happens. So, here is what is
happening. This is in reference, and specific to the test case in the
description.


The flowchart helps a lot, thanks.

But still some flow is still strange.


Process: dd - first time
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
|  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages()<- Mark page dirty

Please note data writeback does _not_ happen/complete.


This part is completely fine.



Process: dd - next time, new process
sys_open O_TRUNC
.
 |-btrfs_setattr()
   |-truncate_pagecache()
 |-truncate_inode_pages_range()
|-truncate_inode_page() - Page is cleared of Dirty flag.
  |-btrfs_invalidatepage(page)
|-__btrfs_qgroup_release_data()
  |-btrfs_qgroup_free_refroot() - qgroup->reserved is
reduced by PAGESIZE.


That's OK too. No problem.
Although the __btrfs_qgroup_release_data() is called by
btrfs_qgroup_free_data().
Which cleared QGROUP_RESERVED flag.


So the call flow should be

  |-btrfs_setattr()
|-truncate_pagecache()
  |-truncate_inode_pages_range()
 |-truncate_inode_page()<- Clear page dirty
   |-btrfs_invalidatepage(page)
 |-btrfs_qgroup_free_data() <- reduce qgroup->reserved
   and clear QGROUP_RESERVED

After btrfs_setattr(), the new dd write data.
So __btrfs_buffered_write() happens again,
dirtying pages and mark QGROUP_RESERVED and increase qgroup->reserved.

So here I don't see any problem
buffered_write:
  Mark dirty
  Mark QGROUP_RESERVED
  Increase qgroup->reserved

btrfs_setattr:
  Clear dirty
  Clear QGROUP_RESERVED
  Decrease qgroup->reserved

All pairs with each other, no leak no underflow.


Yes, but the problem is
run_one_delayed_ref()
 |-btrfs_qgroup_free_delayed_ref
uses delayed_ref_head->qgroup_reserved to reduce the count. Which
function is responsible for decreasing delayed_ref_head->qgroup_reserved
when btrfs_invalidatepage() is reducing the count?


Decreasing qgroup_reserved can happen in two ways:

1) Through qgroup_release_data() + qgroup_free_delayed_ref() combination
   That's used for case like writing pages into disk.

   The work flow is:
__btrfs_buffered_write()
|- btrfs_check_free_space()
   |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED flag
 Increase qgroup->reserved.

run_dealloc_range()  <- Write data into disk
finish_ordered_io()  <- Update metadata
|- add_delayed_data_ref()   <- Create a new data_ref, record
|  how much qgroup->reserved needs to
|  free
|- btrfs_qgroup_release_data()  <- Only clearing QGROUP_RESERVED flag
   But qgroup->reserved is not decreased

commit_trans()
|- run_delayed_refs()
   |- run_one_delayed_ref()
  |- btrfs_qgroup_free_delayed_ref() <- Decrease qgroup->reserved

  So if we hit btrfs_qgroup_free_delayed_ref(), it means one extent
  hits disk.

  The complexity is because qgroup is done at commit_trans() time, so
  qgroup excl/rfer is not updated at finish_ordered_io() time.
  If we decrease qgroup->reserved here, then we can easily exceed qgroup
  

Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-27 Thread Goldwyn Rodrigues


On 09/26/2016 10:10 PM, Qu Wenruo wrote:
> 
> 
> At 09/26/2016 10:31 PM, Goldwyn Rodrigues wrote:
>>
>>
>> On 09/25/2016 09:33 PM, Qu Wenruo wrote:
>>>
>>>
>>> At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote:


> [snipped]
>>>
>>> Sorry I still don't get the point.
>>> Would you please give a call flow of the timing dirtying page and
>>> calling btrfs_qgroup_reserve/free/release_data()?
>>>
>>> Like:
>>> __btrfs_buffered_write()
>>> |- btrfs_check_data_free_space()
>>> |  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
>>> |- btrfs_dirty_pages()<- Mark page dirty
>>>
>>>
>>> [[Timing of btrfs_invalidatepage()]]
>>> About your commit message "once while invalidating the page, and the
>>> next time while free'ing delalloc."
>>> "Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref().
>>>
>>> If so, it means one extent goes through full write back, and long before
>>> calling btrfs_qgroup_free_delayed_ref(), it will call
>>> btrfs_qgroup_release_data() to clear QGROUP_RESERVED.
>>>
>>> So the call will be:
>>> __btrfs_buffered_write()
>>> |- btrfs_check_data_free_space()
>>> |  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
>>> |- btrfs_dirty_pages()<- Mark page dirty
>>>
>>> 
>>> run_delalloc_range()
>>> |- cow_file_range()
>>>|- extent_clear_unlock_delalloc() <- Clear page dirty
>>>
>>> 
>>>
>>> btrfs_finish_ordered_io()
>>> |- insert_reserved_file_extent()
>>>|- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit
>>>  but not decrease reserved space
>>>
>>> 
>>> run_one_delyaed_refs()
>>> |- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space
>>>
>>>
>>> So the problem seems to be, btrfs_invalidatepage() is called after
>>> run_delalloc_range() but before btrfs_finish_ordered_io().
>>>
>>> Did you mean that?
>>
>> This happens event before a writeback happens. So, here is what is
>> happening. This is in reference, and specific to the test case in the
>> description.
> 
> The flowchart helps a lot, thanks.
> 
> But still some flow is still strange.
>>
>> Process: dd - first time
>> __btrfs_buffered_write()
>> |- btrfs_check_data_free_space()
>> |  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
>> |- btrfs_dirty_pages()<- Mark page dirty
>>
>> Please note data writeback does _not_ happen/complete.
> 
> This part is completely fine.
> 
>>
>> Process: dd - next time, new process
>> sys_open O_TRUNC
>> .
>>  |-btrfs_setattr()
>>|-truncate_pagecache()
>>  |-truncate_inode_pages_range()
>> |-truncate_inode_page() - Page is cleared of Dirty flag.
>>   |-btrfs_invalidatepage(page)
>> |-__btrfs_qgroup_release_data()
>>   |-btrfs_qgroup_free_refroot() - qgroup->reserved is
>> reduced by PAGESIZE.
> 
> That's OK too. No problem.
> Although the __btrfs_qgroup_release_data() is called by
> btrfs_qgroup_free_data().
> Which cleared QGROUP_RESERVED flag.
> 
> 
> So the call flow should be
> 
>   |-btrfs_setattr()
> |-truncate_pagecache()
>   |-truncate_inode_pages_range()
>  |-truncate_inode_page()<- Clear page dirty
>|-btrfs_invalidatepage(page)
>  |-btrfs_qgroup_free_data() <- reduce qgroup->reserved
>and clear QGROUP_RESERVED
> 
> After btrfs_setattr(), the new dd write data.
> So __btrfs_buffered_write() happens again,
> dirtying pages and mark QGROUP_RESERVED and increase qgroup->reserved.
> 
> So here I don't see any problem
> buffered_write:
>   Mark dirty
>   Mark QGROUP_RESERVED
>   Increase qgroup->reserved
> 
> btrfs_setattr:
>   Clear dirty
>   Clear QGROUP_RESERVED
>   Decrease qgroup->reserved
> 
> All pairs with each other, no leak no underflow.

Yes, but the problem is
run_one_delayed_ref()
 |-btrfs_qgroup_free_delayed_ref
uses delayed_ref_head->qgroup_reserved to reduce the count. Which
function is responsible for decreasing delayed_ref_head->qgroup_reserved
when btrfs_invalidatepage() is reducing the count?


> 
> Until the last buffered_write(), which doesn't got truncated.
> 
>>
>>
>> Process: sync
>> btrfs_sync_fs()
>> |-btrfs_commit_transaction()
>>   |-btrfs_run_delayed_refs()
>> |- qgroup_free_refroot() - Reduces reserved by the size of the
>> extent (in this case, the filesize of dd (first time)), even though some
>> of the PAGESIZEs have been reduced while performing the truncate on the
>> file.
> 
> The delayed_ref belongs to the last extent, which doesn't got truncated.
> 
> And in that case, that last extent got into disk.
> run_dealloc_range() <- Write data into disk
> finish_ordered_io() <- Update metadata
>   |- insert_reserved_file_extents()
>  |- btrfs_alloc_reserved_file_extent()
>  |  |- btrfs_add_delayed_data_ref
>  | 
>  |- btrfs_qgroup_release_data()
> 
> 
> Then goes to your btrfs_sync_fs() => qgroup_free_refroot()
> 
> 
> 
> [[I think I find the problem now]]

Re: [PATCH] qgroup: Prevent qgroup->reserved from going subzero

2016-09-26 Thread Qu Wenruo



At 09/26/2016 10:31 PM, Goldwyn Rodrigues wrote:



On 09/25/2016 09:33 PM, Qu Wenruo wrote:



At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote:




[snipped]


Sorry I still don't get the point.
Would you please give a call flow of the timing dirtying page and
calling btrfs_qgroup_reserve/free/release_data()?

Like:
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
|  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages()<- Mark page dirty


[[Timing of btrfs_invalidatepage()]]
About your commit message "once while invalidating the page, and the
next time while free'ing delalloc."
"Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref().

If so, it means one extent goes through full write back, and long before
calling btrfs_qgroup_free_delayed_ref(), it will call
btrfs_qgroup_release_data() to clear QGROUP_RESERVED.

So the call will be:
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
|  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages()<- Mark page dirty


run_delalloc_range()
|- cow_file_range()
   |- extent_clear_unlock_delalloc() <- Clear page dirty



btrfs_finish_ordered_io()
|- insert_reserved_file_extent()
   |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit
 but not decrease reserved space


run_one_delyaed_refs()
|- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space


So the problem seems to be, btrfs_invalidatepage() is called after
run_delalloc_range() but before btrfs_finish_ordered_io().

Did you mean that?


This happens event before a writeback happens. So, here is what is
happening. This is in reference, and specific to the test case in the
description.


The flowchart helps a lot, thanks.

But still some flow is still strange.


Process: dd - first time
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
|  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages()<- Mark page dirty

Please note data writeback does _not_ happen/complete.


This part is completely fine.



Process: dd - next time, new process
sys_open O_TRUNC
.
 |-btrfs_setattr()
   |-truncate_pagecache()
 |-truncate_inode_pages_range()
|-truncate_inode_page() - Page is cleared of Dirty flag.
  |-btrfs_invalidatepage(page)
|-__btrfs_qgroup_release_data()
  |-btrfs_qgroup_free_refroot() - qgroup->reserved is
reduced by PAGESIZE.


That's OK too. No problem.
Although the __btrfs_qgroup_release_data() is called by 
btrfs_qgroup_free_data().

Which cleared QGROUP_RESERVED flag.


So the call flow should be

  |-btrfs_setattr()
|-truncate_pagecache()
  |-truncate_inode_pages_range()
 |-truncate_inode_page()<- Clear page dirty
   |-btrfs_invalidatepage(page)
 |-btrfs_qgroup_free_data() <- reduce qgroup->reserved
   and clear QGROUP_RESERVED

After btrfs_setattr(), the new dd write data.
So __btrfs_buffered_write() happens again,
dirtying pages and mark QGROUP_RESERVED and increase qgroup->reserved.

So here I don't see any problem
buffered_write:
  Mark dirty
  Mark QGROUP_RESERVED
  Increase qgroup->reserved

btrfs_setattr:
  Clear dirty
  Clear QGROUP_RESERVED
  Decrease qgroup->reserved

All pairs with each other, no leak no underflow.

Until the last buffered_write(), which doesn't got truncated.




Process: sync
btrfs_sync_fs()
|-btrfs_commit_transaction()
  |-btrfs_run_delayed_refs()
|- qgroup_free_refroot() - Reduces reserved by the size of the
extent (in this case, the filesize of dd (first time)), even though some
of the PAGESIZEs have been reduced while performing the truncate on the
file.


The delayed_ref belongs to the last extent, which doesn't got truncated.

And in that case, that last extent got into disk.
run_dealloc_range() <- Write data into disk
finish_ordered_io() <- Update metadata
  |- insert_reserved_file_extents()
 |- btrfs_alloc_reserved_file_extent()
 |  |- btrfs_add_delayed_data_ref
 | 
 |- btrfs_qgroup_release_data()


Then goes to your btrfs_sync_fs() => qgroup_free_refroot()



[[I think I find the problem now]]

Between btrfs_alloc_reserved_file_extent() and 
btrfs_qgroup_release_data(), there is a window that delayed_refs can be 
executed.

Since that d*mn delayed_refs can be run at any time asynchronously.

In that case, it will call qgroup_free_refroot() first at another 
process context, and later btrfs_qgroup_release_data() get schedualed, 
which will clear QGROUP_RESERVED again, causing the underflow.


Although it's not the original cause you reported, would you mind to try 
the following quick-dirty fix without your fix, to see if it fixes the 
problem?


---
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6811c4..70efa14 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2168,15 +2168,15 @@ static