At 09/26/2016 10:31 PM, Goldwyn Rodrigues wrote:


On 09/25/2016 09:33 PM, Qu Wenruo wrote:


At 09/23/2016 09:43 PM, Goldwyn Rodrigues wrote:


[snipped]

Sorry I still don't get the point.
Would you please give a call flow of the timing dirtying page and
calling btrfs_qgroup_reserve/free/release_data()?

Like:
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
|  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages()            <- Mark page dirty


[[Timing of btrfs_invalidatepage()]]
About your commit message "once while invalidating the page, and the
next time while free'ing delalloc."
"Free'ing delalloc" did you mean btrfs_qgroup_free_delayed_ref().

If so, it means one extent goes through full write back, and long before
calling btrfs_qgroup_free_delayed_ref(), it will call
btrfs_qgroup_release_data() to clear QGROUP_RESERVED.

So the call will be:
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
|  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages()            <- Mark page dirty

<data write back happens>
run_delalloc_range()
|- cow_file_range()
   |- extent_clear_unlock_delalloc() <- Clear page dirty

<modifying metadata>

btrfs_finish_ordered_io()
|- insert_reserved_file_extent()
   |- btrfs_qgroup_release_data() <- Clear QGROUP_RESERVED bit
                                     but not decrease reserved space

<run delayed refs, normally happens in commit_trans>
run_one_delyaed_refs()
|- btrfs_qgroup_free_delayed_ref() <- Directly decrease reserved space


So the problem seems to be, btrfs_invalidatepage() is called after
run_delalloc_range() but before btrfs_finish_ordered_io().

Did you mean that?

This happens event before a writeback happens. So, here is what is
happening. This is in reference, and specific to the test case in the
description.

The flowchart helps a lot, thanks.

But still some flow is still strange.

Process: dd - first time
__btrfs_buffered_write()
|- btrfs_check_data_free_space()
|  |- btrfs_qgroup_reserve_data() <- Mark QGROUP_RESERVED bit
|- btrfs_dirty_pages()            <- Mark page dirty

Please note data writeback does _not_ happen/complete.

This part is completely fine.


Process: dd - next time, new process
sys_open O_TRUNC
.
 |-btrfs_setattr()
   |-truncate_pagecache()
     |-truncate_inode_pages_range()
        |-truncate_inode_page() - Page is cleared of Dirty flag.
          |-btrfs_invalidatepage(page)
            |-__btrfs_qgroup_release_data()
              |-btrfs_qgroup_free_refroot() - qgroup->reserved is
reduced by PAGESIZE.

That's OK too. No problem.
Although the __btrfs_qgroup_release_data() is called by btrfs_qgroup_free_data().
Which cleared QGROUP_RESERVED flag.


So the call flow should be

  |-btrfs_setattr()
    |-truncate_pagecache()
      |-truncate_inode_pages_range()
         |-truncate_inode_page()        <- Clear page dirty
           |-btrfs_invalidatepage(page)
             |-btrfs_qgroup_free_data() <- reduce qgroup->reserved
                                           and clear QGROUP_RESERVED

After btrfs_setattr(), the new dd write data.
So __btrfs_buffered_write() happens again,
dirtying pages and mark QGROUP_RESERVED and increase qgroup->reserved.

So here I don't see any problem
buffered_write:
  Mark dirty
  Mark QGROUP_RESERVED
  Increase qgroup->reserved

btrfs_setattr:
  Clear dirty
  Clear QGROUP_RESERVED
  Decrease qgroup->reserved

All pairs with each other, no leak no underflow.

Until the last buffered_write(), which doesn't got truncated.



Process: sync
btrfs_sync_fs()
|-btrfs_commit_transaction()
  |-btrfs_run_delayed_refs()
    |- qgroup_free_refroot() - Reduces reserved by the size of the
extent (in this case, the filesize of dd (first time)), even though some
of the PAGESIZEs have been reduced while performing the truncate on the
file.

The delayed_ref belongs to the last extent, which doesn't got truncated.

And in that case, that last extent got into disk.
run_dealloc_range() <- Write data into disk
finish_ordered_io() <- Update metadata
  |- insert_reserved_file_extents()
     |- btrfs_alloc_reserved_file_extent()
     |  |- btrfs_add_delayed_data_ref
     |     <This will cause a new delayed_data_ref>
     |- btrfs_qgroup_release_data()
        <This will clear QGROURP_RESERVED>

Then goes to your btrfs_sync_fs() => qgroup_free_refroot()



[[I think I find the problem now]]

Between btrfs_alloc_reserved_file_extent() and btrfs_qgroup_release_data(), there is a window that delayed_refs can be executed.
Since that d*mn delayed_refs can be run at any time asynchronously.

In that case, it will call qgroup_free_refroot() first at another process context, and later btrfs_qgroup_release_data() get schedualed, which will clear QGROUP_RESERVED again, causing the underflow.

Although it's not the original cause you reported, would you mind to try the following quick-dirty fix without your fix, to see if it fixes the problem?

-------
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e6811c4..70efa14 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2168,15 +2168,15 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
        ins.objectid = disk_bytenr;
        ins.offset = disk_num_bytes;
        ins.type = BTRFS_EXTENT_ITEM_KEY;
-       ret = btrfs_alloc_reserved_file_extent(trans, root,
-                                       root->root_key.objectid,
-                                       btrfs_ino(inode), file_pos,
-                                       ram_bytes, &ins);
        /*
         * Release the reserved range from inode dirty range map, as it is
         * already moved into delayed_ref_head
         */
        btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
+       ret = btrfs_alloc_reserved_file_extent(trans, root,
+                                       root->root_key.objectid,
+                                       btrfs_ino(inode), file_pos,
+                                       ram_bytes, &ins);
 out:
        btrfs_free_path(path);

-------


I hope that makes it clear.


[[About test case]]
And for the test case, I can't reproduce the problem no matter if I
apply the fix or not.

Either way it just fails after 3 loops of dd, and later dd will all fail.
But I can still remove the file and write new data into the fs.


Strange? I can reproduce at every instance of running it. Even on 4.8.0-rc7

Maybe related to the memory size.

Since you're also using VM, maybe your VM RAM is too small, so dirty pages pressure triggers a commit halfway and cause the race?


Thanks,
Qu


[[Extra protect about qgroup->reserved]]
And for the underflowed qgroup reserve space, would you mind to add
warning for that case?
Just like what we did in qgroup excl/rfer values, so at least it won't
make qgroup blocking any write.


Oh yes, I wonder why that is not placed there when it is placed in all
other location where qgroup->reserved is reduced.

Also, this has nothing to do with the comment of two ways of free'ing
the qgroups, as suggested in the commit statement. Thats my bad.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to