On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote:
> On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote:
> > On 1 Nov 2018, at 6:15, David Sterba wrote:
> > 
> > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote:
> > >> From: Omar Sandoval <osan...@fb.com>
> > >>
> > >> There's a race between close_ctree() and cleaner_kthread().
> > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
> > >> sees it set, but this is racy; the cleaner might have already checked
> > >> the bit and could be cleaning stuff. In particular, if it deletes 
> > >> unused
> > >> block groups, it will create delayed iputs for the free space cache
> > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
> > >> longer running delayed iputs after a commit. Therefore, if the 
> > >> cleaner
> > >> creates more delayed iputs after delayed iputs are run in
> > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy
> > >> inode crash from the VFS.
> > >>
> > >> Fix it by parking the cleaner
> > >
> > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level,
> > > we're missing a commit or clean up data structures. Messing with state
> > > of a thread would be the last thing I'd try after proving that it's 
> > > not
> > > possible to fix in the logic of btrfs itself.
> > >
> > > The shutdown sequence in close_tree is quite tricky and we've had bugs
> > > there. The interdependencies of thread and data structures and other
> > > subsystems cannot have loops that could not find an ordering that will
> > > not leak something.
> > >
> > > It's not a big problem if some step is done more than once, like
> > > committing or cleaning up some other structures if we know that
> > > it could create new.
> > 
> > The problem is the cleaner thread needs to be told to stop doing new 
> > work, and we need to wait for the work it's already doing to be 
> > finished.  We're getting "stop doing new work" already because the 
> > cleaner thread checks to see if the FS is closing, but we don't have a 
> > way today to wait for him to finish what he's already doing.
> > 
> > kthread_park() is basically the same as adding another mutex or 
> > synchronization point.  I'm not sure how we could change close_tree() or 
> > the final commit to pick this up more effectively?
> 
> The idea is:
> 
> cleaner                                     close_ctree thread
> 
>                                             tell cleaner to stop
>                                           wait
> start work
> if should_stop, then exit
>                                             cleaner is stopped
> 
> [does not run: finish work]
> [does not run: loop]
>                                             pick up the work or make
>                                           sure there's nothing in
>                                           progress anymore
> 
> 
> A simplified version in code:
> 
>   set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags);
> 
>   wait for defrag - could be started from cleaner but next iteration will
>                   see the fs closed and will not continue
> 
>   kthread_stop(transaction_kthread)
> 
>   kthread_stop(cleaner_kthread)
> 
>   /* next, everything that could be left from cleaner should be finished */
> 
>   btrfs_delete_unused_bgs();
>   assert there are no defrags
>   assert there are no delayed iputs
>   commit if necessary
> 
> IOW the unused block groups are removed from close_ctree too early,
> moving that after the threads stop AND makins sure that it does not need
> either of them should work.
> 
> The "AND" above is not currently implemented as btrfs_delete_unused_bgs
> calls plain btrfs_end_transaction that wakes up transaction ktread, so
> there would need to be an argument passed to tell it to do full commit.

Not perfect, relies on the fact that wake_up_process(thread) on a stopped
thread is a no-op, arguments would need to be added to skip that in
btrfs_delete_unused_bgs and btrfs_commit_super.

--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3956,11 +3956,18 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 
        cancel_work_sync(&fs_info->async_reclaim_work);
 
+       if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
+           test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state))
+               btrfs_error_commit_super(fs_info);
+
+       kthread_stop(fs_info->transaction_kthread);
+       kthread_stop(fs_info->cleaner_kthread);
+
        if (!sb_rdonly(fs_info->sb)) {
                /*
-                * If the cleaner thread is stopped and there are
-                * block groups queued for removal, the deletion will be
-                * skipped when we quit the cleaner thread.
+                * The cleaner thread is now stopped and if there are block
+                * groups queued for removal, we have to pick up the work here
+                * so there are no delayed iputs triggered.
                 */
                btrfs_delete_unused_bgs(fs_info);
 
@@ -3969,13 +3976,6 @@ void close_ctree(struct btrfs_fs_info *fs_info)
                        btrfs_err(fs_info, "commit super ret %d", ret);
        }
 
-       if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) ||
-           test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state))
-               btrfs_error_commit_super(fs_info);
-
-       kthread_stop(fs_info->transaction_kthread);
-       kthread_stop(fs_info->cleaner_kthread);
-
        ASSERT(list_empty(&fs_info->delayed_iputs));
        set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);

Reply via email to