On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote: > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > >> From: Omar Sandoval <osan...@fb.com> > > >> > > >> There's a race between close_ctree() and cleaner_kthread(). > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > >> sees it set, but this is racy; the cleaner might have already checked > > >> the bit and could be cleaning stuff. In particular, if it deletes > > >> unused > > >> block groups, it will create delayed iputs for the free space cache > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > >> longer running delayed iputs after a commit. Therefore, if the > > >> cleaner > > >> creates more delayed iputs after delayed iputs are run in > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > > >> inode crash from the VFS. > > >> > > >> Fix it by parking the cleaner > > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > > we're missing a commit or clean up data structures. Messing with state > > > of a thread would be the last thing I'd try after proving that it's > > > not > > > possible to fix in the logic of btrfs itself. > > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > > there. The interdependencies of thread and data structures and other > > > subsystems cannot have loops that could not find an ordering that will > > > not leak something. > > > > > > It's not a big problem if some step is done more than once, like > > > committing or cleaning up some other structures if we know that > > > it could create new. > > > > The problem is the cleaner thread needs to be told to stop doing new > > work, and we need to wait for the work it's already doing to be > > finished. We're getting "stop doing new work" already because the > > cleaner thread checks to see if the FS is closing, but we don't have a > > way today to wait for him to finish what he's already doing. > > > > kthread_park() is basically the same as adding another mutex or > > synchronization point. I'm not sure how we could change close_tree() or > > the final commit to pick this up more effectively? > > The idea is: > > cleaner close_ctree thread > > tell cleaner to stop > wait > start work > if should_stop, then exit > cleaner is stopped > > [does not run: finish work] > [does not run: loop] > pick up the work or make > sure there's nothing in > progress anymore > > > A simplified version in code: > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > > wait for defrag - could be started from cleaner but next iteration will > see the fs closed and will not continue > > kthread_stop(transaction_kthread) > > kthread_stop(cleaner_kthread) > > /* next, everything that could be left from cleaner should be finished */ > > btrfs_delete_unused_bgs(); > assert there are no defrags > assert there are no delayed iputs > commit if necessary > > IOW the unused block groups are removed from close_ctree too early, > moving that after the threads stop AND makins sure that it does not need > either of them should work. > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs > calls plain btrfs_end_transaction that wakes up transaction ktread, so > there would need to be an argument passed to tell it to do full commit.
Not perfect, relies on the fact that wake_up_process(thread) on a stopped thread is a no-op, arguments would need to be added to skip that in btrfs_delete_unused_bgs and btrfs_commit_super. --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3956,11 +3956,18 @@ void close_ctree(struct btrfs_fs_info *fs_info) cancel_work_sync(&fs_info->async_reclaim_work); + if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) || + test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state)) + btrfs_error_commit_super(fs_info); + + kthread_stop(fs_info->transaction_kthread); + kthread_stop(fs_info->cleaner_kthread); + if (!sb_rdonly(fs_info->sb)) { /* - * If the cleaner thread is stopped and there are - * block groups queued for removal, the deletion will be - * skipped when we quit the cleaner thread. + * The cleaner thread is now stopped and if there are block + * groups queued for removal, we have to pick up the work here + * so there are no delayed iputs triggered. */ btrfs_delete_unused_bgs(fs_info); @@ -3969,13 +3976,6 @@ void close_ctree(struct btrfs_fs_info *fs_info) btrfs_err(fs_info, "commit super ret %d", ret); } - if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state) || - test_bit(BTRFS_FS_STATE_TRANS_ABORTED, &fs_info->fs_state)) - btrfs_error_commit_super(fs_info); - - kthread_stop(fs_info->transaction_kthread); - kthread_stop(fs_info->cleaner_kthread); - ASSERT(list_empty(&fs_info->delayed_iputs)); set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);