On Thu, Nov 01, 2018 at 08:24:25AM -0700, Omar Sandoval wrote: > On Thu, Nov 01, 2018 at 04:22:29PM +0100, David Sterba wrote: > > On Thu, Nov 01, 2018 at 04:08:32PM +0100, David Sterba wrote: > > > On Thu, Nov 01, 2018 at 01:31:18PM +0000, Chris Mason wrote: > > > > On 1 Nov 2018, at 6:15, David Sterba wrote: > > > > > > > > > On Wed, Oct 31, 2018 at 10:06:08AM -0700, Omar Sandoval wrote: > > > > >> From: Omar Sandoval <osan...@fb.com> > > > > >> > > > > >> There's a race between close_ctree() and cleaner_kthread(). > > > > >> close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it > > > > >> sees it set, but this is racy; the cleaner might have already checked > > > > >> the bit and could be cleaning stuff. In particular, if it deletes > > > > >> unused > > > > >> block groups, it will create delayed iputs for the free space cache > > > > >> inodes. As of "btrfs: don't run delayed_iputs in commit", we're no > > > > >> longer running delayed iputs after a commit. Therefore, if the > > > > >> cleaner > > > > >> creates more delayed iputs after delayed iputs are run in > > > > >> btrfs_commit_super(), we will leak inodes on unmount and get a busy > > > > >> inode crash from the VFS. > > > > >> > > > > >> Fix it by parking the cleaner > > > > > > > > > > Ouch, that's IMO wrong way to fix it. The bug is on a higher level, > > > > > we're missing a commit or clean up data structures. Messing with state > > > > > of a thread would be the last thing I'd try after proving that it's > > > > > not > > > > > possible to fix in the logic of btrfs itself. > > > > > > > > > > The shutdown sequence in close_tree is quite tricky and we've had bugs > > > > > there. The interdependencies of thread and data structures and other > > > > > subsystems cannot have loops that could not find an ordering that will > > > > > not leak something. > > > > > > > > > > It's not a big problem if some step is done more than once, like > > > > > committing or cleaning up some other structures if we know that > > > > > it could create new. > > > > > > > > The problem is the cleaner thread needs to be told to stop doing new > > > > work, and we need to wait for the work it's already doing to be > > > > finished. We're getting "stop doing new work" already because the > > > > cleaner thread checks to see if the FS is closing, but we don't have a > > > > way today to wait for him to finish what he's already doing. > > > > > > > > kthread_park() is basically the same as adding another mutex or > > > > synchronization point. I'm not sure how we could change close_tree() > > > > or > > > > the final commit to pick this up more effectively? > > > > > > The idea is: > > > > > > cleaner close_ctree thread > > > > > > tell cleaner to stop > > > wait > > > start work > > > if should_stop, then exit > > > cleaner is stopped > > > > > > [does not run: finish work] > > > [does not run: loop] > > > pick up the work or make > > > sure there's nothing in > > > progress anymore > > > > > > > > > A simplified version in code: > > > > > > set_bit(BTRFS_FS_CLOSING_START, &fs_info->flags); > > > > > > wait for defrag - could be started from cleaner but next iteration will > > > see the fs closed and will not continue > > > > > > kthread_stop(transaction_kthread) > > > > > > kthread_stop(cleaner_kthread) > > > > > > /* next, everything that could be left from cleaner should be finished > > > */ > > > > > > btrfs_delete_unused_bgs(); > > > assert there are no defrags > > > assert there are no delayed iputs > > > commit if necessary > > > > > > IOW the unused block groups are removed from close_ctree too early, > > > moving that after the threads stop AND makins sure that it does not need > > > either of them should work. > > > > > > The "AND" above is not currently implemented as btrfs_delete_unused_bgs > > > calls plain btrfs_end_transaction that wakes up transaction ktread, so > > > there would need to be an argument passed to tell it to do full commit. > > > > Not perfect, relies on the fact that wake_up_process(thread) on a stopped > > thread is a no-op, > > How is that? kthread_stop() frees the task struct, so wake_up_process() > would be a use-after-free.
That was an assumption for the demonstration purposes, the wording was confusing sorry.