On 8 May 2019, at 3:15, Nikolay Borisov wrote: > On 7.05.19 г. 20:27 ч., Josef Bacik wrote: >> We have been seeing issues in production where a cleaner script will >> end >> up unlinking a bunch of files that have pending iputs. This means >> they >> will get their final iput's run at btrfs-cleaner time and thus are >> not >> throttled, which impacts the workload. >> >> Since we are unlinking these files we can just drop the delayed iput >> at >> unlink time. We are already holding a reference to the inode so this >> will not be the final iput and thus is completely safe to do at this >> point. Doing this means we are more likely to be doing the final >> iput >> at unlink time, and thus will get the IO charged to the caller and >> get >> throttled appropriately without affecting the main workload. >> >> Signed-off-by: Josef Bacik <jo...@toxicpanda.com> >> --- >> fs/btrfs/inode.c | 22 ++++++++++++++++++++++ >> 1 file changed, 22 insertions(+) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index b6d549c993f6..e58685b5d398 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -4009,6 +4009,28 @@ static int __btrfs_unlink_inode(struct >> btrfs_trans_handle *trans, >> ret = 0; >> else if (ret) >> btrfs_abort_transaction(trans, ret); >> + >> + /* >> + * If we have a pending delayed iput we could end up with the final >> iput >> + * being run in btrfs-cleaner context. If we have enough of these >> built >> + * up we can end up burning a lot of time in btrfs-cleaner without >> any >> + * way to throttle the unlinks. Since we're currently holding a >> ref on >> + * the inode we can run the delayed iput here without any issues as >> the >> + * final iput won't be done until after we drop the ref we're >> currently >> + * holding. >> + */ > > FWIW the caller is not really holding an explicit reference, rather > there is a reference held by the dentry which is going to be disposed > of > by vfs. Considering this I'd say this is a false claim. I.e "we" do > not > hold a reference.
It's impossible to call this function without a reference held on the inode, kind of nit-picking on "we" vs "the vfs". > >> + if (!list_empty(&inode->delayed_iput)) { >> + spin_lock(&fs_info->delayed_iput_lock); >> + if (!list_empty(&inode->delayed_iput)) { >> + list_del_init(&inode->delayed_iput); >> + spin_unlock(&fs_info->delayed_iput_lock); >> + iput(&inode->vfs_inode); >> + if (atomic_dec_and_test(&fs_info->nr_delayed_iputs)) >> + wake_up(&fs_info->delayed_iputs_wait); >> + } else { >> + spin_unlock(&fs_info->delayed_iput_lock); >> + } >> + } > > OTOH this really feels like a hack and this stems from the fact that > iput is rather rudimentary. Additionally you are essentially > opencoding > the body of btrfs_run_delayed_iputs. I was going to suggest to > introduce > a new helper factoring out the common code but it will get ugly due to > the spin lock being dropped before doing the iput. > > But then I'm really starting to question the utility of delayed iputs. > Presumably it was added to defer the expensive final iput in the > cleaner > context or avoid some deadlocks (but we don't know which exactly). > Yet, > here we are some time later where you are essentially saying "this > mechanism is suboptimal because it's dumb and instead of improving > things it's making them worse in certain cases, so let's unload it a > bit > by doing an iput here". The final iput is pretty expensive, since it potentially does the full truncate of an arbitrary sized file. There are a lot of contexts it can't be called from, so the delayed iput code saves us from some impossible situations. It originally came here: commit 24bbcf0442ee04660a5a030efdbb6d03f1c275cb Author: Yan, Zheng <zheng....@oracle.com> Date: Thu Nov 12 09:36:34 2009 +0000 Btrfs: Add delayed iput But we've expanded usage to solve a few different deadlocks. -chris