On 8 May 2019, at 3:15, Nikolay Borisov wrote:

> On 7.05.19 г. 20:27 ч., Josef Bacik wrote:
>> We have been seeing issues in production where a cleaner script will 
>> end
>> up unlinking a bunch of files that have pending iputs.  This means 
>> they
>> will get their final iput's run at btrfs-cleaner time and thus are 
>> not
>> throttled, which impacts the workload.
>>
>> Since we are unlinking these files we can just drop the delayed iput 
>> at
>> unlink time.  We are already holding a reference to the inode so this
>> will not be the final iput and thus is completely safe to do at this
>> point.  Doing this means we are more likely to be doing the final 
>> iput
>> at unlink time, and thus will get the IO charged to the caller and 
>> get
>> throttled appropriately without affecting the main workload.
>>
>> Signed-off-by: Josef Bacik <jo...@toxicpanda.com>
>> ---
>>  fs/btrfs/inode.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index b6d549c993f6..e58685b5d398 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -4009,6 +4009,28 @@ static int __btrfs_unlink_inode(struct 
>> btrfs_trans_handle *trans,
>>              ret = 0;
>>      else if (ret)
>>              btrfs_abort_transaction(trans, ret);
>> +
>> +    /*
>> +     * If we have a pending delayed iput we could end up with the final 
>> iput
>> +     * being run in btrfs-cleaner context.  If we have enough of these 
>> built
>> +     * up we can end up burning a lot of time in btrfs-cleaner without 
>> any
>> +     * way to throttle the unlinks.  Since we're currently holding a 
>> ref on
>> +     * the inode we can run the delayed iput here without any issues as 
>> the
>> +     * final iput won't be done until after we drop the ref we're 
>> currently
>> +     * holding.
>> +     */
>
> FWIW the caller is not really holding an explicit reference, rather
> there is a reference held by the dentry which is going to be disposed 
> of
> by vfs. Considering this I'd say this is a false claim. I.e "we" do 
> not
> hold a reference.

It's impossible to call this function without a reference held on the 
inode, kind of nit-picking on "we" vs "the vfs".

>
>> +    if (!list_empty(&inode->delayed_iput)) {
>> +            spin_lock(&fs_info->delayed_iput_lock);
>> +            if (!list_empty(&inode->delayed_iput)) {
>> +                    list_del_init(&inode->delayed_iput);
>> +                    spin_unlock(&fs_info->delayed_iput_lock);
>> +                    iput(&inode->vfs_inode);
>> +                    if (atomic_dec_and_test(&fs_info->nr_delayed_iputs))
>> +                            wake_up(&fs_info->delayed_iputs_wait);
>> +            } else {
>> +                    spin_unlock(&fs_info->delayed_iput_lock);
>> +            }
>> +    }
>
> OTOH this really feels like a hack and this stems from the fact that
> iput is rather rudimentary. Additionally you are essentially 
> opencoding
> the body of btrfs_run_delayed_iputs. I was going to suggest to 
> introduce
> a new helper factoring out the common code but it will get ugly due to
> the spin lock being dropped before doing the iput.
>
> But then I'm really starting to question the utility of delayed iputs.
> Presumably it was added to defer the expensive final iput in the 
> cleaner
> context or avoid some deadlocks (but we don't know which exactly). 
> Yet,
> here we are some time later where you are essentially saying "this
> mechanism is suboptimal because it's dumb and instead of improving
> things it's making them worse in certain cases, so let's unload it a 
> bit
> by doing an iput here".

The final iput is pretty expensive, since it potentially does the full 
truncate of an arbitrary sized file.  There are a lot of contexts it 
can't be called from, so the delayed iput code saves us from some 
impossible situations.  It originally came here:

commit 24bbcf0442ee04660a5a030efdbb6d03f1c275cb
Author: Yan, Zheng <zheng....@oracle.com>
Date:   Thu Nov 12 09:36:34 2009 +0000

     Btrfs: Add delayed iput

But we've expanded usage to solve a few different deadlocks.

-chris

Reply via email to