On Thu, May 24, 2018 at 11:49:04AM +0100, Filipe Manana wrote:
> On Wed, May 23, 2018 at 4:58 PM, Josef Bacik <jo...@toxicpanda.com> wrote:
> > From: Josef Bacik <jba...@fb.com>
> >
> > There's a priority inversion that exists currently with btrfs fsync.  In
> > some cases we will collect outstanding ordered extents onto a list and
> > only wait on them at the very last second.  However this "very last
> > second" falls inside of a transaction handle, so if we are in a lower
> > priority cgroup we can end up holding the transaction open for longer
> > than needed, so if a high priority cgroup is also trying to fsync()
> > it'll see latency.
> >
> > Signed-off-by: Josef Bacik <jba...@fb.com>
> > ---
> >  fs/btrfs/file.c | 56 
> > ++++----------------------------------------------------
> >  1 file changed, 4 insertions(+), 52 deletions(-)
> >
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 5772f0cbedef..2b1c36612384 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -2069,53 +2069,12 @@ int btrfs_sync_file(struct file *file, loff_t 
> > start, loff_t end, int datasync)
> >         atomic_inc(&root->log_batch);
> >         full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> >                              &BTRFS_I(inode)->runtime_flags);
> > +
> >         /*
> > -        * We might have have had more pages made dirty after calling
> > -        * start_ordered_ops and before acquiring the inode's i_mutex.
> > +        * We have to do this here to avoid the priority inversion of 
> > waiting on
> > +        * IO of a lower priority task while holding a transaciton open.
> >          */
> > -       if (full_sync) {
> > -               /*
> > -                * For a full sync, we need to make sure any ordered 
> > operations
> > -                * start and finish before we start logging the inode, so 
> > that
> > -                * all extents are persisted and the respective file extent
> > -                * items are in the fs/subvol btree.
> > -                */
> > -               ret = btrfs_wait_ordered_range(inode, start, len);
> > -       } else {
> > -               /*
> > -                * Start any new ordered operations before starting to log 
> > the
> > -                * inode. We will wait for them to finish in 
> > btrfs_sync_log().
> > -                *
> > -                * Right before acquiring the inode's mutex, we might have 
> > new
> > -                * writes dirtying pages, which won't immediately start the
> > -                * respective ordered operations - that is done through the
> > -                * fill_delalloc callbacks invoked from the writepage and
> > -                * writepages address space operations. So make sure we 
> > start
> > -                * all ordered operations before starting to log our inode. 
> > Not
> > -                * doing this means that while logging the inode, writeback
> > -                * could start and invoke writepage/writepages, which would 
> > call
> > -                * the fill_delalloc callbacks (cow_file_range,
> > -                * submit_compressed_extents). These callbacks add first an
> > -                * extent map to the modified list of extents and then 
> > create
> > -                * the respective ordered operation, which means in
> > -                * tree-log.c:btrfs_log_inode() we might capture all 
> > existing
> > -                * ordered operations (with btrfs_get_logged_extents()) 
> > before
> > -                * the fill_delalloc callback adds its ordered operation, 
> > and by
> > -                * the time we visit the modified list of extent maps (with
> > -                * btrfs_log_changed_extents()), we see and process the 
> > extent
> > -                * map they created. We then use the extent map to 
> > construct a
> > -                * file extent item for logging without waiting for the
> > -                * respective ordered operation to finish - this file extent
> > -                * item points to a disk location that might not have yet 
> > been
> > -                * written to, containing random data - so after a crash a 
> > log
> > -                * replay will make our inode have file extent items that 
> > point
> > -                * to disk locations containing invalid data, as we returned
> > -                * success to userspace without waiting for the respective
> > -                * ordered operation to finish, because it wasn't captured 
> > by
> > -                * btrfs_get_logged_extents().
> > -                */
> > -               ret = start_ordered_ops(inode, start, end);
> > -       }
> > +       ret = btrfs_wait_ordered_range(inode, start, len);
> >         if (ret) {
> >                 inode_unlock(inode);
> >                 goto out;
> > @@ -2240,13 +2199,6 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> > loff_t end, int datasync)
> >                                 goto out;
> >                         }
> >                 }
> > -               if (!full_sync) {
> > -                       ret = btrfs_wait_ordered_range(inode, start, len);
> > -                       if (ret) {
> > -                               btrfs_end_transaction(trans);
> > -                               goto out;
> > -                       }
> > -               }
> >                 ret = btrfs_commit_transaction(trans);
> >         } else {
> >                 ret = btrfs_end_transaction(trans);
> 
> There's more code in this function that can go away after this.
> The logic to check if the inode is already in the log can now be
> simplified since we always for the ordered extents to complete before
> deciding whether the inode needs to be blogged. The big commet about
> it can go away too:
> 
> https://friendpaste.com/5MHqvkBmdIQgrySryhhjMy

The paste is still alive, the change looks good, can you please send is
as a proper patch? The other patches are in misc-next now. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to