On 21.02.2018 15:06, Filipe Manana wrote:
> On Wed, Feb 21, 2018 at 11:41 AM, Nikolay Borisov <nbori...@suse.com> wrote:
>> Currently the DIO read cases uses a botched idea from ext4 to ensure
>> that DIO reads don't race with truncate. The idea is that if we have a
>> pending truncate we set BTRFS_INODE_READDIO_NEED_LOCK which in turn
>> forces the dio read case to fallback to inode_locking to prevent
>> read/truncate races. Unfortunately this is subtly broken for at least
>> 2 reasons:
>>
>> 1. inode_dio_begin in btrfs_direct_IO is called outside of inode_lock
>> (for the read case). This means that there is no ordering guarantee
>> between the invocation of inode_dio_wait and the increment of
>> i_dio_count in btrfs_direct_IO in the tread case.
>>
>> 2. The memory barriers used in btrfs_inode_(block|resume)_unlocked_dio
>> are not really paired with the reader side - the test_bit in
>> btrfs_direct_IO, since the latter is missing a memory barrier. Furthermore,
>> the actual sleeping condition that needs ordering to prevent live-locks/
>> missed wakeups is the modification/read of i_dio_count. So in this case
>> the waker(T2) needs to make the condition false _BEFORE_ doing a test.
>>
>> The interraction between the two threads roughly looks like:
>>
>> T1(truncate):                                    T2(btrfs_direct_IO):
>> set_bit(BTRFS_INODE_READDIO_NEED_LOCK)             if 
>> (test_bit(BTRFS_INODE_READDIO_NEED_LOCK))
>> if (atomic_read())                                  if 
>> (atomic_dec_and_test(&inode->i_dio_count)
>>   schedule()                                            wake_up_bit
>> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK)
>>
>> Without the ordering between the test_bit in T2 and setting the bit in
>> T1 (due to a missing pairing barrier in T2) it's possible that T1 goes
>> to sleep in schedule and T2 misses the bit set, resulting in missing the
>> wake up.
>>
>> In any case all of this is VERY subtle. So fix it by simply making
>> the DIO READ case take inode_lock_shared. This ensure that we can have
>> DIO reads in parallel at the same time we are protected against
>> concurrent modification of the target file.
> 
> And that prevents writes and reads against different (i.e. not
> overlapping) ranges from happening in parallel.
> That has a big impact on applications (databases for e.g.) that
> operate on large files serving multiple requests.
> Now all reads are serialized against all writes and vice versa.
> 

Correct, but I'd prefer correctness over performance! And I assume other
people as well, since as is the code atm it's not providing full
protection between racing reads and truncate.

> Unless I missed something, a big NAK to this change as it is.



> 
> 
>> This way we closely mimic
>> what ext4 codes does and simplify this mess.
>>
>> Multiple xfstest runs didn't show any regressions.
>>
>> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
>> ---
>>  fs/btrfs/btrfs_inode.h | 17 -----------------
>>  fs/btrfs/inode.c       | 34 ++++++++++++++++++++--------------
>>  2 files changed, 20 insertions(+), 31 deletions(-)
>>
>> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
>> index f527e99c9f8d..3519e49d4ef0 100644
>> --- a/fs/btrfs/btrfs_inode.h
>> +++ b/fs/btrfs/btrfs_inode.h
>> @@ -329,23 +329,6 @@ struct btrfs_dio_private {
>>                         blk_status_t);
>>  };
>>
>> -/*
>> - * Disable DIO read nolock optimization, so new dio readers will be forced
>> - * to grab i_mutex. It is used to avoid the endless truncate due to
>> - * nonlocked dio read.
>> - */
>> -static inline void btrfs_inode_block_unlocked_dio(struct btrfs_inode *inode)
>> -{
>> -       set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>> -       smp_mb();
>> -}
>> -
>> -static inline void btrfs_inode_resume_unlocked_dio(struct btrfs_inode 
>> *inode)
>> -{
>> -       smp_mb__before_atomic();
>> -       clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags);
>> -}
>> -
>>  static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
>>                 u64 logical_start, u32 csum, u32 csum_expected, int 
>> mirror_num)
>>  {
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 491a7397f6fa..9c43257e6e11 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -5149,10 +5149,13 @@ static int btrfs_setsize(struct inode *inode, struct 
>> iattr *attr)
>>                 /* we don't support swapfiles, so vmtruncate shouldn't fail 
>> */
>>                 truncate_setsize(inode, newsize);
>>
>> -               /* Disable nonlocked read DIO to avoid the end less truncate 
>> */
>> -               btrfs_inode_block_unlocked_dio(BTRFS_I(inode));
>> +               /*
>> +                * Truncate after all in-flight dios are finished, new ones
>> +                * will block on inode_lock. This only matters for AIO 
>> requests
>> +                * since DIO READ is performed under inode_shared_lock and
>> +                * write under exclusive lock.
>> +                */
>>                 inode_dio_wait(inode);
>> -               btrfs_inode_resume_unlocked_dio(BTRFS_I(inode));
>>
>>                 ret = btrfs_truncate(inode);
>>                 if (ret && inode->i_nlink) {
>> @@ -8669,15 +8672,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, 
>> struct iov_iter *iter)
>>         loff_t offset = iocb->ki_pos;
>>         size_t count = 0;
>>         int flags = 0;
>> -       bool wakeup = true;
>>         bool relock = false;
>>         ssize_t ret;
>>
>>         if (check_direct_IO(fs_info, iter, offset))
>>                 return 0;
>>
>> -       inode_dio_begin(inode);
>> -
>>         /*
>>          * The generic stuff only does filemap_write_and_wait_range, which
>>          * isn't enough if we've written compressed pages to this area, so
>> @@ -8691,6 +8691,9 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, 
>> struct iov_iter *iter)
>>                                          offset + count - 1);
>>
>>         if (iov_iter_rw(iter) == WRITE) {
>> +
>> +               inode_dio_begin(inode);
>> +
>>                 /*
>>                  * If the write DIO is beyond the EOF, we need update
>>                  * the isize, but it is protected by i_mutex. So we can
>> @@ -8720,11 +8723,13 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, 
>> struct iov_iter *iter)
>>                 dio_data.unsubmitted_oe_range_end = (u64)offset;
>>                 current->journal_info = &dio_data;
>>                 down_read(&BTRFS_I(inode)->dio_sem);
>> -       } else if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK,
>> -                                    &BTRFS_I(inode)->runtime_flags)) {
>> -               inode_dio_end(inode);
>> -               flags = DIO_LOCKING | DIO_SKIP_HOLES;
>> -               wakeup = false;
>> +       } else {
>> +               /*
>> +                * In DIO READ case locking the inode in shared mode ensures
>> +                * we are protected against parallel writes/truncates
>> +                */
>> +               inode_lock_shared(inode);
>> +               inode_dio_begin(inode);
>>         }
>>
>>         ret = __blockdev_direct_IO(iocb, inode,
>> @@ -8755,10 +8760,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, 
>> struct iov_iter *iter)
>>                         btrfs_delalloc_release_space(inode, data_reserved,
>>                                         offset, count - (size_t)ret);
>>                 btrfs_delalloc_release_extents(BTRFS_I(inode), count);
>> -       }
>> +       } else
>> +               inode_unlock_shared(inode);
>>  out:
>> -       if (wakeup)
>> -               inode_dio_end(inode);
>> +       inode_dio_end(inode);
>> +
>>         if (relock)
>>                 inode_lock(inode);
>>
>> --
>> 2.7.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to