On Mon, Feb 04, 2019 at 02:28:10PM +0000, fdman...@kernel.org wrote:
> From: Filipe Manana <fdman...@suse.com>
> 
> When we are mixing buffered writes with direct IO writes against the same
> file and snapshotting is happening concurrently, we can end up with a
> corrupt file content in the snapshot. Example:

The patch subject sounds like it's a corruption in generic snapshot
behaviour. But it's when mixing buffered and direct io, which is a
potential corruption scenario on itself, so snapshotting does make it
that much worse. I'd like to see that somhow reflected in the subject.

> 1) Inode/file is empty.
> 
> 2) Snapshotting starts.
> 
> 2) Buffered write at offset 0 length 256Kb. This updates the i_size of the
>    inode to 256Kb, disk_i_size remains zero. This happens after the task
>    doing the snapshot flushes all existing delalloc.
> 
> 3) DIO write at offset 256Kb length 768Kb. Once the ordered extent
>    completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and
>    updates the inode item in the fs tree with a size of 1Mb (which is
>    the value of disk_i_size).
> 
> 4) The dealloc for the range [0, 256Kb[ did not start yet.
> 
> 5) The transaction used in the DIO ordered extent completion, which updated
>    the inode item, is committed by the snapshotting task.
> 
> 6) Snapshot creation completes.
> 
> 7) Dealloc for the range [0, 256Kb[ is flushed.
> 
> After that when reading the file from the snapshot we always get zeroes for
> the range [0, 256Kb[, the file has a size of 1Mb and the data written by
> the direct IO write is found. From an application's point of view this is
> a corruption, since in the source subvolume it could never read a version
> of the file that included the data from the direct IO write without the
> data from the buffered write included as well. In the snapshot's tree,
> file extent items are missing for the range [0, 256Kb[.
> 
> The issue, obviously, does not happen when using the -o flushoncommit
> mount option.
> 
> Fix this by flushing delalloc for all the roots that are about to be
> snapshotted when committing a transaction. This guarantees total ordering
> when updating the disk_i_size of an inode since the flush for dealloc is
> done when a transaction is in the TRANS_STATE_COMMIT_START state and wait
> is done once no more external writers exist. This is similar to what we
> do when using the flushoncommit mount option, but we do it only if the
> transaction has snapshots to create and only for the roots of the
> subvolumes to be snapshotted. The bulk of the dealloc is flushed in the
> snapshot creation ioctl, so the flush work we do inside the transaction
> is minimized.
> 
> This issue, involving buffered and direct IO writes with snapshotting, is
> often triggered by fstest btrfs/078, and got reported by fsck when not
> using the NO_HOLES features, for example:
> 
>   $ cat results/btrfs/078.full
>   (...)
>   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>   *** fsck.btrfs output ***
>   [1/7] checking root items
>   [2/7] checking extents
>   [3/7] checking free space cache
>   [4/7] checking fs roots
>   root 258 inode 264 errors 100, file extent discount
>   Found file extent holes:
>         start: 524288, len: 65536
>   ERROR: errors found in fs roots
> 
> Signed-off-by: Filipe Manana <fdman...@suse.com>
> ---
>  fs/btrfs/transaction.c | 36 ++++++++++++++++++++++++++++++------
>  1 file changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 4ec2b660d014..2e8f15eee2e8 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1886,8 +1886,10 @@ static void btrfs_cleanup_pending_block_groups(struct 
> btrfs_trans_handle *trans)
>         }
>  }
>  
> -static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
> +static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle 
> *trans)
>  {
> +     struct btrfs_fs_info *fs_info = trans->fs_info;
> +
>       /*
>        * We use writeback_inodes_sb here because if we used
>        * btrfs_start_delalloc_roots we would deadlock with fs freeze.
> @@ -1897,15 +1899,37 @@ static inline int btrfs_start_delalloc_flush(struct 
> btrfs_fs_info *fs_info)
>        * from already being in a transaction and our join_transaction doesn't
>        * have to re-take the fs freeze lock.
>        */
> -     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> +     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
>               writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC);
> +     } else {
> +             struct btrfs_pending_snapshot *pending;
> +             struct list_head *head = &trans->transaction->pending_snapshots;
> +

A comment would be good here as it's not obvious why the sync is done
here (and similarly the waiting part in btrfs_wait_delalloc_flush).

> +             list_for_each_entry(pending, head, list) {
> +                     int ret;
> +
> +                     ret = btrfs_start_delalloc_snapshot(pending->root);
> +                     if (ret)
> +                             return ret;

This adds a potential failure to the middle of transaction commit. I've
checked the errors, there's EROFS (after a global fs error state) and
ENOMEM (from start_delalloc_inodes).

The EROFS could be filtered as a non-issue. ENOMEM means that the
flushing was not possible and skipping it would bring back the problem
this patch is fixing.

So, how about calling writeback_inodes_sb in that case as a fallback?
I'd really like to avoid returning failure from
btrfs_start_delalloc_flush so the extra overhead of the writeback (in a
theoretical error case anyway) should be ok.

> +             }
> +     }
>       return 0;
>  }
>  
> -static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
> +static inline void btrfs_wait_delalloc_flush(struct btrfs_trans_handle 
> *trans)
>  {
> -     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT))
> +     struct btrfs_fs_info *fs_info = trans->fs_info;
> +
> +     if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) {
>               btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
> +     } else {
> +             struct btrfs_pending_snapshot *pending;
> +             struct list_head *head = &trans->transaction->pending_snapshots;
> +
> +             list_for_each_entry(pending, head, list)
> +                     btrfs_wait_ordered_extents(pending->root,
> +                                                U64_MAX, 0, U64_MAX);
> +     }
>  }
>  
>  int btrfs_commit_transaction(struct btrfs_trans_handle *trans)

The patch has been in for-next for some time, I just did not get to
writing the comments. Though the dio/buffered use is discouraged, the
errors reported by the test should be fixed. The obvious concern was the
perf penalty, but from that point I it's ok as you point out above.

Reply via email to