On Mon, Feb 18, 2019 at 5:09 PM David Sterba <dste...@suse.cz> wrote: > > On Mon, Feb 04, 2019 at 02:28:10PM +0000, fdman...@kernel.org wrote: > > From: Filipe Manana <fdman...@suse.com> > > > > When we are mixing buffered writes with direct IO writes against the same > > file and snapshotting is happening concurrently, we can end up with a > > corrupt file content in the snapshot. Example: > > The patch subject sounds like it's a corruption in generic snapshot > behaviour. But it's when mixing buffered and direct io, which is a > potential corruption scenario on itself, so snapshotting does make it > that much worse. I'd like to see that somhow reflected in the subject.
It's a kind of corruption created by snapshotting. I tried to come up with a better subject that wasn't too long to fit in the 74-75 characters limit, but couldn't come up with any. So I left the explanation and example in the remainder of the change log. If you have a better suggestion... I'm open to it. > > > 1) Inode/file is empty. > > > > 2) Snapshotting starts. > > > > 2) Buffered write at offset 0 length 256Kb. This updates the i_size of the > > inode to 256Kb, disk_i_size remains zero. This happens after the task > > doing the snapshot flushes all existing delalloc. > > > > 3) DIO write at offset 256Kb length 768Kb. Once the ordered extent > > completes it sets the inode's disk_i_size to 1Mb (256Kb + 768Kb) and > > updates the inode item in the fs tree with a size of 1Mb (which is > > the value of disk_i_size). > > > > 4) The dealloc for the range [0, 256Kb[ did not start yet. > > > > 5) The transaction used in the DIO ordered extent completion, which updated > > the inode item, is committed by the snapshotting task. > > > > 6) Snapshot creation completes. > > > > 7) Dealloc for the range [0, 256Kb[ is flushed. > > > > After that when reading the file from the snapshot we always get zeroes for > > the range [0, 256Kb[, the file has a size of 1Mb and the data written by > > the direct IO write is found. From an application's point of view this is > > a corruption, since in the source subvolume it could never read a version > > of the file that included the data from the direct IO write without the > > data from the buffered write included as well. In the snapshot's tree, > > file extent items are missing for the range [0, 256Kb[. > > > > The issue, obviously, does not happen when using the -o flushoncommit > > mount option. > > > > Fix this by flushing delalloc for all the roots that are about to be > > snapshotted when committing a transaction. This guarantees total ordering > > when updating the disk_i_size of an inode since the flush for dealloc is > > done when a transaction is in the TRANS_STATE_COMMIT_START state and wait > > is done once no more external writers exist. This is similar to what we > > do when using the flushoncommit mount option, but we do it only if the > > transaction has snapshots to create and only for the roots of the > > subvolumes to be snapshotted. The bulk of the dealloc is flushed in the > > snapshot creation ioctl, so the flush work we do inside the transaction > > is minimized. > > > > This issue, involving buffered and direct IO writes with snapshotting, is > > often triggered by fstest btrfs/078, and got reported by fsck when not > > using the NO_HOLES features, for example: > > > > $ cat results/btrfs/078.full > > (...) > > _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent > > *** fsck.btrfs output *** > > [1/7] checking root items > > [2/7] checking extents > > [3/7] checking free space cache > > [4/7] checking fs roots > > root 258 inode 264 errors 100, file extent discount > > Found file extent holes: > > start: 524288, len: 65536 > > ERROR: errors found in fs roots > > > > Signed-off-by: Filipe Manana <fdman...@suse.com> > > --- > > fs/btrfs/transaction.c | 36 ++++++++++++++++++++++++++++++------ > > 1 file changed, 30 insertions(+), 6 deletions(-) > > > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > > index 4ec2b660d014..2e8f15eee2e8 100644 > > --- a/fs/btrfs/transaction.c > > +++ b/fs/btrfs/transaction.c > > @@ -1886,8 +1886,10 @@ static void > > btrfs_cleanup_pending_block_groups(struct btrfs_trans_handle *trans) > > } > > } > > > > -static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info) > > +static inline int btrfs_start_delalloc_flush(struct btrfs_trans_handle > > *trans) > > { > > + struct btrfs_fs_info *fs_info = trans->fs_info; > > + > > /* > > * We use writeback_inodes_sb here because if we used > > * btrfs_start_delalloc_roots we would deadlock with fs freeze. > > @@ -1897,15 +1899,37 @@ static inline int btrfs_start_delalloc_flush(struct > > btrfs_fs_info *fs_info) > > * from already being in a transaction and our join_transaction > > doesn't > > * have to re-take the fs freeze lock. > > */ > > - if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) > > + if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) { > > writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); > > + } else { > > + struct btrfs_pending_snapshot *pending; > > + struct list_head *head = > > &trans->transaction->pending_snapshots; > > + > > A comment would be good here as it's not obvious why the sync is done > here (and similarly the waiting part in btrfs_wait_delalloc_flush). Intentionally left out due to the changelog explaining it and avoiding a too long comment explaining why/how the corruption happens. > > > + list_for_each_entry(pending, head, list) { > > + int ret; > > + > > + ret = btrfs_start_delalloc_snapshot(pending->root); > > + if (ret) > > + return ret; > > This adds a potential failure to the middle of transaction commit. I've > checked the errors, there's EROFS (after a global fs error state) and > ENOMEM (from start_delalloc_inodes). > > The EROFS could be filtered as a non-issue. ENOMEM means that the > flushing was not possible and skipping it would bring back the problem > this patch is fixing. > > So, how about calling writeback_inodes_sb in that case as a fallback? Thought about it, but the reason I didn't do it is that if you fallback to writeback_unodes_sb, you'll never have the error reported to the user. It's very unlikely the user will do an fsync on the snapshot version of the file, specially if it's a RO snapshot, which would be the only way to report the error. > I'd really like to avoid returning failure from > btrfs_start_delalloc_flush so the extra overhead of the writeback (in a > theoretical error case anyway) should be ok. Me too, as long as the error is reported (a message in syslog/dmesg is very likely to be missed). > > > + } > > + } > > return 0; > > } > > > > -static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info) > > +static inline void btrfs_wait_delalloc_flush(struct btrfs_trans_handle > > *trans) > > { > > - if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) > > + struct btrfs_fs_info *fs_info = trans->fs_info; > > + > > + if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) { > > btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); > > + } else { > > + struct btrfs_pending_snapshot *pending; > > + struct list_head *head = > > &trans->transaction->pending_snapshots; > > + > > + list_for_each_entry(pending, head, list) > > + btrfs_wait_ordered_extents(pending->root, > > + U64_MAX, 0, U64_MAX); > > + } > > } > > > > int btrfs_commit_transaction(struct btrfs_trans_handle *trans) > > The patch has been in for-next for some time, I just did not get to > writing the comments. Though the dio/buffered use is discouraged, the > errors reported by the test should be fixed. The obvious concern was the > perf penalty, but from that point I it's ok as you point out above.