On Wed, 29 Oct 2014 08:21:12 +0000, Filipe Manana wrote: > If right after starting the snapshot creation ioctl we perform a write > against a > file followed by a truncate, with both operations increasing the file's size, > we > can get a snapshot tree that reflects a state of the source subvolume's tree > where > the file truncation happened but the write operation didn't. This leaves a gap > between 2 file extent items of the inode, which makes btrfs' fsck complain > about it. > > For example, if we perform the following file operations: > > $ mkfs.btrfs -f /dev/vdd > $ mount /dev/vdd /mnt > $ xfs_io -f \ > -c "pwrite -S 0xaa -b 32K 0 32K" \ > -c "fsync" \ > -c "pwrite -S 0xbb -b 32770 16K 32770" \ > -c "truncate 90123" \ > /mnt/foobar > > and the snapshot creation ioctl was just called before the second write, we > often > can get the following inode items in the snapshot's btree: > > item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160 > inode generation 146 transid 7 size 90123 block group 0 mode > 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0 > item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20 > inode ref index 282 namelen 10 name: foobar > item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53 > extent data disk byte 1104855040 nr 32768 > extent data offset 0 nr 32768 ram 32768 > extent compression 0 > item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53 > extent data disk byte 0 nr 0 > extent data offset 0 nr 40960 ram 40960 > extent compression 0 > > There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, > 4096)[ > for which there's no file extent item covering it. This is because the file > write > and file truncate operations happened both right after the snapshot creation > ioctl > called btrfs_start_delalloc_inodes(), which means we didn't start and wait > for the > ordered extent that matches the write and, in btrfs_setsize(), we were able > to call > btrfs_cont_expand() before being able to commit the current transaction in the > snapshot creation ioctl. So this made it possibe to insert the hole file > extent > item in the source subvolume (which represents the region added by the > truncate) > right before the transaction commit from the snapshot creation ioctl. > > Btrfs' fsck tool complains about such cases with a message like the following: > > "root 331 inode 257 errors 100, file extent discount" > >>From a user perspective, the expectation when a snapshot is created while >>those > file operations are being performed is that the snapshot will have a file that > either: > > 1) is empty > 2) only the first write was captured > 3) only the 2 writes were captured > 4) both writes and the truncation were captured > > But never capture a state where only the first write and the truncation were > captured (since the second write was performed before the truncation). > > A test case for xfstests follows. > > Signed-off-by: Filipe Manana <fdman...@suse.com> > --- > > V2: Use different approach to solve the problem. Don't start and wait for all > dellaloc to finish after every expanding truncate, instead add an > additional > flush at transaction commit time if we're doing a transaction commit that > creates snapshots.
This method will make the transaction commit spend more time, why not use i_disk_size to expand the file size in btrfs_setsize()? Or we might rename btrfs_{start, end}_nocow_write(), and use them in btrfs_setsize()? Thanks Miao > > V3: Removed useless test condition in +wait_pending_snapshot_roots_delalloc(). > > fs/btrfs/transaction.c | 59 > ++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 59 insertions(+) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 396ae8b..5e7f004 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1714,12 +1714,65 @@ static inline void btrfs_wait_delalloc_flush(struct > btrfs_fs_info *fs_info) > btrfs_wait_ordered_roots(fs_info, -1); > } > > +static int > +start_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans, > + struct list_head *splice) > +{ > + struct btrfs_pending_snapshot *pending_snapshot; > + int ret = 0; > + > + if (btrfs_test_opt(trans->root, FLUSHONCOMMIT)) > + return 0; > + > + spin_lock(&trans->root->fs_info->trans_lock); > + list_splice_init(&trans->transaction->pending_snapshots, splice); > + spin_unlock(&trans->root->fs_info->trans_lock); > + > + /* > + * Start again delalloc for the roots our pending snapshots are made > + * from. We did it before starting/joining a transaction and we do it > + * here again because new inode operations might have happened since > + * then and we want to make sure the snapshot captures a fully > + * consistent state of the source root tree. For example, if after the > + * first delalloc flush a write is made against an inode followed by > + * an expanding truncate, we want to make sure the snapshot captured > + * both the write and the truncation, and not just the truncation. > + * Here we shouldn't have much delalloc work to do, as the bulk of it > + * was done before and outside the transaction. > + */ > + list_for_each_entry(pending_snapshot, splice, list) { > + ret = btrfs_start_delalloc_inodes(pending_snapshot->root, 1); > + if (ret) > + break; > + } > + > + return ret; > +} > + > +static void > +wait_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans, > + struct list_head *splice) > +{ > + struct btrfs_pending_snapshot *pending_snapshot; > + > + if (list_empty(splice)) > + return; > + > + list_for_each_entry(pending_snapshot, splice, list) > + btrfs_wait_ordered_extents(pending_snapshot->root, -1); > + > + spin_lock(&trans->root->fs_info->trans_lock); > + list_splice_init(splice, &trans->transaction->pending_snapshots); > + spin_unlock(&trans->root->fs_info->trans_lock); > +} > + > int btrfs_commit_transaction(struct btrfs_trans_handle *trans, > struct btrfs_root *root) > { > struct btrfs_transaction *cur_trans = trans->transaction; > struct btrfs_transaction *prev_trans = NULL; > struct btrfs_inode *btree_ino = BTRFS_I(root->fs_info->btree_inode); > + LIST_HEAD(pending_snapshots); > int ret; > > /* Stop the commit early if ->aborted is set */ > @@ -1802,6 +1855,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans, > if (ret) > goto cleanup_transaction; > > + ret = start_pending_snapshot_roots_delalloc(trans, &pending_snapshots); > + if (ret) > + goto cleanup_transaction; > + > ret = btrfs_run_delayed_items(trans, root); > if (ret) > goto cleanup_transaction; > @@ -1816,6 +1873,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans, > > btrfs_wait_delalloc_flush(root->fs_info); > > + wait_pending_snapshot_roots_delalloc(trans, &pending_snapshots); > + > btrfs_scrub_pause(root); > /* > * Ok now we need to make sure to block out any other joins while we > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html