On Fri, Jun 26, 2015 at 09:48:05AM +0100, Filipe Manana wrote: > On Fri, Jun 26, 2015 at 4:21 AM, Liu Bo <bo.li....@oracle.com> wrote: > > On Thu, Jun 25, 2015 at 03:23:59PM +0100, Filipe Manana wrote: > >> On Thu, Jun 25, 2015 at 3:20 PM, Liu Bo <bo.li....@oracle.com> wrote: > >> > On Thu, Jun 25, 2015 at 04:17:46AM +0100, fdman...@kernel.org wrote: > >> >> From: Filipe Manana <fdman...@suse.com> > >> >> > >> >> When we have the no_holes feature enabled, if a we truncate a file to a > >> >> smaller size, truncate it again but to a size greater than or equals to > >> >> its original size and fsync it, the log tree will not have any > >> >> information > >> >> about the hole covering the range [truncate_1_offset, new_file_size[. > >> >> Which means if the fsync log is replayed, the file will remain with the > >> >> state it had before both truncate operations. > >> > > >> > Does the fs/subvol tree get updated to the right information at this > >> > time? > >> > >> No, and that's the problem. Because no file extent items are stored in > >> the log tree. > >> The inode item is updated with the new i_size however (as expected). > > > > Yeap, that's right and the patch looks right. > > > > I do appreciate your great work on fixing these corner cases, > > but as of my understanding, they really can be taken by a force commit > > transaction, do they deserve these complex stuff? > > All the "complexity" is not about avoiding a transaction commit and do > something more efficient instead, but rather about detecting such > case. The simpler way of for example always force a commit if the file > has the full sync flag (or a new flag set on truncate) and was created > in a past generation, would be quite ironic given all the > optimizations effort put into fsync, no-holes feature, etc.
I agree. > > > > > After all, like punch_hole, remove xattr, they're rare cases. > > Maybe not as much as one would think. > Having worked on databases before, hole punching, truncations and > other "rare" cases aren't that rare and correct fsync behaviour is a > must or the very least desirable. That makes sense. Forgot to give Reviewed-by: Liu Bo <bo.li....@oracle.com> Thanks, -liubo > > thanks > > > > > Thanks, > > > > -liubo > > > >> > >> thanks > >> > >> > > >> > Thanks, > >> > > >> > -liubo > >> >> > >> >> Without the no_holes feature this does not happen, since when the inode > >> >> is logged (full sync flag is set) it will find in the fs/subvol tree a > >> >> leaf with a generation matching the current transaction id that has an > >> >> explicit extent item representing the hole. > >> >> > >> >> Fix this by adding an explicit extent item representing a hole between > >> >> the last extent and the inode's i_size if we are doing a full sync. > >> >> > >> >> The issue is easy to reproduce with the following test case for fstests: > >> >> > >> >> . ./common/rc > >> >> . ./common/filter > >> >> . ./common/dmflakey > >> >> > >> >> _need_to_be_root > >> >> _supported_fs generic > >> >> _supported_os Linux > >> >> _require_scratch > >> >> _require_dm_flakey > >> >> > >> >> # This test was motivated by an issue found in btrfs when the btrfs > >> >> # no-holes feature is enabled (introduced in kernel 3.14). So enable > >> >> # the feature if the fs being tested is btrfs. > >> >> if [ $FSTYP == "btrfs" ]; then > >> >> _require_btrfs_fs_feature "no_holes" > >> >> _require_btrfs_mkfs_feature "no-holes" > >> >> MKFS_OPTIONS="$MKFS_OPTIONS -O no-holes" > >> >> fi > >> >> > >> >> rm -f $seqres.full > >> >> > >> >> _scratch_mkfs >>$seqres.full 2>&1 > >> >> _init_flakey > >> >> _mount_flakey > >> >> > >> >> # Create our test files and make sure everything is durably persisted. > >> >> $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 64K" \ > >> >> -c "pwrite -S 0xbb 64K 61K" \ > >> >> $SCRATCH_MNT/foo | _filter_xfs_io > >> >> $XFS_IO_PROG -f -c "pwrite -S 0xee 0 64K" \ > >> >> -c "pwrite -S 0xff 64K 61K" \ > >> >> $SCRATCH_MNT/bar | _filter_xfs_io > >> >> sync > >> >> > >> >> # Now truncate our file foo to a smaller size (64Kb) and then truncate > >> >> # it to the size it had before the shrinking truncate (125Kb). Then > >> >> # fsync our file. If a power failure happens after the fsync, we > >> >> expect > >> >> # our file to have a size of 125Kb, with the first 64Kb of data having > >> >> # the value 0xaa and the second 61Kb of data having the value 0x00. > >> >> $XFS_IO_PROG -c "truncate 64K" \ > >> >> -c "truncate 125K" \ > >> >> -c "fsync" \ > >> >> $SCRATCH_MNT/foo > >> >> > >> >> # Do something similar to our file bar, but the first truncation sets > >> >> # the file size to 0 and the second truncation expands the size to the > >> >> # double of what it was initially. > >> >> $XFS_IO_PROG -c "truncate 0" \ > >> >> -c "truncate 253K" \ > >> >> -c "fsync" \ > >> >> $SCRATCH_MNT/bar > >> >> > >> >> _load_flakey_table $FLAKEY_DROP_WRITES > >> >> _unmount_flakey > >> >> > >> >> # Allow writes again, mount to trigger log replay and validate file > >> >> # contents. > >> >> _load_flakey_table $FLAKEY_ALLOW_WRITES > >> >> _mount_flakey > >> >> > >> >> # We expect foo to have a size of 125Kb, the first 64Kb of data all > >> >> # having the value 0xaa and the remaining 61Kb to be a hole (all bytes > >> >> # with value 0x00). > >> >> echo "File foo content after log replay:" > >> >> od -t x1 $SCRATCH_MNT/foo > >> >> > >> >> # We expect bar to have a size of 253Kb and no extents (any byte read > >> >> # from bar has the value 0x00). > >> >> echo "File bar content after log replay:" > >> >> od -t x1 $SCRATCH_MNT/bar > >> >> > >> >> status=0 > >> >> exit > >> >> > >> >> The expected file contents in the golden output are: > >> >> > >> >> File foo content after log replay: > >> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa > >> >> * > >> >> 0200000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > >> >> * > >> >> 0372000 > >> >> File bar content after log replay: > >> >> 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > >> >> * > >> >> 0772000 > >> >> > >> >> Without this fix, their contents are: > >> >> > >> >> File foo content after log replay: > >> >> 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa > >> >> * > >> >> 0200000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb > >> >> * > >> >> 0372000 > >> >> File bar content after log replay: > >> >> 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee > >> >> * > >> >> 0200000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > >> >> * > >> >> 0372000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > >> >> * > >> >> 0772000 > >> >> > >> >> A test case submission for fstests follows soon. > >> >> > >> >> Signed-off-by: Filipe Manana <fdman...@suse.com> > >> >> --- > >> >> fs/btrfs/tree-log.c | 108 > >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> >> 1 file changed, 108 insertions(+) > >> >> > >> >> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c > >> >> index 7ac45cf..ac90336 100644 > >> >> --- a/fs/btrfs/tree-log.c > >> >> +++ b/fs/btrfs/tree-log.c > >> >> @@ -4203,6 +4203,107 @@ static int btrfs_log_all_xattrs(struct > >> >> btrfs_trans_handle *trans, > >> >> return 0; > >> >> } > >> >> > >> >> +/* > >> >> + * If the no holes feature is enabled we need to make sure any hole > >> >> between the > >> >> + * last extent and the i_size of our inode is explicitly marked in the > >> >> log. This > >> >> + * is to make sure that doing something like: > >> >> + * > >> >> + * 1) create file with 128Kb of data > >> >> + * 2) truncate file to 64Kb > >> >> + * 3) truncate file to 256Kb > >> >> + * 4) fsync file > >> >> + * 5) <crash/power failure> > >> >> + * 6) mount fs and trigger log replay > >> >> + * > >> >> + * Will give us a file with a size of 256Kb, the first 64Kb of data > >> >> match what > >> >> + * the file had in its first 64Kb of data at step 1 and the last 192Kb > >> >> of the > >> >> + * file correspond to a hole. The presence of explicit holes in a log > >> >> tree is > >> >> + * what guarantees that log replay will remove/adjust file extent > >> >> items in the > >> >> + * fs/subvol tree. > >> >> + * > >> >> + * Here we do not need to care about holes between extents, that is > >> >> already done > >> >> + * by copy_items(). We also only need to do this in the full sync > >> >> path, where we > >> >> + * lookup for extents from the fs/subvol tree only. In the fast path > >> >> case, we > >> >> + * lookup the list of modified extent maps and if any represents a > >> >> hole, we > >> >> + * insert a corresponding extent representing a hole in the log tree. > >> >> + */ > >> >> +static int btrfs_log_trailing_hole(struct btrfs_trans_handle *trans, > >> >> + struct btrfs_root *root, > >> >> + struct inode *inode, > >> >> + struct btrfs_path *path) > >> >> +{ > >> >> + int ret; > >> >> + struct btrfs_key key; > >> >> + u64 hole_start; > >> >> + u64 hole_size; > >> >> + struct extent_buffer *leaf; > >> >> + struct btrfs_root *log = root->log_root; > >> >> + const u64 ino = btrfs_ino(inode); > >> >> + const u64 i_size = i_size_read(inode); > >> >> + > >> >> + if (!btrfs_fs_incompat(root->fs_info, NO_HOLES)) > >> >> + return 0; > >> >> + > >> >> + key.objectid = ino; > >> >> + key.type = BTRFS_EXTENT_DATA_KEY; > >> >> + key.offset = (u64)-1; > >> >> + > >> >> + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); > >> >> + ASSERT(ret != 0); > >> >> + if (ret < 0) > >> >> + return ret; > >> >> + > >> >> + ASSERT(path->slots[0] > 0); > >> >> + path->slots[0]--; > >> >> + leaf = path->nodes[0]; > >> >> + btrfs_item_key_to_cpu(leaf, &key, path->slots[0]); > >> >> + > >> >> + if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) { > >> >> + /* inode does not have any extents */ > >> >> + hole_start = 0; > >> >> + hole_size = i_size; > >> >> + } else { > >> >> + struct btrfs_file_extent_item *extent; > >> >> + u64 len; > >> >> + > >> >> + /* > >> >> + * If there's an extent beyond i_size, an explicit hole > >> >> was > >> >> + * already inserted by copy_items(). > >> >> + */ > >> >> + if (key.offset >= i_size) > >> >> + return 0; > >> >> + > >> >> + extent = btrfs_item_ptr(leaf, path->slots[0], > >> >> + struct btrfs_file_extent_item); > >> >> + > >> >> + if (btrfs_file_extent_type(leaf, extent) == > >> >> + BTRFS_FILE_EXTENT_INLINE) { > >> >> + len = btrfs_file_extent_inline_len(leaf, > >> >> + path->slots[0], > >> >> + extent); > >> >> + ASSERT(len == i_size); > >> >> + return 0; > >> >> + } > >> >> + > >> >> + len = btrfs_file_extent_num_bytes(leaf, extent); > >> >> + /* Last extent goes beyond i_size, no need to log a hole. > >> >> */ > >> >> + if (key.offset + len > i_size) > >> >> + return 0; > >> >> + hole_start = key.offset + len; > >> >> + hole_size = i_size - hole_start; > >> >> + } > >> >> + btrfs_release_path(path); > >> >> + > >> >> + /* Last extent ends at i_size. */ > >> >> + if (hole_size == 0) > >> >> + return 0; > >> >> + > >> >> + hole_size = ALIGN(hole_size, root->sectorsize); > >> >> + ret = btrfs_insert_file_extent(trans, log, ino, hole_start, 0, 0, > >> >> + hole_size, 0, hole_size, 0, 0, 0); > >> >> + return ret; > >> >> +} > >> >> + > >> >> /* log a single inode in the tree log. > >> >> * At least one parent directory for this inode must exist in the tree > >> >> * or be logged already. > >> >> @@ -4466,6 +4567,13 @@ next_slot: > >> >> err = btrfs_log_all_xattrs(trans, root, inode, path, dst_path); > >> >> if (err) > >> >> goto out_unlock; > >> >> + if (max_key.type >= BTRFS_EXTENT_DATA_KEY && !fast_search) { > >> >> + btrfs_release_path(path); > >> >> + btrfs_release_path(dst_path); > >> >> + err = btrfs_log_trailing_hole(trans, root, inode, path); > >> >> + if (err) > >> >> + goto out_unlock; > >> >> + } > >> >> log_extents: > >> >> btrfs_release_path(path); > >> >> btrfs_release_path(dst_path); > >> >> -- > >> >> 2.1.3 > >> >> > >> >> -- > >> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" > >> >> in > >> >> the body of a message to majord...@vger.kernel.org > >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html