Re: [PATCH] Btrfs: fix data loss in the fast fsync path

Liu Bo Sat, 28 Feb 2015 07:04:50 -0800

On Thu, Feb 26, 2015 at 07:52:57PM +0000, Filipe Manana wrote:
> When using the fast file fsync code path we can miss the fact that
> there new writes happened since the last file fsync and therefore
> return without waiting for the IO to finish and write the new extents
> to the fsync log.
> 
> The comment this change adds to the fsync handler explains how this
> can happen and why, and therefore it's included here:
> 
>   "If the last transaction that changed this file was before the current
>    transaction and we have the full sync flag set in our inode, we can
>    bail out now without any syncing.
> 
>    Note that we can't bail out if the full sync flag isn't set. This is
>    because when the full sync flag is set we start all ordered extents
>    and wait for them to fully complete - when they complete they update
>    the inode's last_trans field through:
> 
>        btrfs_finish_ordered_io() ->
>            btrfs_update_inode_fallback() ->
>                btrfs_update_inode() ->
>                    btrfs_set_inode_last_trans()
> 
>    So we are sure that last_trans is up to date and can do this check to
>    bail out safely. For the fast path, when the full sync flag is not
>    set in our inode, we can not do it because we start only our ordered
>    extents and don't wait for them to complete (that is when
>    btrfs_finish_ordered_io runs) - if we rely on the speculative value
>    for inode->last_trans set by btrfs_file_write_iter we lose data in
>    the following scenario:
> 
>    1. fs_info->last_trans_committed == N - 1 and current transaction is
>       transaction N (fs_info->generation == N);
> 
>    2. do a buffered write;
> 
>    3. fsync our inode, this clears our inode's full sync flag, starts
>       an ordered extent and waits for it to complete - when it completes
>       at btrfs_finish_ordered_io(), the inode's last_trans is set to
>       the value N;
> 
>    4. transaction N is committed, so fs_info->last_trans_committed is
>       now set to the value N and fs_info->generation remains with the
>       value N;
> 
>    5. do another buffered write, when this happens btrfs_file_write_iter
>       sets our inode's last_trans to the value N + 1;
> 
>    6. transaction N + 1 is started and fs_info->generation now has the
>       value N + 1;
> 
>    7. transaction N + 1 is committed, so fs_info->last_trans_committed
>       is set to the value N + 1;
> 
>    8. fsync our inode - because it doesn't have the full sync flag set,
>       we only start the ordered extent, we don't wait for it to complete
>       (only in a later phase) therefore its last_trans field has the
>       value N + 1 set previously by btrfs_file_write_iter(), and so we
>       have:
> 
>           inode->last_trans <= fs_info->last_trans_committed
>               (N + 1)              (N + 1)
> 
>       Which would make us not log the last buffered write, resulting in
>       data loss after a crash."
> 
> This is actually deterministic and not hard to trigger. The following excerpt
> from a testcase I made for xfstests triggers the issue. It moves a dummy file
> across directories and then fsyncs the old parent directory - this is just to
> trigger a transaction commit, so moving files around isn't directly related
> to the issue but it was chosen because running 'sync' for example does more
> than just committing the current transaction, as it flushes/waits for all
> file data to be persisted.
> The body of the test is:
> 
>   _scratch_mkfs >> $seqres.full 2>&1
>   _init_flakey
>   _mount_flakey
> 
>   # Create our main test file 'foo', the one we check for data loss.
>   # By doing an fsync against our file, it makes btrfs clear the 
> 'needs_full_sync'
>   # bit from its flags (btrfs inode specific flags).
>   $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
>                   -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
> 
>   # Now create one other file and 2 directories. We will move this second file
>   # from one directory to the other later because it forces btrfs to commit 
> its
>   # currently open transaction if we fsync the old parent directory. This is
>   # necessary to trigger the data loss bug that affected btrfs.
>   mkdir $SCRATCH_MNT/testdir_1
>   touch $SCRATCH_MNT/testdir_1/bar
>   mkdir $SCRATCH_MNT/testdir_2
> 
>   # Make sure everything is durably persisted.
>   sync
> 
>   # Write more 8Kb of data to our file.
>   $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
> 
>   # Move our 'bar' file into a new directory.
>   mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
> 
>   # Fsync our first directory. Because it had a file moved into some other
>   # directory, this made btrfs commit the currently open transaction. This is
>   # a condition necessary to trigger the data loss bug.
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
> 
>   # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
>   # data we wrote previously to be persisted and available if a crash happens.
>   # This did not happen with btrfs, because of the transaction commit that
>   # happened when we fsynced the parent directory.
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
> 
>   # Simulate a crash/power loss.
>   _load_flakey_table $FLAKEY_DROP_WRITES
>   _unmount_flakey
> 
>   _load_flakey_table $FLAKEY_ALLOW_WRITES
>   _mount_flakey
> 
>   # Now check that all data we wrote before are available.
>   echo "File content after log replay:"
>   od -t x1 $SCRATCH_MNT/foo
> 
>   status=0
>   exit
> 
> The expected golden output for the test, which is what we get with this
> fix applied (or when running against ext3/4 and xfs), is:
> 
>   wrote 8192/8192 bytes at offset 0
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote 8192/8192 bytes at offset 8192
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   File content after log replay:
>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>   *
>   0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
>   *
>   0040000
> 
> Without this fix applied, the output shows the test file is empty:
> 
>   wrote 8192/8192 bytes at offset 0
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   wrote 8192/8192 bytes at offset 8192
>   XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>   File content after log replay:
>   0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
>   *
>   0020000
> 
> A test case for xfstests will be sent soon.
> 
> Signed-off-by: Filipe Manana <fdman...@suse.com>
> ---
>  fs/btrfs/file.c | 63 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 58 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 2bd72cd..91122b6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1971,14 +1971,67 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> loff_t end, int datasync)
>       }
>  
>       /*
> -      * if the last transaction that changed this file was before
> -      * the current transaction, we can bail out now without any
> -      * syncing
> +      * If the last transaction that changed this file was before the current
> +      * transaction and we have the full sync flag set in our inode, we can
> +      * bail out now without any syncing.
> +      *
> +      * Note that we can't bail out if the full sync flag isn't set. This is
> +      * because when the full sync flag is set we start all ordered extents
> +      * and wait for them to fully complete - when they complete they update
> +      * the inode's last_trans field through:
> +      *
> +      *     btrfs_finish_ordered_io() ->
> +      *         btrfs_update_inode_fallback() ->
> +      *             btrfs_update_inode() ->
> +      *                 btrfs_set_inode_last_trans()
> +      *
> +      * So we are sure that last_trans is up to date and can do this check to
> +      * bail out safely. For the fast path, when the full sync flag is not
> +      * set in our inode, we can not do it because we start only our ordered
> +      * extents and don't wait for them to complete (that is when
> +      * btrfs_finish_ordered_io runs) - if we rely on the speculative value
> +      * for inode->last_trans set by btrfs_file_write_iter we lose data in
> +      * the following scenario:
> +      *
> +      * 1. fs_info->last_trans_committed == N - 1 and current transaction is
> +      *    transaction N (fs_info->generation == N);
> +      *
> +      * 2. do a buffered write;
> +      *
> +      * 3. fsync our inode, this clears our inode's full sync flag, starts
> +      *    an ordered extent and waits for it to complete - when it completes
> +      *    at btrfs_finish_ordered_io(), the inode's last_trans is set to
> +      *    the value N;
> +      *
> +      * 4. transaction N is committed, so fs_info->last_trans_committed is
> +      *    now set to the value N and fs_info->generation remains with the
> +      *    value N;
> +      *
> +      * 5. do another buffered write, when this happens btrfs_file_write_iter
> +      *    sets our inode's last_trans to the value N + 1;


This problem proves that in step 5 it doesn't work any more by setting 
last_trans to "root->fs_info->generation + 1", we can remove that setting.

Others look good.

Reviewed-by: Liu Bo <bo.li....@oracle.com>

Thanks,

-liubo
> +      *
> +      * 6. transaction N + 1 is started and fs_info->generation now has the
> +      *    value N + 1;
> +      *
> +      * 7. transaction N + 1 is committed, so fs_info->last_trans_committed
> +      *    is set to the value N + 1;
> +      *
> +      * 8. fsync our inode - because it doesn't have the full sync flag set,
> +      *    we only start the ordered extent, we don't wait for it to complete
> +      *    (only in a later phase) therefore its last_trans field has the
> +      *    value N + 1 set previously by btrfs_file_write_iter(), and so we
> +      *    have:
> +      *
> +      *        inode->last_trans <= fs_info->last_trans_committed
> +      *            (N + 1)              (N + 1)
> +      *
> +      *    Which would make us not log the last buffered write, resulting in
> +      *    data loss after a crash.
>        */
>       smp_mb();
>       if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
> -         BTRFS_I(inode)->last_trans <=
> -         root->fs_info->last_trans_committed) {
> +         (full_sync && BTRFS_I(inode)->last_trans <=
> +          root->fs_info->last_trans_committed)) {
>               BTRFS_I(inode)->last_trans = 0;
>  
>               /*
> -- 
> 2.1.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix data loss in the fast fsync path

Reply via email to