On 1/11/21 6:41 AM, fdman...@kernel.org wrote:
From: Filipe Manana <fdman...@suse.com>

When an incremental send finds an extent that is shared, it checks which
file extent items in the range refer to that extent, and for those it
emits clone operations, while for others it emits regular write operations
to avoid corruption at the destination (as described and fixed by commit
d906d49fc5f4 ("Btrfs: send, fix file corruption due to incorrect cloning
operations")).

However when the root we are cloning from is the send root, we are cloning
from the inode currently being processed and the source file range has
several extent items that partially point to the desired extent, with an
offset smaller than the offset in the file extent item for the range we
want to clone into, it can cause the algorithm to issue a clone operation
that starts at the current eof of the file being processed in the receiver
side, in which case the receiver will fail, with -EINVAL, when attempting
to execute the clone operation.

Example reproducer:

   $ cat test-send-clone.sh
   #!/bin/bash

   DEV=/dev/sdi
   MNT=/mnt/sdi

   mkfs.btrfs -f $DEV >/dev/null
   mount $DEV $MNT

   # Create our test file with a single and large extent (1M) and with
   # different content for different file ranges that will be reflinked
   # later.
   xfs_io -f \
          -c "pwrite -S 0xab 0 128K" \
          -c "pwrite -S 0xcd 128K 128K" \
          -c "pwrite -S 0xef 256K 256K" \
          -c "pwrite -S 0x1a 512K 512K" \
          $MNT/foobar

   btrfs subvolume snapshot -r $MNT $MNT/snap1
   btrfs send -f /tmp/snap1.send $MNT/snap1

   # Now do a series of changes to our file such that we end up with
   # different parts of the extent reflinked into different file offsets
   # and we overwrite a large part of the extent too, so no file extent
   # items refer to that part that was overwritten. This used to confure
   # the algorithm used by the kernel to figure out which file ranges to
   # clone, making it attempt to clone from a source range starting at
   # the current eof of the file, resulting in the receiver to fail since
   # it is an invalid clone operation.
   #
   xfs_io -c "reflink $MNT/foobar 64K 1M 960K" \
          -c "reflink $MNT/foobar 0K 512K 256K" \
          -c "reflink $MNT/foobar 512K 128K 256K" \
          -c "pwrite -S 0x73 384K 640K" \
          $MNT/foobar

   btrfs subvolume snapshot -r $MNT $MNT/snap2
   btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

   echo -e "\nFile digest in the original filesystem:"
   md5sum $MNT/snap2/foobar

   # Now unmount the filesystem, create a new one, mount it and try to
   # apply both send streams to recreate both snapshots.
   umount $DEV

   mkfs.btrfs -f $DEV >/dev/null
   mount $DEV $MNT

   btrfs receive -f /tmp/snap1.send $MNT
   btrfs receive -f /tmp/snap2.send $MNT

   # Must match what we got in the original filesystem of course.
   echo -e "\nFile digest in the new filesystem:"
   md5sum $MNT/snap2/foobar

   umount $MNT

When running the reproducer, the incremental send operation fails due to
an invalid clone operation:

   $ ./test-send-clone.sh
   wrote 131072/131072 bytes at offset 0
   128 KiB, 32 ops; 0.0015 sec (80.906 MiB/sec and 20711.9741 ops/sec)
   wrote 131072/131072 bytes at offset 131072
   128 KiB, 32 ops; 0.0013 sec (90.514 MiB/sec and 23171.6148 ops/sec)
   wrote 262144/262144 bytes at offset 262144
   256 KiB, 64 ops; 0.0025 sec (98.270 MiB/sec and 25157.2327 ops/sec)
   wrote 524288/524288 bytes at offset 524288
   512 KiB, 128 ops; 0.0052 sec (95.730 MiB/sec and 24506.9883 ops/sec)
   Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
   At subvol /mnt/sdi/snap1
   linked 983040/983040 bytes at offset 1048576
   960 KiB, 1 ops; 0.0006 sec (1.419 GiB/sec and 1550.3876 ops/sec)
   linked 262144/262144 bytes at offset 524288
   256 KiB, 1 ops; 0.0020 sec (120.192 MiB/sec and 480.7692 ops/sec)
   linked 262144/262144 bytes at offset 131072
   256 KiB, 1 ops; 0.0018 sec (133.833 MiB/sec and 535.3319 ops/sec)
   wrote 655360/655360 bytes at offset 393216
   640 KiB, 160 ops; 0.0093 sec (66.781 MiB/sec and 17095.8436 ops/sec)
   Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
   At subvol /mnt/sdi/snap2

   File digest in the original filesystem:
   9c13c61cb0b9f5abf45344375cb04dfa  /mnt/sdi/snap2/foobar
   At subvol snap1
   At snapshot snap2
   ERROR: failed to clone extents to foobar: Invalid argument

   File digest in the new filesystem:
   132f0396da8f48d2e667196bff882cfc  /mnt/sdi/snap2/foobar

The clone operation is invalid because its source range starts at the
current eof of the file in the receiver, causing the receiver to get
an -EINVAL error from the clone operation when attempting it.

For the example above, what happens is the following:

1) When processing the extent at file offset 1M, the algorithm checks that
    the extent is shared and can be (fully or partially) found at file
    offset 0.

    At this point the file has a size (and eof) of 1M at the receiver;

2) It finds that our extent item at file offset 1M has a data offset of
    64K and, since the file extent item at file offset 0 has a data offset
    of 0, it issues a clone operation, from the same file and root, that
    has a source range offset of 64K, destination offset of 1M and a length
    of 64K, since the extent item at file offset 0 refers only to the first
    128K of the shared extent.

    After this clone operation, the file size (and eof) at the receiver is
    increased from 1M to 1088K (1M + 64K);

3) Now there's still 896K (960K - 64K) of data left to clone or write, so
    it checks for the next file extent item, which starts at file offset
    128K. This file extent item has a data offset of 0 and a length of
    256K, so a clone operation with a source range offset of 256K, a
    destination offset of 1088K (1M + 64K) and length of 128K is issued.

    After this operation the file size (and eof) at the receiver increases
    from 1088K to 1216K (1088K + 128K);

4) Now there's still 768K (896K - 128K) of data left to clone or write, so
    it checks for the next file extent item, located at file offset 384K.
    This file extent item points to a different extent, not the one we want
    to clone, with a length of 640K. So we issue a write operation into the
    file range 1216K (1088K + 128K, end of the last clone operation), with
    a length of 640K and with a data matching the one we can find for that
    range in send root.

    After this operation, the file size (and eof) at the receiver increases
    from 1216K to 1856K (1216K + 640K);

5) Now there's still 128K (768K - 640K) of data left to clone or write, so
    we look into the file extent item, which is for file offset 1M and it
    points to the extent we want to clone, with a data offset of 64K and a
    length of 960K.

    However this matches the file offset we started with, the start of the
    range to clone into. So we can't for sure find any file extent item
    from here onwards with the rest of the data we want to clone, yet we
    proceed and since the file extent item points to the shared extent,
    with a data offset of 64K, we issue a clone operation with a source
    range starting at file offset 1856K, which matches the file extent
    item's offset, 1M, plus the amount of data cloned and written so far,
    which is 64K (step 2) + 128K (step 3) + 640K (step 4). This clone
    operation is invalid since the source range offset matches the current
    eof of the file in the receiver. We should have stopped looking for
    extents to clone at this point and instead fallback to write, which
    would simply the contain the data in the file range from 1856K to
    1856K + 128K.

So fix this by stopping the loop that looks for file ranges to clone at
clone_range() when we reach the current eof of the file being processed,
if we are cloning from the same file and using the send root as the clone
root. This ensures any data not yet cloned will be sent to the receiver
through a write operation.

A test case for fstests will follow soon.

Reported-by: Massimo B. <massim...@gmx.net>
Link: 
https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.ca...@gmx.net/
Fixes: 11f2069c113e ("Btrfs: send, allow clone operations within the same file")
CC: sta...@vger.kernel.org # 5.5+
Signed-off-by: Filipe Manana <fdman...@suse.com>

Read the commit log like 9 times, then read the comment and understood what was happening.

Reviewed-by: Josef Bacik <jo...@toxicpanda.com>

Thanks,

Josef

Reply via email to