On 22.04.19 г. 18:44 ч., fdman...@kernel.org wrote:
> From: Filipe Manana <fdman...@suse.com>
>
> Send always operates on read-only trees and always expected that while it
> is in progress, nothing changes in those trees. Due to that expectation
> and the fact that send is a read-only operation, it operates on commit
> roots and does not hold transaction handles. However relocation can COW
> nodes and leafs from read-only trees, which can cause unexpected failures
> and crashes (hitting BUG_ONs). while send using a node/leaf, it gets
> COWed, the transaction used to COW it is committed, a new transaction
> starts, the extent previously used for that node/leaf gets allocated,
> possibly for another tree, and the respective extent buffer' content
> changes while send is still using it. When this happens send normally
> fails with EIO being returned to user space and messages like the
> following are found in dmesg/syslog:
>
> [ 3408.699121] BTRFS error (device sdc): parent transid verify failed on
> 58703872 wanted 250 found 253
> [ 3441.523123] BTRFS error (device sdc): did not find backref in send_root.
> inode=63211, offset=0, disk_byte=5222825984 found extent=5222825984
>
> Other times, less often, we hit a BUG_ON() because an extent buffer that
> send is using used to be a node, and while send is still using it, it
> got COWed and got reused as a leaf while send is still using, producing
> the following trace:
>
> [ 3478.466280] ------------[ cut here ]------------
> [ 3478.466282] kernel BUG at fs/btrfs/ctree.c:1806!
> [ 3478.466965] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> [ 3478.467635] CPU: 0 PID: 2165 Comm: btrfs Not tainted 5.0.0-btrfs-next-46
> #1
> [ 3478.468311] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
> [ 3478.469681] RIP: 0010:read_node_slot+0x122/0x130 [btrfs]
> (...)
> [ 3478.471758] RSP: 0018:ffffa437826bfaa0 EFLAGS: 00010246
> [ 3478.472457] RAX: ffff961416ed7000 RBX: 000000000000003d RCX:
> 0000000000000002
> [ 3478.473151] RDX: 000000000000003d RSI: ffff96141e387408 RDI:
> ffff961599b30000
> [ 3478.473837] RBP: ffffa437826bfb8e R08: 0000000000000001 R09:
> ffffa437826bfb8e
> [ 3478.474515] R10: ffffa437826bfa70 R11: 0000000000000000 R12:
> ffff9614385c8708
> [ 3478.475186] R13: 0000000000000000 R14: 0000000000000000 R15:
> 0000000000000000
> [ 3478.475840] FS: 00007f8e0e9cc8c0(0000) GS:ffff9615b6a00000(0000)
> knlGS:0000000000000000
> [ 3478.476489] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 3478.477127] CR2: 00007f98b67a056e CR3: 0000000005df6005 CR4:
> 00000000003606f0
> [ 3478.477762] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 3478.478385] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 3478.479003] Call Trace:
> [ 3478.479600] ? do_raw_spin_unlock+0x49/0xc0
> [ 3478.480202] tree_advance+0x173/0x1d0 [btrfs]
> [ 3478.480810] btrfs_compare_trees+0x30c/0x690 [btrfs]
> [ 3478.481388] ? process_extent+0x1280/0x1280 [btrfs]
> [ 3478.481954] btrfs_ioctl_send+0x1037/0x1270 [btrfs]
> [ 3478.482510] _btrfs_ioctl_send+0x80/0x110 [btrfs]
> [ 3478.483062] btrfs_ioctl+0x13fe/0x3120 [btrfs]
> [ 3478.483581] ? rq_clock_task+0x2e/0x60
> [ 3478.484086] ? wake_up_new_task+0x1f3/0x370
> [ 3478.484582] ? do_vfs_ioctl+0xa2/0x6f0
> [ 3478.485075] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
> [ 3478.485552] do_vfs_ioctl+0xa2/0x6f0
> [ 3478.486016] ? __fget+0x113/0x200
> [ 3478.486467] ksys_ioctl+0x70/0x80
> [ 3478.486911] __x64_sys_ioctl+0x16/0x20
> [ 3478.487337] do_syscall_64+0x60/0x1b0
> [ 3478.487751] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 3478.488159] RIP: 0033:0x7f8e0d7d4dd7
> (...)
> [ 3478.489349] RSP: 002b:00007ffcf6fb4908 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000010
> [ 3478.489742] RAX: ffffffffffffffda RBX: 0000000000000105 RCX:
> 00007f8e0d7d4dd7
> [ 3478.490142] RDX: 00007ffcf6fb4990 RSI: 0000000040489426 RDI:
> 0000000000000005
> [ 3478.490548] RBP: 0000000000000005 R08: 00007f8e0d6f3700 R09:
> 00007f8e0d6f3700
> [ 3478.490953] R10: 00007f8e0d6f39d0 R11: 0000000000000202 R12:
> 0000000000000005
> [ 3478.491343] R13: 00005624e0780020 R14: 0000000000000000 R15:
> 0000000000000001
> (...)
> [ 3478.493352] ---[ end trace d5f537302be4f8c8 ]---
>
> Another possibility, much less likely to happen, is that send will not
> fail but the contents of the stream it produces may not be correct.
>
> To avoid this, do not allow send and relocation (balance) to run in
> parallel. In the long term the goal is to allow for both to be able to
> run concurrently without any problems, but that will take a significant
> effort in development and testing.
>
> Signed-off-by: Filipe Manana <fdman...@suse.com>
> ---
> fs/btrfs/ctree.h | 7 +++++++
> fs/btrfs/disk-io.c | 2 ++
> fs/btrfs/send.c | 14 ++++++++++++++
> fs/btrfs/volumes.c | 8 ++++++++
> 4 files changed, 31 insertions(+)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 753ff68a8e8f..e6284e353dee 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -785,6 +785,7 @@ enum {
> /*
> * Indicate that balance has been set up from the ioctl and is in the
> * main phase. The fs_info::balance_ctl is initialized.
> + * Set and cleared while holding fs_info::balance_mutex.
> */
> BTRFS_FS_BALANCE_RUNNING,
>
> @@ -1167,6 +1168,12 @@ struct btrfs_fs_info {
> spinlock_t swapfile_pins_lock;
> struct rb_root swapfile_pins;
>
> + /*
> + * Number of send operations in progress.
> + * Updated while holding fs_info::balance_mutex.
> + */
> + int send_in_progress;
> +
Rather than introducing yet another variable and more state why not
piggy back on BTRFS_FS_EXCL_OP, this will prevent send running while
balance is running as well as while devices are being removed/added.
> #ifdef CONFIG_BTRFS_FS_REF_VERIFY
> spinlock_t ref_verify_lock;
> struct rb_root block_tree;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5216e7b3f9ad..4cde9a9654fe 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2785,6 +2785,8 @@ int open_ctree(struct super_block *sb,
> spin_lock_init(&fs_info->swapfile_pins_lock);
> fs_info->swapfile_pins = RB_ROOT;
>
> + fs_info->send_in_progress = 0;
> +
> ret = btrfs_alloc_stripe_hash_table(fs_info);
> if (ret) {
> err = ret;
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index f91f474f14f6..5c32ac661519 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -6869,9 +6869,23 @@ long btrfs_ioctl_send(struct file *mnt_file, struct
> btrfs_ioctl_send_args *arg)
> if (ret)
> goto out;
>
> + mutex_lock(&fs_info->balance_mutex);
> + if (test_bit(BTRFS_FS_BALANCE_RUNNING, &fs_info->flags)) {
> + mutex_unlock(&fs_info->balance_mutex);
> + btrfs_warn_rl(fs_info,
> + "Can not run send because a balance operation is in progress");
> + ret = -EAGAIN;
> + goto out;
> + }
> + fs_info->send_in_progress++;
> + mutex_unlock(&fs_info->balance_mutex);
> +
> current->journal_info = BTRFS_SEND_TRANS_STUB;
> ret = send_subvol(sctx);
> current->journal_info = NULL;
> + mutex_lock(&fs_info->balance_mutex);
> + fs_info->send_in_progress--;
> + mutex_unlock(&fs_info->balance_mutex);
> if (ret < 0)
> goto out;
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index db934ceae9c1..8145b62e3912 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4203,6 +4203,14 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
> get_raid_name(meta_index),
> get_raid_name(data_index));
> }
>
> + if (fs_info->send_in_progress) {
> + btrfs_warn_rl(fs_info,
> +"Can not run balance while send operations are in progress (%d in progress)",
> + fs_info->send_in_progress);
> + ret = -EAGAIN;
> + goto out;
> + }
> +
> ret = insert_balance_item(fs_info, bctl);
> if (ret && ret != -EEXIST)
> goto out;
>