On 27.03.19 г. 9:25 ч., Qu Wenruo wrote:
> [BUG]
> There are tons of reports of btrfs-progs screwing up the fs, the most
> recent one is "btrfs check --clear-space-cache v1" triggered BUG_ON()
> and then leaving the fs with transid mismatch problem.
>
> [CAUSE]
> In kernel, we have block layer handing the flush work, even on devices
> without FUA support (like most SATA device using default libata
> settings), kernel handles FUA write by flushing the device, then normal
> write, and finish it with another flush.
>
> The pre-flush, write, post-flush works pretty well to implement FUA
> write.
>
> However in btrfs-progs we just use pwrite(), there is nothing keeping
> the write order.
This could be expanded to include the following information:
1. The device fs is opened with only O_RD or O_RDWR i.e we don't pass
any special arguments when opening the fd.
2. We do not use fsync anywhere
>
> So even for basic v1 free space cache clearing, we have different vision
> on the write sequence from kernel bio layer (by dm-log-writes) and user
> space pwrite() calls.
>
> In btrfs-progs, with extra debug output in write_tree_block() and
> write_dev_supers(), we can see btrfs-progs follows the right write
> sequence:
>
> Opening filesystem to check...
> Checking filesystem on /dev/mapper/log
> UUID: 3feb3c8b-4eb3-42f3-8e9c-0af22dd58ecf
> write tree block start=1708130304 gen=39
> write tree block start=1708146688 gen=39
> write tree block start=1708163072 gen=39
> write super devid=1 gen=39
> write tree block start=1708179456 gen=40
> write tree block start=1708195840 gen=40
> write super devid=1 gen=40
> write tree block start=1708130304 gen=41
> write tree block start=1708146688 gen=41
> write tree block start=1708228608 gen=41
> write super devid=1 gen=41
> write tree block start=1708163072 gen=42
> write tree block start=1708179456 gen=42
> write super devid=1 gen=42
> write tree block start=1708130304 gen=43
> write tree block start=1708146688 gen=43
> write super devid=1 gen=43
> Free space cache cleared
>
> But from dm-log-writes, the bio sequence is a different story:
>
> replaying 1742: sector 131072, size 4096, flags 0(NONE)
> replaying 1743: sector 128, size 4096, flags 0(NONE) <<< Only one sb write
> replaying 1744: sector 2828480, size 4096, flags 0(NONE)
> replaying 1745: sector 2828488, size 4096, flags 0(NONE)
> replaying 1746: sector 2828496, size 4096, flags 0(NONE)
> replaying 1787: sector 2304120, size 4096, flags 0(NONE)
> ......
> replaying 1790: sector 2304144, size 4096, flags 0(NONE)
> replaying 1791: sector 2304152, size 4096, flags 0(NONE)
> replaying 1792: sector 0, size 0, flags 8(MARK)
>
> During the free space cache clearing, we committed 3 transaction but
> dm-log-write only caught one super block write.
>
> This means all the 3 writes were merged into the last super block write.
> And the super block write was the 2nd write, before all tree block
> writes, completely screwing up the metadata CoW protection.
>
> No wonder crashed btrfs-progs can make things worse.
>
> [FIX]
> Fix this super serious problem by implementing pre and post flush for
> the primary super block in btrfs-progs.
>
> Signed-off-by: Qu Wenruo <w...@suse.com>
Reviewed-by: Nikolay Borisov <nbori...@suse.com>
> ---
> disk-io.c | 28 ++++++++++++++++++++++++++++
> 1 file changed, 28 insertions(+)
>
> diff --git a/disk-io.c b/disk-io.c
> index 238b1821be14..275ccff965e2 100644
> --- a/disk-io.c
> +++ b/disk-io.c
> @@ -1585,6 +1585,16 @@ static int write_dev_supers(struct btrfs_fs_info
> *fs_info,
> u32 crc;
> int i, ret;
>
> + /*
> + * We need to write super block after all metadata written.
> + * This is the equivalent of kernel pre-flush for FUA.
> + */
> + ret = fsync(device->fd);
> + if (ret < 0) {
> + error("failed to flush: %m");
> + ret = -errno;
> + goto write_err;
> + }
> if (fs_info->super_bytenr != BTRFS_SUPER_INFO_OFFSET) {
> btrfs_set_super_bytenr(sb, fs_info->super_bytenr);
> crc = ~(u32)0;
> @@ -1605,6 +1615,12 @@ static int write_dev_supers(struct btrfs_fs_info
> *fs_info,
> error("failed to write super block: %m");
> goto write_err;
> }
> + ret = fsync(device->fd);
> + if (ret < 0) {
> + error("failed to flush: %m");
> + ret = -errno;
> + goto write_err;
> + }
> return 0;
> }
>
> @@ -1632,6 +1648,18 @@ static int write_dev_supers(struct btrfs_fs_info
> *fs_info,
> error("failed to write super block %i: %m", i);
> goto write_err;
> }
> + /*
> + * Flush after the primary sb write, this is the equivalent of
> + * kernel post-flush for FUA write.
> + */
> + if (i == 0) {
> + ret = fsync(device->fd);
> + if (ret < 0) {
> + error("failed to flush: %m");
> + ret = -errno;
> + goto write_err;
> + }
> + }
> }
>
> return 0;
>