On 27.03.19 г. 9:25 ч., Qu Wenruo wrote:
> [BUG]
> There are tons of reports of btrfs-progs screwing up the fs, the most
> recent one is "btrfs check --clear-space-cache v1" triggered BUG_ON()
> and then leaving the fs with transid mismatch problem.
> 
> [CAUSE]
> In kernel, we have block layer handing the flush work, even on devices
> without FUA support (like most SATA device using default libata
> settings), kernel handles FUA write by flushing the device, then normal
> write, and finish it with another flush.
> 
> The pre-flush, write, post-flush works pretty well to implement FUA
> write.
> 
> However in btrfs-progs we just use pwrite(), there is nothing keeping
> the write order.

This could be expanded to include the following information:

1. The device fs is opened with only O_RD or O_RDWR i.e we don't pass
any special arguments when opening the fd.

2. We do not use fsync anywhere

> 
> So even for basic v1 free space cache clearing, we have different vision
> on the write sequence from kernel bio layer (by dm-log-writes) and user
> space pwrite() calls.
> 
> In btrfs-progs, with extra debug output in write_tree_block() and
> write_dev_supers(), we can see btrfs-progs follows the right write
> sequence:
> 
>   Opening filesystem to check...
>   Checking filesystem on /dev/mapper/log
>   UUID: 3feb3c8b-4eb3-42f3-8e9c-0af22dd58ecf
>   write tree block start=1708130304 gen=39
>   write tree block start=1708146688 gen=39
>   write tree block start=1708163072 gen=39
>   write super devid=1 gen=39
>   write tree block start=1708179456 gen=40
>   write tree block start=1708195840 gen=40
>   write super devid=1 gen=40
>   write tree block start=1708130304 gen=41
>   write tree block start=1708146688 gen=41
>   write tree block start=1708228608 gen=41
>   write super devid=1 gen=41
>   write tree block start=1708163072 gen=42
>   write tree block start=1708179456 gen=42
>   write super devid=1 gen=42
>   write tree block start=1708130304 gen=43
>   write tree block start=1708146688 gen=43
>   write super devid=1 gen=43
>   Free space cache cleared
> 
> But from dm-log-writes, the bio sequence is a different story:
> 
>   replaying 1742: sector 131072, size 4096, flags 0(NONE)
>   replaying 1743: sector 128, size 4096, flags 0(NONE) <<< Only one sb write
>   replaying 1744: sector 2828480, size 4096, flags 0(NONE)
>   replaying 1745: sector 2828488, size 4096, flags 0(NONE)
>   replaying 1746: sector 2828496, size 4096, flags 0(NONE)
>   replaying 1787: sector 2304120, size 4096, flags 0(NONE)
>   ......
>   replaying 1790: sector 2304144, size 4096, flags 0(NONE)
>   replaying 1791: sector 2304152, size 4096, flags 0(NONE)
>   replaying 1792: sector 0, size 0, flags 8(MARK)
> 
> During the free space cache clearing, we committed 3 transaction but
> dm-log-write only caught one super block write.
> 
> This means all the 3 writes were merged into the last super block write.
> And the super block write was the 2nd write, before all tree block
> writes, completely screwing up the metadata CoW protection.
> 
> No wonder crashed btrfs-progs can make things worse.
> 
> [FIX]
> Fix this super serious problem by implementing pre and post flush for
> the primary super block in btrfs-progs.
> 
> Signed-off-by: Qu Wenruo <w...@suse.com>

Reviewed-by: Nikolay Borisov <nbori...@suse.com>

> ---
>  disk-io.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/disk-io.c b/disk-io.c
> index 238b1821be14..275ccff965e2 100644
> --- a/disk-io.c
> +++ b/disk-io.c
> @@ -1585,6 +1585,16 @@ static int write_dev_supers(struct btrfs_fs_info 
> *fs_info,
>       u32 crc;
>       int i, ret;
>  
> +     /*
> +      * We need to write super block after all metadata written.
> +      * This is the equivalent of kernel pre-flush for FUA.
> +      */
> +     ret = fsync(device->fd);
> +     if (ret < 0) {
> +             error("failed to flush: %m");
> +             ret = -errno;
> +             goto write_err;
> +     }
>       if (fs_info->super_bytenr != BTRFS_SUPER_INFO_OFFSET) {
>               btrfs_set_super_bytenr(sb, fs_info->super_bytenr);
>               crc = ~(u32)0;
> @@ -1605,6 +1615,12 @@ static int write_dev_supers(struct btrfs_fs_info 
> *fs_info,
>                       error("failed to write super block: %m");
>                       goto write_err;
>               }
> +             ret = fsync(device->fd);
> +             if (ret < 0) {
> +                     error("failed to flush: %m");
> +                     ret = -errno;
> +                     goto write_err;
> +             }
>               return 0;
>       }
>  
> @@ -1632,6 +1648,18 @@ static int write_dev_supers(struct btrfs_fs_info 
> *fs_info,
>                       error("failed to write super block %i: %m", i);
>                       goto write_err;
>               }
> +             /*
> +              * Flush after the primary sb write, this is the equivalent of
> +              * kernel post-flush for FUA write.
> +              */
> +             if (i == 0) {
> +                     ret = fsync(device->fd);
> +                     if (ret < 0) {
> +                             error("failed to flush: %m");
> +                             ret = -errno;
> +                             goto write_err;
> +                     }
> +             }
>       }
>  
>       return 0;
> 

Reply via email to