On 2019/3/27 下午4:37, Nikolay Borisov wrote:
>
>
> On 27.03.19 г. 9:25 ч., Qu Wenruo wrote:
>> [BUG]
>> There are tons of reports of btrfs-progs screwing up the fs, the most
>> recent one is "btrfs check --clear-space-cache v1" triggered BUG_ON()
>> and then leaving the fs with transid mismatch problem.
>>
>> [CAUSE]
>> In kernel, we have block layer handing the flush work, even on devices
>> without FUA support (like most SATA device using default libata
>> settings), kernel handles FUA write by flushing the device, then normal
>> write, and finish it with another flush.
>>
>> The pre-flush, write, post-flush works pretty well to implement FUA
>> write.
>>
>> However in btrfs-progs we just use pwrite(), there is nothing keeping
>> the write order.
>
> This could be expanded to include the following information:
>
> 1. The device fs is opened with only O_RD or O_RDWR i.e we don't pass
> any special arguments when opening the fd.

In fact you can pass O_SYNC to solve the problem, but with possible
performance penalty.

In fact, the fsync() call here is completely compatible with any open
flag except O_RDONLY.

So I'm not sure why extra comment on this is needed.

>
> 2. We do not use fsync anywhere

Even we use, it won't cause any problem.

Here we only need to ensure one barrier for super block, if another part
of btrfs-progs wants an extra barrier, it's completely fine to do so.

So I don't see much point to explain both points.

>
>>
>> So even for basic v1 free space cache clearing, we have different vision
>> on the write sequence from kernel bio layer (by dm-log-writes) and user
>> space pwrite() calls.
>>
>> In btrfs-progs, with extra debug output in write_tree_block() and
>> write_dev_supers(), we can see btrfs-progs follows the right write
>> sequence:
>>
>>   Opening filesystem to check...
>>   Checking filesystem on /dev/mapper/log
>>   UUID: 3feb3c8b-4eb3-42f3-8e9c-0af22dd58ecf
>>   write tree block start=1708130304 gen=39
>>   write tree block start=1708146688 gen=39
>>   write tree block start=1708163072 gen=39
>>   write super devid=1 gen=39
>>   write tree block start=1708179456 gen=40
>>   write tree block start=1708195840 gen=40
>>   write super devid=1 gen=40
>>   write tree block start=1708130304 gen=41
>>   write tree block start=1708146688 gen=41
>>   write tree block start=1708228608 gen=41
>>   write super devid=1 gen=41
>>   write tree block start=1708163072 gen=42
>>   write tree block start=1708179456 gen=42
>>   write super devid=1 gen=42
>>   write tree block start=1708130304 gen=43
>>   write tree block start=1708146688 gen=43
>>   write super devid=1 gen=43
>>   Free space cache cleared
>>
>> But from dm-log-writes, the bio sequence is a different story:
>>
>>   replaying 1742: sector 131072, size 4096, flags 0(NONE)
>>   replaying 1743: sector 128, size 4096, flags 0(NONE) <<< Only one sb write
>>   replaying 1744: sector 2828480, size 4096, flags 0(NONE)
>>   replaying 1745: sector 2828488, size 4096, flags 0(NONE)
>>   replaying 1746: sector 2828496, size 4096, flags 0(NONE)
>>   replaying 1787: sector 2304120, size 4096, flags 0(NONE)
>>   ......
>>   replaying 1790: sector 2304144, size 4096, flags 0(NONE)
>>   replaying 1791: sector 2304152, size 4096, flags 0(NONE)
>>   replaying 1792: sector 0, size 0, flags 8(MARK)
>>
>> During the free space cache clearing, we committed 3 transaction but
>> dm-log-write only caught one super block write.
>>
>> This means all the 3 writes were merged into the last super block write.
>> And the super block write was the 2nd write, before all tree block
>> writes, completely screwing up the metadata CoW protection.
>>
>> No wonder crashed btrfs-progs can make things worse.
>>
>> [FIX]
>> Fix this super serious problem by implementing pre and post flush for
>> the primary super block in btrfs-progs.
>>
>> Signed-off-by: Qu Wenruo <w...@suse.com>
>
> Reviewed-by: Nikolay Borisov <nbori...@suse.com>
>
>> ---
>>  disk-io.c | 28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> diff --git a/disk-io.c b/disk-io.c
>> index 238b1821be14..275ccff965e2 100644
>> --- a/disk-io.c
>> +++ b/disk-io.c
>> @@ -1585,6 +1585,16 @@ static int write_dev_supers(struct btrfs_fs_info 
>> *fs_info,
>>      u32 crc;
>>      int i, ret;
>>
>> +    /*
>> +     * We need to write super block after all metadata written.
>> +     * This is the equivalent of kernel pre-flush for FUA.
>> +     */
>> +    ret = fsync(device->fd);
>> +    if (ret < 0) {
>> +            error("failed to flush: %m");
>> +            ret = -errno;
>> +            goto write_err;
>> +    }
>>      if (fs_info->super_bytenr != BTRFS_SUPER_INFO_OFFSET) {
>>              btrfs_set_super_bytenr(sb, fs_info->super_bytenr);
>>              crc = ~(u32)0;
>> @@ -1605,6 +1615,12 @@ static int write_dev_supers(struct btrfs_fs_info 
>> *fs_info,
>>                      error("failed to write super block: %m");
>>                      goto write_err;
>>              }
>> +            ret = fsync(device->fd);
>> +            if (ret < 0) {
>> +                    error("failed to flush: %m");
>> +                    ret = -errno;
>> +                    goto write_err;
>> +            }
>>              return 0;
>>      }
>>
>> @@ -1632,6 +1648,18 @@ static int write_dev_supers(struct btrfs_fs_info 
>> *fs_info,
>>                      error("failed to write super block %i: %m", i);
>>                      goto write_err;
>>              }
>> +            /*
>> +             * Flush after the primary sb write, this is the equivalent of
>> +             * kernel post-flush for FUA write.
>> +             */
>> +            if (i == 0) {
>> +                    ret = fsync(device->fd);
>> +                    if (ret < 0) {
>> +                            error("failed to flush: %m");
>> +                            ret = -errno;
>> +                            goto write_err;
>> +                    }
>> +            }
>>      }
>>
>>      return 0;
>>

Reply via email to