On 2019/3/27 下午4:37, Nikolay Borisov wrote:
>
>
> On 27.03.19 г. 9:25 ч., Qu Wenruo wrote:
>> [BUG]
>> There are tons of reports of btrfs-progs screwing up the fs, the most
>> recent one is "btrfs check --clear-space-cache v1" triggered BUG_ON()
>> and then leaving the fs with transid mismatch problem.
>>
>> [CAUSE]
>> In kernel, we have block layer handing the flush work, even on devices
>> without FUA support (like most SATA device using default libata
>> settings), kernel handles FUA write by flushing the device, then normal
>> write, and finish it with another flush.
>>
>> The pre-flush, write, post-flush works pretty well to implement FUA
>> write.
>>
>> However in btrfs-progs we just use pwrite(), there is nothing keeping
>> the write order.
>
> This could be expanded to include the following information:
>
> 1. The device fs is opened with only O_RD or O_RDWR i.e we don't pass
> any special arguments when opening the fd.
In fact you can pass O_SYNC to solve the problem, but with possible
performance penalty.
In fact, the fsync() call here is completely compatible with any open
flag except O_RDONLY.
So I'm not sure why extra comment on this is needed.
>
> 2. We do not use fsync anywhere
Even we use, it won't cause any problem.
Here we only need to ensure one barrier for super block, if another part
of btrfs-progs wants an extra barrier, it's completely fine to do so.
So I don't see much point to explain both points.
>
>>
>> So even for basic v1 free space cache clearing, we have different vision
>> on the write sequence from kernel bio layer (by dm-log-writes) and user
>> space pwrite() calls.
>>
>> In btrfs-progs, with extra debug output in write_tree_block() and
>> write_dev_supers(), we can see btrfs-progs follows the right write
>> sequence:
>>
>> Opening filesystem to check...
>> Checking filesystem on /dev/mapper/log
>> UUID: 3feb3c8b-4eb3-42f3-8e9c-0af22dd58ecf
>> write tree block start=1708130304 gen=39
>> write tree block start=1708146688 gen=39
>> write tree block start=1708163072 gen=39
>> write super devid=1 gen=39
>> write tree block start=1708179456 gen=40
>> write tree block start=1708195840 gen=40
>> write super devid=1 gen=40
>> write tree block start=1708130304 gen=41
>> write tree block start=1708146688 gen=41
>> write tree block start=1708228608 gen=41
>> write super devid=1 gen=41
>> write tree block start=1708163072 gen=42
>> write tree block start=1708179456 gen=42
>> write super devid=1 gen=42
>> write tree block start=1708130304 gen=43
>> write tree block start=1708146688 gen=43
>> write super devid=1 gen=43
>> Free space cache cleared
>>
>> But from dm-log-writes, the bio sequence is a different story:
>>
>> replaying 1742: sector 131072, size 4096, flags 0(NONE)
>> replaying 1743: sector 128, size 4096, flags 0(NONE) <<< Only one sb write
>> replaying 1744: sector 2828480, size 4096, flags 0(NONE)
>> replaying 1745: sector 2828488, size 4096, flags 0(NONE)
>> replaying 1746: sector 2828496, size 4096, flags 0(NONE)
>> replaying 1787: sector 2304120, size 4096, flags 0(NONE)
>> ......
>> replaying 1790: sector 2304144, size 4096, flags 0(NONE)
>> replaying 1791: sector 2304152, size 4096, flags 0(NONE)
>> replaying 1792: sector 0, size 0, flags 8(MARK)
>>
>> During the free space cache clearing, we committed 3 transaction but
>> dm-log-write only caught one super block write.
>>
>> This means all the 3 writes were merged into the last super block write.
>> And the super block write was the 2nd write, before all tree block
>> writes, completely screwing up the metadata CoW protection.
>>
>> No wonder crashed btrfs-progs can make things worse.
>>
>> [FIX]
>> Fix this super serious problem by implementing pre and post flush for
>> the primary super block in btrfs-progs.
>>
>> Signed-off-by: Qu Wenruo <w...@suse.com>
>
> Reviewed-by: Nikolay Borisov <nbori...@suse.com>
>
>> ---
>> disk-io.c | 28 ++++++++++++++++++++++++++++
>> 1 file changed, 28 insertions(+)
>>
>> diff --git a/disk-io.c b/disk-io.c
>> index 238b1821be14..275ccff965e2 100644
>> --- a/disk-io.c
>> +++ b/disk-io.c
>> @@ -1585,6 +1585,16 @@ static int write_dev_supers(struct btrfs_fs_info
>> *fs_info,
>> u32 crc;
>> int i, ret;
>>
>> + /*
>> + * We need to write super block after all metadata written.
>> + * This is the equivalent of kernel pre-flush for FUA.
>> + */
>> + ret = fsync(device->fd);
>> + if (ret < 0) {
>> + error("failed to flush: %m");
>> + ret = -errno;
>> + goto write_err;
>> + }
>> if (fs_info->super_bytenr != BTRFS_SUPER_INFO_OFFSET) {
>> btrfs_set_super_bytenr(sb, fs_info->super_bytenr);
>> crc = ~(u32)0;
>> @@ -1605,6 +1615,12 @@ static int write_dev_supers(struct btrfs_fs_info
>> *fs_info,
>> error("failed to write super block: %m");
>> goto write_err;
>> }
>> + ret = fsync(device->fd);
>> + if (ret < 0) {
>> + error("failed to flush: %m");
>> + ret = -errno;
>> + goto write_err;
>> + }
>> return 0;
>> }
>>
>> @@ -1632,6 +1648,18 @@ static int write_dev_supers(struct btrfs_fs_info
>> *fs_info,
>> error("failed to write super block %i: %m", i);
>> goto write_err;
>> }
>> + /*
>> + * Flush after the primary sb write, this is the equivalent of
>> + * kernel post-flush for FUA write.
>> + */
>> + if (i == 0) {
>> + ret = fsync(device->fd);
>> + if (ret < 0) {
>> + error("failed to flush: %m");
>> + ret = -errno;
>> + goto write_err;
>> + }
>> + }
>> }
>>
>> return 0;
>>