On 2018年06月28日 22:36, Adam Borowski wrote:
> On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote:
>> There is a reporter considering btrfs raid1 has a major design flaw
>> which can't handle nodatasum files.
>>
>> Despite his incorrect expectation, btrfs indeed doesn't handle device
>> generation mismatch well.
>>
>> This means if one devices missed and re-appeared, even its generation
>> no longer matches with the rest device pool, btrfs does nothing to it,
>> but treat it as normal good device.
>>
>> At least let's detect such generation mismatch and avoid mounting the
>> fs.
> 
> Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
> The vast majority of data is fine, and extents that have been written to
> while a device is missing will be either placed elsewhere (if the filesystem
> knew it was degraded) or read one of the copies to notice a wrong checksum
> and automatically recover (if the device was still falsely believed to be
> good at write time).

Yes, for fs without any nodatasum usage, the behavior is indeed overkilled.
But sometimes such overkilled sanity check is really important, as long
as nodatasum is a provided feature.

> 
> We currently don't have selective scrub yet so resyncing such single-copy
> extents is costly, but 1. all will be fine if the data is read, 2. it's
> possible to add such a smart resync in the future, far better than a
> write-intent bitmap can do.

Well, auto scrub for a device looks not that bad to me.
Since normally scrub is scheduled as normal maintenance work, it should
not be a super expensive work.

We only need to teach btrfs to treat such device as kind of degraded.
Then we can reuse most of the scrub routine to fix it.

> 
> To do the latter, we can note the last generation the filesystem was known
> to be fully coherent (ie, all devices were successfully flushed with no
> mysterious write failures), then run selective scrub (perhaps even
> automatically) when the filesystem is no longer degraded.  There's some
> extra complexity with 3- or 4-way RAID (multiple levels of degradation) but
> a single number would help even there.
> 
> But even currently, without the above not-yet-written recovery, it's
> reasonably safe to continue without scrub -- it's a case of running
> partially degraded when the bad copy is already known to be suspicious.
> 
> For no-nodatacow data and metadata, that is.
> 
>> Currently there is no automatic rebuild yet, which means if users find
>> device generation mismatch error message, they can only mount the fs
>> using "device" and "degraded" mount option (if possible), then replace
>> the offending device to manually "rebuild" the fs.
> 
> As nodatacow already means "I don't care about this data, or have another
> way of recovering it", I don't quite get why we would drop existing
> auto-recovery for a common transient failure case.

Yep, exactly my understanding of nodatasum behavior.

However in real world, btrfs is the only *linux* fs supports data csum,
and the most widely used fs like ext4 xfs doesn't support data csum.

As the discussion about the behavior goes, I find that LVM/mdraid +
ext4/xfs could do better device missing management than btrfs nodatasum,
this means we need to at least do something that LVM/mdraid could provide.

> 
> If you're paranoid, perhaps some bit "this filesystem has some nodatacow
> data on it" could warrant such a block, but it would still need to be
> overridable _without_ a need for replace.  There's also the problem that
> systemd marks its journal nodatacow (despite it having infamously bad
> handling of failures!), and too many distributions infect their default
> installs with systemd, meaning such a bit would be on in most cases.
> 
> But why would I put all my other data at risk, just because there's a
> nodatacow file?  There's a big difference between scrubbing when only a few
> transactions worth of data is suspicious and completely throwing away a
> mostly-good replica to replace it from the now fully degraded copy.
> 
>> I totally understand that, generation based solution can't handle
>> split-brain case (where 2 RAID1 devices get mounted degraded separately)
>> at all, but at least let's handle what we can do.
> 
> Generation can do well at least unless both devices were mounted elsewhere
> and got the exact same number of transactions, the problem is that nodatacow
> doesn't bump generation number.

Generation is never a problem, as any metadata change will still bump
generation.

> 
>> The best way to solve the problem is to make btrfs treat such lower gen
>> devices as some kind of missing device, and queue an automatic scrub for
>> that device.
>> But that's a lot of extra work, at least let's start from detecting such
>> problem first.
> 
> I wonder if there's some way to treat problematic nodatacow files as
> degraded only?
> 
> Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd
> need to pretty much copy all of md's logic, with a write-intent bitmap or an
> equivalent.

At least, let's see what LVM/md is doing, and try to learn something is
never a bad idea.

Thanks,
Qu

> 
> 
> Meow!
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to