On 2018年06月28日 22:36, Adam Borowski wrote: > On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote: >> There is a reporter considering btrfs raid1 has a major design flaw >> which can't handle nodatasum files. >> >> Despite his incorrect expectation, btrfs indeed doesn't handle device >> generation mismatch well. >> >> This means if one devices missed and re-appeared, even its generation >> no longer matches with the rest device pool, btrfs does nothing to it, >> but treat it as normal good device. >> >> At least let's detect such generation mismatch and avoid mounting the >> fs. > > Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. > The vast majority of data is fine, and extents that have been written to > while a device is missing will be either placed elsewhere (if the filesystem > knew it was degraded) or read one of the copies to notice a wrong checksum > and automatically recover (if the device was still falsely believed to be > good at write time).
Yes, for fs without any nodatasum usage, the behavior is indeed overkilled. But sometimes such overkilled sanity check is really important, as long as nodatasum is a provided feature. > > We currently don't have selective scrub yet so resyncing such single-copy > extents is costly, but 1. all will be fine if the data is read, 2. it's > possible to add such a smart resync in the future, far better than a > write-intent bitmap can do. Well, auto scrub for a device looks not that bad to me. Since normally scrub is scheduled as normal maintenance work, it should not be a super expensive work. We only need to teach btrfs to treat such device as kind of degraded. Then we can reuse most of the scrub routine to fix it. > > To do the latter, we can note the last generation the filesystem was known > to be fully coherent (ie, all devices were successfully flushed with no > mysterious write failures), then run selective scrub (perhaps even > automatically) when the filesystem is no longer degraded. There's some > extra complexity with 3- or 4-way RAID (multiple levels of degradation) but > a single number would help even there. > > But even currently, without the above not-yet-written recovery, it's > reasonably safe to continue without scrub -- it's a case of running > partially degraded when the bad copy is already known to be suspicious. > > For no-nodatacow data and metadata, that is. > >> Currently there is no automatic rebuild yet, which means if users find >> device generation mismatch error message, they can only mount the fs >> using "device" and "degraded" mount option (if possible), then replace >> the offending device to manually "rebuild" the fs. > > As nodatacow already means "I don't care about this data, or have another > way of recovering it", I don't quite get why we would drop existing > auto-recovery for a common transient failure case. Yep, exactly my understanding of nodatasum behavior. However in real world, btrfs is the only *linux* fs supports data csum, and the most widely used fs like ext4 xfs doesn't support data csum. As the discussion about the behavior goes, I find that LVM/mdraid + ext4/xfs could do better device missing management than btrfs nodatasum, this means we need to at least do something that LVM/mdraid could provide. > > If you're paranoid, perhaps some bit "this filesystem has some nodatacow > data on it" could warrant such a block, but it would still need to be > overridable _without_ a need for replace. There's also the problem that > systemd marks its journal nodatacow (despite it having infamously bad > handling of failures!), and too many distributions infect their default > installs with systemd, meaning such a bit would be on in most cases. > > But why would I put all my other data at risk, just because there's a > nodatacow file? There's a big difference between scrubbing when only a few > transactions worth of data is suspicious and completely throwing away a > mostly-good replica to replace it from the now fully degraded copy. > >> I totally understand that, generation based solution can't handle >> split-brain case (where 2 RAID1 devices get mounted degraded separately) >> at all, but at least let's handle what we can do. > > Generation can do well at least unless both devices were mounted elsewhere > and got the exact same number of transactions, the problem is that nodatacow > doesn't bump generation number. Generation is never a problem, as any metadata change will still bump generation. > >> The best way to solve the problem is to make btrfs treat such lower gen >> devices as some kind of missing device, and queue an automatic scrub for >> that device. >> But that's a lot of extra work, at least let's start from detecting such >> problem first. > > I wonder if there's some way to treat problematic nodatacow files as > degraded only? > > Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd > need to pretty much copy all of md's logic, with a write-intent bitmap or an > equivalent. At least, let's see what LVM/md is doing, and try to learn something is never a bad idea. Thanks, Qu > > > Meow! >
signature.asc
Description: OpenPGP digital signature