It would be be useful to have the ability to scrub only the metadata. In many cases the data is so large that a full scrub is not feasible. In my "little" test system of 34TB a full scrub takes many hours and the IOPS saturate the disks to the extent that the volume is unusable due to the high latencies. Ideally there would be a way to rate limit the scrub operation I/Os so that it can happen in the background without impacting the normal workload.
On Fri, 18 Oct 2019 at 21:38, Chris Murphy <li...@colorremedies.com> wrote: > > On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB <jonandermonl...@gmail.com> > wrote: > > > > It would be interesting to know the pros and cons of this setup that > > you are suggesting vs zfs. > > +zfs detects and corrects bitrot ( > > http://www.zfsnas.com/2015/05/24/testing-bit-rot/ ) > > +zfs has working raid56 > > -modules out of kernel for license incompatibilities (a big minus) > > > > BTRFS can detect bitrot but... are we sure it can fix it? (can't seem > > to find any conclusive doc about it right now) > > Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12. > > > I'm one of those that is waiting for the write hole bug to be fixed in > > order to use raid5 on my home setup. It's a shame it's taking so long. > > For what it's worth, the write hole is considered to be rare. > https://lwn.net/Articles/665299/ > > Further, the write hole means a) parity is corrupt or stale compared > to data stripe elements which is caused by a crash or powerloss during > writes, and b) subsequently there is a missing device or bad sector in > the same stripe as the corrupt/stale parity stripe element. The effect > of b) is that reconstruction from parity is necessary, and the effect > of a) is that it's reconstructed incorrectly, thus corruption. But > Btrfs detects this corruption, whether it's metadata or data. The > corruption isn't propagated in any case. But it makes the filesystem > fragile if this happens with metadata. Any parity stripe element > staleness likely results in significantly bad reconstruction in this > case, and just can't be worked around, even btrfs check probably can't > fix it. If the write hole problem happens with data block group, then > EIO. But the good news is that this isn't going to result in silent > data or file system metadata corruption. For sure you'll know about > it. > > This is why scrub after a crash or powerloss with raid56 is important, > while the array is still whole (not degraded). The two problems with > that are: > > a) the scrub isn't initiated automatically, nor is it obvious to the > user it's necessary > b) the scrub can take a long time, Btrfs has no partial scrubbing. > > Wheras mdadm arrays offer a write intent bitmap to know what blocks to > partially scrub, and to trigger it automatically following a crash or > powerloss. > > It seems Btrfs already has enough on-disk metadata to infer a > functional equivalent to the write intent bitmap, via transid. Just > scrub the last ~50 generations the next time it's mounted. Either do > this every time a Btrfs raid56 is mounted. Or create some flag that > allows Btrfs to know if the filesystem was not cleanly shutdown. It's > possible 50 generations could be a lot of data, but since it's an > online scrub triggered after mount, it wouldn't add much to mount > times. I'm also picking 50 generations arbitrarily, there's no basis > for that number. > > The above doesn't cover the case where partial stripe write (which > leads to write hole problem), and a crash or powerloss, and at the > same time one or more device failures. In that case there's no time > for a partial scrub to fix the problem leading to the write hole. So > even if the corruption is detected, it's too late to fix it. But at > least an automatic partial scrub, even degraded, will mean the user > would be flagged of the uncorrectable problem before they get too far > along. > > > -- > Chris Murphy