On 2019-09-25T15:05:44, Chris Murphy wrote: > On Wed, Sep 25, 2019 at 1:34 PM Pallissard, Matthew <m...@pallissard.net> > wrote: > > On 2019-09-25T13:08:34, Chris Murphy wrote: > > > On Wed, Sep 25, 2019 at 8:50 AM Pallissard, Matthew <m...@pallissard.net> > > > wrote: > > > > > > > > Version: > > > > Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 > > > > x86_64 GNU/Linux > > > > > > You need to upgrade to arch kernel 5.2.14 or newer (they backported the > > > fix first appearing in stable 5.2.15). Or you need to downgrade to 5.1 > > > series. > > > https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdman...@kernel.org/T/#u > > > > > > That's a nasty bug. I don't offhand see evidence that you've hit this > > > bug. But I'm not certain. So first thing should be to use a different > > > kernel. > > > > Interesting, I'll go ahead with a kernel upgrade as that easy enough. > > However, that looks like it's related to a stacktrace regarding a hung > > process. Which is not the original problem I had. > > Based on the output in my previous email, I've been working under the > > assumption that there is a problem on-disk. Is that not correct? > > That bug does cause filesystem corruption that is not repairable. > Whether you have that problem or a different problem, I'm not sure. > But it's best to avoid combining problems. > > The file system mounts rw now? Or still only mounts ro?
It mounts RW, but I have yet to attempt an actual write. > I think most of the errors reported by btrfs check, if they still exist after > doing a scrub, should be repaired by 'btrfs check --repair' but I don't > advise that until later. I'm not a developer, maybe Qu can offer some advise > on those errors. > > > Next, anytime there is a crash or powerfailur with Btrfs raid56, you need > > > to do a complete scrub of the volume. Obviously will take time but that's > > > what needs to be done first. > > > > I'm using raid 10, not 5 or 6. > > Same advice, but it's not as important to raid10 because it doesn't have the > write hole problem. > > > OK actually, before the scrub you need to confirm that each drive's SCT > > > ERC time is *less* than the kernel's SCSI command timer. e.g. > > > > I gather that I should probably do this before any scrub, be it raid 5, 6, > > or 10. But, Is a scrub the operation I should attempt on this raid 10 > > array to repair the specific errors mentioned in my previous email? > > Definitely deal with the timing issue first. If by chance there are bad > sectors on any of the drives, they must be properly reported by the drive > with a discrete read error in order for Btrfs to do a proper fixup. If the > times are mismatched, then Linux can get tired waiting, and do a link reset > on the drive before the read error happens. And now the whole command queue > is lost and the problem isn't fixed. Good to know, that seems like a critical piece of information. A few searches turned up this page, https://wiki.debian.org/Btrfs#FAQ. Should this be noted on the 'gotchas' or 'getting started page as well? I'd be happy to make edits should the powers that be allow it. > There are myriad errors and the advice I'm giving to scrub is a safe first > step to make sure the storage stack is sane - or at least we know where the > simpler problems are. And then move to the less simple ones that have higher > risk. It also changed the volume the least. Everything else, like balance > and chunk recover and btrfs check --repair - all make substantial changes to > the file system and have higher risk of making things worse. This sounds sensible. > In theory if the storage stack does exactly what Btrfs says, then at worst > you should lose some data, but the file system itself should be consistent. > And that includes power failures. The fact there's problems reported suggests > a bug somewhere - it could be Btrfs, it could be device mapper, it could be > controller or drive firmware. I'll go ahead with a kernel upgrade/make sure the timing issues are squared away. Then I'll kick off a scrub. I'll report back when the scrub is complete or something interesting happens. Whichever comes first. Thanks again. Matt Pallissard
signature.asc
Description: PGP signature