On 2019-09-27T17:01:27, Pallissard, Matthew wrote: > > On 2019-09-25T14:32:31, Pallissard, Matthew wrote: > > On 2019-09-25T15:05:44, Chris Murphy wrote: > > > On Wed, Sep 25, 2019 at 1:34 PM Pallissard, Matthew <m...@pallissard.net> > > > wrote: > > > > On 2019-09-25T13:08:34, Chris Murphy wrote: > > > > > On Wed, Sep 25, 2019 at 8:50 AM Pallissard, Matthew > > > > > <m...@pallissard.net> wrote: > > > > > > > > > > > > Version: > > > > > > Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC > > > > > > 2019 x86_64 GNU/Linux > > > > > > > > > > You need to upgrade to arch kernel 5.2.14 or newer (they backported > > > > > the fix first appearing in stable 5.2.15). Or you need to downgrade > > > > > to 5.1 series. > > > > > https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdman...@kernel.org/T/#u > > > > > > > > > > That's a nasty bug. I don't offhand see evidence that you've hit this > > > > > bug. But I'm not certain. So first thing should be to use a different > > > > > kernel. > > > > > > > > Interesting, I'll go ahead with a kernel upgrade as that easy enough. > > > > However, that looks like it's related to a stacktrace regarding a hung > > > > process. Which is not the original problem I had. > > > > Based on the output in my previous email, I've been working under the > > > > assumption that there is a problem on-disk. Is that not correct? > > > > > > That bug does cause filesystem corruption that is not repairable. > > > Whether you have that problem or a different problem, I'm not sure. > > > But it's best to avoid combining problems. > > > > > > The file system mounts rw now? Or still only mounts ro? > > > > It mounts RW, but I have yet to attempt an actual write. > > > > > > > I think most of the errors reported by btrfs check, if they still exist > > > after doing a scrub, should be repaired by 'btrfs check --repair' but I > > > don't advise that until later. I'm not a developer, maybe Qu can offer > > > some advise on those errors. > > > > > > > > > Next, anytime there is a crash or powerfailur with Btrfs raid56, you > > > > > need to do a complete scrub of the volume. Obviously will take time > > > > > but that's what needs to be done first. > > > > > > > > I'm using raid 10, not 5 or 6. > > > > > > Same advice, but it's not as important to raid10 because it doesn't have > > > the write hole problem. > > > > > > > > > OK actually, before the scrub you need to confirm that each drive's > > > > > SCT ERC time is *less* than the kernel's SCSI command timer. e.g. > > > > > > > > I gather that I should probably do this before any scrub, be it raid 5, > > > > 6, or 10. But, Is a scrub the operation I should attempt on this raid > > > > 10 array to repair the specific errors mentioned in my previous email? > > > > > > Definitely deal with the timing issue first. If by chance there are bad > > > sectors on any of the drives, they must be properly reported by the drive > > > with a discrete read error in order for Btrfs to do a proper fixup. If > > > the times are mismatched, then Linux can get tired waiting, and do a link > > > reset on the drive before the read error happens. And now the whole > > > command queue is lost and the problem isn't fixed. > > > > Good to know, that seems like a critical piece of information. A few > > searches turned up this page, https://wiki.debian.org/Btrfs#FAQ. > > > > Should this be noted on the 'gotchas' or 'getting started page as well? > > I'd be happy to make edits should the powers that be allow it. > > > > > > > There are myriad errors and the advice I'm giving to scrub is a safe > > > first step to make sure the storage stack is sane - or at least we know > > > where the simpler problems are. And then move to the less simple ones > > > that have higher risk. It also changed the volume the least. Everything > > > else, like balance and chunk recover and btrfs check --repair - all make > > > substantial changes to the file system and have higher risk of making > > > things worse. > > > > This sounds sensible. > > > > > > > In theory if the storage stack does exactly what Btrfs says, then at > > > worst you should lose some data, but the file system itself should be > > > consistent. And that includes power failures. The fact there's problems > > > reported suggests a bug somewhere - it could be Btrfs, it could be device > > > mapper, it could be controller or drive firmware. > > > > I'll go ahead with a kernel upgrade/make sure the timing issues are squared > > away. Then I'll kick off a scrub. > > > > I'll report back when the scrub is complete or something interesting > > happens. Whichever comes first. > > As a followup; > 1. I took care of the timing issues > 2. ran a scrub. > 3. I ran a balance, it kept failing with about 20% left > - stacktraces in dmesg showed spinlock stuff > > 3. got I/O errors on one file during my final backup, ( > - post-backup hashsums of everything else checked out > - the errors during the copy were csum mismatches should anyone care > > 4. ran a bunch of potentially disruptive btrfs check commands in alphabetical > order because "why not at this point?" > - they had zero affect as far as I can tell, all the same files were > readable, the btrfs check errors looked identical (admittedly I didn't put > them side by side) > > 5. re-provisioned the array, restored from backups. > > As I thought about it, it may have not been an issue with the original power > outage. I only ran a check after the power outage. My array could have had > an issue due to a previous bug. I was on a 5.2x kernel for several weeks > under high load. Anyway, there are enough unknowns to make a root cause > analysis not worth my time. > > Marking this as unresolved folks in the future who may be looking for answers. >
Man, I should have read that over one more time for typos. Oh well. Matt Pallissard
signature.asc
Description: PGP signature