Thanks for the extensive answer, Chris!

> a. Check with the manufacturer of the hardware raid for firmware
> updates for all the controllers. Also check if the new version is
> backward compatible with an array made with the version you have, and
> if not, if downgrade is possible. That way you have the option of
> pulling the drives you have, putting them on a shelf, buying new
> drives, and creating a new array with new hardware raid controller
> firmware without having to blow away this broken Btrfs file system
> just yet.

Yes, this is what I am preparing to do and absolutely agree. The HW controllers 
are part of the issue, for sure. Analysing the logs shows that right at the 
time where the system became unavailable from user point of view the following 
happened:

* HW bus reset by kernel module due to IO timeout towards the controller
* unresponsive scsi devices logged by kernel
* btrfs kernel module logging IO in flight being dropped

A few minutes (!) later, the hw controller dropped a disk from one of the 
arrays and logged IO timeouts towards this (probably faulty) disk. The raid 
itself always stayed "consistent" but "degraded". There has still not been any 
issue with the array, also the rebuild went fine.

I have extensive experience with exactly this controller / disk / firmware / 
kernel combo (on other filesystems) and unfortunately have to say that part of 
the issue seems to be how BTRFS has (not?) handled the IO timeouts / drops from 
the lower layer.

Happy to provide insight into the timeline and messages of these events if of 
interest to anyone.

> b. If you go with Btrfs again, I suggest using metadata raid1. It's
> speculation whether that would help recovery in this failure case. But
> it probably wouldn't have made it any worse, and wouldn't meaningfully
> impact performance.

Point taken!

> For first mount after mkfs, use mount option
> 'space_cache=v2' to create the free space tree, it's soon to be the
> default anyway, and for large file systems it offers improved
> performance and the same reliability.

The system was already on v2.

> If you don't care about
> performance you could just always use `nospace_cache` mount option in
> addition to `noatime,notreelog` and optionally a compression option
> like `compress=zstd`. I would not use the nodatacow or nodatasum
> options. If you're considering those mount options you should just
> consider using ext4 or XFS at the next go around.

Reason for these mount options was to prevent fragmentation of large files (due 
to COW) and nodatasum due to (supposed to exist) stability issues with the 
checksumming. Advantage of BTRFS over XFS for me was multi-device capability 
(without LVM), snapshots and directory (subvolume) based quota.

>
> c. If it turns out the current Btrfs can be repaired, of course update
> backups ASAP. But then I'd personally consider the file system still
> suspect for anything other than short term use, and you'll want to
> rebuild it from scratch eventually anyway, which lands you back at a.)
> and b.) above.

Yes, the FS will in any case only be used to recover and then be rebuild. Not 
sure if on BTRFS to be honest.
But then, I know people who have lost data due to XFS not handling power 
failures or lower-layer HW issues well. So maybe it is just what it is and an 
up2date backup is required in any case (what if a user deletes everything 
accidentally...the best FS wont safe you from this).


> The most recent ext4 and XFS upstream work enables
> metadata checksumming so you'd be in the same boat as you were with
> Btrfs using nodatacow; there are still some older tools that create
> those file systems without metadata checksumming, so I'd watch out for
> that. And I'd say it's a coin toss which one to pick; I'm not really
> sure off hand which one has a greater chance of surviving a hard reset
> with inflight data.

True.

> d. Back to the hardware raid6 controller: you should make sure it's
> really configured per manufacturer's expectations with respect to
> drive write caching. Something got lost in the hard reset. Should the
> individual drive write caches be disabled? Possible that the hardware
> raid vendor expects this, if they're doing controller caching, and
> they ensure the proper flushing to disk in the order expected by the
> file system, where individual drive write caches can thwart that
> ordering.

As per above, it is now clear that the issue was in fact triggered when the 
controller timed out and in-flight data "had to" be dropped by upper layers.

> If the controller has a write cache, is it battery backed?

Yes!


> If not, does the manufacturer recommend disabling write caching?
> Something didn't work, and these are just some of the questions to try
> and find out the optimal settings to avoid this happening in the
> future, because even with a backup, restoring this much data is a
> PITA.


Thank you very much again for taking the time to reply. I really appreciate it!

Kind regards,

Marcel

> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Chris Murphy


Reply via email to