Re: Status of RAID5/6

Austin S. Hemmelgarn Thu, 22 Mar 2018 05:02:11 -0700

On 2018-03-21 16:02, Christoph Anton Mitterer wrote:
On the note of maintenance specifically:

- Maintenance tools
   - How to get the status of the RAID? (Querying kernel logs is IMO
     rather a bad way for this)
     This includes:
     - Is the raid degraded or not?

Check for the 'degraded' flag in the mount options. Assuming you'redoing things sensibly and not specifying it on mount, it gets added whenthe array goes degraded.

     - Are scrubs/repairs/rebuilds/reshapes in progress and how far are
       they? (Reshape would be: if the raid level is changed or the raid
       grown/shrinked: has all data been replicated enough to be
       "complete" for the desired raid lvl/number of devices/size?

A bit trickier, but still not hard, just check the the output of `btrfsscrub status`, `btrfs balance status`, and `btrfs replace status` forthe volume. It won't check automatic spot-repairs (that is, repairingindividual blocks that fail checksums), but most people really don't care

    - What should one regularly do? scrubs? balance? How often?
      Do we get any automatic (but configurable) tools for this?

There aren't any such tools that I know of currently. storaged mighthave some, but I've never really looked at it so i can't comment (I'mkind of adverse to having hundreds of background services running to dostuff that can just as easily be done in a polling manner from cronwithout compromising their utility). Right now though, it's _trivial_to automate things with cron, or systemd timers, or even third-partytools like monit (which has the bonus that if the maintenance fails, youget an e-mail about it).

    - There should be support in commonly used tools, e.g. Icinga/Nagios
      check_raid

Agreed. I think there might already be a Nagios plugin for the basicchecks, not sure about anything else though.

Netdata has had basic monitoring support for a while now, but it onlylooks at allocations, not error counters, so while it will help catchimpending ENOSPC issues, it can't really help much with data corruptionissues.

    - Ideally there should also be some desktop notification tool, which
      tells about raid (and btrfs errors in general) as small
      installations with raids typically run no Icinga/Nagios but rely
      on e.g. email or gui notifications.

Desktop notifications would be nice, but are out of scope for the mainbtrfs-progs. Not even LVM, MDADM, or ZFS ship desktop notificationsupport from upstream. You don't need Icinga or Nagios for monitoringeither. Netdata works pretty well for covering the allocation checks(and I'm planning to have something soon, and it's trivial to set upe-mail notifications with cron or systemd timers or even tools like monit.

On the note of generic monitoring though, I've been working on a Python3 script (with no dependencies beyond the Python standard library) to dothe same checks that Netdata does regarding allocations, as well aschecking device error counters and mount options that should bereasonable as a simple warning tool run from cron or a systemd timer.I'm hoping to get it included in the upstream btrfs-progs, but I don'thave it in a state yet that it's ready to be posted (the checks areworking, but I'm still having issues reliably mapping between mountpoints and filesystem UUID's).

I think especially for such tools it's important that these are
maintained by upstream (and yes I know you guys are rather fs
developers not)... but since these tools are so vital, having them done
3rd party can easily lead to the situation where something changes in
btrfs, the tools don't notice and errors remain undetected.

It depends on what they look at. All the stuff under /sys/fs/btrfsshould never change (new things might get added, but none of the oldstuff is likely to ever change because /sys is classified as part of theuserspace ABI, and any changes would get shot down by Linus), soanything that just uses those will likely have no issues (Netdata fallsinto this category for example). Same goes for anything using ioctlsdirectly, as those are also userspace ABI.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of RAID5/6

Reply via email to