On 2022-02-01 11:16, Lentes, Bernd wrote:
Hi,

we just experienced two power outages in a few days.
This showed me that our UPS configuration and the handling of resources on the cluster is insufficient.
We have a two-node cluster with SLES 12 SP5 and a Smart-UPS SRT 3000 from APC with Network Management Card.
The UPS is able to buffer the two nodes and some Hardware (SAN, Monitor) for about one hour.
Our resources are Virtual Domains, about 20 of different flavor and version.

Our primary goal is not to bypass as long as possible a power outage but to shutdown all domains correctly after a dedicated time.

I'm currently thinking of waiting for a dedicated time (maybe 15 minutes) and then do a "crm resource stop VirtualDomains" in a script.
I would give the cluster some time for the shutdown (5-10 minutes) and afterwards shutdown the nodes (via script).
I have to keep an eye on if both nodes are running or only one of them.

How is your approach ?

Bernd

I don't know if this will be a useful answer for you, but I haven't seen anyone else reply.

In the Anvil!, we use SNMP to collect data on APC UPSes powering a given cluster. The OIDs we read are at the head of this file, but the logic to read and collect the data starts here;

https://github.com/ClusterLabs/anvil/blob/main/scancore-agents/scan-apc-ups/scan-apc-ups#L3026

Some processing happens in-agent, but mainly the collected data is written to a generic "power" table (as we support any UPS we can collect data from). When we're done scanning, we analyze the data in the 'power' table to decide if we need to shed load (withdraw and power off nodes to extend runtime), do a complete graceful shutdown (if the batteries are about to die), or reboot the nodes after power is restored.

This logic is handled mainly here. First, we figure out which UPS powers which nodes/clusters, then we pull the data on those specific UPSes to return a general "power state".

https://github.com/ClusterLabs/anvil/blob/main/Anvil/Tools/ScanCore.pm#L607

The power state then tells the main daemon what actions to take, if any (load shed, shut down, restart). That's here;

https://github.com/ClusterLabs/anvil/blob/main/Anvil/Tools/ScanCore.pm#L1541

This is super high level, and much of the specifics are related to the Anvil! cluster, but it hopefully gives you a starting point on how to approach the problem. We've been doing it this way for many years with really good effect.

Cheers


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to