On Tue, 2019-01-22 at 20:35 +0300, Andrei Borzenkov wrote: > 22.01.2019 20:00, Ken Gaillot пишет: > > On Tue, 2019-01-22 at 16:52 +0100, Lentes, Bernd wrote: > > > Hi, > > > > > > we have a new UPS which has enough charge to provide our 2-node > > > cluster with the periphery (SAN, switches ...) for a resonable > > > time. > > > I'm currently thinking of the shutdown- and restart-procedure of > > > the > > > complete cluster when the power is lost and does not come back > > > soon. > > > Then cluster is provided via UPS, but that does not work > > > infinite. So > > > i have to shutdown the complete cluster. > > > I have the possibility to run scripts on each node which are > > > triggered by the UPS. > > > > > > My shutdown procedure is: > > > crm -w node standby node1 > > > resources are migrated to node2 > > > systemctl stop pacemaker > > > stops also corosync > > > node is not fenced ! (because of standby ?) > > > > Clean shutdowns don't get fenced. As long as the exiting node can > > tell > > the rest of the cluster that it's leaving, everything can be > > coordinated gracefully. > > > > > systemctl poweroff > > > clean shutdown of node1 > > > > > > crm -w node standby node2 > > > clean stop of resources > > > systemctl stop pacmeaker > > > systemctl poweroff > > > > > > The scripts would be executed form node2, via ssh for node1. > > > What do you think about it ? > > > > Good plan, though perhaps there should be some allowance for the > > case > > in which only node1 is running when the power dies. > > > > > Now the restart, which makes me trouble. > > > Currently i want to restart the cluster manually, because i'm not > > > completly familiar with pacemaker and a bit afraid of getting > > > constellations > > > due to automotization i didn't think of before. > > > > > > I can do that from anywhere because both nodes have ILO-cards. > > > > > > I start e.g. node1 with power button. > > > > > > systemctl start corosync > > > systemctl start pacemaker > > > corosync and pacemaker don't start automatically, i read that > > > several times as a recommendation. > > > Now my first problem. Let's assume the other node is broken. But > > > i > > > still want to get > > > resources running. My no-quorum-policy is ignore. That should be > > > fine. But i have this setup now and don't get the resources > > > running > > > automatically. > > > > I'm guessing you have corosync 2's wait_for_all set (probably > > implicitly by two_node). This is a safeguard for the situation > > where > > both nodes are booted up but can't see each other. > > > > If you're sure the other node is down, you can disable wait_for_all > > before starting the node. (I'm not sure if this can be changed > > while > > corosync is already running.) > > > > > > > > crm_mon says: > > > ================================================================= > > > ==== > > > === > > > Stack: corosync > > > Current DC: ha-idg-1 (version 1.1.19+20180928.0d2680780-1.8- > > > 1.1.19+20180928.0d2680780) - partition WITHOUT quorum > > > Last updated: Tue Jan 22 15:34:19 2019 > > > Last change: Tue Jan 22 13:39:14 2019 by root via crm_attribute > > > on > > > ha-idg-1 > > > > > > 2 nodes configured > > > 13 resources configured > > > > > > Node ha-idg-1: online > > > Node ha-idg-2: UNCLEAN (offline) > > > > > > Inactive resources: > > > > > > fence_ha-idg-2 (stonith:fence_ilo2): Stopped > > > fence_ha-idg-1 (stonith:fence_ilo4): Stopped > > > Clone Set: cl_share [gr_share] > > > Stopped: [ ha-idg-1 ha-idg-2 ] > > > vm_mausdb (ocf::heartbeat:VirtualDomain): Stopped > > > vm_sim (ocf::heartbeat:VirtualDomain): Stopped > > > vm_geneious (ocf::heartbeat:VirtualDomain): Stopped > > > Clone Set: cl_SNMP [SNMP] > > > Stopped: [ ha-idg-1 ha-idg-2 ] > > > > > > Node Attributes: > > > * Node ha-idg-1: > > > + maintenance : off > > > > > > Migration Summary: > > > * Node ha-idg-1: > > > > > > Failed Fencing Actions: > > > * Off of ha-idg-2 failed: delegate=, client=crmd.9938, origin=ha- > > > idg- > > > 1, > > > last-failed='Tue Jan 22 15:34:17 2019' > > > > > This is another problem - if cluster requires stonith, it won't statr > resources with another node UNCLEAN and fencing attempt apparently > failed.
Good point, I missed that. If you're sure the target node is down, you can tell the cluster that with "stonith_admin --confirm <node>", and it will treat it as successfully fenced. > > > > Negative Location Constraints: > > > loc_fence_ha-idg-1 prevents fence_ha-idg-1 from running on > > > ha- > > > idg-1 > > > loc_fence_ha-idg-2 prevents fence_ha-idg-2 from running on > > > ha- > > > idg-2 > > > ================================================================= > > > ==== > > > Cluster does not have quorum but that shouldn't be a problem. > > > corosync and pacemaker are started. > > > Why do the resources don't start automatically ? All target-roles > > > are > > > set to "started". > > > Is it because the fencing didn't succeed ? The status of ha-idg-2 > > > isn't clear for the cluster ? > > > If yes, what can i do ? > > > > > > Bernd > > > > > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org