On March 10, 2022 2:48 pm, [email protected] wrote: > That was actually a really BAD ADVICE…. as when node initiate maintenance > mode it will try to migrate hosted vms … and eventually ends up in the same > Lock loop.. > what you really need is to remove started vms from ha-manager, so when the > node initiate shutdown it will do firstly do regular shutdown vm per vm. > > So, do something like below as first command in your NUT command sequence: > > for a in `ha-manager status | grep started|awk '{print $2}'|sed 's/vm://g'`; > do ha-manager remove $a;done
what you should do is just change the policy to freeze or fail-over before triggering the shutdown. and once power comes back up and your cluster has booted, switch it back to migrate. that way, the shutdown will just stop and freeze the resources, similar to what happens when rebooting using the default conditional policy. note that editing datacenter.cfg (where the shutdown_policy is configured) is currently not exposed in any CLI tool, but you can update it using pvesh or the API. there is still one issue though - if the whole cluster is shutdown at the same time, at some point during the shutdown a non-quorate partition will be all that's left, and at that point certain actions won't work anymore and the node probably will get fenced. fixing this effectively would require some sort of conditional delay at the right point in the shutdown sequence that waits for all guests on all nodes(!) to stop before proceeding with stopping the PVE services and corosync (nodes still might get fenced if they take too long shutting down after the last guest has exited, but that shouldn't cause much issues other than noise). one way to do this would be for your NUT script to set a flag file in /etc/pve, and some systemd service with the right Wants/After settings that blocks the shutdown if the flag file exists and any guests are still running. probably requires some tinkering, but can be safely tested in a virtual cluster before moving to production ;) this last problem is not related to HA though (other than HA introducing another source of trouble courtesy of fencing being active) - you will also potentially hit it with your approach. the 'stop all guests on node' logic that PVE has on shutdown is for shutting down one node without affecting quorum, it doesn't work reliably for full-cluster shutdowns (you might not see problems if timing works out, but it's based on chance). an alternative approach would be to request all HA resources to be stopped or disabled (`ha-manager set .. --state ..`), wait for that to be done cluster-wide (e.g. by polling /cluster/resources API path), and then trigger the shutdown. disadvantage of that is you have to remember the pre-shutdown state and restore that afterwards for each resource.. https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_node_maintenance >> On Mar 10, 2022, at 2:48 PM, [email protected] wrote: >> >> I don’t remember, search into pvecm and pve[tab][tab] related commands man >> pages >> >>> On Mar 10, 2022, at 2:19 PM, Stefan Radman <[email protected]> wrote: >>> >>> Hi Sto >>> >>> Thanks for the suggestions. >>> >>> The second option is what I was looking for. >>> >>> How do I initiate “pve node maintenance mode”? >>> >>> The “Node Maintenance” paragraph in the HA documentation is quite brief and >>> does not refer to any command or GUI component. >>> >>> Thank you >>> >>> Stefan >>> >>> >>>> On Mar 10, 2022, at 14:50, [email protected] >>>> <mailto:[email protected]> wrote: >>>> >>>> Hi, >>>> >>>> here are two ideas: shutdown sequence -and- command sequence >>>> 1: shutdown sequence you may achieve when you set NUT’s on each node to >>>> only monitor the UPS power, then configure each node to shutdown itself on >>>> a different ups power levels, ex: node1 on 15% battery, node2 on 10% >>>> battery and so on >>>> 2: you can set a cmd sequence to firstly execute pve node maintenance >>>> mode , and then execute shutdown -> this way HA will not try to migrate vm >>>> to node in maintenance, and the chance all nodes to goes into maintenance >>>> in exactly same second seems to be not a risk at all. >>>> >>>> hope thats helpful. >>>> >>>> Regards, >>>> Sto. >>>> >>>>> On Mar 10, 2022, at 1:10 PM, Stefan Radman via pve-user >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> >>>>> From: Stefan Radman <[email protected] <mailto:[email protected]>> >>>>> Subject: Locking HA during UPS shutdown >>>>> Date: March 10, 2022 at 1:10:09 PM GMT+2 >>>>> To: PVE User List <[email protected] >>>>> <mailto:[email protected]>> >>>>> >>>>> >>>>> Hi >>>>> >>>>> I am configuring a 3 node PVE cluster with integrated Ceph storage. >>>>> >>>>> It is powered by 2 UPS that are monitored by NUT (Network UPS Tools). >>>>> >>>>> HA is configured with 3 groups: >>>>> group pve1 nodes pve1:1,pve2,pve3 >>>>> group pve2 nodes pve1,pve2:1,pve3 >>>>> group pve3 nodes pve1,pve2,pve3:1 >>>>> >>>>> That will normally place the VMs in each group on the corresponding node, >>>>> unless that node fails. >>>>> >>>>> The cluster is configured to migrate VMs away from a node before shutting >>>>> it down (Cluster=>Options=>HA Settings: shutdown_policy=migrate). >>>>> >>>>> NUT is configured to shut down the serves once the last of the two UPS is >>>>> running low on battery. >>>>> >>>>> My problem: >>>>> When NUT starts shutting down the 3 nodes, HA will first try to >>>>> live-migrate them to another node. >>>>> That live migration process gets stuck because all the nodes are shutting >>>>> down simultaneously. >>>>> It seems that the whole process runs into a timeout, finally “powers off” >>>>> all the VMs and shuts down the nodes. >>>>> >>>>> My question: >>>>> Is there a way to “lock” or temporarily de-activate HA before shutting >>>>> down a node to avoid that deadlock? >>>>> >>>>> Thank you >>>>> >>>>> Stefan >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> pve-user mailing list >>>>> [email protected] <mailto:[email protected]> >>>>> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user >>>>> <https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user> >>>> >>>> >>>> Best Regards, >>>> >>>> Stoyan Stoyanov Sto | Solutions Manager >>>> | Telehouse.Solutions | ICT Department >>>> | phone/viber: +359 894774934 <tel:+359 894774934> >>>> | telegram: @prostoSto >>>> <https://mysignature.io/redirect/skype:prosto.sto?chat> >>>> | skype: prosto.sto >>>> <https://mysignature.io/redirect/skype:prosto.sto?chat> >>>> | email: [email protected] <mailto:[email protected]> >>>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg> >>>> | address: Telepoint #2, Sofia, Bulgaria >>>> <https://mysignature.io/editor/?utm_source=freepixel><356841.png> >>>> >>>> <https://mysig.io/ZDNkNWY> >>>> Save paper. Don’t print >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Best Regards, >>>> >>>> Stoyan Stoyanov Sto | Solutions Manager >>>> | Telehouse.Solutions | ICT Department >>>> | phone/viber: +359 894774934 <tel:+359 894774934> >>>> | telegram: @prostoSto >>>> <https://mysignature.io/redirect/skype:prosto.sto?chat> >>>> | skype: prosto.sto >>>> <https://mysignature.io/redirect/skype:prosto.sto?chat> >>>> | email: [email protected] <mailto:[email protected]> >>>> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg> >>>> | address: Telepoint #2, Sofia, Bulgaria >>>> <https://mysignature.io/editor/?utm_source=freepixel><356841.png> >>>> >>>> <https://mysig.io/ZDNkNWY> >>>> Save paper. Don’t print >>> >> >> >> Best Regards, >> >> Stoyan Stoyanov Sto | Solutions Manager >> | Telehouse.Solutions | ICT Department >> | phone/viber: +359 894774934 <tel:+359 894774934> >> | telegram: @prostoSto >> <https://mysignature.io/redirect/skype:prosto.sto?chat> >> | skype: prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat> >> | email: [email protected] <mailto:[email protected]> >> | website: www.telehouse.solutions <https://mysig.io/MTRmMTg> >> | address: Telepoint #2, Sofia, Bulgaria >> <https://mysignature.io/editor/?utm_source=freepixel> >> >> <https://mysig.io/ZDNkNWY> >> Save paper. Don’t print >> >> >> >> >> _______________________________________________ >> pve-user mailing list >> [email protected] >> https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > > > Best Regards, > > Stoyan Stoyanov Sto | Solutions Manager > | Telehouse.Solutions | ICT Department > | phone/viber: +359 894774934 <tel:+359 894774934> > | telegram: @prostoSto > <https://mysignature.io/redirect/skype:prosto.sto?chat> > | skype: prosto.sto <https://mysignature.io/redirect/skype:prosto.sto?chat> > | email: [email protected] <mailto:[email protected]> > | website: www.telehouse.solutions <https://mysig.io/MTRmMTg> > | address: Telepoint #2, Sofia, Bulgaria > <https://mysignature.io/editor/?utm_source=freepixel> > > <https://mysig.io/ZDNkNWY> > Save paper. Don’t print > > > > > _______________________________________________ > pve-user mailing list > [email protected] > https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user > _______________________________________________ pve-user mailing list [email protected] https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-user
