Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote: > But when the timeout has run out the RA tries to kill the machine with a > "virsh destroy". > And if that does not work (what is occasionally my problem) because the domain > is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back > which > cause pacemaker to fence the lazy node. Or am i wrong ? What does the log look like when this happens? -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Adding a node to an active cluster
Hi, Thanks for your reply. I have tried reload the corosync.conf. But it seems making changes in corosync.conf and reload doesn't modify cib config, and crm status output remains unchanged as well. Thanks, Jiaqi Tian - Original message -From: Tomas Jelinek Sent by: "Users" To: users@clusterlabs.orgCc:Subject: [EXTERNAL] Re: [ClusterLabs] Adding a node to an active clusterDate: Thu, Oct 22, 2020 4:35 AM Hi,Have you reloaded corosync configuration after modifying corosync.conf?I don't see it in your procedure. Modifying the config file has noeffect until you reload it with 'corosync-cfgtool -R'.It should not be needed to modify CIB and add the new node manually. Ifeverything is done correctly and the new node joins corosync membership,pacemaker should figure out there is a new node in the cluster by itself.Anyway, using pcs or crmsh is highly recommended as others already said.Regards,TomasDne 21. 10. 20 v 16:03 Jiaqi Tian1 napsal(a):> Hi,> I'm trying to add a new node into an active pacemaker cluster with> resources up and running.> After steps:> 1. update corosync.conf files among all hosts in cluster including the> new node> 2. copy corosync auth file to the new node> 3. enable corosync and pacemaker on the new node> 4. adding the new node to the list of node in /var/lib/pacemaker/cib/cib.xml> Then I run crm status, the new node is displayed as offline. It will not> become online, unless we run restart corosync and pacemaker on all nodes> in cluster. But this is not what we want, since we want to keep existing> nodes and resources up and running. Also in this case crm_node -l> doesn't list the new node.> So my question is:> 1. Is there another approach to have the existing nodes aware of the new> node and have crm status indicates the node is online while keeping> other nodes and resources up and running?> 2. which config file crm_node command reads?> Thanks,> Jiaqi Tian>>> ___> Manage your subscription:> https://lists.clusterlabs.org/mailman/listinfo/users >> ClusterLabs home: https://www.clusterlabs.org/ >___Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Adding a node to an active cluster
Hi, Thank you for your suggestion. The case I have is, I have host1 and host2 in cluster that has resources running, then I try to join host3 to the cluster by running "crm cluster join -c host1 -y". But I get this "Configuring csync2...ERROR: cluster.join: Can't invoke crm cluster init init csync2_remote on host3" issue. Are there any other requirements for running this command? Thanks Jiaqi Tian - Original message -From: Xin Liang Sent by: "Users" To: Cluster Labs - All topics related to open-source clustering welcomed Cc:Subject: [EXTERNAL] Re: [ClusterLabs] Adding a node to an active clusterDate: Wed, Oct 21, 2020 9:44 PM Hi Jiaqi, Assuming you already have node1 running resources, you can try to run this command on node2: "crm cluster join -c node1 -y" From: Users on behalf of Jiaqi Tian1 Sent: Wednesday, October 21, 2020 10:03 PMTo: users@clusterlabs.org Subject: [ClusterLabs] Adding a node to an active cluster Hi, I'm trying to add a new node into an active pacemaker cluster with resources up and running. After steps: 1. update corosync.conf files among all hosts in cluster including the new node 2. copy corosync auth file to the new node 3. enable corosync and pacemaker on the new node 4. adding the new node to the list of node in /var/lib/pacemaker/cib/cib.xml Then I run crm status, the new node is displayed as offline. It will not become online, unless we run restart corosync and pacemaker on all nodes in cluster. But this is not what we want, since we want to keep existing nodes and resources up and running. Also in this case crm_node -l doesn't list the new node. So my question is: 1. Is there another approach to have the existing nodes aware of the new node and have crm status indicates the node is online while keeping other nodes and resources up and running? 2. which config file crm_node command reads? Thanks, Jiaqi Tian ___Manage your subscription:https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 23, 2020, at 11:11 PM, arvidjaar arvidj...@gmail.com wrote: >> I need someting like that which waits for some time (maybe 30s) if the domain >> nevertheless stops although >> "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the >> process wakes up from D state. > > So why not ignore virsh error and just wait always? You probably need to > retain "domain not found" exit condition still. That's my plan. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 23, 2020, at 8:45 PM, Valentin Vidić vvi...@valentin-vidic.from.hr wrote: > On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote: >> But when the timeout has run out the RA tries to kill the machine with a >> "virsh >> destroy". >> And if that does not work (what is occasionally my problem) because the >> domain >> is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back >> which >> cause pacemaker to fence the lazy node. Or am i wrong ? > > What does the log look like when this happens? > /var/log/cluster/corosync.log: VirtualDomain(vm_amok)[8998]: 2020/09/27_22:34:11 INFO: Issuing graceful shutdown request for domain vm_amok. VirtualDomain(vm_amok)[8998]: 2020/09/27_22:37:06 INFO: Issuing forced shutdown (destroy) request for domain vm_amok. Sep 27 22:37:11 [11282] ha-idg-2 lrmd: warning: child_timeout_callback: vm_amok_stop_0 process (PID 8998) timed out Sep 27 22:37:11 [11282] ha-idg-2 lrmd: warning: operation_finished: vm_amok_stop_0:8998 - timed out after 18ms timeout of the domain is 180 sec. /var/log/libvirt/libvirtd.log (time is UTC): 2020-09-27 20:37:21.489+: 18583: error : virProcessKillPainfully:401 : Failed to terminate process 14037 with SIGKILL: Device or resource busy 2020-09-27 20:37:21.505+: 6610: error : virNetSocketWriteWire:1852 : Cannot write data: Broken pipe 2020-09-27 20:37:31.962+: 6610: error : qemuMonitorIO:719 : internal error: End of file from qemu monitor SIGKILL didn't work. Nevertheless the process is finished 20 seconds later after destroy, surely because it woke up from D and received the signal. /var/log/cluster/corosync.log on the DC: Sep 27 22:37:11 [3580] ha-idg-1 crmd: warning: status_from_rc: Action 93 (vm_amok_stop_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error Stop (also sigkill) failed Sep 27 22:37:11 [3579] ha-idg-1pengine: notice: native_stop_constraints: Stop of failed resource vm_amok is implicit after ha-idg-2 is fenced cluster decides to fence the node although resource is stopped 10 seconds later atop log: 14037 - S 261% /usr/bin/qemu-system-x86_64 -machine accel=kvm -name guest=vm_amok,debug-threads=on -S -object secret,id=masterKey0 ... PID of the domain is 14037 14037 - E 0% worker (at 22:37:31) domain has stoppped Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
23.10.2020 21:08, Lentes, Bernd пишет: > > Surprisingly if the virsh destroy is successfull the RA waits until the > domain isn't running anymore: > ... > > I need someting like that which waits for some time (maybe 30s) if the domain > nevertheless stops although > "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the > process wakes up from D state. So why not ignore virsh error and just wait always? You probably need to retain "domain not found" exit condition still. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
why don't you work with something like this: 'op stop interval =300 timeout=600'. The stop operation will timeout at your requirements without modifying the script. Best Regards, Strahil Nikolov В четвъртък, 22 октомври 2020 г., 23:30:08 Гринуич+3, Lentes, Bernd написа: Hi guys, ocassionally stopping a VirtualDomain resource via "crm resource stop" does not work, and in the end the node is fenced, which is ugly. I had a look at the RA to see what it does. After trying to stop the domain via "virsh shutdown ..." in a configurable time it switches to "virsh destroy". i assume "virsh destroy" send a sigkill to the respective process. But when the host is doing heavily IO it's possible that the process is in "D" state (uninterruptible sleep) in which it can't be finished with a SIGKILL. The the node the domain is running on is fenced due to that. I digged deeper and found out that the signal is often delivered a bit later (just some seconds) and the process is killed, but pacemaker already decided to fence the node. It's all about this excerp in the RA: force_stop() { local out ex translate local status=0 ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}." out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) ex=$? translate=$(echo $out|tr 'A-Z' 'a-z') echo >&2 "$translate" case $ex$translate in *"error:"*"domain is not running"*|*"error:"*"domain not found"*|\ *"error:"*"failed to get domain"*) : ;; # unexpected path to the intended outcome, all is well [!0]*) ocf_exit_reason "forced stop failed" return $OCF_ERR_GENERIC ;; 0*) while [ $status != $OCF_NOT_RUNNING ]; do VirtualDomain_status status=$? done ;; esac return $OCF_SUCCESS } I'm thinking about the following: How about to let the script wait a bit after "virsh destroy". I saw that usually it just takes some seconds that "virsh destroy" is successfull. I'm thinking about this change: ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}." out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) ex=$? sleep (10) < (or maybe configurable) translate=$(echo $out|tr 'A-Z' 'a-z') What do you think ? Bernd -- Bernd Lentes Systemadministration Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.len...@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd stay healthy Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Upgrading/downgrading cluster configuration
Usually I prefer to use "crm configure show" and later "crm configure edit" and replace the config. I am not sure if this will work with such downgrade scenario, but it shouldn't be a problem. Best Regards, Strahil Nikolov В четвъртък, 22 октомври 2020 г., 21:30:55 Гринуич+3, Vitaly Zolotusky написа: Thanks for the reply. I do see backup config command in pcs, but not in crmsh. What would that be in crmsh? Would something like this work after Corosync started in the state with all resources inactive? I'll try this: crm configure save crm configure load Thank you! _Vitaly Zolotusky > On October 22, 2020 1:54 PM Strahil Nikolov wrote: > > > Have you tried to backup the config via crmsh/pcs and when you downgrade to > restore from it ? > > Best Regards, > Strahil Nikolov > > > > > > > В четвъртък, 22 октомври 2020 г., 15:40:43 Гринуич+3, Vitaly Zolotusky > написа: > > > > > > Hello, > We are trying to upgrade our product from Corosync 2.X to Corosync 3.X. Our > procedure includes upgrade where we stopthe cluster, replace rpms and restart > the cluster. Upgrade works fine, but we also need to implement rollback in > case something goes wrong. > When we rollback and reload old RPMs cluster says that there are no active > resources. It looks like there is a problem with cluster configuration > version. > Here is output of the crm_mon: > > d21-22-left.lab.archivas.com /opt/rhino/sil/bin # crm_mon -A1 > Stack: corosync > Current DC: NONE > Last updated: Thu Oct 22 12:39:37 2020 > Last change: Thu Oct 22 12:04:49 2020 by root via crm_attribute on > d21-22-left.lab.archivas.com > > 2 nodes configured > 15 resources configured > > Node d21-22-left.lab.archivas.com: UNCLEAN (offline) > Node d21-22-right.lab.archivas.com: UNCLEAN (offline) > > No active resources > > > Node Attributes: > > *** > What would be the best way to implement downgrade of the configuration? > Should we just change crm feature set, or we need to rebuild the whole config? > Thanks! > _Vitaly Zolotusky > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 23, 2020, at 5:06 PM, Strahil Nikolov hunter86...@yahoo.com wrote: > why don't you work with something like this: 'op stop interval =300 > timeout=600'. > The stop operation will timeout at your requirements without modifying the > script. > > Best Regards, > Strahil Nikolov But when the timeout has run out the RA tries to kill the machine with a "virsh destroy". And if that does not work (what is occasionally my problem) because the domain is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back which cause pacemaker to fence the lazy node. Or am i wrong ? Where is the benefit of the shorter interval ? The return value of the "virsh destroy" operation is set immediately. And it's -ne 0 when the "virsh destroy" didn't suceed. No matter if the domain stops 20 sec. later, the return value is not changed. and send to the LRM so the cluster wants to stonith that node. Surprisingly if the virsh destroy is successfull the RA waits until the domain isn't running anymore: force_stop { ... 0*) while [ $status != $OCF_NOT_RUNNING ]; do VirtualDomain_status status=$? done ;; I need someting like that which waits for some time (maybe 30s) if the domain nevertheless stops although "virsh destroy" gaves an error back. Because the SIGKILL is delivered if the process wakes up from D state. For this amount of time the RA has to wait and to take care that the the return value is zero if the domain stopped or is -ne 0 if also the waiting didn't help. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] VirtualDomain does not stop via "crm resource stop" - modify RA ?
- On Oct 23, 2020, at 7:11 AM, Andrei Borzenkov arvidj...@gmail.com wrote: >> >> ocf_log info "Issuing forced shutdown (destroy) request for domain >> ${DOMAIN_NAME}." >> out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) >> ex=$? >> sleep (10)< (or maybe configurable) >> translate=$(echo $out|tr 'A-Z' 'a-z') >> >> >> What do you think ? >> > > > It makes no difference. You wait 10 seconds before parsing output of > "virsh destroy", that's all. It does not change output itself, so if > output indicates that "virsh destroy" failed, it will still indicate > that after 10 seconds. > > Either you need to repeat "virsh destroy" in a loop, or virsh itself > should be more robust. Hi Andrei, yes, you are right. I saw it alread after sending the E-Mail. I will change that. Bernd Helmholtz Zentrum München Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671 ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Antw: [EXT] VirtualDomain does not stop via "crm resource stop" - modify RA ?
>>> "Lentes, Bernd" schrieb am 22.10.2020 um 22:29 in Nachricht <684655755.160569.1603398597074.javamail.zim...@helmholtz-muenchen.de>: > Hi guys, > > ocassionally stopping a VirtualDomain resource via "crm resource stop" does > not work, and in the end the node is fenced, which is ugly. > I had a look at the RA to see what it does. After trying to stop the domain > via "virsh shutdown ..." in a configurable time it switches to "virsh > destroy". > i assume "virsh destroy" send a sigkill to the respective process. But when In "good old xen xm" a "destroy" "tears down" the VM, just throwing it out of memory, that is from perspective of the VM it is very much like powering it off (no processes or buffered writes complete). So you typically want to set a proper timeout to avoid that. The thing to worry about is the shutdown inside the VM, not outside. > the host is doing heavily IO it's possible that the process is in "D" state > (uninterruptible sleep) > in which it can't be finished with a SIGKILL. The the node the domain is > running on is fenced due to that. > I digged deeper and found out that the signal is often delivered a bit later > (just some seconds) and the process is killed, but pacemaker already decided > to fence the node. > It's all about this excerp in the RA: > > force_stop() > { > local out ex translate > local status=0 > > ocf_log info "Issuing forced shutdown (destroy) request for domain > ${DOMAIN_NAME}." > out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) > ex=$? > translate=$(echo $out|tr 'A-Z' 'a-z') > echo >&2 "$translate" > case $ex$translate in > *"error:"*"domain is not running"*|*"error:"*"domain not > found"*|\ > *"error:"*"failed to get domain"*) > : ;; # unexpected path to the intended outcome, all > is well > [!0]*) > ocf_exit_reason "forced stop failed" > return $OCF_ERR_GENERIC ;; > 0*) > while [ $status != $OCF_NOT_RUNNING ]; do > VirtualDomain_status > status=$? > done ;; > esac > return $OCF_SUCCESS > } > > I'm thinking about the following: > How about to let the script wait a bit after "virsh destroy". I saw that > usually it just takes some seconds that "virsh destroy" is successfull. > I'm thinking about this change: > > ocf_log info "Issuing forced shutdown (destroy) request for domain > ${DOMAIN_NAME}." > out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) > ex=$? > sleep (10)< (or maybe configurable) > translate=$(echo $out|tr 'A-Z' 'a-z') > > > What do you think ? Destroy will need a little time, so it may be wirth trying that. Regards, Ulrich > > Bernd > > > -- > > Bernd Lentes > Systemadministration > Institute for Metabolism and Cell Death (MCD) > Building 25 - office 122 > HelmholtzZentrum München > bernd.len...@helmholtz-muenchen.de > phone: +49 89 3187 1241 > phone: +49 89 3187 3827 > fax: +49 89 3187 2294 > http://www.helmholtz-muenchen.de/mcd > > stay healthy > Helmholtz Zentrum München > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling > Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin > Guenther > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/