Re: [ClusterLabs] Pacemaker resources are not scheduled
On Mon, 2018-04-16 at 23:52 +0800, lkxjtu wrote: > > Lkxjtu, > > > On 14/04/18 00:16 +0800, lkxjtu wrote: > >> My cluster version: > >> Corosync 2.4.0 > >> Pacemaker 1.1.16 > >> > >> There are many resource anomalies. Some resources are only > monitored > >> and not recovered. Some resources are not monitored or recovered. > >> Only one resource of vnm is scheduled normally, but this resource > >> cannot be started because other resources in the cluster are > >> abnormal. Just like a deadlock. I have been plagued by this > problem > >> for a long time. I just want a stable and highly available > resource > >> with infinite recovery for everyone. Is my resource configure > >> correct? > > > see below > > >> $ cat /etc/corosync/corosync.conf > >> [co]mpatibility: whitetank > >> > >> [...] > >> > >> logging { > >> fileline:off > >> to_stderr: no > >> to_logfile: yes > >> logfile: /root/info/logs/pacemaker_cluster/corosync.log > >> to_syslog: yes > >> syslog_facility: daemon > >> syslog_priority: info > >> debug: off > >> function_name: on > >> timestamp: on > >> logger_subsys { > >> subsys: AMF > >> debug: off > >> tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > >> } > >> } > >> > >> amf { > >> mode: disabled > >> } > >> > >> aisexec { > >> user: root > >> group: root > >> } > > > You are apparently mixing configuration directives for older major > > version(s) of corosync than you claim to be using. > > See corosync_conf(5) + votequorum(5) man pages for what you are > > supposed to configure with the actual version. > > > Thank you for your detailed answer! > Corosync.conf is part of the ansible scripts, but corosync and > pacemaker are updated with the yum source. So it has caused the > current gap. I will carefully compare the gap between the new and old > versions > > > > Regarding your pacemaker configuration: > > >> $ crm configure show > >> > >> [... reordered ... ] > >> > >> property cib-bootstrap-options: \ > >> have-watchdog=false \ > >> dc-version=1.1.16-12.el7-94ff4df \ > >> cluster-infrastructure=corosync \ > >> stonith-enabled=false \ > >> start-failure-is-fatal=false \ > >> load-threshold="3200%" > > > You are urged to configure fencing, otherwise asking for sane > > cluster's behaviour (which you do) is out of question, unless > > you precisely know why you are not configuring it. > > > My environment is a virtual machine environment. There is no headshot > device. Can I configure fencing? How to do it? If you have access to the physical host, see fence_virtd and the fence_xvm fence agent. They implement fencing by having the hypervisor kill the VM. There are some limitations to that approach. If the host itself is dead, then the fencing will fail. So it makes the most sense when all the VMs are on a single host -- but that introduces a single point of failure. If you don't have access the physical host, see if you have access to some sort of VM management API. If so, you can write a fence agent that calls the API to kill the VM (fence agents already exist for some public cloud providers). > >> > >> [... reordered ... ] > >> > > > Furthermore you are using custom resource agents of undisclosed > > quality and compatibility with the requirements: > > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-sing > le/Pacemaker_Explained/index.html#ap-ocf > > https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev- > guides/ra-dev-guide.asc > > > Since your resources come in isolated groups, I would go one > > by one, trying to figure out why the group won't run as expected. > > > For instance: > > >> primitive inetmanager inetmanager \ > >> op monitor interval=10s timeout=160 \ > >> op stop interval=0 timeout=60s on-fail=restart \ > >> op start interval=0 timeout=60s on-fail=restart \ > >> meta migration-threshold=2 failure-timeout=60s resource- > stickiness=100 > >> primitive inetmanager_vip IPaddr2 \ > >> params ip=122.0.1.201 cidr_netmask=24 \ > >> op start interval=0 timeout=20 \ > >> op stop interval=0 timeout=20 \ > >> op monitor timeout=20s interval=10s depth=0 \ > >> meta migration-threshold=3 failure-timeout=60s > >> [...] > >> colocation inetmanager_col +inf: inetmanager_vip inetmanager > >> order inetmanager_order Mandatory: inetmanager inetmanager_vip > >> > >> [...] > >> > >> $ crm status > >> [...] > >> Full list of resources: > >> [...] > >> inetmanager_vip(ocf::heartbeat:IPaddr2): Stopped > >> inetmanager(ocf::heartbeat:inetmanager): Stopped > >> > >> [...] > >> > >> corosync.log of node 122.0.1.10 > >> Apr 13 23:49:56 [6137] paas-controller-122-0-1- > 10 crmd: warning: status_from_rc: Action 24 > (inetmanager_monitor_0) on 122.0.1.9 failed (target: 7 vs. rc: 1): > Error > >> Apr
Re: [ClusterLabs] Pacemaker resources are not scheduled
> Lkxjtu, > On 14/04/18 00:16 +0800, lkxjtu wrote: >> My cluster version: >> Corosync 2.4.0 >> Pacemaker 1.1.16 There are many resource anomalies. Some resources are only monitored >> and not recovered. Some resources are not monitored or recovered. >> Only one resource of vnm is scheduled normally, but this resource >> cannot be started because other resources in the cluster are >> abnormal. Just like a deadlock. I have been plagued by this problem >> for a long time. I just want a stable and highly available resource >> with infinite recovery for everyone. Is my resource configure >> correct? > see below >> $ cat /etc/corosync/corosync.conf >> [co]mpatibility: whitetank [...] logging { >> fileline:off >> to_stderr: no >> to_logfile: yes >> logfile: /root/info/logs/pacemaker_cluster/corosync.log >> to_syslog: yes >> syslog_facility: daemon >> syslog_priority: info >> debug: off >> function_name: on >> timestamp: on >> logger_subsys { >> subsys: AMF >> debug: off >> tags: enter|leave|trace1|trace2|trace3|trace4|trace6 >> } >> } amf { >> mode: disabled >> } aisexec { >> user: root >> group: root >> } > You are apparently mixing configuration directives for older major > version(s) of corosync than you claim to be using. > See corosync_conf(5) + votequorum(5) man pages for what you are > supposed to configure with the actual version. Thank you for your detailed answer! Corosync.conf is part of the ansible scripts, but corosync and pacemaker are updated with the yum source. So it has caused the current gap. I will carefully compare the gap between the new and old versions > Regarding your pacemaker configuration: >> $ crm configure show [... reordered ... ] property cib-bootstrap-options: \ >> have-watchdog=false \ >> dc-version=1.1.16-12.el7-94ff4df \ >> cluster-infrastructure=corosync \ >> stonith-enabled=false \ >> start-failure-is-fatal=false \ >> load-threshold="3200%" > You are urged to configure fencing, otherwise asking for sane > cluster's behaviour (which you do) is out of question, unless > you precisely know why you are not configuring it. My environment is a virtual machine environment. There is no headshot device. Can I configure fencing? How to do it? [... reordered ... ] >> > Furthermore you are using custom resource agents of undisclosed > quality and compatibility with the requirements: > https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf> > > https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc > Since your resources come in isolated groups, I would go one > by one, trying to figure out why the group won't run as expected. > For instance: >> primitive inetmanager inetmanager \ >> op monitor interval=10s timeout=160 \ >> op stop interval=0 timeout=60s on-fail=restart \ >> op start interval=0 timeout=60s on-fail=restart \ >> meta migration-threshold=2 failure-timeout=60s >> resource-stickiness=100 >> primitive inetmanager_vip IPaddr2 \ >> params ip=122.0.1.201 cidr_netmask=24 \ >> op start interval=0 timeout=20 \ >> op stop interval=0 timeout=20 \ >> op monitor timeout=20s interval=10s depth=0 \ >> meta migration-threshold=3 failure-timeout=60s >> [...] >> colocation inetmanager_col +inf: inetmanager_vip inetmanager >> order inetmanager_order Mandatory: inetmanager inetmanager_vip [...] $ crm status >> [...] >> Full list of resources: >> [...] >> inetmanager_vip(ocf::heartbeat:IPaddr2): Stopped >> inetmanager(ocf::heartbeat:inetmanager): Stopped [...] corosync.log of node 122.0.1.10 >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: >> status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed >> (target: 7 vs. rc: 1): Error >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >> abort_transition_graph: Transition aborted by operation >> inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | >> magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 >> source=match_graph_event:310 complete=false >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >> match_graph_event: Action inetmanager_monitor_0 (24) confirmed on >> 122.0.1.9 (rc=1) >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >> process_graph_event:Detected action (360.24) >> inetmanager_monitor_0.2152=unknown error: failed >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: >> status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed >> (target: 7 vs. rc: 1): Error >> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >>
Re: [ClusterLabs] Pacemaker resources are not scheduled
Lkxjtu, On 14/04/18 00:16 +0800, lkxjtu wrote: > My cluster version: > Corosync 2.4.0 > Pacemaker 1.1.16 > > There are many resource anomalies. Some resources are only monitored > and not recovered. Some resources are not monitored or recovered. > Only one resource of vnm is scheduled normally, but this resource > cannot be started because other resources in the cluster are > abnormal. Just like a deadlock. I have been plagued by this problem > for a long time. I just want a stable and highly available resource > with infinite recovery for everyone. Is my resource configure > correct? see below > $ cat /etc/corosync/corosync.conf > [co]mpatibility: whitetank > > [...] > > logging { > fileline:off > to_stderr: no > to_logfile: yes > logfile: /root/info/logs/pacemaker_cluster/corosync.log > to_syslog: yes > syslog_facility: daemon > syslog_priority: info > debug: off > function_name: on > timestamp: on > logger_subsys { > subsys: AMF > debug: off > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > } > } > > amf { > mode: disabled > } > > aisexec { > user: root > group: root > } You are apparently mixing configuration directives for older major version(s) of corosync than you claim to be using. See corosync_conf(5) + votequorum(5) man pages for what you are supposed to configure with the actual version. Regarding your pacemaker configuration: > $ crm configure show > > [... reordered ... ] > > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.16-12.el7-94ff4df \ > cluster-infrastructure=corosync \ > stonith-enabled=false \ > start-failure-is-fatal=false \ > load-threshold="3200%" You are urged to configure fencing, otherwise asking for sane cluster's behaviour (which you do) is out of question, unless you precisely know why you are not configuring it. > > [... reordered ... ] > Furthermore you are using custom resource agents of undisclosed quality and compatibility with the requirements: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc Since your resources come in isolated groups, I would go one by one, trying to figure out why the group won't run as expected. For instance: > primitive inetmanager inetmanager \ > op monitor interval=10s timeout=160 \ > op stop interval=0 timeout=60s on-fail=restart \ > op start interval=0 timeout=60s on-fail=restart \ > meta migration-threshold=2 failure-timeout=60s resource-stickiness=100 > primitive inetmanager_vip IPaddr2 \ > params ip=122.0.1.201 cidr_netmask=24 \ > op start interval=0 timeout=20 \ > op stop interval=0 timeout=20 \ > op monitor timeout=20s interval=10s depth=0 \ > meta migration-threshold=3 failure-timeout=60s > [...] > colocation inetmanager_col +inf: inetmanager_vip inetmanager > order inetmanager_order Mandatory: inetmanager inetmanager_vip > > [...] > > $ crm status > [...] > Full list of resources: > [...] > inetmanager_vip(ocf::heartbeat:IPaddr2): Stopped > inetmanager(ocf::heartbeat:inetmanager): Stopped > > [...] > > corosync.log of node 122.0.1.10 > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: > status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed > (target: 7 vs. rc: 1): Error > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 > 'modify' on 122.0.1.9: Event failed | > magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 > source=match_graph_event:310 complete=false > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > match_graph_event: Action inetmanager_monitor_0 (24) confirmed on > 122.0.1.9 (rc=1) > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > process_graph_event:Detected action (360.24) > inetmanager_monitor_0.2152=unknown error: failed > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: > status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed > (target: 7 vs. rc: 1): Error > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 > 'modify' on 122.0.1.9: Event failed | > magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 > source=match_graph_event:310 complete=false > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > match_graph_event: Action inetmanager_monitor_0 (24) confirmed on > 122.0.1.9 (rc=1) > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: >