Lkxjtu, On 14/04/18 00:16 +0800, lkxjtu wrote: > My cluster version: > Corosync 2.4.0 > Pacemaker 1.1.16 > > There are many resource anomalies. Some resources are only monitored > and not recovered. Some resources are not monitored or recovered. > Only one resource of vnm is scheduled normally, but this resource > cannot be started because other resources in the cluster are > abnormal. Just like a deadlock. I have been plagued by this problem > for a long time. I just want a stable and highly available resource > with infinite recovery for everyone. Is my resource configure > correct?
see below > $ cat /etc/corosync/corosync.conf > [co]mpatibility: whitetank > > [...] > > logging { > fileline: off > to_stderr: no > to_logfile: yes > logfile: /root/info/logs/pacemaker_cluster/corosync.log > to_syslog: yes > syslog_facility: daemon > syslog_priority: info > debug: off > function_name: on > timestamp: on > logger_subsys { > subsys: AMF > debug: off > tags: enter|leave|trace1|trace2|trace3|trace4|trace6 > } > } > > amf { > mode: disabled > } > > aisexec { > user: root > group: root > } You are apparently mixing configuration directives for older major version(s) of corosync than you claim to be using. See corosync_conf(5) + votequorum(5) man pages for what you are supposed to configure with the actual version. Regarding your pacemaker configuration: > $ crm configure show > > [... reordered ... ] > > property cib-bootstrap-options: \ > have-watchdog=false \ > dc-version=1.1.16-12.el7-94ff4df \ > cluster-infrastructure=corosync \ > stonith-enabled=false \ > start-failure-is-fatal=false \ > load-threshold="3200%" You are urged to configure fencing, otherwise asking for sane cluster's behaviour (which you do) is out of question, unless you precisely know why you are not configuring it. > > [... reordered ... ] > Furthermore you are using custom resource agents of undisclosed quality and compatibility with the requirements: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc Since your resources come in isolated groups, I would go one by one, trying to figure out why the group won't run as expected. For instance: > primitive inetmanager inetmanager \ > op monitor interval=10s timeout=160 \ > op stop interval=0 timeout=60s on-fail=restart \ > op start interval=0 timeout=60s on-fail=restart \ > meta migration-threshold=2 failure-timeout=60s resource-stickiness=100 > primitive inetmanager_vip IPaddr2 \ > params ip=122.0.1.201 cidr_netmask=24 \ > op start interval=0 timeout=20 \ > op stop interval=0 timeout=20 \ > op monitor timeout=20s interval=10s depth=0 \ > meta migration-threshold=3 failure-timeout=60s > [...] > colocation inetmanager_col +inf: inetmanager_vip inetmanager > order inetmanager_order Mandatory: inetmanager inetmanager_vip > > [...] > > $ crm status > [...] > Full list of resources: > [...] > inetmanager_vip (ocf::heartbeat:IPaddr2): Stopped > inetmanager (ocf::heartbeat:inetmanager): Stopped > > [...] > > corosync.log of node 122.0.1.10 > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: > status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed > (target: 7 vs. rc: 1): Error > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 > 'modify' on 122.0.1.9: Event failed | > magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 > source=match_graph_event:310 complete=false > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > match_graph_event: Action inetmanager_monitor_0 (24) confirmed on > 122.0.1.9 (rc=1) > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > process_graph_event: Detected action (360.24) > inetmanager_monitor_0.2152=unknown error: failed > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: warning: > status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed > (target: 7 vs. rc: 1): Error > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > abort_transition_graph: Transition aborted by operation inetmanager_monitor_0 > 'modify' on 122.0.1.9: Event failed | > magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 > source=match_graph_event:310 complete=false > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > match_graph_event: Action inetmanager_monitor_0 (24) confirmed on > 122.0.1.9 (rc=1) > Apr 13 23:49:56 [6137] paas-controller-122-0-1-10 crmd: info: > process_graph_event: Detected action (360.24) > inetmanager_monitor_0.2152=unknown error: failed What causes inetmanager agent to return 1 (OCF_ERR_GENERIC) when 7 (OCF_NOT_RUNNING) is expected? It may be a trivial issue in the implementation of the agent, making the whole group together with "inetmanager_vip" resource fail (due to the respective constraints). It may be similar with other isolated sets of resources. You may find ocf-tester (ocft) tool from resource-agents project useful to check a basic sanity of the custom agents: https://github.com/ClusterLabs/resource-agents/tree/master/tools/ocft Hope this helps -- Poki
pgpMPZbQseBX7.pgp
Description: PGP signature
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org