[ClusterLabs] “pcs cluster stop -all” hangs and
Hi, When I run the "pcs cluster stop --all", it will hang and there is no any response sometimes. The log is as below. Could we find the reason why it hangs from the log and how to make the cluster stop right now? [root@node2 pg_log]# pcs status Cluster name: hgpurog Stack: corosync Current DC: sds1 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Fri May 11 01:11:26 2018 Last change: Fri May 11 01:09:24 2018 by hacluster via crmd on sds1 2 nodes configured 3 resources configured Online: [ sds1 sds2 ] Full list of resources: Master/Slave Set: pgsql-ha [pgsqld] Stopped: [ sds1 sds2 ] Resource Group: mastergroup master-vip (ocf::heartbeat:IPaddr2): Started sds1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@node2 pg_log]# pcs cluster stop --all The /var/log/messages is as asbelow: May 11 01:07:50 node2 crmd[5365]: notice: State transition S_PENDING -> S_NOT_DC May 11 01:07:50 node2 crmd[5365]: notice: State transition S_NOT_DC -> S_PENDING May 11 01:07:50 node2 crmd[5365]: notice: State transition S_PENDING -> S_NOT_DC May 11 01:07:51 node2 pgsqlms(pgsqld)[5371]: INFO: Execute action monitor and the result 7 May 11 01:07:51 node2 pgsqlms(undef)[5408]: INFO: Execute action meta-data and the result 0 May 11 01:07:51 node2 crmd[5365]: notice: Result of probe operation for pgsqld on sds2: 7 (not running) May 11 01:07:51 node2 crmd[5365]: notice: sds2-pgsqld_monitor_0:6 [ /tmp:5866 - no response\n ] May 11 01:07:51 node2 crmd[5365]: notice: Result of probe operation for master-vip on sds2: 7 (not running) May 11 01:10:02 node2 systemd: Started Session 16 of user root. May 11 01:10:02 node2 systemd: Starting Session 16 of user root. May 11 01:11:33 node2 pacemakerd[5357]: notice: Caught 'Terminated' signal May 11 01:11:33 node2 systemd: Stopping Pacemaker High Availability Cluster Manager... May 11 01:11:33 node2 pacemakerd[5357]: notice: Shutting down Pacemaker May 11 01:11:33 node2 pacemakerd[5357]: notice: Stopping crmd May 11 01:11:33 node2 crmd[5365]: notice: Caught 'Terminated' signal May 11 01:11:33 node2 crmd[5365]: notice: Shutting down cluster resource manager May 11 01:12:49 node2 systemd: Started Session 17 of user root. May 11 01:12:49 node2 systemd-logind: New session 17 of user root. May 11 01:12:49 node2 gdm-launch-environment]: AccountsService: ActUserManager: user (null) has no username (object path: /org/freedesktop/Accounts/User0, uid: 0) May 11 01:12:49 node2 journal: ActUserManager: user (null) has no username (object path: /org/freedesktop/Accounts/User0, uid: 0) May 11 01:12:49 node2 systemd: Starting Session 17 of user root. May 11 01:12:49 node2 dbus[648]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) May 11 01:12:49 node2 dbus-daemon: dbus[648]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) May 11 01:12:49 node2 dbus[648]: [system] Successfully activated service 'org.freedesktop.problems' May 11 01:12:49 node2 dbus-daemon: dbus[648]: [system] Successfully activated service 'org.freedesktop.problems' May 11 01:12:49 node2 journal: g_dbus_interface_skeleton_unexport: assertion 'interface_->priv->connections != NULL' failed Here is the log in the peer node May 11 01:09:08 node1 pgsqlms(pgsqld)[28599]: WARNING: No secondary connected to the master May 11 01:09:08 node1 pgsqlms(pgsqld)[28599]: WARNING: "sds2" is not connected to the primary May 11 01:09:08 node1 pgsqlms(pgsqld)[28599]: INFO: Execute action monitor and the result 8 May 11 01:09:18 node1 pgsqlms(pgsqld)[28679]: WARNING: No secondary connected to the master May 11 01:09:18 node1 pgsqlms(pgsqld)[28679]: WARNING: "sds2" is not connected to the primary May 11 01:09:18 node1 pgsqlms(pgsqld)[28679]: INFO: Execute action monitor and the result 8 May 11 01:09:24 node1 crmd[]: notice: sds1-pgsqld_monitor_1:19 [ /tmp:5866 - accepting connections\n ] May 11 01:09:24 node1 crmd[]: notice: Transition aborted by deletion of lrm_resource[@id='pgsqld']: Resource state removal May 11 01:10:02 node1 systemd: Started Session 17 of user root. May 11 01:10:02 node1 systemd: Starting Session 17 of user root. May 11 01:11:33 node1 pacemakerd[1042]: notice: Caught 'Terminated' signal May 11 01:11:33 node1 systemd: Stopping Pacemaker High Availability Cluster Manager... May 11 01:11:33 node1 pacemakerd[1042]: notice: Shutting down Pacemaker May 11 01:11:33 node1 pacemakerd[1042]: notice: Stopping crmd May 11 01:11:33 node1 crmd[]: notice: Caught 'Terminated' signal May 11 01:11:33 node1 crmd[]: notice: Shutting down cluster resource manager May 11 01:11:33 node1 crmd[]: warning: Input I_SHUTDOWN received in state S_TRANSITION_ENGINE from crm_shutdown ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home:
Re: [ClusterLabs] What is the mechanism for pacemaker to recovery resources
On Thu, 2018-05-10 at 22:02 +0800, lkxjtu wrote: > > Great! These two parameters (batch-limit & node-action-limit) solve > my problem. Thank you very much! > > By the way, is there any way to know the number of parallel action on > node and cluster? If you set PCMK_debug=crmd (or pacemaker-controld in the soon-to-be- released 2.0.0), then the detail log on each node will have messages like: debug: Current load is 0.57 across 1 core(s) and debug: Host rhel7-1 supports a maximum of 2 jobs and throttle mode . New job limit is 2 Of course your logs will grow faster with debug turned on ... Otherwise there's no simple way to know. It might be nice to have a command-line option to query the current values. > At 2018-05-10 20:56:27, "lkxjtu"wrote: > On Tue, 2018-05-08 at 23:52 +0800, lkxjtu wrote: > I have a three > node cluster of about 50 resources. When I reboot > > three nodes at the same time, I observe the resource by "crm > status". > > I found that pacemaker starts 3-5 resources at a time, from top to > > bottom, rather than start all at the same time. Is there any > > parameter control? > > It seems to be acceptable. But if there is a resource that can not > > start up because of a exception, the latter resources recovery will > > become very slow.I don't know the principle of pacemaker recovery > > resources.In particular, order and priority.Is there any > > suggestions?Thank you very much! > There are a few things affecting start-up order. First (obviously) is > your constraints. If you have any ordering constraints, they will > enforce the configured order. Second is internal constraints. > Pacemaker has certain built-in constraints for safety. This includes > obvious logical requirements such as starting a resource before > promoting it. Pacemaker will do a probe (one-time monitor) of each > resource on each node to find its initial state; everything is > ordered after those probes. A clone won't be promoted until all > pending starts complete. Last is throttling. By default Pacemaker > computes a maximum number of jobs that can be executed at once across > the entire cluster, and for each node. The number is based on > observed CPU load on the nodes (and thus depends partly on the number > of CPU cores). Usually it is best to allow Pacemaker to calculate the > throttling, but you can force particular values by setting: - node- > action-limit: a cluster-wide property specifying the maximum number > of actions that can be executed at once on any one node. - > PCMK_node_action_limit: an environment variable specifying the same > thing but can be configured differently per node. - batch-limit: a > cluster-wide property specifying the maximum number of actions that > can be executed at once across the entire cluster. The purpose of > throttling is to keep Pacemaker from overloading the nodes such that > actions might start timing out, causing unnecessary recovery. > > > lkxjtu > 邮箱:lkx...@163.com > 签名由 网易邮箱大师 定制 > > > -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] What is the mechanism for pacemaker to recovery resources
Great! These two parameters (batch-limit & node-action-limit) solve my problem. Thank you very much! By the way, is there any way to know the number of parallel action on node and cluster? At 2018-05-10 20:56:27, "lkxjtu"wrote: On Tue, 2018-05-08 at 23:52 +0800, lkxjtu wrote: > I have a three node cluster of about 50 resources. When I reboot > three nodes at the same time, I observe the resource by "crm status". > I found that pacemaker starts 3-5 resources at a time, from top to > bottom, rather than start all at the same time. Is there any > parameter control? > It seems to be acceptable. But if there is a resource that can not > start up because of a exception, the latter resources recovery will > become very slow.I don't know the principle of pacemaker recovery > resources.In particular, order and priority.Is there any > suggestions?Thank you very much! There are a few things affecting start-up order. First (obviously) is your constraints. If you have any ordering constraints, they will enforce the configured order. Second is internal constraints. Pacemaker has certain built-in constraints for safety. This includes obvious logical requirements such as starting a resource before promoting it. Pacemaker will do a probe (one-time monitor) of each resource on each node to find its initial state; everything is ordered after those probes. A clone won't be promoted until all pending starts complete. Last is throttling. By default Pacemaker computes a maximum number of jobs that can be executed at once across the entire cluster, and for each node. The number is based on observed CPU load on the nodes (and thus depends partly on the number of CPU cores). Usually it is best to allow Pacemaker to calculate the throttling, but you can force particular values by setting: - node-action-limit: a cluster-wide property specifying the maximum number of actions that can be executed at once on any one node. - PCMK_node_action_limit: an environment variable specifying the same thing but can be configured differently per node. - batch-limit: a cluster-wide property specifying the maximum number of actions that can be executed at once across the entire cluster. The purpose of throttling is to keep Pacemaker from overloading the nodes such that actions might start timing out, causing unnecessary recovery. | | lkxjtu 邮箱:lkx...@163.com | 签名由 网易邮箱大师 定制___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org