[ClusterLabs] “pcs cluster stop -all” hangs and

2018-05-10 Thread 范国腾
Hi,

When I run the "pcs cluster stop --all", it will hang and there is no any 
response sometimes. The log is as below. Could we find the reason why it hangs 
from the log and how to make the cluster stop right now? 

[root@node2 pg_log]# pcs status
Cluster name: hgpurog
Stack: corosync
Current DC: sds1 (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Fri May 11 01:11:26 2018
Last change: Fri May 11 01:09:24 2018 by hacluster via crmd on sds1

2 nodes configured
3 resources configured

Online: [ sds1 sds2 ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
 Stopped: [ sds1 sds2 ]
 Resource Group: mastergroup
 master-vip (ocf::heartbeat:IPaddr2):   Started sds1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[root@node2 pg_log]# pcs cluster stop --all


The /var/log/messages is as asbelow:
May 11 01:07:50 node2 crmd[5365]:  notice: State transition S_PENDING -> 
S_NOT_DC
May 11 01:07:50 node2 crmd[5365]:  notice: State transition S_NOT_DC -> 
S_PENDING
May 11 01:07:50 node2 crmd[5365]:  notice: State transition S_PENDING -> 
S_NOT_DC
May 11 01:07:51 node2 pgsqlms(pgsqld)[5371]: INFO: Execute action monitor and 
the result 7
May 11 01:07:51 node2 pgsqlms(undef)[5408]: INFO: Execute action meta-data and 
the result 0
May 11 01:07:51 node2 crmd[5365]:  notice: Result of probe operation for pgsqld 
on sds2: 7 (not running)
May 11 01:07:51 node2 crmd[5365]:  notice: sds2-pgsqld_monitor_0:6 [ /tmp:5866 
- no response\n ]
May 11 01:07:51 node2 crmd[5365]:  notice: Result of probe operation for 
master-vip on sds2: 7 (not running)
May 11 01:10:02 node2 systemd: Started Session 16 of user root.
May 11 01:10:02 node2 systemd: Starting Session 16 of user root.
May 11 01:11:33 node2 pacemakerd[5357]:  notice: Caught 'Terminated' signal
May 11 01:11:33 node2 systemd: Stopping Pacemaker High Availability Cluster 
Manager...
May 11 01:11:33 node2 pacemakerd[5357]:  notice: Shutting down Pacemaker
May 11 01:11:33 node2 pacemakerd[5357]:  notice: Stopping crmd
May 11 01:11:33 node2 crmd[5365]:  notice: Caught 'Terminated' signal
May 11 01:11:33 node2 crmd[5365]:  notice: Shutting down cluster resource 
manager
May 11 01:12:49 node2 systemd: Started Session 17 of user root.
May 11 01:12:49 node2 systemd-logind: New session 17 of user root.
May 11 01:12:49 node2 gdm-launch-environment]: AccountsService: ActUserManager: 
user (null) has no username (object path: /org/freedesktop/Accounts/User0, uid: 
0)
May 11 01:12:49 node2 journal: ActUserManager: user (null) has no username 
(object path: /org/freedesktop/Accounts/User0, uid: 0)
May 11 01:12:49 node2 systemd: Starting Session 17 of user root.
May 11 01:12:49 node2 dbus[648]: [system] Activating service 
name='org.freedesktop.problems' (using servicehelper)
May 11 01:12:49 node2 dbus-daemon: dbus[648]: [system] Activating service 
name='org.freedesktop.problems' (using servicehelper)
May 11 01:12:49 node2 dbus[648]: [system] Successfully activated service 
'org.freedesktop.problems'
May 11 01:12:49 node2 dbus-daemon: dbus[648]: [system] Successfully activated 
service 'org.freedesktop.problems'
May 11 01:12:49 node2 journal: g_dbus_interface_skeleton_unexport: assertion 
'interface_->priv->connections != NULL' failed

Here is the log in the peer node
May 11 01:09:08 node1 pgsqlms(pgsqld)[28599]: WARNING: No secondary connected 
to the master
May 11 01:09:08 node1 pgsqlms(pgsqld)[28599]: WARNING: "sds2" is not connected 
to the primary
May 11 01:09:08 node1 pgsqlms(pgsqld)[28599]: INFO: Execute action monitor and 
the result 8
May 11 01:09:18 node1 pgsqlms(pgsqld)[28679]: WARNING: No secondary connected 
to the master
May 11 01:09:18 node1 pgsqlms(pgsqld)[28679]: WARNING: "sds2" is not connected 
to the primary
May 11 01:09:18 node1 pgsqlms(pgsqld)[28679]: INFO: Execute action monitor and 
the result 8
May 11 01:09:24 node1 crmd[]:  notice: sds1-pgsqld_monitor_1:19 [ 
/tmp:5866 - accepting connections\n ]
May 11 01:09:24 node1 crmd[]:  notice: Transition aborted by deletion of 
lrm_resource[@id='pgsqld']: Resource state removal
May 11 01:10:02 node1 systemd: Started Session 17 of user root.
May 11 01:10:02 node1 systemd: Starting Session 17 of user root.
May 11 01:11:33 node1 pacemakerd[1042]:  notice: Caught 'Terminated' signal
May 11 01:11:33 node1 systemd: Stopping Pacemaker High Availability Cluster 
Manager...
May 11 01:11:33 node1 pacemakerd[1042]:  notice: Shutting down Pacemaker
May 11 01:11:33 node1 pacemakerd[1042]:  notice: Stopping crmd
May 11 01:11:33 node1 crmd[]:  notice: Caught 'Terminated' signal
May 11 01:11:33 node1 crmd[]:  notice: Shutting down cluster resource 
manager
May 11 01:11:33 node1 crmd[]: warning: Input I_SHUTDOWN received in state 
S_TRANSITION_ENGINE from crm_shutdown


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: 

Re: [ClusterLabs] What is the mechanism for pacemaker to recovery resources

2018-05-10 Thread Ken Gaillot
On Thu, 2018-05-10 at 22:02 +0800, lkxjtu wrote:
> 
> Great! These two parameters (batch-limit & node-action-limit) solve
> my problem. Thank you very much!
> 
> By the way, is there any way to know the number of parallel action on
> node and cluster?

If you set PCMK_debug=crmd (or pacemaker-controld in the soon-to-be-
released 2.0.0), then the detail log on each node will have messages
like:

debug: Current load is 0.57 across 1 core(s)

and

debug: Host rhel7-1 supports a maximum of 2 jobs and throttle mode
.  New job limit is 2

Of course your logs will grow faster with debug turned on ...

Otherwise there's no simple way to know. It might be nice to have a
command-line option to query the current values.

> At 2018-05-10 20:56:27, "lkxjtu"  wrote:
> On Tue, 2018-05-08 at 23:52 +0800, lkxjtu wrote: > I have a three
> node cluster of about 50 resources. When I reboot
> > three nodes at the same time, I observe the resource by "crm
> status".
> > I found that pacemaker starts 3-5 resources at a time, from top to
> > bottom, rather than start all at the same time. Is there any
> > parameter control?
> > It seems to be acceptable. But if there is a resource that can not
> > start up because of a exception, the latter resources recovery will
> > become very slow.I don't know the principle of pacemaker recovery
> > resources.In particular, order and priority.Is there any
> > suggestions?Thank you very much!
> There are a few things affecting start-up order. First (obviously) is
> your constraints. If you have any ordering constraints, they will
> enforce the configured order. Second is internal constraints.
> Pacemaker has certain built-in constraints for safety. This includes
> obvious logical requirements such as starting a resource before
> promoting it. Pacemaker will do a probe (one-time monitor) of each
> resource on each node to find its initial state; everything is
> ordered after those probes. A clone won't be promoted until all
> pending starts complete. Last is throttling. By default Pacemaker
> computes a maximum number of jobs that can be executed at once across
> the entire cluster, and for each node. The number is based on
> observed CPU load on the nodes (and thus depends partly on the number
> of CPU cores). Usually it is best to allow Pacemaker to calculate the
> throttling, but you can force particular values by setting: - node-
> action-limit: a cluster-wide property specifying the maximum number
> of actions that can be executed at once on any one node. -
> PCMK_node_action_limit: an environment variable specifying the same
> thing but can be configured differently per node. - batch-limit: a
> cluster-wide property specifying the maximum number of actions that
> can be executed at once across the entire cluster. The purpose of
> throttling is to keep Pacemaker from overloading the nodes such that
> actions might start timing out, causing unnecessary recovery.
> 
>    
> lkxjtu
> 邮箱:lkx...@163.com
> 签名由 网易邮箱大师 定制
> 
> 
>  
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] What is the mechanism for pacemaker to recovery resources

2018-05-10 Thread lkxjtu

Great! These two parameters (batch-limit & node-action-limit) solve my problem. 
Thank you very much!

By the way, is there any way to know the number of parallel action on node and 
cluster?




At 2018-05-10 20:56:27, "lkxjtu"  wrote:

On Tue, 2018-05-08 at 23:52 +0800, lkxjtu wrote: > I have a three node cluster 
of about 50 resources. When I reboot
> three nodes at the same time, I observe the resource by "crm status".
> I found that pacemaker starts 3-5 resources at a time, from top to
> bottom, rather than start all at the same time. Is there any
> parameter control?
> It seems to be acceptable. But if there is a resource that can not
> start up because of a exception, the latter resources recovery will
> become very slow.I don't know the principle of pacemaker recovery
> resources.In particular, order and priority.Is there any
> suggestions?Thank you very much!
There are a few things affecting start-up order. First (obviously) is your 
constraints. If you have any ordering constraints, they will enforce the 
configured order. Second is internal constraints. Pacemaker has certain 
built-in constraints for safety. This includes obvious logical requirements 
such as starting a resource before promoting it. Pacemaker will do a probe 
(one-time monitor) of each resource on each node to find its initial state; 
everything is ordered after those probes. A clone won't be promoted until all 
pending starts complete. Last is throttling. By default Pacemaker computes a 
maximum number of jobs that can be executed at once across the entire cluster, 
and for each node. The number is based on observed CPU load on the nodes (and 
thus depends partly on the number of CPU cores). Usually it is best to allow 
Pacemaker to calculate the throttling, but you can force particular values by 
setting: - node-action-limit: a cluster-wide property specifying the maximum 
number of actions that can be executed at once on any one node. - 
PCMK_node_action_limit: an environment variable specifying the same thing but 
can be configured differently per node. - batch-limit: a cluster-wide property 
specifying the maximum number of actions that can be executed at once across 
the entire cluster. The purpose of throttling is to keep Pacemaker from 
overloading the nodes such that actions might start timing out, causing 
unnecessary recovery.


| |
lkxjtu
邮箱:lkx...@163.com
|

签名由 网易邮箱大师 定制___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org