[ClusterLabs] Pacemaker crashed and produce a coredump file

2020-07-29 Thread lkxjtu
Hi Reid Wahl,


There are more log informations below. The reason seems to be that 
communication with DBUS timed out. Any suggestions?


1672712 Jul 24 21:20:17 [3945305] B0610011   lrmd: info: 
pcmk_dbus_timeout_dispatch:Timeout 0x147bbd0 expired
1672713 Jul 24 21:20:17 [3945305] B0610011   lrmd: info: 
pcmk_dbus_find_error:  LoadUnit error 'org.freedesktop.DBus.Error.NoReply': Did 
not receive a reply. Possible causes include: the remote application 
did not send a reply, the message bus security policy blocked the reply, the 
reply timeout expired, or the network connection was broken.
1672714 Jul 24 21:20:17 [3945305] B0610011   lrmd:error: 
systemd_loadunit_result:   Unexepcted DBus type, expected o in 's' instead 
of s
1672715 Jul 24 21:20:17 [3945305] B0610011   lrmd:error: crm_abort: 
systemd_unit_exec_with_unit: Triggered fatal assert at systemd.c:514 : unit
1672716 2020-07-24T21:20:17.701484+08:00 B0610011 lrmd[3945305]:error: 
systemd_loadunit_result: Unexepcted DBus type, expected o in 's' instead of s
1672717 2020-07-24T21:20:17.701517+08:00 B0610011 lrmd[3945305]:error: 
crm_abort: systemd_unit_exec_with_unit: Triggered fatal assert at systemd.c:514 
: unit
1672718 Jul 24 21:20:17 [3945306] B0610011   crmd:error: crm_ipc_read:  
Connection to lrmd failed



> Hi,
>
> It looks like this is a bug that was fixed in later releases. The `path`
> variable was a null pointer when it was passed to
> `systemd_unit_exec_with_unit` as the `unit` argument. Commit 62a0d26a
> 
> adds a null check to the `path` variable before using it to call 
> `systemd_unit_exec_with_unit`.
>
> I believe pacemaker-1.1.15-11.el7 is the first RHEL pacemaker release that
> contains the fix. Can you upgrade and see if the issue is resolved?
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Pacemaker crashed and produce a coredump file

2020-07-29 Thread lkxjtu
RPM Version Information:
corosync-2.3.4-7.el7_2.1.x86_64
pacemaker-1.1.12-22.el7.x86_64




Coredump file backtrace:

```
warning: .dynamic section for "/lib64/libk5crypto.so.3" is not at the expected 
address (wrong library or version mismatch?)
Missing separate debuginfo for
Try: yum --enablerepo='*debug*' install 
/usr/lib/debug/.build-id/91/375124d864f2692ced1c4a5f090826b7074dc0
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/pacemaker/lrmd'.
Program terminated with signal 6, Aborted.
#0  0x7f6ee9ed85f7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install 
audit-libs-2.4.1-5.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 
corosynclib-2.3.4-7.el7_2.1.x86_64 dbus-libs-1.6.12-13.el7.x86_64 
glib2-2.42.2-5.el7.x86_64 glibc-2.17-106.el7_2.4.x86_64 
gmp-6.0.0-12.el7_1.x86_64 gnutls-3.3.8-14.el7_2.x86_64 
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 
libcom_err-1.42.9-7.el7.x86_64 libffi-3.0.13-16.el7.x86_64 
libqb-0.17.1-2.el7.1.x86_64 libselinux-2.5-12.el7.x86_64 
libtasn1-3.8-2.el7.x86_64 libtool-ltdl-2.4.2-21.el7_2.x86_64 
libuuid-2.23.2-26.el7_2.2.x86_64 libxml2-2.9.1-6.el7_2.2.x86_64 
libxslt-1.1.28-5.el7.x86_64 nettle-2.7.1-4.el7.x86_64 
openssl-libs-1.0.1e-51.el7_2.4.x86_64 p11-kit-0.20.7-3.el7.x86_64 
pam-1.1.8-12.el7_1.1.x86_64 pcre-8.32-15.el7.x86_64 
trousers-0.3.13-1.el7.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 
zlib-1.2.7-15.el7.x86_64
(gdb) bt
#0  0x7f6ee9ed85f7 in raise () from /lib64/libc.so.6
#1  0x7f6ee9ed9ce8 in abort () from /lib64/libc.so.6
#2  0x7f6eeb7e1f67 in crm_abort (file=file@entry=0x7f6eeb5c863e 
"systemd.c", function=function@entry=0x7f6eeb5c8ec0 <__FUNCTION__.33949> 
"systemd_unit_exec_with_unit", line=line@entry=514,
assert_condition=assert_condition@entry=0x7f6eeb5c8713 "unit", 
do_core=do_core@entry=1, do_fork=, do_fork@entry=0) at 
utils.c:1197
#3  0x7f6eeb5c5cef in systemd_unit_exec_with_unit (op=op@entry=0x13adce0, 
unit=0x0) at systemd.c:514
#4  0x7f6eeb5c5e81 in systemd_loadunit_result (reply=reply@entry=0x139f2a0, 
op=op@entry=0x13adce0) at systemd.c:175
#5  0x7f6eeb5c6181 in systemd_loadunit_cb (pending=0x13aa380, 
user_data=0x13adce0) at systemd.c:197
#6  0x7f6eeb16f862 in complete_pending_call_and_unlock () from 
/lib64/libdbus-1.so.3
#7  0x7f6eeb172b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
#8  0x7f6eeb5c1e40 in pcmk_dbus_connection_dispatch (connection=0x13a4cb0, 
new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:388
#9  0x7f6eeb171260 in _dbus_connection_update_dispatch_status_and_unlock () 
from /lib64/libdbus-1.so.3
#10 0x7f6eeb172a93 in reply_handler_timeout () from /lib64/libdbus-1.so.3
#11 0x7f6eeb5c1daf in pcmk_dbus_timeout_dispatch (data=0x13aa660) at 
dbus.c:491
#12 0x7f6ee97a21c3 in g_timeout_dispatch () from /lib64/libglib-2.0.so.0
#13 0x7f6ee97a17aa in g_main_context_dispatch () from 
/lib64/libglib-2.0.so.0
#14 0x7f6ee97a1af8 in g_main_context_iterate.isra.24 () from 
/lib64/libglib-2.0.so.0
#15 0x7f6ee97a1dca in g_main_loop_run () from /lib64/libglib-2.0.so.0
#16 0x00402824 in main (argc=, argv=0x7ffce752b258) at 
main.c:344
(gdb) up
#1  0x7f6ee9ed9ce8 in abort () from /lib64/libc.so.6
(gdb) up
#2  0x7f6eeb7e1f67 in crm_abort (file=file@entry=0x7f6eeb5c863e 
"systemd.c", function=function@entry=0x7f6eeb5c8ec0 <__FUNCTION__.33949> 
"systemd_unit_exec_with_unit", line=line@entry=514,
assert_condition=assert_condition@entry=0x7f6eeb5c8713 "unit", 
do_core=do_core@entry=1, do_fork=, do_fork@entry=0) at 
utils.c:1197
1197abort();
(gdb) up
#3  0x7f6eeb5c5cef in systemd_unit_exec_with_unit (op=op@entry=0x13adce0, 
unit=0x0) at systemd.c:514
514 CRM_ASSERT(unit);
(gdb) up
#4  0x7f6eeb5c5e81 in systemd_loadunit_result (reply=reply@entry=0x139f2a0, 
op=op@entry=0x13adce0) at systemd.c:175
175 systemd_unit_exec_with_unit(op, path);
(gdb) up
#5  0x7f6eeb5c6181 in systemd_loadunit_cb (pending=0x13aa380, 
user_data=0x13adce0) at systemd.c:197
197 systemd_loadunit_result(reply, user_data);
(gdb) up
#6  0x7f6eeb16f862 in complete_pending_call_and_unlock () from 
/lib64/libdbus-1.so.3
(gdb) up
#7  0x7f6eeb172b51 in dbus_connection_dispatch () from /lib64/libdbus-1.so.3
(gdb) up
#8  0x7f6eeb5c1e40 in pcmk_dbus_connection_dispatch (connection=0x13a4cb0, 
new_status=DBUS_DISPATCH_DATA_REMAINS, data=0x0) at dbus.c:388
388 dbus_connection_dispatch(connection);
(gdb) up
#9  0x7f6eeb171260 in _dbus_connection_update_dispatch_status_and_unlock () 
from /lib64/libdbus-1.so.3
(gdb) up
#10 0x7f6eeb172a93 in reply_handler_timeout () from /lib64/libdbus-1.so.3
(gdb) up
#11 0x7f6eeb5c1daf in pcmk_dbus_timeout_dispatch (data=0x13aa660) at 
dbus.c:491
491 dbus_timeout_handle(data);
(gdb) up
#12 0x7f6ee97a21c3 in 

[ClusterLabs] Keep printing "Sent 0 CPG messages" in corosync.log

2018-09-29 Thread lkxjtu


Corosync.log has kept printing the following logs for several days. What's 
wrong with the corosync cluster? Now the cpu load is not high.

Cluster version information:
[root@paas-controller-172-167-40-24:~]$ rpm -q corosync
corosync-2.4.0-9.el7_4.2.x86_64
[root@paas-controller-172-167-40-24:~]$ rpm -q pacemaker
pacemaker-1.1.16-12.el7_4.2.x86_64



Sep 30 01:23:27 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:28 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:28 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:28 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:28 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:28 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:28 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:28 [127667] paas-controller-172-21-0-2 corosync warning [MAIN  ] 
timer_function_scheduler_timeout Corosync main process was not scheduled for 
10470.3652 ms (threshold is 2400. ms). Consider token timeout increase.
Sep 30 01:23:29 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:29 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:29 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:29 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:29 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:29 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:29 [127667] paas-controller-172-21-0-2 corosync notice  [TOTEM ] 
pause_flush Process pause detected for 8760 ms, flushing membership messages.
Sep 30 01:23:30 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:30 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:30 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:30 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:30 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:30 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:31 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
Sep 30 01:23:31 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)
Sep 30 01:23:31 [128234] paas-controller-172-21-0-2  attrd: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=14519): Try again (6)
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] The crmd process exited: Generic Pacemaker error (201)

2018-09-29 Thread lkxjtu


Version information
[root@paas-controller-172-167-40-24:~]$ rpm -q corosync
corosync-2.4.0-9.el7_4.2.x86_64
[root@paas-controller-172-167-40-24:~]$ rpm -q pacemaker
pacemaker-1.1.16-12.el7_4.2.x86_64

The crmd process exited with error code of 201. The pacemakerd process tried to 
fork 100 times, exceeding the threshold, and the crmd process exited forever. 
Here is the last attempt log of forking the crmd process.

I have two questions. The first one is why the crmd process exits? And the 
second question is whether I can set the threshold for retry times? Thank you 
very much!



Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd:error: 
pcmk_child_exit:The crmd process (83749) exited: Generic Pacemaker error 
(201)
Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd:   notice: 
pcmk_process_exit:  Respawning failed child process: crmd
Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd: info: 
start_child:Using uid=189 and group=189 for process crmd
Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd: info: 
start_child:Forked child 88033 for process crmd
Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd: info: 
mcp_cpg_deliver:Ignoring process list sent by peer for local node
Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd: info: 
mcp_cpg_deliver:Ignoring process list sent by peer for local node
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
crm_log_init:   Changed active directory to /var/lib/pacemaker/cores
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
main:   CRM Git Version: 1.1.16-12.el7_4.2 (94ff4df)
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
do_log: Input I_STARTUP received in state S_STARTING from crmd_init
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
get_cluster_type:   Verifying cluster type: 'corosync'
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
get_cluster_type:   Assuming an active 'corosync' cluster
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
do_cib_control: CIB connection established
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:   notice: 
crm_cluster_connect:Connecting to cluster infrastructure: corosync
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
crm_get_peer:   Created entry 
ebd1fc7d-5c48-4c81-85ec-bad8a3f6fcb1/0x7fe04dec49a0 for node 
172.167.40.24/167040024 (1 total)
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
crm_get_peer:   Node 167040024 is now known as 172.167.40.24
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
peer_update_callback:   172.167.40.24 is now in unknown state
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
crm_get_peer:   Node 167040024 has uuid 167040024
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
crm_update_peer_proc:   cluster_connect_cpg: Node 172.167.40.24[167040024] 
- corosync-cpg is now online
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
peer_update_callback:   Client 172.167.40.24/peer now has status [online] 
(DC=, changed=400)
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
init_cs_connection_once:Connection to 'corosync': established
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:   notice: 
cluster_connect_quorum: Quorum acquired
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
do_ha_control:  Connected to the cluster
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
lrmd_ipc_connect:   Connecting to lrmd
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
do_lrm_control: LRM connection established
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
do_started: Delaying start, no membership data (0010)
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
do_started: Delaying start, no membership data (0010)
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
pcmk_quorum_notification:   Quorum retained  membership=4 members=1
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:   notice: 
crm_update_peer_state_iter: Node 172.167.40.24 state is now member  
nodeid=167040024 previous=unknown source=pcmk_quorum_notification
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
peer_update_callback:   Cluster node 172.167.40.24 is now member (was in 
unknown state)
Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd: info: 
do_started: Delaying start, Config not read 

[ClusterLabs] How does failure-timeout works, will the resource not be scheduled when setting too short?

2018-05-19 Thread lkxjtu

I have two pacemaker resources. We call them A and B. Because of environmental 
reasons, their start methods and monitor methods always return failure

(OCF_ERR_GENERIC). The following are their configurations:(The cluster property 
of start-failure-is-fatal is false)

primitive A A \
op monitor interval=20 timeout=120 \
op stop interval=0 timeout=120 on-fail=restart \
op start interval=0 timeout=240 on-fail=restart \
meta failure-timeout=60s
primitive B B \
op monitor interval=20 timeout=120 \
op stop interval=0 timeout=120 on-fail=restart \
op start interval=0 timeout=240 on-fail=restart \
meta failure-timeout=60s
clone A_cl A
clone B_cl B

The time consuming of their methods is different:
A:
start = 60s   monitor < 1sstop = 80s
B:
start < 1smonitor < 1sstop < 1s

Resource of A is scheduled normally, always start and stop. But for resource B, 
there is only circular monitor fails, without start and stop.
. And there is no fail-count showing of B in "crm status -f".

Two operations can solve the problem of B not being scheduled:
1,Set failure-timeout of B from 60s to 600s
2,Modify ocf of A,make the stop method return as soon as possible

I tested it several times, and the results were the same. Why does the resource 
not be scheduled when failure-timeout setting too short? And what does

it have to do with the time consuming stop of another resource?  Is this a bug?

My pacemaker version is 1.1.16. Any suggestion is welcome. Thank you!


James
2018-05-20
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] What is the mechanism for pacemaker to recovery resources

2018-05-10 Thread lkxjtu

Great! These two parameters (batch-limit & node-action-limit) solve my problem. 
Thank you very much!

By the way, is there any way to know the number of parallel action on node and 
cluster?




At 2018-05-10 20:56:27, "lkxjtu" <lkx...@163.com> wrote:

On Tue, 2018-05-08 at 23:52 +0800, lkxjtu wrote: > I have a three node cluster 
of about 50 resources. When I reboot
> three nodes at the same time, I observe the resource by "crm status".
> I found that pacemaker starts 3-5 resources at a time, from top to
> bottom, rather than start all at the same time. Is there any
> parameter control?
> It seems to be acceptable. But if there is a resource that can not
> start up because of a exception, the latter resources recovery will
> become very slow.I don't know the principle of pacemaker recovery
> resources.In particular, order and priority.Is there any
> suggestions?Thank you very much!
There are a few things affecting start-up order. First (obviously) is your 
constraints. If you have any ordering constraints, they will enforce the 
configured order. Second is internal constraints. Pacemaker has certain 
built-in constraints for safety. This includes obvious logical requirements 
such as starting a resource before promoting it. Pacemaker will do a probe 
(one-time monitor) of each resource on each node to find its initial state; 
everything is ordered after those probes. A clone won't be promoted until all 
pending starts complete. Last is throttling. By default Pacemaker computes a 
maximum number of jobs that can be executed at once across the entire cluster, 
and for each node. The number is based on observed CPU load on the nodes (and 
thus depends partly on the number of CPU cores). Usually it is best to allow 
Pacemaker to calculate the throttling, but you can force particular values by 
setting: - node-action-limit: a cluster-wide property specifying the maximum 
number of actions that can be executed at once on any one node. - 
PCMK_node_action_limit: an environment variable specifying the same thing but 
can be configured differently per node. - batch-limit: a cluster-wide property 
specifying the maximum number of actions that can be executed at once across 
the entire cluster. The purpose of throttling is to keep Pacemaker from 
overloading the nodes such that actions might start timing out, causing 
unnecessary recovery.


| |
lkxjtu
邮箱:lkx...@163.com
|

签名由 网易邮箱大师 定制___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] What is the mechanism for pacemaker to recovery resources

2018-05-08 Thread lkxjtu
I have a three node cluster of about 50 resources. When I reboot three nodes at 
the same time, I observe the resource by "crm status". I found that pacemaker 
starts 3-5 resources at a time, from top to bottom, rather than start all at 
the same time. Is there any parameter control?
It seems to be acceptable. But if there is a resource that can not start up 
because of a exception, the latter resources recovery will become very slow.I 
don't know the principle of pacemaker recovery resources.In particular, order 
and priority.Is there any suggestions?Thank you very much!
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resources are not scheduled

2018-04-16 Thread lkxjtu
> Lkxjtu,

> On 14/04/18 00:16 +0800, lkxjtu wrote:
>> My cluster version:
>> Corosync 2.4.0
>> Pacemaker 1.1.16
>>>> There are many resource anomalies. Some resources are only monitored
>> and not recovered. Some resources are not monitored or recovered.
>> Only one resource of vnm is scheduled normally, but this resource
>> cannot be started because other resources in the cluster are
>> abnormal. Just like a deadlock. I have been plagued by this problem
>> for a long time. I just want a stable and highly available resource
>> with infinite recovery for everyone. Is my resource configure
>> correct?

> see below

>> $ cat /etc/corosync/corosync.conf
>> [co]mpatibility: whitetank
>>>> [...]
>>>> logging {
>>   fileline:off
>>   to_stderr:   no
>>   to_logfile:  yes
>>   logfile: /root/info/logs/pacemaker_cluster/corosync.log
>>   to_syslog:   yes
>>   syslog_facility: daemon
>>   syslog_priority: info
>>   debug:   off
>>   function_name:   on
>>   timestamp:   on
>>   logger_subsys {
>> subsys: AMF
>> debug:  off
>> tags:   enter|leave|trace1|trace2|trace3|trace4|trace6
>>   }
>> }
>>>> amf {
>>   mode: disabled
>> }
>>>> aisexec {
>>   user:  root
>>   group: root
>> }

> You are apparently mixing configuration directives for older major
> version(s) of corosync than you claim to be using.
> See corosync_conf(5) + votequorum(5) man pages for what you are
> supposed to configure with the actual version.


Thank you for your detailed answer!
Corosync.conf is part of the ansible scripts, but corosync and pacemaker are 
updated with the yum source. So it has caused the current gap. I will carefully 
compare the gap between the new and old versions


> Regarding your pacemaker configuration:

>> $ crm configure show
>>>> [... reordered ... ]
>>>> property cib-bootstrap-options: \
>> have-watchdog=false \
>> dc-version=1.1.16-12.el7-94ff4df \
>> cluster-infrastructure=corosync \
>> stonith-enabled=false \
>> start-failure-is-fatal=false \
>> load-threshold="3200%"

> You are urged to configure fencing, otherwise asking for sane
> cluster's behaviour (which you do) is out of question, unless
> you precisely know why you are not configuring it.


My environment is a virtual machine environment. There is no headshot device. 
Can I configure fencing? How to do it? 


>>>> [... reordered ... ]
>>
> Furthermore you are using custom resource agents of undisclosed
> quality and compatibility with the requirements:
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#ap-ocf>
>  
> https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc

> Since your resources come in isolated groups, I would go one
> by one, trying to figure out why the group won't run as expected.

> For instance:

>> primitive inetmanager inetmanager \
>> op monitor interval=10s timeout=160 \
>> op stop interval=0 timeout=60s on-fail=restart \
>> op start interval=0 timeout=60s on-fail=restart \
>> meta migration-threshold=2 failure-timeout=60s 
>> resource-stickiness=100
>> primitive inetmanager_vip IPaddr2 \
>> params ip=122.0.1.201 cidr_netmask=24 \
>> op start interval=0 timeout=20 \
>> op stop interval=0 timeout=20 \
>> op monitor timeout=20s interval=10s depth=0 \
>> meta migration-threshold=3 failure-timeout=60s
>> [...]
>> colocation inetmanager_col +inf: inetmanager_vip inetmanager
>> order inetmanager_order Mandatory: inetmanager inetmanager_vip
>>>> [...]
>>>> $ crm status
>> [...]
>> Full list of resources:
>> [...]
>>  inetmanager_vip(ocf::heartbeat:IPaddr2):   Stopped
>>  inetmanager(ocf::heartbeat:inetmanager):   Stopped
>>>> [...]
>>>> corosync.log of node 122.0.1.10
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd:  warning: 
>> status_from_rc: Action 24 (inetmanager_monitor_0) on 122.0.1.9 failed 
>> (target: 7 vs. rc: 1): Error
>> Apr 13 23:49:56 [6137] paas-controller-122-0-1-10   crmd: info: 
>> abort_transition_graph: Transition aborted by operation 
>> inetmanager_monitor_0 'modify' on 122.0.1.9: Event failed | 
>> magic=0:1;24:360:7:a7901eb1-462f-4259-a613-e0023ce8a6be cib=0.124.2400 
>> source=match_graph_even

[ClusterLabs] Pacemaker resources are not scheduled

2018-04-13 Thread lkxjtu
My cluster version:
Corosync 2.4.0
Pacemaker 1.1.16

There are many resource anomalies. Some resources are only monitored and not 
recovered. Some resources are not monitored or recovered. Only one resource of 
vnm is scheduled normally, but this resource cannot be started because other 
resources in the cluster are abnormal. Just like a deadlock. I have been 
plagued by this problem for a long time. I just want a stable and highly 
available resource with infinite recovery for everyone. Is my resource 
configure correct?



$ cat /etc/corosync/corosync.conf
mpatibility: whitetank

quorum {
  provider: corosync_votequorum
two_node: 0
   }

nodelist {
  node {
ring0_addr: 122.0.1.8
name: 122.0.1.8
nodeid: 01008
  }
  node {
ring0_addr: 122.0.1.9
name: 122.0.1.9
nodeid: 01009
  }
  node {
ring0_addr: 122.0.1.10
name: 122.0.1.10
nodeid: 01010
  }
}

totem {
  version: 2
  token:   3000
  token_retransmits_before_loss_const: 10
  join:60
  consensus:   3600
  vsftype: none
  max_messages:20
  clear_node_high_bit: yes
  rrp_mode:none
  secauth: off
  threads: 2
  transport:   udpu
  interface {
ringnumber:  0
bindnetaddr: 122.0.1.8
mcastport:   5405
  }
}

logging {
  fileline:off
  to_stderr:   no
  to_logfile:  yes
  logfile: /root/info/logs/pacemaker_cluster/corosync.log
  to_syslog:   yes
  syslog_facility: daemon
  syslog_priority: info
  debug:   off
  function_name:   on
  timestamp:   on
  logger_subsys {
subsys: AMF
debug:  off
tags:   enter|leave|trace1|trace2|trace3|trace4|trace6
  }
}

amf {
  mode: disabled
}

aisexec {
  user:  root
  group: root
}





$ crm configure show
node 1008: 122.0.1.8
node 1009: 122.0.1.9
node 1010: 122.0.1.10
primitive apigateway apigateway \
op monitor interval=20s timeout=220 \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta failure-timeout=60s target-role=Started
primitive apigateway_vip IPaddr2 \
params ip=122.0.1.203 cidr_netmask=24 \
op start interval=0 timeout=20 \
op stop interval=0 timeout=20 \
op monitor timeout=20s interval=10s depth=0 \
meta migration-threshold=3 failure-timeout=60s
primitive inetmanager inetmanager \
op monitor interval=10s timeout=160 \
op stop interval=0 timeout=60s on-fail=restart \
op start interval=0 timeout=60s on-fail=restart \
meta migration-threshold=2 failure-timeout=60s resource-stickiness=100
primitive inetmanager_vip IPaddr2 \
params ip=122.0.1.201 cidr_netmask=24 \
op start interval=0 timeout=20 \
op stop interval=0 timeout=20 \
op monitor timeout=20s interval=10s depth=0 \
meta migration-threshold=3 failure-timeout=60s
primitive logserver logserver \
op monitor interval=20s timeout=220 \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta failure-timeout=60s target-role=Started
primitive mariadb_vip IPaddr2 \
params ip=122.0.1.204 cidr_netmask=24 \
op start interval=0 timeout=20 \
op stop interval=0 timeout=20 \
op monitor timeout=20s interval=10s depth=0 \
meta migration-threshold=3 failure-timeout=60s
primitive mysql mysql \
op monitor interval=20s timeout=220 \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta failure-timeout=60s target-role=Started
primitive p_rdbserver hardb_docker \
op start timeout=3600 interval=0 \
op stop timeout=1260 interval=0 \
op promote timeout=3600 interval=0 \
op monitor role=Master interval=1 timeout=30 \
op monitor interval=15 timeout=7200 \
meta migration-threshold=3 failure-timeout=3600s
primitive p_rdbvip IPaddr2 \
params ip=100.0.1.203 cidr_netmask=24 \
op start interval=0 timeout=20 \
op stop interval=0 timeout=20 \
op monitor timeout=20s interval=10s depth=0 \
meta migration-threshold=3 failure-timeout=60s resource-stickiness=100
primitive rabbitmq rabbitmq \
op monitor interval=20s timeout=220 \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta failure-timeout=60s target-role=Started
primitive rabbitmq_vip IPaddr2 \
params ip=122.0.1.200 \
op start interval=0 timeout=20 \
op stop interval=0 timeout=20 \
op monitor timeout=20s interval=10s depth=0 \
meta 

[ClusterLabs] What does these logs mean in corosync.log

2018-02-12 Thread lkxjtu
These logs are both print when system is abnormal, I am very confused what they 
mean. Does anyone know what they mean? Thank you very much
corosync version   2.4.0
pacemaker version  1.1.16

1)
Feb 01 10:57:58 [18927] paas-controller-192-167-0-2   crmd:  warning: 
find_xml_node:Could not find parameters in resource-agent.

2)
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:10 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1bb
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf
Feb 08 00:00:25 [32899] paas-controller-22-0-2-10 corosync notice  [TOTEM ] 
orf_token_rtr Retransmit List: 19f1cf

3)
Feb 11 22:57:17 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11533 (ratio 20:1) in 51ms
Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11522 (ratio 20:1) in 53ms
Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11537 (ratio 20:1) in 45ms
Feb 11 22:57:21 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11514 (ratio 20:1) in 47ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11536 (ratio 20:1) in 50ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11551 (ratio 20:1) in 51ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11524 (ratio 20:1) in 54ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11545 (ratio 20:1) in 60ms
Feb 11 22:57:22 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11536 (ratio 20:1) in 54ms
Feb 11 22:57:25 [5206] paas-controller-192-20-20-6cib: info: 
crm_compress_string:   Compressed 233922 bytes into 11522 (ratio 20:1) in 61ms___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-11-09 Thread lkxjtu


>Also, I forgot about the undocumented/unsupported start-delay operation
>attribute, that you can put on the status operation to delay the first
>monitor. That may give you the behavior you want.
I have try to add "start-delay=60s" to monitor operation. The first monitor was 
really delayed as 60s. But in the 60s, it will block other resources too! The 
result is the same to sleeping in monitor.
So, I think the best method for me,  is to judge whether need to return success 
in monitor function by timestamp.
Thank you very much!








At 2017-11-06 21:53:53, "Ken Gaillot" <kgail...@redhat.com> wrote:
>On Sat, 2017-11-04 at 22:46 +0800, lkxjtu wrote:
>> 
>> 
>> >Another possibility would be to have the start return immediately,
>> and
>> >make the monitor artificially return success for the first 10
>> minutes
>> >after starting. It's hacky, and it depends on your situation whether
>> >the behavior is acceptable.
>> I tried to put the sleep into the monitor function,( I add a “sleep
>> 60” at the monitor entry for debug),  the start function returns
>> immediately. I found an interesting thing that is, at the first time
>> of monitor after start, it will block other resource too, but from
>> the second time, it won't block other resources! Is this normal?
>
>Yes, the first result is for an unknown status, but after that, the
>cluster assumes the resource is OK unless/until the monitor says
>otherwise.
>
>However, I wasn't suggesting putting a sleep inside the monitor -- I
>was just thinking of having the monitor check the time, and if it's
>within 10 minutes of start, return success.
>
>> >My first thought on how to implement this
>> >would be to have the start action set a private node attribute
>> >(attrd_updater -p) with a timestamp. When the monitor runs, it could
>> do
>> >its usual check, and if it succeeds, remove that node attribute, but
>> if
>> >it fails, check the node attribute to see whether it's within the
>> >desired delay.
>> This means that if it is in the desired delay, monitor should return
>> success even if healthcheck failed?
>> I think this can solve my problem except "crm status" show
>
>Yes, that's what I had in mind. The status would show "running", which
>may or may not be what you want in this case.
>
>Also, I forgot about the undocumented/unsupported start-delay operation
>attribute, that you can put on the status operation to delay the first
>monitor. That may give you the behavior you want.
>
>> At 2017-11-01 21:20:50, "Ken Gaillot" <kgail...@redhat.com> wrote:
>> >On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
>> >> 
>> >> Thank you for your response! This means that there shoudn't be
>> long
>> >> "sleep" in ocf script.
>> >> If my service takes 10 minite from service starting to healthcheck
>> >> normally, then what shoud I do?
>> >
>> >That is a tough situation with no great answer.
>> >
>> >You can leave it as it is, and live with the delay. Note that it
>> only
>> >happens if a resource fails after the slow resource has already
>> begun
>> >starting ... if they fail at the same time (as with a node failure),
>> >the cluster will schedule recovery for both at the same time.
>> >
>> >Another possibility would be to have the start return immediately,
>> and
>> >make the monitor artificially return success for the first 10
>> minutes
>> >after starting. It's hacky, and it depends on your situation whether
>> >the behavior is acceptable. My first thought on how to implement
>> this
>> >would be to have the start action set a private node attribute
>> >(attrd_updater -p) with a timestamp. When the monitor runs, it could
>> do
>> >its usual check, and if it succeeds, remove that node attribute, but
>> if
>> >it fails, check the node attribute to see whether it's within the
>> >desired delay.
>> >
>> >> Thank you very much!
>> >>  
>> >> > Hi,
>> >> > If I remember correctly, any pending actions from a previous
>> >> transition
>> >> > must be completed before a new transition can be calculated.
>> >> Otherwise,
>> >> > there's the possibility that the pending action could change the
>> >> state
>> >> > in a way that makes the second transition's decisions harmful.
>> >> > Theoretically (and ideally), pacemaker could figure out whether

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-11-04 Thread lkxjtu



>Another possibility would be to have the start return immediately, and
>make the monitor artificially return success for the first 10 minutes
>after starting. It's hacky, and it depends on your situation whether
>the behavior is acceptable.
I tried to put the sleep into the monitor function,( I add a “sleep 60” at the 
monitor entry for debug),  the start function returns immediately. I found an 
interesting thing that is, at the first time of monitor after start, it will 
block other resource too, but from the second time, it won't block other 
resources! Is this normal?



>My first thought on how to implement this
>would be to have the start action set a private node attribute
>(attrd_updater -p) with a timestamp. When the monitor runs, it could do
>its usual check, and if it succeeds, remove that node attribute, but if
>it fails, check the node attribute to see whether it's within the
>desired delay.
This means that if it is in the desired delay, monitor should return success 
even if healthcheck failed?
I think this can solve my problem except "crm status" show





At 2017-11-01 21:20:50, "Ken Gaillot" <kgail...@redhat.com> wrote:
>On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
>> 
>> Thank you for your response! This means that there shoudn't be long
>> "sleep" in ocf script.
>> If my service takes 10 minite from service starting to healthcheck
>> normally, then what shoud I do?
>
>That is a tough situation with no great answer.
>
>You can leave it as it is, and live with the delay. Note that it only
>happens if a resource fails after the slow resource has already begun
>starting ... if they fail at the same time (as with a node failure),
>the cluster will schedule recovery for both at the same time.
>
>Another possibility would be to have the start return immediately, and
>make the monitor artificially return success for the first 10 minutes
>after starting. It's hacky, and it depends on your situation whether
>the behavior is acceptable. My first thought on how to implement this
>would be to have the start action set a private node attribute
>(attrd_updater -p) with a timestamp. When the monitor runs, it could do
>its usual check, and if it succeeds, remove that node attribute, but if
>it fails, check the node attribute to see whether it's within the
>desired delay.
>
>> Thank you very much!
>>  
>> > Hi,
>> > If I remember correctly, any pending actions from a previous
>> transition
>> > must be completed before a new transition can be calculated.
>> Otherwise,
>> > there's the possibility that the pending action could change the
>> state
>> > in a way that makes the second transition's decisions harmful.
>> > Theoretically (and ideally), pacemaker could figure out whether
>> some of
>> > the actions in the second transition would be needed regardless of
>> > whether the pending actions succeeded or failed, but in practice,
>> that
>> > would be difficult to implement (and possibly take more time to
>> > calculate than is desirable in a recovery situation).
>>  
>> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
>> 
>> > I have two clone resources in my corosync/pacemaker cluster. They
>> are
>> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1
>> minute
>> > to start the
>> > service(calling ocf start function for 1 minite). Configured as
>> > below:
>> > # crm configure show
>> > node 168002177: 192.168.2.177
>> > node 168002178: 192.168.2.178
>> > node 168002179: 192.168.2.179
>> > primitive fm_mgt fm_mgt \
>> > op monitor interval=20s timeout=120s \
>> > op stop interval=0 timeout=120s on-fail=restart \
>> > op start interval=0 timeout=120s on-fail=restart \
>> > meta target-role=Started
>> > primitive logserver logserver \
>> > op monitor interval=20s timeout=120s \
>> > op stop interval=0 timeout=120s on-fail=restart \
>> > op start interval=0 timeout=120s on-fail=restart \
>> > meta target-role=Started
>> > clone fm_mgt_replica fm_mgt
>> > clone logserver_replica logserver
>> > property cib-bootstrap-options: \
>> > have-watchdog=false \
>> > dc-version=1.1.13-10.el7-44eb2dd \
>> > cluster-infrastructure=corosync \
>> > stonith-enabled=false \
>> > start-failure-is-fatal=false
>> > When I kill fm_mgt service on one node,pacemaker will immediately
>> > recover it after monitor failed. This loo

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-10-27 Thread lkxjtu

Thank you for your response! This means that there shoudn't be long "sleep" in 
ocf script.
If my service takes 10 minite from service starting to healthcheck normally, 
then what shoud I do?
Thank you very much!
 
> Hi,
> If I remember correctly, any pending actions from a previous transition
> must be completed before a new transition can be calculated. Otherwise,
> there's the possibility that the pending action could change the state
> in a way that makes the second transition's decisions harmful.
> Theoretically (and ideally), pacemaker could figure out whether some of
> the actions in the second transition would be needed regardless of
> whether the pending actions succeeded or failed, but in practice, that
> would be difficult to implement (and possibly take more time to
> calculate than is desirable in a recovery situation).
 
> On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:

> I have two clone resources in my corosync/pacemaker cluster. They are
> fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1 minute
> to start the
> service(calling ocf start function for 1 minite). Configured as
> below:
> # crm configure show
> node 168002177: 192.168.2.177
> node 168002178: 192.168.2.178
> node 168002179: 192.168.2.179
> primitive fm_mgt fm_mgt \
> op monitor interval=20s timeout=120s \
> op stop interval=0 timeout=120s on-fail=restart \
> op start interval=0 timeout=120s on-fail=restart \
> meta target-role=Started
> primitive logserver logserver \
> op monitor interval=20s timeout=120s \
> op stop interval=0 timeout=120s on-fail=restart \
> op start interval=0 timeout=120s on-fail=restart \
> meta target-role=Started
> clone fm_mgt_replica fm_mgt
> clone logserver_replica logserver
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.13-10.el7-44eb2dd \
> cluster-infrastructure=corosync \
> stonith-enabled=false \
> start-failure-is-fatal=false
> When I kill fm_mgt service on one node,pacemaker will immediately
> recover it after monitor failed. This looks perfectly normal. But in
> this 1 minite
> of fm_mgt starting, if I kill logserver service on any node, the
> monitor will catch the fail normally too,but pacemaker will not
> restart it
> immediately but waiting for fm_mgt starting finished. After fm_mgt
> starting finished, pacemaker begin restarting logserver. It seems
> that there are
> some dependency between pacemaker resource.
> # crm status
> Last updated: Thu Oct 26 06:40:24 2017  Last change: Thu Oct
> 26 06:36:33 2017 by root via crm_resource on 192.168.2.177
> Stack: corosync
> Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - partition
> with quorum
> 3 nodes and 6 resources configured
> Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
> Full list of resources:
>  Clone Set: logserver_replica [logserver]
>  logserver  (ocf::heartbeat:logserver): FAILED 192.168.2.177
>  Started: [ 192.168.2.178 192.168.2.179 ]
>  Clone Set: fm_mgt_replica [fm_mgt]
>  Started: [ 192.168.2.178 192.168.2.179 ]
>  Stopped: [ 192.168.2.177 ]
> I am confusing very much. Is there something wrong configure?Thank
> you very much!
> James
> best regards

 




【网易自营】好吃到爆!鲜香弹滑加热即食,经典13香/麻辣小龙虾仅75元3斤>>  



【网易自营】好吃到爆!鲜香弹滑加热即食,经典13香/麻辣小龙虾仅75元3斤>>  ___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-10-27 Thread lkxjtu
I have two clone resources in my corosync/pacemaker cluster. They are fm_mgt 
and logserver. Both of their RA is ocf. fm_mgt takes 1 minute to start the

service(calling ocf start function for 1 minite). Configured as below:

# crm configure show
node 168002177: 192.168.2.177
node 168002178: 192.168.2.178
node 168002179: 192.168.2.179
primitive fm_mgt fm_mgt \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
primitive logserver logserver \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
clone fm_mgt_replica fm_mgt
clone logserver_replica logserver
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.13-10.el7-44eb2dd \
cluster-infrastructure=corosync \
stonith-enabled=false \
start-failure-is-fatal=false

When I kill fm_mgt service on one node,pacemaker will immediately recover it 
after monitor failed. This looks perfectly normal. But in this 1 minite

of fm_mgt starting, if I kill logserver service on any node, the monitor will 
catch the fail normally too,but pacemaker will not restart it

immediately but waiting for fm_mgt starting finished. After fm_mgt starting 
finished, pacemaker begin restarting logserver. It seems that there are

some dependency between pacemaker resource.

# crm status
Last updated: Thu Oct 26 06:40:24 2017  Last change: Thu Oct 26 
06:36:33 2017 by root via crm_resource on 192.168.2.177
Stack: corosync
Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
3 nodes and 6 resources configured
Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
Full list of resources:
 Clone Set: logserver_replica [logserver]
 logserver  (ocf::heartbeat:logserver): FAILED 192.168.2.177
 Started: [ 192.168.2.178 192.168.2.179 ]
 Clone Set: fm_mgt_replica [fm_mgt]
 Started: [ 192.168.2.178 192.168.2.179 ]
 Stopped: [ 192.168.2.177 ]

I am confusing very much. Is there something wrong configure?Thank you very 
much!

James

best regards___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-10-27 Thread lkxjtu
I have two clone resources in my corosync/pacemaker cluster. They are fm_mgt 
and logserver. Both of their RA is ocf. fm_mgt takes 1 minute to start the

service(calling ocf start function for 1 minite). Configured as below:

# crm configure show
node 168002177: 192.168.2.177
node 168002178: 192.168.2.178
node 168002179: 192.168.2.179
primitive fm_mgt fm_mgt \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
primitive logserver logserver \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
clone fm_mgt_replica fm_mgt
clone logserver_replica logserver
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.13-10.el7-44eb2dd \
cluster-infrastructure=corosync \
stonith-enabled=false \
start-failure-is-fatal=false

When I kill fm_mgt service on one node,pacemaker will immediately recover it 
after monitor failed. This looks perfectly normal. But in this 1 minite

of fm_mgt starting, if I kill logserver service on any node, the monitor will 
catch the fail normally too,but pacemaker will not restart it

immediately but waiting for fm_mgt starting finished. After fm_mgt starting 
finished, pacemaker begin restarting logserver. It seems that there are

some dependency between pacemaker resource.

# crm status
Last updated: Thu Oct 26 06:40:24 2017  Last change: Thu Oct 26 
06:36:33 2017 by root via crm_resource on 192.168.2.177
Stack: corosync
Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
3 nodes and 6 resources configured
Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
Full list of resources:
 Clone Set: logserver_replica [logserver]
 logserver  (ocf::heartbeat:logserver): FAILED 192.168.2.177
 Started: [ 192.168.2.178 192.168.2.179 ]
 Clone Set: fm_mgt_replica [fm_mgt]
 Started: [ 192.168.2.178 192.168.2.179 ]
 Stopped: [ 192.168.2.177 ]

I am confusing very much. Is there something wrong configure?Thank you very 
much!

James

best regards




【网易自营】好吃到爆!鲜香弹滑加热即食,经典13香/麻辣小龙虾仅75元3斤>>  ___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] dependency of pacemaker resources starting

2017-10-27 Thread lkxjtu
Does anyone know this question?

best regards


发自网易邮箱手机版


在2017年10月25日 23:10,lkxjtu 写道:
My problem is about the pacemaker. For example,the pacemaker cluster has two 
resources, both of them resource agent are ocf. One of resource is 
starting(calling ocf start function), such as needing for 1 minutes, then in 
this 1 minutes, if another resource monitor failed, pacemaker will not 
immediately call stop/start method to restart it but waiting the first resource 
to starting complete. After the first resource start completely, the second 
resource begin restarting.
 
My cluster version: corosync 2.3.4 pacemaker 1.1.13
 
Configure as:
# crm configure show
node 168002177: 192.168.2.177
node 168002178: 192.168.2.178
node 168002179: 192.168.2.179
primitive fm_mgt fm_mgt \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
primitive logserver logserver \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
clone fm_mgt_replica fm_mgt
clone logserver_replica logserver
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.13-10.el7-44eb2dd \
cluster-infrastructure=corosync \
stonith-enabled=false \
start-failure-is-fatal=false
 
When I kill fm_mgt service on 177 node, pacemaker will stop and start it 
immediately after firts monitor failed. But at this time, I kill a logserver 
service, pacemaker will not restart it after monitor failed, but waiting fm_mgt 
recovery completely.There are 48 seconds between logserver monitor failed and 
stop/start action.
 
# crm status
Last updated: Thu Oct 26 06:40:24 2017  Last change: Thu Oct 26 
06:36:33 2017 by root via crm_resource on 192.168.2.177
Stack: corosync
Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
3 nodes and 6 resources configured
Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
Full list of resources:
 Clone Set: logserver_replica [logserver]
 logserver  (ocf::heartbeat:logserver): FAILED 192.168.2.177
 Started: [ 192.168.2.178 192.168.2.179 ]
 Clone Set: fm_mgt_replica [fm_mgt]
 Started: [ 192.168.2.178 192.168.2.179 ]
 Stopped: [ 192.168.2.177 ]
 
I am confusing very much. Is there something wrong configure?Thank you very 
much!



【网易自营|30天无忧退货】仅售同款价1/4!MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>  ___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] dependency of pacemaker resources starting

2017-10-25 Thread lkxjtu
My problem is about the pacemaker. For example,the pacemaker cluster has two 
resources, both of them resource agent are ocf. One of resource is 
starting(calling ocf start function), such as needing for 1 minutes, then in 
this 1 minutes, if another resource monitor failed, pacemaker will not 
immediately call stop/start method to restart it but waiting the first resource 
to starting complete. After the first resource start completely, the second 
resource begin restarting.
 
My cluster version: corosync 2.3.4 pacemaker 1.1.13
 
Configure as:
# crm configure show
node 168002177: 192.168.2.177
node 168002178: 192.168.2.178
node 168002179: 192.168.2.179
primitive fm_mgt fm_mgt \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
primitive logserver logserver \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
clone fm_mgt_replica fm_mgt
clone logserver_replica logserver
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.13-10.el7-44eb2dd \
cluster-infrastructure=corosync \
stonith-enabled=false \
start-failure-is-fatal=false
 
When I kill fm_mgt service on 177 node, pacemaker will stop and start it 
immediately after firts monitor failed. But at this time, I kill a logserver 
service, pacemaker will not restart it after monitor failed, but waiting fm_mgt 
recovery completely.There are 48 seconds between logserver monitor failed and 
stop/start action.
 
# crm status
Last updated: Thu Oct 26 06:40:24 2017  Last change: Thu Oct 26 
06:36:33 2017 by root via crm_resource on 192.168.2.177
Stack: corosync
Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - partition with 
quorum
3 nodes and 6 resources configured
Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
Full list of resources:
 Clone Set: logserver_replica [logserver]
 logserver  (ocf::heartbeat:logserver): FAILED 192.168.2.177
 Started: [ 192.168.2.178 192.168.2.179 ]
 Clone Set: fm_mgt_replica [fm_mgt]
 Started: [ 192.168.2.178 192.168.2.179 ]
 Stopped: [ 192.168.2.177 ]
 
I am confusing very much. Is there something wrong configure?Thank you very 
much!___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org