> On 2018-06-20, at 3:59 PM, Casey & Gina <caseyandg...@icloud.com> wrote:
> 
>> Get the cluster healthy, tail the system logs from both nodes, trigger a
>> fault and wait for things to settle. Then share the logs please.
> 
> What do you mean by "system logs"?  Do you mean the corosync.log?  Triggering 
> a fault is powering off a node, so I can't get a tailed log file from that 
> host.  Is there another mechanism I should try?

Sorry, I did a little more research.  I guess you mean the syslog, and realized 
I could `killall -9 corosync` to trigger a failure.  Let me know if there is a 
better way or this is okay...

Here are the logs:

Node that was "master" to start with, that I did not kill corosync on:

Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: Operation 
postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-2, call=36, rc=0, 
cib-update=0, confirmed=true)
Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 5 (Complete=12, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-58.bz2): Complete
Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Jun 20 21:58:10 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[15918]: INFO: Update 
score of "d-gp2-dbpg64-1" from -1000 to 1000 because of a change in the 
replication lag (0).
Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_IDLE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]:   notice: On loss of CCM Quorum: 
Ignore
Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 6 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Complete
Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]:   notice: Calculated Transition 
6: /var/lib/pacemaker/pengine/pe-input-59.bz2
Jun 20 21:58:13 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' 
insert (-1)
Jun 20 22:01:13 d-gp2-dbpg64-2 snmpd[1468]: message repeated 6 times: [ error 
on subcontainer 'ia_addr' insert (-1)]
Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] A processor 
failed, forming new configuration.
Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] A processor failed, 
forming new configuration.
Jun 20 22:01:43 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' 
insert (-1)
Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] A new 
membership (10.124.164.249:260) was formed. Members left: 1
Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] Failed to 
receive the leave message. failed: 1
Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] A new membership 
(10.124.164.249:260) was formed. Members left: 1
Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] Failed to receive the 
leave message. failed: 1
Jun 20 22:01:43 d-gp2-dbpg64-2 pacemakerd[6716]:   notice: 
crm_reap_unseen_nodes: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]:   notice: crm_reap_unseen_nodes: 
Node d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: crm_update_peer_proc: 
Node d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: 
crm_update_peer_proc: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: Removing d-gp2-dbpg64-1/1 
from the membership list
Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: Purged 1 peers with id=1 
and/or uname=d-gp2-dbpg64-1 from the membership cache
Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: Removing 
d-gp2-dbpg64-1/1 from the membership list
Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: Purged 1 peers with 
id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache
Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: crm_update_peer_proc: Node 
d-gp2-dbpg64-1[1] - state is now lost (was member)
Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: Removing d-gp2-dbpg64-1/1 
from the membership list
Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: Purged 1 peers with id=1 
and/or uname=d-gp2-dbpg64-1 from the membership cache
Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_IDLE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]:   notice: On loss of CCM Quorum: 
Ignore
Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 7 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]:   notice: Calculated Transition 
7: /var/lib/pacemaker/pengine/pe-input-60.bz2
Jun 20 22:01:57 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[17381]: INFO: 
Ignoring unknown application_name/node "d-gp2-dbpg64-1"

Node that was a standby, which I kill -9'd corosync on:

Jun 20 21:57:52 d-gp2-dbpg64-1 stonith-ng[2035]:   notice: On loss of CCM 
Quorum: Ignore
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: State transition S_PENDING 
-> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE 
origin=do_cl_join_finalize_respond ]
Jun 20 21:57:54 d-gp2-dbpg64-1 stonith-ng[2035]:   notice: Versions did not 
change in patch 0.81.8
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
postgresql-master-vip_monitor_0: not running (node=d-gp2-dbpg64-1, call=5, 
rc=7, cib-update=12, confirmed=true)
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
postgresql-10-main_monitor_0: not running (node=d-gp2-dbpg64-1, call=10, rc=7, 
cib-update=13, confirmed=true)
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: 
d-gp2-dbpg64-1-postgresql-10-main_monitor_0:10 [ /var/run/postgresql:5432 - no 
response\npg_ctl: no server running\n ]
Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
vfencing_monitor_0: not running (node=d-gp2-dbpg64-1, call=14, rc=7, 
cib-update=14, confirmed=true)
Jun 20 21:57:55 d-gp2-dbpg64-1 pgsqlms(postgresql-10-main)[2155]: INFO: 
Instance "postgresql-10-main" started
Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
postgresql-10-main_start_0: ok (node=d-gp2-dbpg64-1, call=15, rc=0, 
cib-update=15, confirmed=true)
Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-1, call=16, rc=0, 
cib-update=0, confirmed=true)
Jun 20 22:01:32 d-gp2-dbpg64-1 systemd[1]: Started Session 2 of user cshobe.
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Main process 
exited, code=killed, status=9/KILL
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Unit entered 
failed state.
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Failed with result 
'signal'.
Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:  warning: new_event_notification 
(2036-2039-8): Bad file descriptor (9)
Jun 20 22:01:41 d-gp2-dbpg64-1 cib[2034]:    error: Connection to the CPG API 
failed: Library error (2)
Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2035]:    error: Connection to the 
CPG API failed: Library error (2)
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]:    error: Connection to the CPG API 
failed: Library error (2)
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]:   notice: Disconnecting client 
0x559e6c2c8810, pid=2039...
Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:    error: Connection to stonith-ng 
failed
Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:    error: Connection to 
stonith-ng[0x55852f94ff10] closed (I/O condition=17)
Jun 20 22:01:41 d-gp2-dbpg64-1 pacemakerd[2030]:    error: Connection to the 
CPG API failed: Library error (2)
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:   notice: Additional logging 
available in /var/log/corosync/corosync.log
Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:   notice: Additional logging 
available in /var/log/corosync/corosync.log
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:   notice: Connecting to cluster 
infrastructure: corosync
Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:    error: Could not connect to the 
Cluster Process Group API: 2
Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:   notice: Connecting to 
cluster infrastructure: corosync
Jun 20 22:01:41 d-gp2-dbpg64-1 crmd[2648]:   notice: Additional logging 
available in /var/log/corosync/corosync.log
Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:    error: Could not connect to 
the Cluster Process Group API: 2
Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [  393.367015] show_signal_msg: 15 
callbacks suppressed
Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [  393.367020] attrd[2647]: segfault at 
1b8 ip 00007f8a4813a870 sp 00007ffc7a76f398 error 4 in 
libqb.so.0.17.2[7f8a4812d000+21000]
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Main process 
exited, code=exited, status=107/n/a
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Unit entered 
failed state.
Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Failed with 
result 'exit-code'.

_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to