Silly question; Did you actually enable stonith? Can you share your config?
digimer On 2018-06-20 06:04 PM, Casey & Gina wrote: >> On 2018-06-20, at 3:59 PM, Casey & Gina <caseyandg...@icloud.com> wrote: >> >>> Get the cluster healthy, tail the system logs from both nodes, trigger a >>> fault and wait for things to settle. Then share the logs please. >> >> What do you mean by "system logs"? Do you mean the corosync.log? >> Triggering a fault is powering off a node, so I can't get a tailed log file >> from that host. Is there another mechanism I should try? > > Sorry, I did a little more research. I guess you mean the syslog, and > realized I could `killall -9 corosync` to trigger a failure. Let me know if > there is a better way or this is okay... > > Here are the logs: > > Node that was "master" to start with, that I did not kill corosync on: > > Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]: notice: Operation > postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-2, call=36, rc=0, > cib-update=0, confirmed=true) > Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]: notice: Transition 5 > (Complete=12, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-58.bz2): Complete > Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]: notice: State transition > S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL > origin=notify_crmd ] > Jun 20 21:58:10 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[15918]: INFO: > Update score of "d-gp2-dbpg64-1" from -1000 to 1000 because of a change in > the replication lag (0). > Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]: notice: State transition S_IDLE > -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]: notice: On loss of CCM > Quorum: Ignore > Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]: notice: Transition 6 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Complete > Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]: notice: State transition > S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL > origin=notify_crmd ] > Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]: notice: Calculated Transition > 6: /var/lib/pacemaker/pengine/pe-input-59.bz2 > Jun 20 21:58:13 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' > insert (-1) > Jun 20 22:01:13 d-gp2-dbpg64-2 snmpd[1468]: message repeated 6 times: [ error > on subcontainer 'ia_addr' insert (-1)] > Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]: notice [TOTEM ] A processor > failed, forming new configuration. > Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]: [TOTEM ] A processor failed, > forming new configuration. > Jun 20 22:01:43 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' > insert (-1) > Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice [TOTEM ] A new > membership (10.124.164.249:260) was formed. Members left: 1 > Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice [TOTEM ] Failed to > receive the leave message. failed: 1 > Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: [TOTEM ] A new membership > (10.124.164.249:260) was formed. Members left: 1 > Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: [TOTEM ] Failed to receive > the leave message. failed: 1 > Jun 20 22:01:43 d-gp2-dbpg64-2 pacemakerd[6716]: notice: > crm_reap_unseen_nodes: Node d-gp2-dbpg64-1[1] - state is now lost (was member) > Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]: notice: crm_reap_unseen_nodes: > Node d-gp2-dbpg64-1[1] - state is now lost (was member) > Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]: notice: crm_update_peer_proc: > Node d-gp2-dbpg64-1[1] - state is now lost (was member) > Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]: notice: > crm_update_peer_proc: Node d-gp2-dbpg64-1[1] - state is now lost (was member) > Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]: notice: Removing > d-gp2-dbpg64-1/1 from the membership list > Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]: notice: Purged 1 peers with > id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache > Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]: notice: Removing > d-gp2-dbpg64-1/1 from the membership list > Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]: notice: Purged 1 peers > with id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache > Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]: notice: crm_update_peer_proc: > Node d-gp2-dbpg64-1[1] - state is now lost (was member) > Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]: notice: Removing d-gp2-dbpg64-1/1 > from the membership list > Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]: notice: Purged 1 peers with id=1 > and/or uname=d-gp2-dbpg64-1 from the membership cache > Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]: notice: State transition S_IDLE > -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]: notice: On loss of CCM > Quorum: Ignore > Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]: notice: Transition 7 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete > Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]: notice: State transition > S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL > origin=notify_crmd ] > Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]: notice: Calculated Transition > 7: /var/lib/pacemaker/pengine/pe-input-60.bz2 > Jun 20 22:01:57 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[17381]: INFO: > Ignoring unknown application_name/node "d-gp2-dbpg64-1" > > Node that was a standby, which I kill -9'd corosync on: > > Jun 20 21:57:52 d-gp2-dbpg64-1 stonith-ng[2035]: notice: On loss of CCM > Quorum: Ignore > Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]: notice: State transition > S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE > origin=do_cl_join_finalize_respond ] > Jun 20 21:57:54 d-gp2-dbpg64-1 stonith-ng[2035]: notice: Versions did not > change in patch 0.81.8 > Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]: notice: Operation > postgresql-master-vip_monitor_0: not running (node=d-gp2-dbpg64-1, call=5, > rc=7, cib-update=12, confirmed=true) > Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]: notice: Operation > postgresql-10-main_monitor_0: not running (node=d-gp2-dbpg64-1, call=10, > rc=7, cib-update=13, confirmed=true) > Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]: notice: > d-gp2-dbpg64-1-postgresql-10-main_monitor_0:10 [ /var/run/postgresql:5432 - > no response\npg_ctl: no server running\n ] > Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]: notice: Operation > vfencing_monitor_0: not running (node=d-gp2-dbpg64-1, call=14, rc=7, > cib-update=14, confirmed=true) > Jun 20 21:57:55 d-gp2-dbpg64-1 pgsqlms(postgresql-10-main)[2155]: INFO: > Instance "postgresql-10-main" started > Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]: notice: Operation > postgresql-10-main_start_0: ok (node=d-gp2-dbpg64-1, call=15, rc=0, > cib-update=15, confirmed=true) > Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]: notice: Operation > postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-1, call=16, rc=0, > cib-update=0, confirmed=true) > Jun 20 22:01:32 d-gp2-dbpg64-1 systemd[1]: Started Session 2 of user cshobe. > Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Main process > exited, code=killed, status=9/KILL > Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Unit entered > failed state. > Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Failed with > result 'signal'. > Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]: warning: new_event_notification > (2036-2039-8): Bad file descriptor (9) > Jun 20 22:01:41 d-gp2-dbpg64-1 cib[2034]: error: Connection to the CPG API > failed: Library error (2) > Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2035]: error: Connection to the > CPG API failed: Library error (2) > Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]: error: Connection to the CPG > API failed: Library error (2) > Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]: notice: Disconnecting client > 0x559e6c2c8810, pid=2039... > Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]: error: Connection to stonith-ng > failed > Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]: error: Connection to > stonith-ng[0x55852f94ff10] closed (I/O condition=17) > Jun 20 22:01:41 d-gp2-dbpg64-1 pacemakerd[2030]: error: Connection to the > CPG API failed: Library error (2) > Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]: notice: Additional logging > available in /var/log/corosync/corosync.log > Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]: notice: Additional logging > available in /var/log/corosync/corosync.log > Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]: notice: Connecting to cluster > infrastructure: corosync > Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]: error: Could not connect to > the Cluster Process Group API: 2 > Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]: notice: Connecting to > cluster infrastructure: corosync > Jun 20 22:01:41 d-gp2-dbpg64-1 crmd[2648]: notice: Additional logging > available in /var/log/corosync/corosync.log > Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]: error: Could not connect > to the Cluster Process Group API: 2 > Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [ 393.367015] show_signal_msg: 15 > callbacks suppressed > Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [ 393.367020] attrd[2647]: segfault > at 1b8 ip 00007f8a4813a870 sp 00007ffc7a76f398 error 4 in > libqb.so.0.17.2[7f8a4812d000+21000] > Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Main process > exited, code=exited, status=107/n/a > Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Unit entered > failed state. > Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Failed with > result 'exit-code'. > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org