Silly question; Did you actually enable stonith? Can you share your config?

digimer

On 2018-06-20 06:04 PM, Casey & Gina wrote:
>> On 2018-06-20, at 3:59 PM, Casey & Gina <caseyandg...@icloud.com> wrote:
>>
>>> Get the cluster healthy, tail the system logs from both nodes, trigger a
>>> fault and wait for things to settle. Then share the logs please.
>>
>> What do you mean by "system logs"?  Do you mean the corosync.log?  
>> Triggering a fault is powering off a node, so I can't get a tailed log file 
>> from that host.  Is there another mechanism I should try?
> 
> Sorry, I did a little more research.  I guess you mean the syslog, and 
> realized I could `killall -9 corosync` to trigger a failure.  Let me know if 
> there is a better way or this is okay...
> 
> Here are the logs:
> 
> Node that was "master" to start with, that I did not kill corosync on:
> 
> Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: Operation 
> postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-2, call=36, rc=0, 
> cib-update=0, confirmed=true)
> Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 5 
> (Complete=12, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-58.bz2): Complete
> Jun 20 21:57:55 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition 
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
> origin=notify_crmd ]
> Jun 20 21:58:10 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[15918]: INFO: 
> Update score of "d-gp2-dbpg64-1" from -1000 to 1000 because of a change in 
> the replication lag (0).
> Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_IDLE 
> -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]:   notice: On loss of CCM 
> Quorum: Ignore
> Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 6 
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-59.bz2): Complete
> Jun 20 21:58:10 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition 
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
> origin=notify_crmd ]
> Jun 20 21:58:10 d-gp2-dbpg64-2 pengine[2499]:   notice: Calculated Transition 
> 6: /var/lib/pacemaker/pengine/pe-input-59.bz2
> Jun 20 21:58:13 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' 
> insert (-1)
> Jun 20 22:01:13 d-gp2-dbpg64-2 snmpd[1468]: message repeated 6 times: [ error 
> on subcontainer 'ia_addr' insert (-1)]
> Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] A processor 
> failed, forming new configuration.
> Jun 20 22:01:42 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] A processor failed, 
> forming new configuration.
> Jun 20 22:01:43 d-gp2-dbpg64-2 snmpd[1468]: error on subcontainer 'ia_addr' 
> insert (-1)
> Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] A new 
> membership (10.124.164.249:260) was formed. Members left: 1
> Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]: notice  [TOTEM ] Failed to 
> receive the leave message. failed: 1
> Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] A new membership 
> (10.124.164.249:260) was formed. Members left: 1
> Jun 20 22:01:43 d-gp2-dbpg64-2 corosync[6683]:  [TOTEM ] Failed to receive 
> the leave message. failed: 1
> Jun 20 22:01:43 d-gp2-dbpg64-2 pacemakerd[6716]:   notice: 
> crm_reap_unseen_nodes: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]:   notice: crm_reap_unseen_nodes: 
> Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: crm_update_peer_proc: 
> Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: 
> crm_update_peer_proc: Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: Removing 
> d-gp2-dbpg64-1/1 from the membership list
> Jun 20 22:01:43 d-gp2-dbpg64-2 attrd[6720]:   notice: Purged 1 peers with 
> id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache
> Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: Removing 
> d-gp2-dbpg64-1/1 from the membership list
> Jun 20 22:01:43 d-gp2-dbpg64-2 stonith-ng[6719]:   notice: Purged 1 peers 
> with id=1 and/or uname=d-gp2-dbpg64-1 from the membership cache
> Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: crm_update_peer_proc: 
> Node d-gp2-dbpg64-1[1] - state is now lost (was member)
> Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: Removing d-gp2-dbpg64-1/1 
> from the membership list
> Jun 20 22:01:43 d-gp2-dbpg64-2 cib[6718]:   notice: Purged 1 peers with id=1 
> and/or uname=d-gp2-dbpg64-1 from the membership cache
> Jun 20 22:01:43 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition S_IDLE 
> -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]:   notice: On loss of CCM 
> Quorum: Ignore
> Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]:   notice: Transition 7 
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
> Jun 20 22:01:44 d-gp2-dbpg64-2 crmd[6721]:   notice: State transition 
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
> origin=notify_crmd ]
> Jun 20 22:01:44 d-gp2-dbpg64-2 pengine[2499]:   notice: Calculated Transition 
> 7: /var/lib/pacemaker/pengine/pe-input-60.bz2
> Jun 20 22:01:57 d-gp2-dbpg64-2 pgsqlms(postgresql-10-main)[17381]: INFO: 
> Ignoring unknown application_name/node "d-gp2-dbpg64-1"
> 
> Node that was a standby, which I kill -9'd corosync on:
> 
> Jun 20 21:57:52 d-gp2-dbpg64-1 stonith-ng[2035]:   notice: On loss of CCM 
> Quorum: Ignore
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: State transition 
> S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE 
> origin=do_cl_join_finalize_respond ]
> Jun 20 21:57:54 d-gp2-dbpg64-1 stonith-ng[2035]:   notice: Versions did not 
> change in patch 0.81.8
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
> postgresql-master-vip_monitor_0: not running (node=d-gp2-dbpg64-1, call=5, 
> rc=7, cib-update=12, confirmed=true)
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
> postgresql-10-main_monitor_0: not running (node=d-gp2-dbpg64-1, call=10, 
> rc=7, cib-update=13, confirmed=true)
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: 
> d-gp2-dbpg64-1-postgresql-10-main_monitor_0:10 [ /var/run/postgresql:5432 - 
> no response\npg_ctl: no server running\n ]
> Jun 20 21:57:54 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
> vfencing_monitor_0: not running (node=d-gp2-dbpg64-1, call=14, rc=7, 
> cib-update=14, confirmed=true)
> Jun 20 21:57:55 d-gp2-dbpg64-1 pgsqlms(postgresql-10-main)[2155]: INFO: 
> Instance "postgresql-10-main" started
> Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
> postgresql-10-main_start_0: ok (node=d-gp2-dbpg64-1, call=15, rc=0, 
> cib-update=15, confirmed=true)
> Jun 20 21:57:55 d-gp2-dbpg64-1 crmd[2039]:   notice: Operation 
> postgresql-10-main_notify_0: ok (node=d-gp2-dbpg64-1, call=16, rc=0, 
> cib-update=0, confirmed=true)
> Jun 20 22:01:32 d-gp2-dbpg64-1 systemd[1]: Started Session 2 of user cshobe.
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Main process 
> exited, code=killed, status=9/KILL
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Unit entered 
> failed state.
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: corosync.service: Failed with 
> result 'signal'.
> Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:  warning: new_event_notification 
> (2036-2039-8): Bad file descriptor (9)
> Jun 20 22:01:41 d-gp2-dbpg64-1 cib[2034]:    error: Connection to the CPG API 
> failed: Library error (2)
> Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2035]:    error: Connection to the 
> CPG API failed: Library error (2)
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]:    error: Connection to the CPG 
> API failed: Library error (2)
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2037]:   notice: Disconnecting client 
> 0x559e6c2c8810, pid=2039...
> Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:    error: Connection to stonith-ng 
> failed
> Jun 20 22:01:41 d-gp2-dbpg64-1 lrmd[2036]:    error: Connection to 
> stonith-ng[0x55852f94ff10] closed (I/O condition=17)
> Jun 20 22:01:41 d-gp2-dbpg64-1 pacemakerd[2030]:    error: Connection to the 
> CPG API failed: Library error (2)
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:   notice: Additional logging 
> available in /var/log/corosync/corosync.log
> Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:   notice: Additional logging 
> available in /var/log/corosync/corosync.log
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:   notice: Connecting to cluster 
> infrastructure: corosync
> Jun 20 22:01:41 d-gp2-dbpg64-1 attrd[2647]:    error: Could not connect to 
> the Cluster Process Group API: 2
> Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:   notice: Connecting to 
> cluster infrastructure: corosync
> Jun 20 22:01:41 d-gp2-dbpg64-1 crmd[2648]:   notice: Additional logging 
> available in /var/log/corosync/corosync.log
> Jun 20 22:01:41 d-gp2-dbpg64-1 stonith-ng[2646]:    error: Could not connect 
> to the Cluster Process Group API: 2
> Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [  393.367015] show_signal_msg: 15 
> callbacks suppressed
> Jun 20 22:01:41 d-gp2-dbpg64-1 kernel: [  393.367020] attrd[2647]: segfault 
> at 1b8 ip 00007f8a4813a870 sp 00007ffc7a76f398 error 4 in 
> libqb.so.0.17.2[7f8a4812d000+21000]
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Main process 
> exited, code=exited, status=107/n/a
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Unit entered 
> failed state.
> Jun 20 22:01:41 d-gp2-dbpg64-1 systemd[1]: pacemaker.service: Failed with 
> result 'exit-code'.
> 
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to