[ClusterLabs] Stonith two-node cluster shot each other

Daniel Ragle Tue, 04 Dec 2018 09:49:08 -0800

I *think* the two nodes of my cluster shot each other in the head thisweekend and I can't figure out why.


Looking at corosync.log on node1 I see this:

[143747] node1.mydomain.com corosyncnotice [TOTEM ] A processor failed,forming new configuration.[143747] node1.mydomain.com corosyncnotice [TOTEM ] A new membership(192.168.10.25:236) was formed. Members joined: 2 left: 2[143747] node1.mydomain.com corosyncnotice [TOTEM ] Failed to receivethe leave message. failed: 2

[143747] node1.mydomain.com corosyncnotice  [TOTEM ] Retransmit List: 1

Dec 01 07:03:50 [143768] node1.mydomain.com crmd: info:pcmk_cpg_membership: Node 2 left group crmd (peer=node2.mydomain.com,counter=1.0)Dec 01 07:03:50 [143766] node1.mydomain.com attrd: info:pcmk_cpg_membership: Node 2 left group attrd (peer=node2.mydomain.com,counter=1.0)Dec 01 07:03:50 [143764] node1.mydomain.com stonith-ng: info:pcmk_cpg_membership: Node 2 left group stonith-ng(peer=node2.vselect.com, counter=1.0)Dec 01 07:03:50 [143762] node1.mydomain.com pacemakerd: info:pcmk_cpg_membership: Node 2 left group pacemakerd(peer=node2.vselect.com, counter=1.0)

Followed by a whole slew of messages generally saying node2 wasdead/could not be reached, culminating in:

Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: notice:initiate_remote_stonith_op: Requesting peer fencing (reboot) ofnode2.mydomain.com | id=a041d1df-e857-4815-91db-00f448106a33 state=0Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:process_remote_stonith_query: Query result 1 of 2 fromnode1.mydomain.com for node2.mydomain.com/reboot (1 devices)a041d1df-e857-4815-91db-00f448106a33Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:call_remote_stonith: Total timeout set to 300 for peer's fencing ofnode2.mydomain.com forstonith-api.139901|id=a041d1df-e857-4815-91db-00f448106a33Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:call_remote_stonith: Requesting that 'node1.mydomain.com' perform op'node2.mydomain.com reboot' for stonith-api.139901 (360s, 0s)Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:process_remote_stonith_query: Query result 2 of 2 fromnode2.mydomain.com for node2.mydomain.com/reboot (1 devices)a041d1df-e857-4815-91db-00f448106a33Dec 01 07:04:20 [143764] node1.mydomain.com stonith-ng: info:stonith_fence_get_devices_cb: Found 1 matching devices for'node2.mydomain.com'Dec 01 07:04:21 [143768] node1.mydomain.com crmd: info:crm_update_peer_expected: handle_request: Node node2.mydomain.com[2]- expected state is now down (was member)Dec 01 07:04:21 [143766] node1.mydomain.com attrd: info:attrd_peer_update: Setting shutdown[node2.mydomain.com]: (null) ->1543665861 from node2.mydomain.comDec 01 07:04:21 [143763] node1.mydomain.com cib: info:cib_perform_op: Diff: --- 0.188.66 2Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:cib_perform_op: Diff: +++ 0.188.67 (null)Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:cib_perform_op: + /cib: @num_updates=67Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:cib_perform_op: ++ /cib/status/node_state[@id='2']:<transient_attributes id="2"/>Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:cib_perform_op: ++<instance_attributes id="status-2">Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:cib_perform_op: ++ <nvpairid="status-2-shutdown" name="shutdown" value="1543665861"/>Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:cib_perform_op: ++</instance_attributes>Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:cib_perform_op: ++</transient_attributes>Dec 01 07:04:21 [143763] node1.mydomain.com cib: info:cib_process_request: Completed cib_modify operation for section status:OK (rc=0, origin=node2.mydomain.com/attrd/6, version=0.188.67)


And on node2 I see this:

[50215] node2.mydomain.com corosyncnotice [TOTEM ] A new membership(192.168.10.25:228) was formed. Members[50215] node2.mydomain.com corosyncnotice [TOTEM ] A new membership(192.168.10.25:236) was formed. Members joined: 1 left: 1[50215] node2.mydomain.com corosyncnotice [TOTEM ] Failed to receivethe leave message. failed: 1Dec 01 07:03:50 [50224] node2.mydomain.com cib: info:pcmk_cpg_membership: Node 1 left group cib (peer=node1.mydomain.com,counter=2.0)Dec 01 07:03:50 [50224] node2.mydomain.com cib: info:crm_update_peer_proc: pcmk_cpg_membership: Node node1.mydomain.com[1] -corosync-cpg is now offlineDec 01 07:03:50 [50229] node2.mydomain.com crmd: info:pcmk_cpg_membership: Node 1 left group crmd (peer=node1.mydomain.com,counter=2.0)Dec 01 07:03:50 [50229] node2.mydomain.com crmd: info:crm_update_peer_proc: pcmk_cpg_membership: Node node1.mydomain.com[1] -corosync-cpg is now offlineDec 01 07:03:50 [50229] node2.mydomain.com crmd: info:peer_update_callback: Client node1.mydomain.com/peer now has status[offline] (DC=true, changed=4000000)


and then later

Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: notice:handle_request: Client stonith-api.170881.b598a6f3 wants to fence(reboot) '1' with device '(any)'Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: notice:initiate_remote_stonith_op: Requesting peer fencing (reboot) ofnode1.mydomain.com | id=2b08eff2-1555-46fa-8a88-fe500f3fca87 state=0Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: info:process_remote_stonith_query: Query result 1 of 2 fromnode1.mydomain.com for node1.mydomain.com/reboot (1 devices)2b08eff2-1555-46fa-8a88-fe500f3fca87Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: info:process_remote_stonith_query: Query result 2 of 2 fromnode2.mydomain.com for node1.mydomain.com/reboot (1 devices)2b08eff2-1555-46fa-8a88-fe500f3fca87Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: info:call_remote_stonith: Total timeout set to 300 for peer's fencing ofnode1.mydomain.com forstonith-api.170881|id=2b08eff2-1555-46fa-8a88-fe500f3fca87Dec 01 07:04:20 [50225] node2.mydomain.com stonith-ng: info:call_remote_stonith: Requesting that 'node2.mydomain.com' perform op'node1.mydomain.com reboot' for stonith-api.170881 (360s, 0s)Dec 01 07:04:21 [50225] node2.mydomain.com stonith-ng: info:stonith_fence_get_devices_cb: Found 1 matching devices for'node1.mydomain.com'

What is wrong with my config that they would want to kill each other?Shouldn't one always survive?


# pcs stonith show --full
 Resource: FenceNode2 (class=stonith type=fence_ipmilan)

Attributes: hexadecimal_kg=<KEY> ipaddr=192.168.10.29 lanplus=1login=ipmiUser method=onoff passwd=<BLAH> power_timeout=30 power_wait=4

  Operations: monitor interval=60s (FenceNode2-monitor-interval-60s)
 Resource: FenceNode1 (class=stonith type=fence_ipmilan)

Attributes: hexadecimal_kg=<KEY> ipaddr=192.168.100.28 lanplus=1login=ipmiUser method=onoff passwd=<BLAH> power_timeout=30 power_wait=4

  Operations: monitor interval=60s (FenceNode1-monitor-interval-60s)

The corresponding constraints:

  Resource: FenceNode1
    Disabled on: node1.mydomain.com (score:-INFINITY)
  Resource: FenceNode2
    Disabled on: node2.mydomain.com (score:-INFINITY)

And corosync.conf:

# cat /etc/corosync/corosync.conf
totem {
    version: 2
    cluster_name: MyCluster
    secauth: off
    transport: udp

    interface {
        ringnumber: 0
        bindnetaddr: 192.168.10.0
        broadcast: no
        mcastport: 5405
        ttl: 1
    }
}

nodelist {
    node {
        ring0_addr: node1.vselect.com
        nodeid: 1
    }

    node {
        ring0_addr: node2.vselect.com
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
}

TIA,

Dan
_______________________________________________
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Stonith two-node cluster shot each other

Reply via email to