Re: [ClusterLabs] 2-Node cluster - both nodes unclean - can't start cluster

Jan Friesse Mon, 13 Mar 2023 01:50:44 -0700

On 10/03/2023 22:29, Reid Wahl wrote:

On Fri, Mar 10, 2023 at 10:49 AM Lentes, Bernd
<bernd.len...@helmholtz-muenchen.de> wrote:


Hi,

I don’t get my cluster running. I had problems with an OCFS2 Volume, both
nodes have been fenced.
When I do now a “systemctl start pacemaker.service”, crm_mon shows for a few
seconds both nodes as UNCLEAN, then pacemaker stops.
I try to confirm the fendcing with “Stonith_admin –C”, but it doesn’t work.
Maybe time is to short, pacemaker is just running for a few seconds.

Here is the log:

Mar 10 19:36:24 [31037] ha-idg-1 corosync notice  [MAIN  ] Corosync Cluster
Engine ('2.3.6'): started and ready to provide service.
Mar 10 19:36:24 [31037] ha-idg-1 corosync info    [MAIN  ] Corosync built-in
features: debug testagents augeas systemd pie relro bindnow
Mar 10 19:36:24 [31037] ha-idg-1 corosync notice  [TOTEM ] Initializing
transport (UDP/IP Multicast).
Mar 10 19:36:24 [31037] ha-idg-1 corosync notice  [TOTEM ] Initializing
transmit/receive security (NSS) crypto: aes256 hash: sha1
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [TOTEM ] The network
interface [192.168.100.10] is now up.
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
loaded: corosync configuration map access [0]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name: cmap
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
loaded: corosync configuration service [1]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name: cfg
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
loaded: corosync cluster closed process group service v1.01 [2]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name: cpg
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
loaded: corosync profile loading service [4]
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] Using quorum
provider corosync_votequorum
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] This node is
within the primary component and will provide service.
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] Members[0]:
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
loaded: corosync vote quorum service v1.0 [5]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name:
votequorum
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
loaded: corosync cluster quorum service v0.1 [3]
Mar 10 19:36:25 [31037] ha-idg-1 corosync info    [QB    ] server name:
quorum
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [TOTEM ] A new membership
(192.168.100.10:2340) was formed. Members joined: 1084777482


Is this really the corosync node ID of one of your nodes? If not,
what's your corosync version? Is the number the same every time the
issue happens? The number is so large and seemingly random that I
wonder if there's some kind of memory corruption.

It's autogenerated nodeid (just ipv4 addresss). Nodeid was not requiredfor Corosync < 3 (we made it required mostly for knet).

Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [QUORUM] Members[1]:
1084777482
Mar 10 19:36:25 [31037] ha-idg-1 corosync notice  [MAIN  ] Completed service
synchronization, ready to provide service.
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice: main:    Starting
Pacemaker 1.1.24+20210811.f5abda0ee-3.27.1 | build=1.1.24+20210811.f5abda0ee
features: generated-manpages agent-manp
ages ncurses libqb-logging libqb-ipc lha-fencing systemd nagios
corosync-native atomic-attrd snmp libesmtp acls cibsecrets
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: main:    Maximum core
file size is: 18446744073709551615
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: qb_ipcs_us_publish:
server name: pacemakerd
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
pcmk__ipc_is_authentic_process_active:   Could not connect to lrmd IPC:
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
pcmk__ipc_is_authentic_process_active:   Could not connect to cib_ro IPC:
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
pcmk__ipc_is_authentic_process_active:   Could not connect to crmd IPC:
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
pcmk__ipc_is_authentic_process_active:   Could not connect to attrd IPC:
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
pcmk__ipc_is_authentic_process_active:   Could not connect to pengine IPC:
Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
pcmk__ipc_is_authentic_process_active:   Could not connect to stonith-ng
IPC: Connection refused
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice: get_node_name:
Could not obtain a node name for corosync nodeid 1084777482
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: crm_get_peer:
Created entry 3c2499de-58a8-44f7-bf1e-03ff1fbec774/0x1456550 for node
(null)/1084777482 (1 total)
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: crm_get_peer:    Node
1084777482 has uuid 1084777482
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: crm_update_peer_proc:
cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice:
cluster_connect_quorum:  Quorum acquired
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice: get_node_name:
Defaulting to uname -n for the local corosync node name
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: crm_get_peer:    Node
1084777482 is now known as ha-idg-1
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Using uid=90 and group=90 for process cib
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Forked child 31045 for process cib
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Forked child 31046 for process stonith-ng
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Forked child 31047 for process lrmd
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Using uid=90 and group=90 for process attrd
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Forked child 31048 for process attrd
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Using uid=90 and group=90 for process pengine
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Forked child 31049 for process pengine
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Using uid=90 and group=90 for process crmd
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: start_child:
Forked child 31050 for process crmd
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: main:    Starting
mainloop
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info:
pcmk_quorum_notification:        Quorum retained | membership=2340 members=1
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:   notice:
crm_update_peer_state_iter:      Node ha-idg-1 state is now member |
nodeid=1084777482 previous=unknown source=pcmk_quorum_notification
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: pcmk_cpg_membership:
Group pacemakerd event 0: node 1084777482 pid 31044 joined via cpg_join
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: pcmk_cpg_membership:
Group pacemakerd event 0: ha-idg-1 (node 1084777482 pid 31044) is member
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:25 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_log_init:
Changed active directory to /var/lib/pacemaker/cores
Mar 10 19:36:25 [31049] ha-idg-1    pengine:     info: crm_log_init:
Changed active directory to /var/lib/pacemaker/cores
Mar 10 19:36:25 [31049] ha-idg-1    pengine:     info: qb_ipcs_us_publish:
server name: pengine
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: get_cluster_type:
Verifying cluster type: 'corosync'
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_log_init:
Changed active directory to /var/lib/pacemaker/cores
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: get_cluster_type:
Assuming an active 'corosync' cluster
Mar 10 19:36:25 [31049] ha-idg-1    pengine:     info: main:    Starting
pengine
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: main:    Starting up
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: retrieveCib:
Reading cluster configuration file /var/lib/pacemaker/cib/cib.xml (digest:
/var/lib/pacemaker/cib/cib.xml.sig)
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: get_cluster_type:
Verifying cluster type: 'corosync'
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: get_cluster_type:
Assuming an active 'corosync' cluster
Mar 10 19:36:25 [31048] ha-idg-1      attrd:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: corosync
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_log_init:
Changed active directory to /var/lib/pacemaker/cores
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: get_cluster_type:
Verifying cluster type: 'corosync'
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: get_cluster_type:
Assuming an active 'corosync' cluster
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: corosync
Mar 10 19:36:25 [31047] ha-idg-1       lrmd:     info: crm_log_init:
Changed active directory to /var/lib/pacemaker/cores
Mar 10 19:36:25 [31047] ha-idg-1       lrmd:     info: qb_ipcs_us_publish:
server name: lrmd
Mar 10 19:36:25 [31047] ha-idg-1       lrmd:     info: main:    Starting
Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: crm_log_init:
Changed active directory to /var/lib/pacemaker/cores
Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: main:    CRM Git
Version: 1.1.24+20210811.f5abda0ee-3.27.1 (1.1.24+20210811.f5abda0ee)
Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: get_cluster_type:
Verifying cluster type: 'corosync'
Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: get_cluster_type:
Assuming an active 'corosync' cluster
Mar 10 19:36:25 [31050] ha-idg-1       crmd:  warning:
log_deprecation_warnings:        Compile-time support for crm_mon SNMP
options is deprecated and will be removed in a future release (configure
alerts instead)
Mar 10 19:36:25 [31050] ha-idg-1       crmd:  warning:
log_deprecation_warnings:        Compile-time support for crm_mon SMTP
options is deprecated and will be removed in a future release (configure
alerts instead)
Mar 10 19:36:25 [31050] ha-idg-1       crmd:     info: do_log:  Input
I_STARTUP received in state S_STARTING from crmd_init
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info:
validate_with_relaxng:   Creating RNG parser context
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482       ⇐========= this happens
quite often


corosync.conf is missing name for node - just add it like:

nodelist {
    node {
        name: HOSTNAME
        nodeid: nodeid
        ring0_addr: IPADDR
    }

    node {
        name: HOSTNAME2
        nodeid: nodeid2
        ring0_addr: IPADDR2
    }

    ...
}

but it shouldn't be a problem.

Mar 10 19:36:25 [31048] ha-idg-1      attrd:   notice: get_node_name:
Could not obtain a node name for corosync nodeid 1084777482
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_get_peer:
Created entry c1bd522c-34da-49b3-97cb-22fd4580959b/0x109e210 for node
(null)/1084777482 (1 total)
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_get_peer:    Node
1084777482 has uuid 1084777482
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_update_peer_proc:
cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
Mar 10 19:36:25 [31048] ha-idg-1      attrd:   notice:
crm_update_peer_state_iter:      Node (null) state is now member |
nodeid=1084777482 previous=unknown source=crm_update_peer_proc
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info:
init_cs_connection_once: Connection to 'corosync': established
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice: get_node_name:
Could not obtain a node name for corosync nodeid 1084777482
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_get_peer:
Created entry 1d232d33-d274-415d-be94-765dc1b4e1e4/0x9478d0 for node
(null)/1084777482 (1 total)
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_get_peer:    Node
1084777482 has uuid 1084777482
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_update_peer_proc:
cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice:
crm_update_peer_state_iter:      Node (null) state is now member |
nodeid=1084777482 previous=unknown source=crm_update_peer_proc
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: startCib:        CIB
Initialization completed successfully
Mar 10 19:36:25 [31045] ha-idg-1        cib:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: corosync
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:25 [31048] ha-idg-1      attrd:   notice: get_node_name:
Defaulting to uname -n for the local corosync node name
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: crm_get_peer:    Node
1084777482 is now known as ha-idg-1
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice: get_node_name:
Defaulting to uname -n for the local corosync node name
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info:
init_cs_connection_once: Connection to 'corosync': established
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:25 [31045] ha-idg-1        cib:   notice: get_node_name:
Could not obtain a node name for corosync nodeid 1084777482
Mar 10 19:36:25 [31048] ha-idg-1      attrd:     info: main:    Cluster
connection active
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_get_peer:
Created entry 7c2b1d3d-0ab6-4fa6-887c-5d01e5927a67/0x147af10 for node
(null)/1084777482 (1 total)
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_get_peer:    Node
1084777482 has uuid 1084777482
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_update_peer_proc:
cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
Mar 10 19:36:25 [31045] ha-idg-1        cib:   notice:
crm_update_peer_state_iter:      Node (null) state is now member |
nodeid=1084777482 previous=unknown source=crm_update_peer_proc
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info:
init_cs_connection_once: Connection to 'corosync': established
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:   notice: get_node_name:
Defaulting to uname -n for the local corosync node name
Mar 10 19:36:25 [31046] ha-idg-1 stonith-ng:     info: crm_get_peer:    Node
1084777482 is now known as ha-idg-1
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:25 [31045] ha-idg-1        cib:   notice: get_node_name:
Defaulting to uname -n for the local corosync node name
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: crm_get_peer:    Node
1084777482 is now known as ha-idg-1
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: qb_ipcs_us_publish:
server name: cib_ro
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: qb_ipcs_us_publish:
server name: cib_rw
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: qb_ipcs_us_publish:
server name: cib_shm
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: cib_init:
Starting cib mainloop
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: pcmk_cpg_membership:
Group cib event 0: node 1084777482 pid 31045 joined via cpg_join
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: pcmk_cpg_membership:
Group cib event 0: ha-idg-1 (node 1084777482 pid 31045) is member
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info: cib_file_backup:
Archived previous version as /var/lib/pacemaker/cib/cib-34.raw
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info:
cib_file_write_with_digest:      Wrote version 7.29548.0 of the CIB to disk
(digest: 03b4ec65319cef255d43fc1ec9d285a5)
Mar 10 19:36:25 [31045] ha-idg-1        cib:     info:
cib_file_write_with_digest:      Reading cluster configuration file
/var/lib/pacemaker/cib/cib.MBy2v0 (digest:
/var/lib/pacemaker/cib/cib.nDn0X9)
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_cib_control:  CIB
connection established
Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: corosync
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: get_node_name:
Could not obtain a node name for corosync nodeid 1084777482
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: crm_get_peer:
Created entry 873262c1-ede0-4ba7-97e6-53ead0a6d7b0/0x1613910 for node
(null)/1084777482 (1 total)
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: crm_get_peer:    Node
1084777482 has uuid 1084777482
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: crm_update_peer_proc:
cluster_connect_cpg: Node (null)[1084777482] - corosync-cpg is now online
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: get_node_name:
Defaulting to uname -n for the local corosync node name
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info:
init_cs_connection_once: Connection to 'corosync': established
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: get_node_name:
Defaulting to uname -n for the local corosync node name
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: crm_get_peer:    Node
1084777482 is now known as ha-idg-1
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: peer_update_callback:
Cluster node ha-idg-1 is now in unknown state      ⇐===== is that the
problem ?


Probably a normal part of the startup process but I haven't tested it yet.

Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: attrd_erase_attrs:
Clearing transient attributes from CIB |
xpath=//node_state[@uname='ha-idg-1']/transient_attributes
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info:
attrd_start_election_if_needed:  Starting an election to determine the
writer
Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: cib_process_request:
Forwarding cib_delete operation for section
//node_state[@uname='ha-idg-1']/transient_attributes to all
(origin=local/attrd/2)
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:26 [31048] ha-idg-1      attrd:   notice: get_node_name:
Defaulting to uname -n for the local corosync node name
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: main:    CIB
connection active
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: qb_ipcs_us_publish:
server name: attrd
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: main:    Accepting
attribute updates
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: pcmk_cpg_membership:
Group attrd event 0: node 1084777482 pid 31048 joined via cpg_join
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: pcmk_cpg_membership:
Group attrd event 0: ha-idg-1 (node 1084777482 pid 31048) is member
Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: corosync_node_name:
Unable to get node name for nodeid 1084777482
Mar 10 19:36:26 [31045] ha-idg-1        cib:   notice: get_node_name:
Defaulting to uname -n for the local corosync node name
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: election_check:
election-attrd won by local node
Mar 10 19:36:26 [31048] ha-idg-1      attrd:   notice: attrd_declare_winner:
Recorded local node as attribute writer (was unset)
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: attrd_peer_update:
Setting #attrd-protocol[ha-idg-1]: (null) -> 2 from ha-idg-1
Mar 10 19:36:26 [31048] ha-idg-1      attrd:     info: write_attribute:
Processed 1 private change for #attrd-protocol, id=n/a, set=n/a
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: setup_cib:
Watching for stonith topology changes
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: qb_ipcs_us_publish:
server name: stonith-ng
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: main:    Starting
stonith-ng mainloop
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: pcmk_cpg_membership:
Group stonith-ng event 0: node 1084777482 pid 31046 joined via cpg_join
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: pcmk_cpg_membership:
Group stonith-ng event 0: ha-idg-1 (node 1084777482 pid 31046) is member
Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice:
cluster_connect_quorum:  Quorum acquired
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: init_cib_cache_cb:
Updating device list from the cib: init
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: cib_devices_update:
Updating devices to version 7.29548.0
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:   notice: unpack_config:   On
loss of CCM Quorum: Ignore
Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: cib_process_request:
Completed cib_delete operation for section
//node_state[@uname='ha-idg-1']/transient_attributes: OK (rc=0,
origin=ha-idg-1/attrd/2, version=7.29548.0)
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_ha_control:
Connected to the cluster
Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: cib_process_request:
Forwarding cib_modify operation for section nodes to all
(origin=local/crmd/3)
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: lrmd_ipc_connect:
Connecting to lrmd
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_lrm_control:  LRM
connection established
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_started:
Delaying start, no membership data (0000000000100000)
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info:
pcmk_quorum_notification:        Quorum retained | membership=2340 members=1
Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice:
crm_update_peer_state_iter:      Node ha-idg-1 state is now member |
nodeid=1084777482 previous=unknown source=pcmk_quorum_notification
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: peer_update_callback:
Cluster node ha-idg-1 is now member (was in unknown state)
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_started:
Delaying start, Config not read (0000000000000040)
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: pcmk_cpg_membership:
Group crmd event 0: node 1084777482 pid 31050 joined via cpg_join
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: pcmk_cpg_membership:
Group crmd event 0: ha-idg-1 (node 1084777482 pid 31050) is member
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_started:
Delaying start, Config not read (0000000000000040)
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_started:
Delaying start, Config not read (0000000000000040)
Mar 10 19:36:26 [31045] ha-idg-1        cib:     info: cib_process_request:
Completed cib_modify operation for section nodes: OK (rc=0,
origin=ha-idg-1/crmd/3, version=7.29548.0)
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: qb_ipcs_us_publish:
server name: crmd
Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: do_started:      The
local CRM is operational    ⇐============================ looks pretty good
Mar 10 19:36:26 [31050] ha-idg-1       crmd:     info: do_log:  Input
I_PENDING received in state S_STARTING from do_started
Mar 10 19:36:26 [31050] ha-idg-1       crmd:   notice: do_state_transition:
State transition S_STARTING -> S_PENDING | input=I_PENDING
cause=C_FSA_INTERNAL origin=do_started
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: action_synced_wait:
Managed fence_ilo2_metadata_1 process 31052 exited with rc=0
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info:
stonith_device_register: Added 'fence_ilo_ha-idg-2' to the device list (1
active devices)
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info: action_synced_wait:
Managed fence_ilo4_metadata_1 process 31054 exited with rc=0
Mar 10 19:36:26 [31046] ha-idg-1 stonith-ng:     info:
stonith_device_register: Added 'fence_ilo_ha-idg-1' to the device list (2
active devices)
Mar 10 19:36:28 [31050] ha-idg-1       crmd:     info:
te_trigger_stonith_history_sync: Fence history will be synchronized
cluster-wide within 30 seconds
Mar 10 19:36:28 [31050] ha-idg-1       crmd:   notice: te_connect_stonith:
Fencer successfully connected
Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice: handle_request:
Received manual confirmation that ha-idg-1 is fenced
<===================== seems to be my "stonith_admin -C"

Yes

Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice:
initiate_remote_stonith_op:      Initiating manual confirmation for
ha-idg-1: 23926653-7baa-44b8-ade3-5ee8468f3db6
Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice: stonith_manual_ack:
Injecting manual confirmation that ha-idg-1 is safely off/down
Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice: remote_op_done:
Operation 'off' targeting ha-idg-1 on a human for
stonith_admin.31555@ha-idg-1.23926653: OK
Mar 10 19:36:34 [31050] ha-idg-1       crmd:     info: exec_alert_list:
Sending fencing alert via smtp_alert to informatic....@helmholtz-muenchen.de
Mar 10 19:36:34 [31047] ha-idg-1       lrmd:     info:
process_lrmd_alert_exec: Executing alert smtp_alert for
6bb5a831-e90c-4b0b-8783-0092a26a1e6c
Mar 10 19:36:34 [31050] ha-idg-1       crmd:     crit:
tengine_stonith_notify:  We were allegedly just fenced by a human for
ha-idg-1!      <=====================  what does that mean ? I didn't fence
it


It means you ran `stonith_admin -C`

https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-1.1.24/fencing/remote.c#L945-L961

Mar 10 19:36:34 [31050] ha-idg-1       crmd:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:  warning: pcmk_child_exit:
Shutting cluster down because crmd[31050] had fatal failure
<=======================  ???


Pacemaker is shutting down on the local node because it just received
confirmation that it was fenced (because you ran `stonith_admin -C`).
This is expected behavior.

Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: pcmk_shutdown_worker:
Shutting down Pacemaker
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
Stopping pengine | sent signal 15 to process 31049
Mar 10 19:36:34 [31049] ha-idg-1    pengine:   notice: crm_signal_dispatch:
Caught 'Terminated' signal | 15 (invoking handler)
Mar 10 19:36:34 [31049] ha-idg-1    pengine:     info: qb_ipcs_us_withdraw:
withdrawing server sockets
Mar 10 19:36:34 [31049] ha-idg-1    pengine:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
pengine[31049] exited with status 0 (OK)
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
Stopping attrd | sent signal 15 to process 31048
Mar 10 19:36:34 [31048] ha-idg-1      attrd:   notice: crm_signal_dispatch:
Caught 'Terminated' signal | 15 (invoking handler)
Mar 10 19:36:34 [31048] ha-idg-1      attrd:     info: main:    Shutting
down attribute manager
Mar 10 19:36:34 [31048] ha-idg-1      attrd:     info: qb_ipcs_us_withdraw:
withdrawing server sockets
Mar 10 19:36:34 [31048] ha-idg-1      attrd:     info: attrd_cib_destroy_cb:
Connection disconnection complete
Mar 10 19:36:34 [31048] ha-idg-1      attrd:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
attrd[31048] exited with status 0 (OK)
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
Stopping lrmd | sent signal 15 to process 31047
Mar 10 19:36:34 [31047] ha-idg-1       lrmd:   notice: crm_signal_dispatch:
Caught 'Terminated' signal | 15 (invoking handler)
Mar 10 19:36:34 [31047] ha-idg-1       lrmd:     info: lrmd_exit:
Terminating with 0 clients
Mar 10 19:36:34 [31047] ha-idg-1       lrmd:     info: qb_ipcs_us_withdraw:
withdrawing server sockets
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:34 [31047] ha-idg-1       lrmd:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
lrmd[31047] exited with status 0 (OK)
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
Stopping stonith-ng | sent signal 15 to process 31046
Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:   notice: crm_signal_dispatch:
Caught 'Terminated' signal | 15 (invoking handler)
Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:     info: stonith_shutdown:
Terminating with 3 clients
Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:     info:
cib_connection_destroy:  Connection to the CIB closed.
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:     info: qb_ipcs_us_withdraw:
withdrawing server sockets
Mar 10 19:36:34 [31046] ha-idg-1 stonith-ng:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
stonith-ng[31046] exited with status 0 (OK)
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: stop_child:
Stopping cib | sent signal 15 to process 31045
Mar 10 19:36:34 [31045] ha-idg-1        cib:   notice: crm_signal_dispatch:
Caught 'Terminated' signal | 15 (invoking handler)
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: cib_shutdown:
Disconnected 0 clients
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: cib_shutdown:    All
clients disconnected (0)
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: terminate_cib:
initiate_exit: Exiting from mainloop...
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
crm_cluster_disconnect:  Disconnecting from cluster infrastructure: corosync
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
terminate_cs_connection: Disconnecting from Corosync
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
terminate_cs_connection: No Quorum connection
Mar 10 19:36:34 [31045] ha-idg-1        cib:   notice:
terminate_cs_connection: Disconnected from Corosync
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
crm_cluster_disconnect:  Disconnected from corosync
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
crm_cluster_disconnect:  Disconnecting from cluster infrastructure: corosync
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
terminate_cs_connection: Disconnecting from Corosync
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
cluster_disconnect_cpg:  No CPG connection
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
terminate_cs_connection: No Quorum connection
Mar 10 19:36:34 [31045] ha-idg-1        cib:   notice:
terminate_cs_connection: Disconnected from Corosync
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info:
crm_cluster_disconnect:  Disconnected from corosync
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: mcp_cpg_deliver:
Ignoring process list sent by peer for local node
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: qb_ipcs_us_withdraw:
withdrawing server sockets
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: qb_ipcs_us_withdraw:
withdrawing server sockets
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: qb_ipcs_us_withdraw:
withdrawing server sockets
Mar 10 19:36:34 [31045] ha-idg-1        cib:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: pcmk_child_exit:
cib[31045] exited with status 0 (OK)
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: pcmk_shutdown_worker:
Shutdown complete
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:   notice: pcmk_shutdown_worker:
Attempting to inhibit respawning after fatal error
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info:
pcmk_exit_with_cluster:  Asking Corosync to shut down
Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [CFG   ] Node 1084777482
was shut down by sysadmin
Mar 10 19:36:34 [31044] ha-idg-1 pacemakerd:     info: crm_xml_cleanup:
Cleaning up memory from libxml2
Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Unloading all
Corosync service engines.
Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
server sockets
Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
unloaded: corosync vote quorum service v1.0
Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
server sockets
Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
unloaded: corosync configuration map access
Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
server sockets
Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
unloaded: corosync configuration service
Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
server sockets
Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
unloaded: corosync cluster closed process group service v1.01
Mar 10 19:36:34 [31037] ha-idg-1 corosync info    [QB    ] withdrawing
server sockets
Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
unloaded: corosync cluster quorum service v0.1
Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [SERV  ] Service engine
unloaded: corosync profile loading service
Mar 10 19:36:34 [31037] ha-idg-1 corosync notice  [MAIN  ] Corosync Cluster
Engine exiting normally

Bernd


Can you help me understand the issue here? You started the cluster on
this node at 19:36:24. 10 seconds later, you ran `stonith_admin -C`,
and the local node shut down Pacemaker, as expected. It doesn't look
like Pacemaker stopped until you ran that command.

The dc-deadtime property is set to 20 seconds by default. You can
expect nodes to be in UNCLEAN state until then.


--
Bernd Lentes
System Administrator
Institute for Metabolism and Cell Death (MCD)
Building 25 - office 122
HelmholtzZentrum München
bernd.len...@helmholtz-muenchen.de
phone: +49 89 3187 1241
        +49 89 3187 49123
fax:   +49 89 3187 2294
https://www.helmholtz-munich.de/en/mcd

Public key:
30 82 01 0a 02 82 01 01 00 b3 72 3e ce 2c 0a 6f 58 49 2c 92 23 c7 b9 c1 ff
6c 3a 53 be f7 9e e9 24 b7 49 fa 3c e8 de 28 85 2c d3 ed f7 70 03 3f 4d 82
fc cc 96 4f 18 27 1f df 25 b3 13 00 db 4b 1d ec 7f 1b cf f9 cd e8 5b 1f 11
b3 a7 48 f8 c8 37 ed 41 ff 18 9f d7 83 51 a9 bd 86 c2 32 b3 d6 2d 77 ff 32
83 92 67 9e ae ae 9c 99 ce 42 27 6f bf d8 c2 a1 54 fd 2b 6b 12 65 0e 8a 79
56 be 53 89 70 51 02 6a eb 76 b8 92 25 2d 88 aa 57 08 42 ef 57 fb fe 00 71
8e 90 ef b2 e3 22 f3 34 4f 7b f1 c4 b1 7c 2f 1d 6f bd c8 a6 a1 1f 25 f3 e4
4b 6a 23 d3 d2 fa 27 ae 97 80 a3 f0 5a c4 50 4a 45 e3 45 4d 82 9f 8b 87 90
d0 f9 92 2d a7 d2 67 53 e6 ae 1e 72 3e e9 e0 c9 d3 1c 23 e0 75 78 4a 45 60
94 f8 e3 03 0b 09 85 08 d0 6c f3 ff ce fa 50 25 d9 da 81 7b 2a dc 9e 28 8b
83 04 b4 0a 9f 37 b8 ac 58 f1 38 43 0e 72 af 02 03 01 00 01
(null)

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und 
Umwelt (GmbH), Ingolstadter Landstr. 1, 85764 Neuherberg, 
www.helmholtz-munich.de. Geschaeftsfuehrung:  Prof. Dr. med. Dr. h.c. Matthias 
Tschoep, Kerstin Guenther, Daniela Sommer (kom.) | Aufsichtsratsvorsitzende: 
Prof. Dr. Veronika von Messling | Registergericht: Amtsgericht Muenchen  HRB 
6466 | USt-IdNr. DE 129521671



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] 2-Node cluster - both nodes unclean - can't start cluster

Reply via email to