I seem to get rgmanager stuck/unresponsive/unkillable rather regularly after a fence attempt has failed. fence_ack_manual rarely brings it back into shape and I have to resort to rebooting the node(s). Googling the error messages usually gets me nowhere, e.g. the one below (fenced_domain_info error -1) yields exactly one hit, the source code. I feel special.

Can anyone point me in the right direction on how to debug this further?

This is RHEL 6.3 and a simple 2-node cluster + quorum disk and shared SAN storage with HA-LVM.

Here's a recent log snippet if it helps:
Jun 06 10:07:23 dlm_controld cluster node 1 removed seq 100812
Jun 06 10:07:23 dlm_controld del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/1"
Jun 06 10:07:23 dlm_controld dlm:controld conf 1 0 1 memb 2 join left 1
Jun 06 10:07:23 dlm_controld dlm:ls:rgmanager conf 1 0 1 memb 2 join left 1
Jun 06 10:07:23 dlm_controld rgmanager add_change cg 4 remove nodeid 1 reason 3 Jun 06 10:07:23 dlm_controld rgmanager add_change cg 4 counts member 1 joined 0 remove 1 failed 1
Jun 06 10:07:23 dlm_controld rgmanager stop_kernel cg 4
Jun 06 10:07:23 dlm_controld write "0" to "/sys/kernel/dlm/rgmanager/control" Jun 06 10:07:23 dlm_controld rgmanager check_fencing 1 wait add 1370252270 fail 1370506043 last 1370252214
Jun 06 10:07:24 dlm_controld cluster node 1 added seq 100816
Jun 06 10:07:24 dlm_controld set_configfs_node 1 10.112.32.22 local 0
Jun 06 10:07:24 dlm_controld dlm:controld conf 2 1 0 memb 1 2 join 1 left
Jun 06 10:07:24 dlm_controld cpg_mcast_joined retried 2 protocol
Jun 06 10:07:24 dlm_controld dlm:ls:rgmanager conf 2 1 0 memb 1 2 join 1 left
Jun 06 10:07:24 dlm_controld rgmanager add_change cg 5 joined nodeid 1
Jun 06 10:07:24 dlm_controld rgmanager add_change cg 5 counts member 2 joined 1 remove 0 failed 0
Jun 06 10:07:49 dlm_controld cluster node 1 removed seq 100820
Jun 06 10:07:49 dlm_controld del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/1"
Jun 06 10:07:49 dlm_controld dlm:controld conf 1 0 1 memb 2 join left 1
Jun 06 10:07:49 dlm_controld dlm:ls:rgmanager conf 1 0 1 memb 2 join left 1
Jun 06 10:07:49 dlm_controld rgmanager add_change cg 6 remove nodeid 1 reason 3 Jun 06 10:07:49 dlm_controld rgmanager add_change cg 6 counts member 1 joined 0 remove 1 failed 1 Jun 06 10:07:49 dlm_controld rgmanager check_fencing 1 wait add 1370252270 fail 1370506069 last 1370252214
Jun 06 11:55:03 dlm_controld cluster node 1 added seq 100828
Jun 06 11:55:03 dlm_controld set_configfs_node 1 10.112.32.22 local 0
Jun 06 11:55:10 dlm_controld cluster node 1 removed seq 100832
Jun 06 11:55:10 dlm_controld del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/1"
Jun 06 12:19:10 dlm_controld dlm_controld 3.0.12.1 started
Jun 06 12:19:11 dlm_controld found /dev/misc/dlm-control minor 54
Jun 06 12:19:11 dlm_controld found /dev/misc/dlm-monitor minor 53
Jun 06 12:19:11 dlm_controld found /dev/misc/dlm_plock minor 52
Jun 06 12:19:11 dlm_controld /dev/misc/dlm-monitor fd 12
Jun 06 12:19:11 dlm_controld /sys/kernel/config/dlm/cluster/comms: opendir failed: 2 Jun 06 12:19:11 dlm_controld /sys/kernel/config/dlm/cluster/spaces: opendir failed: 2
Jun 06 12:19:11 dlm_controld cluster node 2 added seq 100836
Jun 06 12:19:11 dlm_controld set_configfs_node 2 10.113.32.25 local 1
Jun 06 12:19:11 dlm_controld totem/rrp_mode = 'none'
Jun 06 12:19:11 dlm_controld set protocol 0
Jun 06 12:19:11 dlm_controld group_mode 3 compat 0
Jun 06 12:19:11 dlm_controld setup_cpg_daemon 15
Jun 06 12:19:11 dlm_controld dlm:controld conf 1 1 0 memb 2 join 2 left
Jun 06 12:19:11 dlm_controld set_protocol member_count 1 propose daemon 1.1.1 kernel 1.1.1
Jun 06 12:19:11 dlm_controld run protocol from nodeid 2
Jun 06 12:19:11 dlm_controld daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
Jun 06 12:19:11 dlm_controld plocks 17
Jun 06 12:19:11 dlm_controld plock cpg message size: 104 bytes
Jun 06 12:19:12 dlm_controld client connection 5 fd 18
Jun 06 12:21:15 dlm_controld uevent: add@/kernel/dlm/rgmanager
Jun 06 12:21:15 dlm_controld kernel: add@ rgmanager
Jun 06 12:21:15 dlm_controld uevent: online@/kernel/dlm/rgmanager
Jun 06 12:21:15 dlm_controld kernel: online@ rgmanager
Jun 06 12:21:15 dlm_controld dlm:ls:rgmanager conf 1 1 0 memb 2 join 2 left
Jun 06 12:21:15 dlm_controld rgmanager add_change cg 1 joined nodeid 2
Jun 06 12:21:15 dlm_controld rgmanager add_change cg 1 we joined
Jun 06 12:21:15 dlm_controld rgmanager add_change cg 1 counts member 1 joined 1 remove 0 failed 0
Jun 06 12:28:40 dlm_controld uevent: remove@/kernel/dlm/rgmanager
Jun 06 12:28:40 dlm_controld kernel: remove@ rgmanager
Jun 06 12:28:57 dlm_controld uevent: add@/kernel/dlm/rgmanager
Jun 06 12:28:57 dlm_controld kernel: add@ rgmanager
Jun 06 12:28:57 dlm_controld uevent: online@/kernel/dlm/rgmanager
Jun 06 12:28:57 dlm_controld kernel: online@ rgmanager
Jun 06 12:28:57 dlm_controld process_uevent online@ error -17 errno 2
Jun 06 12:32:06 dlm_controld uevent: remove@/kernel/dlm/rgmanager
Jun 06 12:32:06 dlm_controld kernel: remove@ rgmanager
Jun 06 12:32:24 dlm_controld connection 5 read error -1
Jun 06 12:33:24 dlm_controld dlm_controld 3.0.12.1 started
Jun 06 12:33:24 dlm_controld found /dev/misc/dlm-control minor 54
Jun 06 12:33:24 dlm_controld found /dev/misc/dlm-monitor minor 53
Jun 06 12:33:24 dlm_controld found /dev/misc/dlm_plock minor 52
Jun 06 12:33:24 dlm_controld /dev/misc/dlm-monitor fd 13
Jun 06 12:33:24 dlm_controld clear_configfs_nodes rmdir "/sys/kernel/config/dlm/cluster/comms/2"
Jun 06 12:33:24 dlm_controld cluster node 2 added seq 100836
Jun 06 12:33:24 dlm_controld set_configfs_node 2 10.113.32.25 local 1
Jun 06 12:33:24 dlm_controld totem/rrp_mode = 'none'
Jun 06 12:33:24 dlm_controld set protocol 0
Jun 06 12:33:24 dlm_controld group_mode 3 compat 0
Jun 06 12:33:24 dlm_controld setup_cpg_daemon 15
Jun 06 12:33:24 dlm_controld dlm:controld conf 1 1 0 memb 2 join 2 left
Jun 06 12:33:24 dlm_controld set_protocol member_count 1 propose daemon 1.1.1 kernel 1.1.1
Jun 06 12:33:24 dlm_controld run protocol from nodeid 2
Jun 06 12:33:24 dlm_controld daemon run 1.1.1 max 1.1.1 kernel run 1.1.1 max 1.1.1
Jun 06 12:33:24 dlm_controld plocks 17
Jun 06 12:33:24 dlm_controld plock cpg message size: 104 bytes
Jun 06 12:33:25 dlm_controld client connection 5 fd 18
Jun 06 12:33:46 dlm_controld uevent: add@/kernel/dlm/rgmanager
Jun 06 12:33:46 dlm_controld kernel: add@ rgmanager
Jun 06 12:33:46 dlm_controld uevent: online@/kernel/dlm/rgmanager
Jun 06 12:33:46 dlm_controld kernel: online@ rgmanager
Jun 06 12:33:46 dlm_controld dlm:ls:rgmanager conf 1 1 0 memb 2 join 2 left
Jun 06 12:33:46 dlm_controld rgmanager add_change cg 1 joined nodeid 2
Jun 06 12:33:46 dlm_controld rgmanager add_change cg 1 we joined
Jun 06 12:33:46 dlm_controld rgmanager add_change cg 1 counts member 1 joined 1 remove 0 failed 0
Jun 06 12:55:46 dlm_controld fenced_domain_info error -1
<same message repeats every second>
Jun 06 12:58:08 dlm_controld cluster node 1 added seq 100840
Jun 06 12:58:08 dlm_controld set_configfs_node 1 10.112.32.22 local 0
Jun 06 12:58:08 dlm_controld fenced_domain_info error -1
<same message repeats every second>
Jun 06 12:58:25 dlm_controld cluster node 1 removed seq 100844
Jun 06 12:58:25 dlm_controld del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/1"
Jun 06 12:58:25 dlm_controld fenced_domain_info error -1
<same message repeats every second>
Jun 06 12:58:54 dlm_controld uevent: remove@/kernel/dlm/rgmanager
Jun 06 12:58:54 dlm_controld kernel: remove@ rgmanager
Jun 06 12:58:54 dlm_controld fenced_domain_info error -1
<same message repeats every second>
Jun 06 12:59:14 dlm_controld uevent: add@/kernel/dlm/rgmanager
Jun 06 12:59:14 dlm_controld kernel: add@ rgmanager
Jun 06 12:59:14 dlm_controld fenced_domain_info error -1
Jun 06 12:59:14 dlm_controld uevent: online@/kernel/dlm/rgmanager
Jun 06 12:59:14 dlm_controld kernel: online@ rgmanager
Jun 06 12:59:14 dlm_controld process_uevent online@ error -17 errno 111
Jun 06 12:59:14 dlm_controld fenced_domain_info error -1
<same message repeats every second>
Jun 06 13:00:30 dlm_controld uevent: remove@/kernel/dlm/rgmanager
Jun 06 13:00:30 dlm_controld kernel: remove@ rgmanager
Jun 06 13:00:30 dlm_controld fenced_domain_info error -1
<same message repeats every second>

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to