On Wed, Oct 6, 2010 at 6:51 AM, Lon Hohberger <[email protected]> wrote:
> On 10/01/2010 02:11 AM, Joel Heenan wrote: > >> So just further to this I found a Red Hat bug about this exact issue: >> >> https://bugzilla.redhat.com/show_bug.cgi?id=570373 >> >> And for me it works perfectly if the dom0 is fenced using fence_node on >> the command line. However, if the host becomes unavailable then it is >> not fenced, and from reading the fenced man page it seems this is >> because there isn't a shared resource like clvm or gfs, so therefore the >> cluster doesn't see a need to fence the host. This means subsequent >> fence_xvm commands fail. >> >> I guess I need to find a way to force fenced to operate without clvm and >> fence dom0s? >> >> Joel >> >> > fence_xvm/fence_xvmd is designed to handle two primary cases: > > 1) kill the misbehaving VM, or > 2) Wait for the last-known owner of misbehaving VM to be dead. > > Effectively, (2) occurs when the host cluster node dies and the host is > subsequently fenced. > > According to 570373, (2) stopped working at some point, but I haven't > gotten enough information to adequately debug the problem. > > If you have a cluster which exhibits this behavior, please contact me on > FreeNode in #linux-cluster. > Hi Lon, I was able to re-create this issue and capture the logs as per the bug, I will send them to your email address. This is what it looks like from the guest: """ 2010-10-06T23:26:31.902493+00:00 c013otin01-test fenced[1891]: c013otin07-test not a cluster member after 0 sec post_fail_delay 2010-10-06T23:26:31.902608+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test" 2010-10-06T23:26:36.858569+00:00 c013otin01-test clurgmgrd[3519]: <info> Waiting for node #7 to be fenced 2010-10-06T23:27:04.440434+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response 2010-10-06T23:27:04.440548+00:00 c013otin01-test ccsd[1862]: Attempt to close an unopened CCS descriptor (3035370). 2010-10-06T23:27:04.440595+00:00 c013otin01-test ccsd[1862]: Error while processing disconnect: Invalid request descriptor 2010-10-06T23:27:04.440633+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed 2010-10-06T23:27:09.444804+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test" 2010-10-06T23:27:41.703023+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response 2010-10-06T23:27:41.703146+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed 2010-10-06T23:27:46.703283+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test" 2010-10-06T23:28:19.365666+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response 2010-10-06T23:28:19.365967+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed 2010-10-06T23:28:24.365843+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test" 2010-10-06T23:28:56.643939+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response 2010-10-06T23:28:56.644226+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed 2010-10-06T23:29:01.644127+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test" 2010-10-06T23:29:34.171420+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response 2010-10-06T23:29:34.171507+00:00 c013otin01-test ccsd[1862]: Attempt to close an unopened CCS descriptor (3035970). 2010-10-06T23:29:34.171524+00:00 c013otin01-test ccsd[1862]: Error while processing disconnect: Invalid request descriptor 2010-10-06T23:29:34.171578+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed 2010-10-06T23:29:39.170656+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test" 2010-10-06T23:30:01.418667+00:00 c013otin01-test rsync_policy_files: receiving file list ... done 2010-10-06T23:30:01.418699+00:00 c013otin01-test rsync_policy_files: 2010-10-06T23:30:01.418708+00:00 c013otin01-test rsync_policy_files: sent 30 bytes received 12 bytes 84.00 bytes/sec 2010-10-06T23:30:01.418716+00:00 c013otin01-test rsync_policy_files: total size is 0 speedup is 0.00 2010-10-06T23:30:08.760903+00:00 c013otin01-test kernel: INFO: task clurgmgrd:25022 blocked for more than 120 seconds. 2010-10-06T23:30:08.760918+00:00 c013otin01-test kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2010-10-06T23:30:08.760923+00:00 c013otin01-test kernel: clurgmgrd D ffff880001064b60 0 25022 3518 25023 25019 (NOTLB) 2010-10-06T23:30:08.760926+00:00 c013otin01-test kernel: ffff88016437fdb8 0000000000000286 0000000000000000 00000000ee8f8108 2010-10-06T23:30:08.760928+00:00 c013otin01-test kernel: 0000000000000008 ffff88098ed37080 ffff88097c9207a0 00000000000087b3 2010-10-06T23:30:08.760930+00:00 c013otin01-test kernel: ffff88098ed37268 ffffffff8029ed82 2010-10-06T23:30:08.760933+00:00 c013otin01-test kernel: Call Trace: 2010-10-06T23:30:08.760937+00:00 c013otin01-test kernel: [<ffffffff8029ed82>] futex_wake+0x50/0xd4 2010-10-06T23:30:08.760940+00:00 c013otin01-test kernel: [<ffffffff8023fe9c>] do_futex+0x2c2/0xcfb 2010-10-06T23:30:08.760942+00:00 c013otin01-test kernel: [<ffffffff802644cb>] __down_read+0x82/0x9a 2010-10-06T23:30:08.760945+00:00 c013otin01-test kernel: [<ffffffff8830b468>] :dlm:dlm_user_request+0x2d/0x175 """ Here is what the fence_xvmd log shows on one dom0: """ Request to fence: c013otin07-test Evaluating Domain: c013otin07-test Last Owner: 7 State 1 Domain UUID Owner State ------ ---- ----- ----- c013operations01-test 9654e57b-7bb6-019e-937b-dc009f734a13 00001 00001 c013otin01-test 6fc9063b-5e9f-ef86-5ae2-8faa5fcde84a 00001 00001 c013summary01-test 10432e54-673f-8c61-d08d-591c42adce6e 00001 00002 Domain-0 00000000-0000-0000-0000-000000000000 00001 00001 Storing c013operations01-test Storing c013otin01-test Storing c013summary01-test Request to fence: c013otin07-test Evaluating Domain: c013otin07-test Last Owner: 7 State 1 """ I did notice that group_tool state looks a bit borked: """ [r...@dom0-01 ~]# group_tool type level name id state fence 0 default 00000000 JOIN_STOP_WAIT [1 2 3 4 5 6 7 8 9 10] dlm 1 rgmanager 00000000 JOIN_STOP_WAIT [1 2 3 4 5 6 7 8 9 10] """ Is the group_tool output, the JOIN_STOP_WAIT the problem here? If so do you know how to fix it without rebooting all the nodes? I tried "fence_tool leave", and "fence_tool join" on all dom0's but that didn't resolve the problem. Thanks Joel
-- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
