I recently changed the host name of a cluster. It may or may not be related, but after I noticed that I can cleanly start gfs2 when the node boots. However, if the node is withdrawn and then I try to rejoin it without a reboot, it hangs with this in syslog;
==== Jan 14 12:21:34 kp-a10n01 kernel: Pid: 22580, comm: kslowd000 Not tainted 2.6.32-696.18.7.el6.x86_64 #1 Jan 14 12:21:34 kp-a10n01 kernel: Call Trace: Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa0714308>] ? gfs2_lm_withdraw+0x128/0x160 [gfs2] Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa071451d>] ? gfs2_consist_inode_i+0x5d/0x60 [gfs2] Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b466>] ? find_good_lh+0x76/0x90 [gfs2] Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b509>] ? gfs2_find_jhead+0x89/0x170 [gfs2] Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8107e0ee>] ? vprintk_default+0xe/0x10 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b6ee>] ? gfs2_recover_work+0xfe/0x790 [gfs2] Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8106b73e>] ? perf_event_task_sched_out+0x2e/0x70 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8100968b>] ? __switch_to+0x6b/0x320 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8154b728>] ? schedule+0x458/0xc50 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81063883>] ? __wake_up+0x53/0x70 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121be3>] ? slow_work_execute+0x233/0x310 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121e17>] ? slow_work_thread+0x157/0x360 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a71a0>] ? autoremove_wake_function+0x0/0x40 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121cc0>] ? slow_work_thread+0x0/0x360 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a6d0e>] ? kthread+0x9e/0xc0 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81557afa>] ? child_rip+0xa/0x20 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a6c70>] ? kthread+0x0/0xc0 Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81557af0>] ? child_rip+0x0/0x20 Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: jid=0: Failed Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: jid=1: Trying to acquire journal lock... Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: can't read in statfs inode: -5 Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not found shared Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not found shared Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not found shared Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,19 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,19 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,805b err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,18 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,17 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,16 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 1,0 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,18 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 9,0 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 1,1 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 4,0 err=-22 Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,17 err=-22 ==== I have to fence the node to get the system back up. It happens on either node, and it happens regardless of the peer node being connected. GFS2 on top of clvmd on an rhcs cluster on RHEL 6. Would configs help? digimer -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org