On a fresh boot, fsck.gfs2 found no errors on either node. On 2018-01-15 01:03 AM, Ulrich Windl wrote: > I'd deal with "fatal: filesystem consistency error" first. > > >>>> Digimer <li...@alteeve.ca> schrieb am 14.01.2018 um 21:48 in Nachricht > <6a036895-8964-ca76-3774-4b7e9bcf5...@alteeve.ca>: >> On 2018-01-14 12:29 PM, Digimer wrote: >>> I recently changed the host name of a cluster. It may or may not be >>> related, but after I noticed that I can cleanly start gfs2 when the node >>> boots. However, if the node is withdrawn and then I try to rejoin it >>> without a reboot, it hangs with this in syslog; >>> >>> ==== >>> Jan 14 12:21:34 kp-a10n01 kernel: Pid: 22580, comm: kslowd000 Not >>> tainted 2.6.32-696.18.7.el6.x86_64 #1 >>> Jan 14 12:21:34 kp-a10n01 kernel: Call Trace: >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa0714308>] ? >>> gfs2_lm_withdraw+0x128/0x160 [gfs2] >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa071451d>] ? >>> gfs2_consist_inode_i+0x5d/0x60 [gfs2] >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b466>] ? >>> find_good_lh+0x76/0x90 [gfs2] >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b509>] ? >>> gfs2_find_jhead+0x89/0x170 [gfs2] >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8107e0ee>] ? >>> vprintk_default+0xe/0x10 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffffa070b6ee>] ? >>> gfs2_recover_work+0xfe/0x790 [gfs2] >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8106b73e>] ? >>> perf_event_task_sched_out+0x2e/0x70 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8100968b>] ? >>> __switch_to+0x6b/0x320 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff8154b728>] ? >>> schedule+0x458/0xc50 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81063883>] ? > __wake_up+0x53/0x70 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121be3>] ? >>> slow_work_execute+0x233/0x310 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121e17>] ? >>> slow_work_thread+0x157/0x360 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a71a0>] ? >>> autoremove_wake_function+0x0/0x40 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81121cc0>] ? >>> slow_work_thread+0x0/0x360 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a6d0e>] ? kthread+0x9e/0xc0 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81557afa>] ? > child_rip+0xa/0x20 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff810a6c70>] ? kthread+0x0/0xc0 >>> Jan 14 12:21:34 kp-a10n01 kernel: [<ffffffff81557af0>] ? > child_rip+0x0/0x20 >>> Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >>> jid=0: Failed >>> Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >>> jid=1: Trying to acquire journal lock... >>> Jan 14 12:21:34 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: can't >>> read in statfs inode: -5 >>> Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not >>> found shared >>> Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not >>> found shared >>> Jan 14 12:21:34 kp-a10n01 gfs_controld[20749]: recovery_uevent mg not >>> found shared >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,19 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,19 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,805b err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,18 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,17 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 5,16 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 1,0 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,18 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 9,0 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 1,1 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 4,0 err=-22 >>> Jan 14 12:21:34 kp-a10n01 kernel: gdlm_unlock 2,17 err=-22 >>> ==== >>> >>> I have to fence the node to get the system back up. It happens on either >>> node, and it happens regardless of the peer node being connected. >>> >>> GFS2 on top of clvmd on an rhcs cluster on RHEL 6. Would configs help? >>> >>> digimer >>> >> >> Happened again (well, many times, but here's the log output from another >> hang); >> >> ==== >> Jan 14 12:46:41 kp-a10n01 kernel: GFS2 (built Jan 4 2018 17:32:36) >> installed >> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=: Trying to join cluster >> "lock_dlm", "kp-anvil-10:shared" >> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >> Joined cluster. Now mounting FS... >> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >> jid=0, already locked for use >> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >> jid=0: Looking at journal... >> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >> fatal: filesystem consistency error >> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >> inode = 4 25 >> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >> function = find_good_lh, file = fs/gfs2/recovery.c, line = 205 >> Jan 14 12:46:41 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: about >> to withdraw this file system >> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >> telling LM to unmount >> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: > withdrawn >> Jan 14 12:46:42 kp-a10n01 kernel: Pid: 11668, comm: kslowd000 Not >> tainted 2.6.32-696.18.7.el6.x86_64 #1 >> Jan 14 12:46:42 kp-a10n01 kernel: Call Trace: >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06fb308>] ? >> gfs2_lm_withdraw+0x128/0x160 [gfs2] >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06fb51d>] ? >> gfs2_consist_inode_i+0x5d/0x60 [gfs2] >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06f2466>] ? >> find_good_lh+0x76/0x90 [gfs2] >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06f2509>] ? >> gfs2_find_jhead+0x89/0x170 [gfs2] >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8107e0ee>] ? >> vprintk_default+0xe/0x10 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffffa06f26ee>] ? >> gfs2_recover_work+0xfe/0x790 [gfs2] >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8106b73e>] ? >> perf_event_task_sched_out+0x2e/0x70 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81074a83>] ? >> dequeue_entity+0x113/0x2e0 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8100968b>] ? >> __switch_to+0x6b/0x320 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8154b728>] ? >> schedule+0x458/0xc50 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff8107543b>] ? >> enqueue_task_fair+0xfb/0x100 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81121be3>] ? >> slow_work_execute+0x233/0x310 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81121e17>] ? >> slow_work_thread+0x157/0x360 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff810a71a0>] ? >> autoremove_wake_function+0x0/0x40 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81121cc0>] ? >> slow_work_thread+0x0/0x360 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff810a6d0e>] ? kthread+0x9e/0xc0 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81557afa>] ? child_rip+0xa/0x20 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff810a6c70>] ? kthread+0x0/0xc0 >> Jan 14 12:46:42 kp-a10n01 kernel: [<ffffffff81557af0>] ? child_rip+0x0/0x20 >> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >> jid=0: Failed >> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: >> jid=1: Trying to acquire journal lock... >> Jan 14 12:46:42 kp-a10n01 kernel: GFS2: fsid=kp-anvil-10:shared.0: can't >> read in statfs inode: -5 >> Jan 14 12:46:42 kp-a10n01 gfs_controld[5371]: recovery_uevent mg not >> found shared >> Jan 14 12:46:42 kp-a10n01 gfs_controld[5371]: recovery_uevent mg not >> found shared >> Jan 14 12:46:42 kp-a10n01 gfs_controld[5371]: recovery_uevent mg not >> found shared >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 2,19 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,19 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,805b err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,18 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,17 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 5,16 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 1,0 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 4,0 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 2,17 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 2,18 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 9,0 err=-22 >> Jan 14 12:46:42 kp-a10n01 kernel: gdlm_unlock 1,1 err=-22 >> ==== >> >> >> -- >> Digimer >> Papers and Projects: https://alteeve.com/w/ >> "I am, somehow, less interested in the weight and convolutions of >> Einstein’s brain than in the near certainty that people of equal talent >> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://lists.clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
-- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org