On 8/27/2010 at 08:50 AM, Michael Smith <msm...@cbnco.com> wrote: >> Xinwei Hu <hxin...@...> writes: > > > > > That sounds worrying actually. > > > I think this is logged as bug 585419 on SLES' bugzilla. > > > If you can reproduce this issue, it worths to reopen it I think. > > I've got a pair of fully patched SLES11 SP1 nodes and they're showing > what I guess is the same behaviour: if I hard-poweroff node2, operations > like "vgdisplay -v" hang on node1 for quite some time. Sometimes a > minute, sometimes two, sometimes forever. They get stuck here: > > Aug 26 18:31:42 xen-test1 clvmd[8906]: doing PRE command LOCK_VG > 'V_vm_store' at > 1 (client=0x7f2714000b40) > Aug 26 18:31:42 xen-test1 clvmd[8906]: lock_resource 'V_vm_store', > flags=0, mode=3 > > > After a few seconds, corosync & dlm notice the node is gone, but > vg_display and > friends still hang while trying to lock the VG. > > Aug 26 18:31:44 xen-test1 corosync[8476]: [TOTEM ] A processor failed, > forming new configuration. > Aug 26 18:31:50 xen-test1 cluster-dlm[8870]: update_cluster: Processing > membership 1260 > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Skipped > active node 219878572: born-on=1256, last-seen=1260, this-event=1260, > last-event=1256 > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: del_configfs_node: > del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/236655788" > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Removed > inactive node 236655788: born-on=1252, last-seen=1256, this-event=1260, > last-event=1256 > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:controld > conf 1 0 1 memb 219878572 join left 236655788 > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:ls:clvmd > conf 1 0 1 memb 219878572 join left 236655788 > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd > add_change cg 3 remove nodeid 236655788 reason 3 > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd > add_change cg 3 counts member 1 joined 0 remove 1 failed 1 > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: stop_kernel: clvmd > stop_kernel cg 3 > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: do_sysfs: write "0" to > "/sys/kernel/dlm/clvmd/control" > Aug 26 18:31:51 xen-test1 kernel: [ 365.267802] dlm: closing connection > to node 236655788 > Aug 26 18:31:51 xen-test1 clvmd[8906]: confchg callback. 0 joined, 1 > left, 1 members > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: fence_node_time: Node > 236655788/xen-test2 has not been shot yet > Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: check_fencing_done: clvmd > check_fencing 23665578 not fenced add 1282861615 fence 0 > Aug 26 18:31:51 xen-test1 crmd: [8489]: info: ais_dispatch: Membership > 1260: quorum still lost > Aug 26 18:31:51 xen-test1 cluster-dlm: [8870]: info: ais_dispatch: > Membership 1260: quorum still lost
Do you have STONITH configured? Note that it says "xen-test2 has not been shot yet" and "clvmd ... not fenced". It's just going to sit there until the down node is successfully fenced - this is intentional, as it's not safe to keep running until you *know* the dead node is dead. Regards, Tim -- Tim Serong <tser...@novell.com> Senior Clustering Engineer, OPS Engineering, Novell Inc. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker