Problem: It seems that IO on one machine in the cluster (not always the same machine) will hang and all processes accessing clustered LVs will block. Other machines will follow suit shortly thereafter until the machine that first exhibited the problem is rebooted (via fence_drac manually). No messages in dmesg, syslog, etc. Filesystems recently fsckd.
Hardware: Dell 1950s (similar except memory -- 3x 16GB RAM, 1x 8GB RAM). Running RHEL4 ES U7. Four machines Onboard gigabit NICs (Machines use little bandwidth, and all network traffic including DLM share NICs) QLogic 2462 PCI-Express dual channel FC HBAs QLogic SANBox 5200 FC switch Apple XRAID which presents as two LUNs (~4.5TB raw aggregate) Cisco Catalyst switch Simple four machine RHEL4 U7 cluster running kernel 2.6.9-78.0.1.ELsmp x86_64 with the following packages: ccs-1.0.12-1 cman-1.0.24-1 cman-kernel-smp-2.6.9-55.13.el4_7.1 cman-kernheaders-2.6.9-55.13.el4_7.1 dlm-kernel-smp-2.6.9-54.11.el4_7.1 dlm-kernheaders-2.6.9-54.11.el4_7.1 fence-1.32.63-1.el4_7.1 GFS-6.1.18-1 GFS-kernel-smp-2.6.9-80.9.el4_7.1 One clustered VG. Striped across two physical volumes, which correspond to each side of an Apple XRAID. Clustered volume group info: --- Volume group --- VG Name hq-san System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 50 VG Access read/write VG Status resizable Clustered yes Shared no MAX LV 0 Cur LV 3 Open LV 3 Max PV 0 Cur PV 2 Act PV 2 VG Size 4.55 TB PE Size 4.00 MB Total PE 1192334 Alloc PE / Size 905216 / 3.45 TB Free PE / Size 287118 / 1.10 TB VG UUID hfeIhf-fzEq-clCf-b26M-cMy3-pphm-B6wmLv Logical volumes contained with hq-san VG: cam_development hq-san -wi-ao 500.00G qa hq-san -wi-ao 1.07T svn_users hq-san -wi-ao 1.89T All four machines mount svn_users, two machines mount qa, and one mounts cam_development. /etc/cluster/cluster.conf: <?xml version="1.0"?> <cluster alias="tungsten" config_version="31" name="qualia"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="odin" votes="1"> <fence> <method name="1"> <device modulename="" name="odin-drac"/> </method> </fence> </clusternode> <clusternode name="hugin" votes="1"> <fence> <method name="1"> <device modulename="" name="hugin-drac"/> </method> </fence> </clusternode> <clusternode name="munin" votes="1"> <fence> <method name="1"> <device modulename="" name="munin-drac"/> </method> </fence> </clusternode> <clusternode name="zeus" votes="1"> <fence> <method name="1"> <device modulename="" name="zeus-drac"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="0"/> <fencedevices> <resources/> <fencedevice name="odin-drac" agent="fence_drac" ipaddr="redacted" login="root" passwd="redacted"/> <fencedevice name="hugin-drac" agent="fence_drac" ipaddr="redacted" login="root" passwd="redacted"/> <fencedevice name="munin-drac" agent="fence_drac" ipaddr="redacted" login="root" passwd="redacted"/> <fencedevice name="zeus-drac" agent="fence_drac" ipaddr="redacted" login="root" passwd="redacted"/> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster> -- Shawn Hood 910.670.1819 m -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster