Hi,

We have an 8 node cluster running SASgrid. We have the core components of
SAS under RHCS (rgmanager) control, but there are user/client jobs that are
initiated manually and by cron outside of RHCS. We have run into an issue a
few times where it seems that when the gfs init script is called to unmount
all the file systems and it kills off all the processes using the gfs file
systems, the gfs on the other nodes locks up and hangs. The node leaving
the cluster via a reboot appears to have left cleanly (cman_tool services
doesn't show any *WAIT* states) but everything is hung and requires a
complete reboot of the cluster to get things going. We are wondering if the
killing of the processes by the gfs init script, which uses fuser to try to
kill gracefully but then uses a -9, could be issuing the -9 and thus
leaving locks in DLM that could be causing this issue.

Is this possible? I would think that if a node has properly/cleanly left
the cluster, locks that were held by that node would be released. Is there
a way to display locks that may be still existing for that node that is
down? And lastly, is there a way to force the release of those locks with
out the reboot of the cluster? I've been searching the linux-cluster
archives with little success.

RHEL 5.6
cman-2.0.115-68.el5_6.3
gfs-utils-0.1.20-8.el5
kmod-gfs-0.1.34-12.el5


Thanks
Jeremy
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to