On 03/23/2010 01:18 PM, Greg Woods wrote: > >>> On one node, i can get all services to start(and they work fine), but >>> whenever fail over occurs, there's nfs related handles left open thus >>> inhibiting/hanging the fail over. more specifically, the file systems fails >>> to unmount. > > If you are referring to file systems on the server that are made > available for NFS mounting that hang on unmount (it's not clear from the > above if your cluster nodes are NFS servers or clients), then you need > to unexport the file systems first, then you can umount them. I handled > this by writing my own nfs-exports RA that basically just does an > "exportfs -u" with the appropriate parameters, and used an "order" line > in crm shell to make sure that the Filesystem resource is ordered before > the nfs-exports resource. The nfs-exports resource will export the file > system on start, and unexport it on stop. > > --Greg >
This is what I am seeing for NFS related open files on cluster node that is trying to perform the unmount. As you can see, theres no open files in the shared path (/data). The PID's referenced are NFS kernel processes. r...@vanessa:~# lsof | grep /data r...@vanessa:~# lsof | grep nfs nfsiod 15479 root cwd DIR 104,3 4096 2 / nfsiod 15479 root rtd DIR 104,3 4096 2 / nfsiod 15479 root txt unknown /proc/15479/exe nfsd4 15511 root cwd DIR 104,3 4096 2 / nfsd4 15511 root rtd DIR 104,3 4096 2 / nfsd4 15511 root txt unknown /proc/15511/exe nfsd 15512 root cwd DIR 104,3 4096 2 / nfsd 15512 root rtd DIR 104,3 4096 2 / nfsd 15512 root txt unknown /proc/15512/exe nfsd 15513 root cwd DIR 104,3 4096 2 / nfsd 15513 root rtd DIR 104,3 4096 2 / nfsd 15513 root txt unknown /proc/15513/exe nfsd 15514 root cwd DIR 104,3 4096 2 / nfsd 15514 root rtd DIR 104,3 4096 2 / nfsd 15514 root txt unknown /proc/15514/exe nfsd 15515 root cwd DIR 104,3 4096 2 / nfsd 15515 root rtd DIR 104,3 4096 2 / nfsd 15515 root txt unknown /proc/15515/exe nfsd 15516 root cwd DIR 104,3 4096 2 / nfsd 15516 root rtd DIR 104,3 4096 2 / nfsd 15516 root txt unknown /proc/15516/exe nfsd 15517 root cwd DIR 104,3 4096 2 / nfsd 15517 root rtd DIR 104,3 4096 2 / nfsd 15517 root txt unknown /proc/15517/exe nfsd 15518 root cwd DIR 104,3 4096 2 / nfsd 15518 root rtd DIR 104,3 4096 2 / nfsd 15518 root txt unknown /proc/15518/exe nfsd 15519 root cwd DIR 104,3 4096 2 / nfsd 15519 root rtd DIR 104,3 4096 2 / nfsd 15519 root txt unknown /proc/15519/exe Looks like everything but the filesystem resource stopped correctly. r...@valerie:/# crm_resource -L Master/Slave Set: master-drbd1 Masters: [ vanessa ] Stopped: [ drbd1:1 ] Resource Group: fileserver_cluster_group fileserver_fs0 (ocf::heartbeat:Filesystem) Started fileserver_vip0 (ocf::heartbeat:IPaddr) Stopped fileserver_nfs-common (lsb:nfs-common) Stopped fileserver_nfs (lsb:nfs-kernel-server) Stopped fileserver_notify_admin (ocf::heartbeat:MailTo) Stopped At this point, if I reboot the hung node with `echo b > /proc/sysrq-trigger`, the resource/resource group fail over to the good node just fine. Once the node reboot, all is good once again, that is, until I try again. Note: Clients seem to handle the node fail over quite well, even if the fail over takes a little while because I have to intervene. _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems