On 03/23/2010 01:18 PM, Greg Woods wrote:
> 
>>> On one node, i can get all services to start(and they work fine), but
>>> whenever fail over occurs, there's nfs related handles left open thus
>>> inhibiting/hanging the fail over. more specifically, the file systems fails
>>> to unmount.
> 
> If you are referring to file systems on the server that are made
> available for NFS mounting that hang on unmount (it's not clear from the
> above if your cluster nodes are NFS servers or clients), then you need
> to unexport the file systems first, then you can umount them. I handled
> this by writing my own nfs-exports RA that basically just does an
> "exportfs -u" with the appropriate parameters, and used an "order" line
> in crm shell to make sure that the Filesystem resource is ordered before
> the nfs-exports resource. The nfs-exports resource will export the file
> system on start, and unexport it on stop.
> 
> --Greg
> 

This is what I am seeing for NFS related open files on cluster node that is 
trying to perform the unmount. As you can see,
theres no open files in the shared path (/data). The PID's referenced are NFS 
kernel processes.

r...@vanessa:~# lsof | grep /data

r...@vanessa:~# lsof | grep nfs
nfsiod    15479      root  cwd       DIR      104,3     4096          2 /
nfsiod    15479      root  rtd       DIR      104,3     4096          2 /
nfsiod    15479      root  txt   unknown                                
/proc/15479/exe
nfsd4     15511      root  cwd       DIR      104,3     4096          2 /
nfsd4     15511      root  rtd       DIR      104,3     4096          2 /
nfsd4     15511      root  txt   unknown                                
/proc/15511/exe
nfsd      15512      root  cwd       DIR      104,3     4096          2 /
nfsd      15512      root  rtd       DIR      104,3     4096          2 /
nfsd      15512      root  txt   unknown                                
/proc/15512/exe
nfsd      15513      root  cwd       DIR      104,3     4096          2 /
nfsd      15513      root  rtd       DIR      104,3     4096          2 /
nfsd      15513      root  txt   unknown                                
/proc/15513/exe
nfsd      15514      root  cwd       DIR      104,3     4096          2 /
nfsd      15514      root  rtd       DIR      104,3     4096          2 /
nfsd      15514      root  txt   unknown                                
/proc/15514/exe
nfsd      15515      root  cwd       DIR      104,3     4096          2 /
nfsd      15515      root  rtd       DIR      104,3     4096          2 /
nfsd      15515      root  txt   unknown                                
/proc/15515/exe
nfsd      15516      root  cwd       DIR      104,3     4096          2 /
nfsd      15516      root  rtd       DIR      104,3     4096          2 /
nfsd      15516      root  txt   unknown                                
/proc/15516/exe
nfsd      15517      root  cwd       DIR      104,3     4096          2 /
nfsd      15517      root  rtd       DIR      104,3     4096          2 /
nfsd      15517      root  txt   unknown                                
/proc/15517/exe
nfsd      15518      root  cwd       DIR      104,3     4096          2 /
nfsd      15518      root  rtd       DIR      104,3     4096          2 /
nfsd      15518      root  txt   unknown                                
/proc/15518/exe
nfsd      15519      root  cwd       DIR      104,3     4096          2 /
nfsd      15519      root  rtd       DIR      104,3     4096          2 /
nfsd      15519      root  txt   unknown                                
/proc/15519/exe

Looks like everything but the filesystem resource stopped correctly.


r...@valerie:/# crm_resource -L
 Master/Slave Set: master-drbd1
     Masters: [ vanessa ]
     Stopped: [ drbd1:1 ]
 Resource Group: fileserver_cluster_group
     fileserver_fs0     (ocf::heartbeat:Filesystem) Started
     fileserver_vip0    (ocf::heartbeat:IPaddr) Stopped
     fileserver_nfs-common      (lsb:nfs-common) Stopped
     fileserver_nfs     (lsb:nfs-kernel-server) Stopped
     fileserver_notify_admin    (ocf::heartbeat:MailTo) Stopped

At this point, if I reboot the hung node with `echo b > /proc/sysrq-trigger`, 
the resource/resource group fail over to the
good node just fine.  Once the node reboot, all is good once again, that is, 
until I try again.

Note:
Clients seem to handle the node fail over quite well, even if the fail over 
takes a little while because I have to intervene.










_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to