I have SoGE running on centos6.5 on Dell520 blades.
I have a master, several shadow master hosts.
My master is sick, the local root filesystem is behaving very strangely and everything left to do is a reboot. The NFS filesystems work fine. The messages file is clean. When I use any command that looks at the root filessystem it hangs ls / or df -h it hangs I am able to login (home directory is nfs) but other users are having trouble logging in. Root user works fine even though its home directory is root.


The reason I am writing is I want to manual fail over to my shadow master. I was unable to find a best practices. I know that just taking down the master should failover to shadow master first on the list. But I would rather do it cleanly. Also, how do I fail back to my original master, or does it do it automatically once it is back online?


If you have any ideas on the precipitating problem, that would be awesome. Remember, nfs works great, nothing in the error logs I could find, commands on root filesystem (/bin and /sbin work as long as they are not examining root). We have run Dells OM software, I have checked the logs, checked the filesystem (almost empty) size of log files, most were small, top indicated the CPU is no more then 1% utilized, memory is 1/3 utilized, cannot find any network errors.

Thanks
Dan
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to