[gridengine users] need to take down master due to root filesystem issues

Dan Hyatt Wed, 17 Sep 2014 22:06:07 -0700


I have SoGE running on centos6.5 on Dell520 blades.
I have a master, several shadow master hosts.

My master is sick, the local root filesystem is behaving very strangelyand everything left to do is a reboot. The NFS filesystems work fine.The messages file is clean.When I use any command that looks at the root filessystem it hangs ls/ or df -h it hangsI am able to login (home directory is nfs) but other users are havingtrouble logging in. Root user works fine even though its home directoryis root.

The reason I am writing is I want to manual fail over to my shadowmaster. I was unable to find a best practices. I know that just takingdown the master should failover to shadow master first on the list. ButI would rather do it cleanly. Also, how do I fail back to my originalmaster, or does it do it automatically once it is back online?

If you have any ideas on the precipitating problem, that would beawesome. Remember, nfs works great, nothing in the error logs I couldfind, commands on root filesystem (/bin and /sbin work as long as theyare not examining root).We have run Dells OM software, I have checked the logs, checked thefilesystem (almost empty) size of log files, most were small, topindicated the CPU is no more then 1% utilized, memory is 1/3 utilized,cannot find any network errors.


Thanks
Dan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] need to take down master due to root filesystem issues

Reply via email to