I have SoGE running on centos6.5 on Dell520 blades.
I have a master, several shadow master hosts.
My master is sick, the local root filesystem is behaving very strangely
and everything left to do is a reboot. The NFS filesystems work fine.
The messages file is clean.
When I use any command that looks at the root filessystem it hangs ls
/ or df -h it hangs
I am able to login (home directory is nfs) but other users are having
trouble logging in. Root user works fine even though its home directory
is root.
The reason I am writing is I want to manual fail over to my shadow
master. I was unable to find a best practices. I know that just taking
down the master should failover to shadow master first on the list. But
I would rather do it cleanly. Also, how do I fail back to my original
master, or does it do it automatically once it is back online?
If you have any ideas on the precipitating problem, that would be
awesome. Remember, nfs works great, nothing in the error logs I could
find, commands on root filesystem (/bin and /sbin work as long as they
are not examining root).
We have run Dells OM software, I have checked the logs, checked the
filesystem (almost empty) size of log files, most were small, top
indicated the CPU is no more then 1% utilized, memory is 1/3 utilized,
cannot find any network errors.
Thanks
Dan
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users