The NFS server is separate -- our Isilon storage. We are working with EMC to determine whether there are issues there. But in the meantime, I am trying to figure out how independent we can get from NFS to limit vulnerability to that sort of problem.
On Nov 12, 2014, at 11:53 AM, Feng Zhang wrote: > Bright sets Spool to be local on each node, while the config and > excusables on NFS if you have a HA configuration on your head servers. > I think in theory, if the active head fails, you can bring it offline > and make the passive head active manually, and your jobs will not be > lost. > > From the error message, looks like the NFS server is failed too, that > the node can not mount it. Is the NFS server installed on the failed > head server? I remember that Bright recommends to use a separate NFS > server. > > On Wed, Nov 12, 2014 at 11:33 AM, Skylar Thompson > <[email protected]> wrote: >> Hi Eric, >> >> We produce our own RPMs using FPM, just so we don't have to have the >> executables on NFS. When the NFS storage is busy, it can make GE unusable >> and sometimes unstable (if you hit protocol timeouts) if the executables >> and/or job spool are on NFS. >> >> On Wed, Nov 12, 2014 at 04:26:51PM +0000, Peskin, Eric wrote: >>> All, >>> >>> Does SGE have to use NFS or can it work locally on each node? >>> If parts of it have to be on NFS, what is the minimal subset? >>> How much of this changes if you want redundant masters? >>> >>> We have a cluster running CentOS 6.3, Bright Cluster Manager 6.0, and SGE >>> 2011.11. Specifically, SGE is provided by a Bright package: >>> sge-2011.11-360_cm6.0.x86_64 >>> >>> Twice, we have lost all the running SGE jobs when the cluster failed over >>> from one head node to the other. =( Not supposed to happen. >>> Since then, we have also had many individual jobs get lost. The later >>> situation correlates with messages in the system logs saying >>> >>>> abrt[9007]: File '/cm/shared/apps/sge/2011.11/bin/linux-x64/sge_execd' >>>> seems to be deleted >>> >>> That file lives on an NFS mount on our Isilon storage. >>> Surely, the executables don't have to be on NFS? >>> Interesting, we are using local spooling, the spool directory on each node >>> is /cm/local/apps/sge/var/spool , which is, indeed local. >>> But the $SGE_ROOT , /cm/shared/apps/sge/2011.11 lives on NFS. >>> Does any of it need to? >>> Maybe just the var part would need to: /cm/shared/apps/sge/var ? >>> >>> Thanks, >>> Eric >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> >> -- >> -- Skylar Thompson ([email protected]) >> -- Genome Sciences Department, System Administrator >> -- Foege Building S046, (206)-685-7354 >> -- University of Washington School of Medicine >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
