Bright sets Spool to be local on each node, while the config and
excusables on NFS if you have a HA configuration on your head servers.
I think in theory, if the active head fails, you can bring it offline
and make the passive head active manually, and your jobs will not be
lost.

>From the error message, looks like the NFS server is failed too, that
the node can not mount it. Is the NFS server installed on the failed
head server? I remember that Bright recommends to use a separate NFS
server.

On Wed, Nov 12, 2014 at 11:33 AM, Skylar Thompson
<[email protected]> wrote:
> Hi Eric,
>
> We produce our own RPMs using FPM, just so we don't have to have the
> executables on NFS. When the NFS storage is busy, it can make GE unusable
> and sometimes unstable (if you hit protocol timeouts) if the executables
> and/or job spool are on NFS.
>
> On Wed, Nov 12, 2014 at 04:26:51PM +0000, Peskin, Eric wrote:
>> All,
>>
>> Does SGE have to use NFS or can it work locally on each node?
>> If parts of it have to be on NFS, what is the minimal subset?
>> How much of this changes if you want redundant masters?
>>
>> We have a cluster running CentOS 6.3, Bright Cluster Manager 6.0, and SGE 
>> 2011.11.  Specifically, SGE is provided by a Bright package: 
>> sge-2011.11-360_cm6.0.x86_64
>>
>> Twice, we have lost all the running SGE jobs when the cluster failed over 
>> from one head node to the other.  =( Not supposed to happen.
>> Since then, we have also had many individual jobs get lost.  The later 
>> situation correlates with messages in the system logs saying
>>
>> > abrt[9007]: File '/cm/shared/apps/sge/2011.11/bin/linux-x64/sge_execd' 
>> > seems to be deleted
>>
>> That file lives on an NFS mount on our Isilon storage.
>> Surely, the executables don't have to be on NFS?
>> Interesting, we are using local spooling, the spool directory on each node 
>> is  /cm/local/apps/sge/var/spool , which is, indeed local.
>> But the $SGE_ROOT ,  /cm/shared/apps/sge/2011.11 lives on NFS.
>> Does any of it need to?
>> Maybe just the var part would need to:  /cm/shared/apps/sge/var ?
>>
>> Thanks,
>> Eric
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>
> --
> -- Skylar Thompson ([email protected])
> -- Genome Sciences Department, System Administrator
> -- Foege Building S046, (206)-685-7354
> -- University of Washington School of Medicine
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to