Re: [gridengine users] SGE and NFS

Peskin, Eric Wed, 12 Nov 2014 09:36:58 -0800

The NFS server is separate -- our Isilon storage.  We are working with EMC to 
determine whether there are issues there.  But in the meantime, I am trying to 
figure out how independent we can get from NFS to limit vulnerability to that 
sort of problem.




On Nov 12, 2014, at 11:53 AM, Feng Zhang wrote:

> Bright sets Spool to be local on each node, while the config and
> excusables on NFS if you have a HA configuration on your head servers.
> I think in theory, if the active head fails, you can bring it offline
> and make the passive head active manually, and your jobs will not be
> lost.
> 
> From the error message, looks like the NFS server is failed too, that
> the node can not mount it. Is the NFS server installed on the failed
> head server? I remember that Bright recommends to use a separate NFS
> server.
> 
> On Wed, Nov 12, 2014 at 11:33 AM, Skylar Thompson
> <[email protected]> wrote:
>> Hi Eric,
>> 
>> We produce our own RPMs using FPM, just so we don't have to have the
>> executables on NFS. When the NFS storage is busy, it can make GE unusable
>> and sometimes unstable (if you hit protocol timeouts) if the executables
>> and/or job spool are on NFS.
>> 
>> On Wed, Nov 12, 2014 at 04:26:51PM +0000, Peskin, Eric wrote:
>>> All,
>>> 
>>> Does SGE have to use NFS or can it work locally on each node?
>>> If parts of it have to be on NFS, what is the minimal subset?
>>> How much of this changes if you want redundant masters?
>>> 
>>> We have a cluster running CentOS 6.3, Bright Cluster Manager 6.0, and SGE 
>>> 2011.11.  Specifically, SGE is provided by a Bright package: 
>>> sge-2011.11-360_cm6.0.x86_64
>>> 
>>> Twice, we have lost all the running SGE jobs when the cluster failed over 
>>> from one head node to the other.  =( Not supposed to happen.
>>> Since then, we have also had many individual jobs get lost.  The later 
>>> situation correlates with messages in the system logs saying
>>> 
>>>> abrt[9007]: File '/cm/shared/apps/sge/2011.11/bin/linux-x64/sge_execd' 
>>>> seems to be deleted
>>> 
>>> That file lives on an NFS mount on our Isilon storage.
>>> Surely, the executables don't have to be on NFS?
>>> Interesting, we are using local spooling, the spool directory on each node 
>>> is  /cm/local/apps/sge/var/spool , which is, indeed local.
>>> But the $SGE_ROOT ,  /cm/shared/apps/sge/2011.11 lives on NFS.
>>> Does any of it need to?
>>> Maybe just the var part would need to:  /cm/shared/apps/sge/var ?
>>> 
>>> Thanks,
>>> Eric
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> --
>> -- Skylar Thompson ([email protected])
>> -- Genome Sciences Department, System Administrator
>> -- Foege Building S046, (206)-685-7354
>> -- University of Washington School of Medicine
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] SGE and NFS

Reply via email to