Hi All,

Apologies for cross-posting -- not sure which list is the most active these 
days…?

I'm currently having a real issue with our shared SGE_ROOT directory, which 
also contains spool directories.  It is XFS-formatted on the server, which is 
also hosts the sgemaster daemon, and shared via NFSv4.

The cluster has 108 processors, spread over 11 execution nodes, wired up with 
1GE.  Under heavy fast scheduling (ie *large* task arrays of very short jobs) 
we are experiencing server crashes: spinning rpciod and nfsd processes both on 
clients and on the server cause very high loadavg, alarm states, sgeexecd to go 
into uninterruptible sleep states, machines falling over etc etc.

I would have thought that the NFSv4 shared directory would cope with this load, 
since the cluster is not massive.  However, we have our scheduling delay set to 
0, so I'm wondering if this is causing the issue.  I'd like to check your 
collective experience on this one, before changing the cluster config to use 
local spool dirs.

Many thanks,

Chris
--
Dr Chris Jewell
Department of Statistics
University of Warwick
Coventry
CV4 7AL
UK
Tel: +44 (0)24 7615 0778






_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to