Check if the jobs have LD_LIBRARY_PATH added by SGE in the job environment. We have in the past received reports of heavy NFS loads due to this.
Basically, run this simple job: #!/bin/sh env And see if LD_LIBRARY_PATH is set or added with anything in the SGE dir (ie. assuming SGE is on your NFS share). Rayson On Sun, May 6, 2012 at 2:14 PM, Chris Jewell <[email protected]> wrote: > Hi All, > > Apologies for cross-posting -- not sure which list is the most active these > days…? > > I'm currently having a real issue with our shared SGE_ROOT directory, which > also contains spool directories. It is XFS-formatted on the server, which is > also hosts the sgemaster daemon, and shared via NFSv4. > > The cluster has 108 processors, spread over 11 execution nodes, wired up with > 1GE. Under heavy fast scheduling (ie *large* task arrays of very short jobs) > we are experiencing server crashes: spinning rpciod and nfsd processes both > on clients and on the server cause very high loadavg, alarm states, sgeexecd > to go into uninterruptible sleep states, machines falling over etc etc. > > I would have thought that the NFSv4 shared directory would cope with this > load, since the cluster is not massive. However, we have our scheduling > delay set to 0, so I'm wondering if this is causing the issue. I'd like to > check your collective experience on this one, before changing the cluster > config to use local spool dirs. > > Many thanks, > > Chris > -- > Dr Chris Jewell > Department of Statistics > University of Warwick > Coventry > CV4 7AL > UK > Tel: +44 (0)24 7615 0778 > > > > > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
