Re: [gridengine users] NFS spool dirs -- crash under heavy scheduling load

Rayson Ho Sun, 06 May 2012 11:23:48 -0700

Check if the jobs have LD_LIBRARY_PATH added by SGE in the job
environment. We have in the past received reports of heavy NFS loads
due to this.


Basically, run this simple job:

#!/bin/sh

env

And see if LD_LIBRARY_PATH is set or added with anything in the SGE
dir (ie. assuming SGE is on your NFS share).

Rayson



On Sun, May 6, 2012 at 2:14 PM, Chris Jewell <[email protected]> wrote:
> Hi All,
>
> Apologies for cross-posting -- not sure which list is the most active these 
> days…?
>
> I'm currently having a real issue with our shared SGE_ROOT directory, which 
> also contains spool directories.  It is XFS-formatted on the server, which is 
> also hosts the sgemaster daemon, and shared via NFSv4.
>
> The cluster has 108 processors, spread over 11 execution nodes, wired up with 
> 1GE.  Under heavy fast scheduling (ie *large* task arrays of very short jobs) 
> we are experiencing server crashes: spinning rpciod and nfsd processes both 
> on clients and on the server cause very high loadavg, alarm states, sgeexecd 
> to go into uninterruptible sleep states, machines falling over etc etc.
>
> I would have thought that the NFSv4 shared directory would cope with this 
> load, since the cluster is not massive.  However, we have our scheduling 
> delay set to 0, so I'm wondering if this is causing the issue.  I'd like to 
> check your collective experience on this one, before changing the cluster 
> config to use local spool dirs.
>
> Many thanks,
>
> Chris
> --
> Dr Chris Jewell
> Department of Statistics
> University of Warwick
> Coventry
> CV4 7AL
> UK
> Tel: +44 (0)24 7615 0778
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] NFS spool dirs -- crash under heavy scheduling load

Reply via email to