Also, maybe this has possibly been fixed already? Am not seeing this happen on our Slurm 17.x test cluster, but it appears on our cluster using 15.x.
> On Jul 5, 2017, at 10:37 AM, Craig Yoshioka <yoshi...@ohsu.edu> wrote: > > Hi, > > I posted this a while back but didn’t get any responses. I prefer using > `srun` to invoke commands on our cluster because it is way more convenient > then writing wrappers for sbatch for running single process jobs (no multiple > steps). The problem is that if I submit to many srun jobs, the head node > starts running out of socket resources (or other?) and I start getting > timeouts and some of the srun processes start using 100% CPU. > > I’ve tried redirecting all I/O to prevent use of sockets, etc., but still see > this problem. Can anyone suggest an alternative approach or fix? Something > that doesn’t require I write shell wrappers, but also doesn’t keep a running > process going on the head node? > > Thanks, > -Craig