Re: [slurm-users] ConstrainRAMSpace=yes and page cache?

2019-06-13 Thread Kilian Cavalotti
Hi Jürgen, I would take a look at the various *KmemSpace options in cgroups.conf, they can certainly help with this. Cheers, -- Kilian On Thu, Jun 13, 2019 at 2:41 PM Juergen Salk wrote: > > Dear all, > > I'm just starting to get used to Slurm and play around with it in a small test >

[slurm-users] ConstrainRAMSpace=yes and page cache?

2019-06-13 Thread Juergen Salk
Dear all, I'm just starting to get used to Slurm and play around with it in a small test environment within our old cluster. For our next system we will probably have to abandon our current exclusive user node access policy in favor of a shared user policy, i.e. jobs from different users will

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Christopher W. Harrop
> ... >> One way I?m using to work around this is to inject a long random string >> into the ?comment option. Then, if I see the socket timeout, I use squeue >> to look for that job and retrieve its ID. It?s not ideal, but it can work. > > I would have expected a different approach: use a

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread John Hearns
I agree with Christopher Coffey - look at the sssd caching. I have had experience with sssd and can help a bit. Also if you are seeing long waits could you have nested groups? sssd is notorious for not handling these well, and there are settings in the configuration file which you can experiment

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Mark Hahn
On Thu, 13 Jun 2019, Christopher Harrop - NOAA Affiliate wrote: ... One way I?m using to work around this is to inject a long random string into the ?comment option. Then, if I see the socket timeout, I use squeue to look for that job and retrieve its ID. It?s not ideal, but it can work. I

Re: [slurm-users] increasing timelimit on array jobs no longer supported?

2019-06-13 Thread Bill Wichser
Thanks. Had no problem setting the individual element of the array. Just thought that it worked differently in the past! Memory apparently isn't what it used to be! Thanks again, Bill On 6/13/19 10:25 AM, Jacob Jenson wrote: scontrol show job

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Jeffrey Frey
The error message cited is associated with SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT, which is only ever raised by slurm_send_timeout() and slurm_recv_timeout(). Those functions raise that error when a generic socket-based send/receive operation exceeds an arbitrary time limit imposed by the caller.

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Christopher Harrop - NOAA Affiliate
Hi, My group is struggling with this also. The worst part of this, which no one has brought up yet, is that the sbatch command does not necessarily fail to submit the job in this situation. In fact, most of the time (for us), it succeeds. There appears to be some sort of race condition or

Re: [slurm-users] increasing timelimit on array jobs no longer supported?

2019-06-13 Thread Jacob Jenson
Bill, You can always set the time limit on a job array to a specific value: # scontrol update jobid=123 timelimit=45 You can also increment the time limit on a job array that is still in a single job record. Separate job records are spit off as needed, say when a task starts or an attempt is

[slurm-users] increasing timelimit on array jobs no longer supported?

2019-06-13 Thread Bill Wichser
# scontrol update jobid=3136818 timelimit+=30-00:00:00 scontrol: error: TimeLimit increment/decrement not supported for job arrays This is new to 18.08.7 it appears. Am I just missing something here? Bill