Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
Hi Paul, Thanks a lot for your suggestion. The cluster I'm using has thousands of users, so I'm doubtful the admins will change this setting just for me. But I'll mention it to the support team I'm working with. I was hoping more for something that can be done on the user end. Is there some way

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Ole Holm Nielsen
Hi Guillaume, The performance of the slurmctld server depends strongly on the server hardware on which it is running! This should be taken into account when considering your question. SchedMD recommends that the slurmctld server should have only a few, but very fast CPU cores, in order to e

Re: [slurm-users] Dependencies with singleton and after

2019-08-27 Thread Jarno van der Kolk
Hi all, I'm still puzzled by the expected behaviour of the following: $ sbatch --hold fakejob.sh Submitted batch job 25909273 $ sbatch --hold fakejob.sh Submitted batch job 25909274 $ sbatch --hold fakejob.sh Submitted batch job 25909275 $ scontrol update jobid=25909273 Dependency=singleton $ scon

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Paul Edmon
At least for our cluster we generally recommend that if you are submitting large numbers of jobs you either use a job array or you just for loop over the jobs you want to submit.  A fork bomb is definitely not recommended.  For highest throughput submission a job array is your best bet as in on

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
Thanks Ole for giving so much thought into my question. I'll pass a long these suggestions. Unfortunately as a user there's not a whole lot I can do about the choice of hardware. Thanks for the link to the guide, I'll have a look at it. Even as a user it's helpful to be well informed on the admin

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
Hi Paul, Your comment confirms my worst fear, that I should either implement job arrays or stick to a sequential for loop. My problem with job arrays is that, as far as I understand them, they cannot be used with singleton to set a max job limit. I use singleton to limit the number of jobs a use

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Brian Andrus
Just a couple comments from experience in general: 1) If you can, either use xargs or parallel to do the forking so you can limit the number of simultaneous submissions 2) I have yet to see where it is a good idea to have many separate jobs when using an array can work.     If you can prep

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Brian Andrus
Here is where you may want to look into slurmdbd and sacct Then you can create a qos that has MaxJobsPerUser to limit the total number running on a per-user basis: https://slurm.schedmd.com/resource_limits.html Brian Andrus On 8/27/2019 9:38 AM, Guillaume Perrault Archambault wrote: Hi Paul

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Guillaume Perrault Archambault
HI Brian, Thanks a lot for your recommendations. I'll do my best to address your three points in line. I hope I've understood you correctly, please correct me if i've misunderstood parts. "1) If you can, either use xargs or parallel to do the forking so you can limit the number of simultaneous s

[slurm-users] Node resource is under-allocated

2019-08-27 Thread Christopher Benjamin Coffey
Hi, Can someone help me understand what this error is? select/cons_res: node cn95 memory is under-allocated (125000-135000) for JobId=23544043 We get a lot of these from time to time and I don't understand what its about? Looking at the code it doesn't make sense for this to be happening on ru