Re: [slurm-users] Node OverSubscribe even if set to no

2018-04-17 Thread Stéphane Larose
Hi all, I found out a way to avoid oversubscribing. I had to comment this configuration: PreemptMode=Suspend,Gang PreemptType=preempt/partition_prio In my actual configuration, all the partitions are at the same priority. At times, I increase the priority of a partition and jobs in other partit

Re: [slurm-users] ulimit in sbatch script

2018-04-17 Thread Mahmood Naderan
Great. Thank you very much. It passed the problematic point. On Tue, Apr 17, 2018, 19:24 Ole Holm Nielsen wrote: > On 04/17/2018 04:38 PM, Mahmood Naderan wrote: > > That parameter is used in slurm.conf. Should I modify that only on the > > head node? Or all nodes? Then should I restart slurm

[slurm-users] What can cause a job to get killed?

2018-04-17 Thread Andy Riebs
I had a job running last night, with a 30 minute timeout. (It's a well-tested script that runs multiple times daily.) On one run, in a middle of a set of runs for this job, I got this on the console after about 8 minutes: srun: forcing job termination srun: Job step aborted: Waiting up to 32

[slurm-users] configure --htmldir

2018-04-17 Thread Jason Bacon
FYI, I just discovered that doc/man/man1/Makefile does not respect configure's --htmldir flag: [root@centosdev slurm]# fgrep '$ ./configure' work/slurm-17.11.5/config.log   $ ./configure --bindir=/usr/pkg/bin --htmldir=/usr/pkg/share/doc/slurm-wlm/html --with-munge=/usr/pkg --with-hwloc=/usr

[slurm-users] SLURM Operator Role (to cancel SLURM Jobs)

2018-04-17 Thread Buckley, Ronan
Hi, I have given 4 users the operator role and they are all part of the coordinator accounts. However, when I su to the users in question, they get a permission denied error when trying to cancel a job. What am I missing? Ronan

Re: [slurm-users] ulimit in sbatch script

2018-04-17 Thread Ole Holm Nielsen
On 04/17/2018 04:38 PM, Mahmood Naderan wrote: That parameter is used in slurm.conf. Should I modify that only on the head node? Or all nodes? Then should I restart slurm processes? Yes, definitely! I collected the detailed instructions here: https://wiki.fysik.dtu.dk/niflheim/Slurm_configurat

Re: [slurm-users] ulimit in sbatch script

2018-04-17 Thread Mahmood Naderan
That parameter is used in slurm.conf. Should I modify that only on the head node? Or all nodes? Then should I restart slurm processes? Regards, Mahmood On Tue, Apr 17, 2018 at 4:18 PM, Chris Samuel wrote: > On Tuesday, 17 April 2018 7:23:40 PM AEST Mahmood Naderan wrote: > >> [hamid@rocks7 ca

[slurm-users] Recurring error

2018-04-17 Thread Valerio Bellizzomi
Hello, I have a recurring error in the log of slurmctld: [2018-04-10T19:32:40.145] error: _unpack_ret_list: message type 24949, record 0 of 56214 [2018-04-10T19:32:40.145] error: invalid type trying to be freed 24949 [2018-04-10T19:32:40.145] error: unpacking header [2018-04-10T19:32:40.145] erro

Re: [slurm-users] Node OverSubscribe even if set to no

2018-04-17 Thread Stéphane Larose
Hi Chris, > You might want to double check the config is acting as expected with: > > scontrol show part | fgrep OverSubscribe PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO PriorityJobFactor=10 PriorityTier=10 RootOnly=NO ReqResv=NO OverSubscribe=NO Priorit

Re: [slurm-users] Way MaxRSS should be interpreted

2018-04-17 Thread E.S. Rosenberg
Hi Gareth, Your assessment is also what I would have thought MaxRSS should be the maximum of the sum of all RSS in a sample, swap and shared memory does complicate things but I think most people expect jobs to only be killed if their RSS exceeds their memory request. That being said as far as I un

Re: [slurm-users] "allocated+" status

2018-04-17 Thread Andy Riebs
Hmmm... the man page says of "reduce_completing_frag," "By default if a job is found completing then no jobs are scheduled. If this parameter is used the node in a completing job are taken out of consideration." This feels like it's missing a word or two. The first sentence says that, by def

Re: [slurm-users] Way MaxRSS should be interpreted

2018-04-17 Thread Gareth.Williams
I think the situation is likely to be a little different. Let’s consider a fortran program that statically or dynamically defines large arrays. This defines a virtual memory size – like declaring that this is the maximum amount of memory you might use if you fill the arrays. That amount of real

Re: [slurm-users] Way MaxRSS should be interpreted

2018-04-17 Thread E.S. Rosenberg
Hi Loris, Thanks for your explanation! I would have interpreted as max(sum()). Is there a way to get max(sum()) or at least sum form of sum()? The assumption that all processes are peaking at the same value is not a valid one unless all threads have essentially the same workload... Thanks again! E

Re: [slurm-users] SLURM's reservations

2018-04-17 Thread Chris Samuel
On Tuesday, 17 April 2018 7:41:18 PM AEST De Giorgi Jean-Claude wrote: > Thanks a lot for your help. > Yes, I misunderstood the "format" part. > > Thank you for your example. My pleasure, glad it was useful! We have a newer version of Slurm which has different (& more) format options to you so

Re: [slurm-users] ulimit in sbatch script

2018-04-17 Thread Chris Samuel
On Tuesday, 17 April 2018 7:23:40 PM AEST Mahmood Naderan wrote: > [hamid@rocks7 case1_source2]$ scontrol show config | fgrep VSizeFactor > VSizeFactor = 110 percent Great, I think that's the cause of the limit you are seeing.. VSizeFactor Memory specifications

Re: [slurm-users] Way MaxRSS should be interpreted

2018-04-17 Thread Loris Bennett
Hi Eli, "E.S. Rosenberg" writes: > Hi fellow slurm users, > We have been struggling for a while with understanding how MaxRSS is reported. > > This because jobs often die with MaxRSS not even approaching 10% of the > requested memory sometimes. > > I just found the following document: > https:/

Re: [slurm-users] slurm jobs are pending but resources are available

2018-04-17 Thread Benjamin Redling
Hello, Am 16.04.2018 um 18:50 schrieb Michael Di Domenico: > On Mon, Apr 16, 2018 at 6:35 AM, wrote: > perhaps i missed something in the email, but it sounds like you have > 56 cores, you have two running jobs that consume 52 cores, leaving you > four free. No. From the original mail: <--- %

[slurm-users] Way MaxRSS should be interpreted

2018-04-17 Thread E.S. Rosenberg
Hi fellow slurm users, We have been struggling for a while with understanding how MaxRSS is reported. This because jobs often die with MaxRSS not even approaching 10% of the requested memory sometimes. I just found the following document: https://research.csc.fi/-/a It says: "*maxrss *= maximum

Re: [slurm-users] Python code for munging hostfiles

2018-04-17 Thread John Hearns
Loris, Ole, thankyou so much. That is the Python script I was thinking of. On 17 April 2018 at 11:15, Ole Holm Nielsen wrote: > On 04/17/2018 10:56 AM, John Hearns wrote: > >> Please can some kind soul remind me what the Python code for mangling >> Slurm and PBS machinefiles is called please?

Re: [slurm-users] SLURM's reservations

2018-04-17 Thread De Giorgi Jean-Claude
Hi Chris, Thanks a lot for your help. Yes, I misunderstood the "format" part. Thank you for your example. Regards, Jean-Claude On 17.04.18, 05:43, "slurm-users on behalf of Chris Samuel" wrote: On Tuesday, 17 April 2018 12:52:04 AM AEST De Giorgi Jean-Claude wrote: > Accordin

Re: [slurm-users] SLURM's reservations

2018-04-17 Thread De Giorgi Jean-Claude
Hello Daniel, Thank you for your information. That’s very helpful. Regards, Jean-Claude From: slurm-users on behalf of Daniel Grimwood Reply-To: Slurm User Community List Date: Tuesday, 17 April 2018 at 05:10 To: 'Slurm User Community List' Subject: Re: [slurm-users] SLURM's reservations

Re: [slurm-users] ulimit in sbatch script

2018-04-17 Thread Mahmood Naderan
See [hamid@rocks7 case1_source2]$ scontrol show config | fgrep VSizeFactor VSizeFactor = 110 percent Regards, Mahmood On Tue, Apr 17, 2018 at 12:51 PM, Chris Samuel wrote: > On Tuesday, 17 April 2018 5:08:09 PM AEST Mahmood Naderan wrote: > >> So, UsePAM has not been set. So, slu

Re: [slurm-users] Python code for munging hostfiles

2018-04-17 Thread Ole Holm Nielsen
On 04/17/2018 10:56 AM, John Hearns wrote: Please can some kind soul remind me what the Python code for mangling Slurm and PBS machinefiles is called please? We discussed it here about a year ago, in the context of running Ansys. I have a Cunning Plan (TM) to recode it in Julia, for no real re

Re: [slurm-users] Python code for munging hostfiles

2018-04-17 Thread Loris Bennett
Loris Bennett writes: > Hi John, > > John Hearns writes: > >> Please can some kind soul remind me what the Python code for mangling >> Slurm and PBS machinefiles is called please? We discussed it here >> about a year ago, in the context of running Ansys. >> >> I have a Cunning Plan (TM) to recod

Re: [slurm-users] Python code for munging hostfiles

2018-04-17 Thread Loris Bennett
Hi John, John Hearns writes: > Please can some kind soul remind me what the Python code for mangling > Slurm and PBS machinefiles is called please? We discussed it here > about a year ago, in the context of running Ansys. > > I have a Cunning Plan (TM) to recode it in Julia, for no real reason >

Re: [slurm-users] What version I should install?

2018-04-17 Thread Ole Holm Nielsen
On 04/17/2018 09:14 AM, David Rodríguez wrote: Thanks Chris! Thanks Ole! In fact, I followed your wiki. But I had many doubts in order to use version 17.11 or 17.02 because I don know the differences between them. Finally, I installed the last one. Always install the latest and greatest ver

[slurm-users] Python code for munging hostfiles

2018-04-17 Thread John Hearns
Please can some kind soul remind me what the Python code for mangling Slurm and PBS machinefiles is called please? We discussed it here about a year ago, in the context of running Ansys. I have a Cunning Plan (TM) to recode it in Julia, for no real reason other than curiosity.

Re: [slurm-users] Node OverSubscribe even if set to no

2018-04-17 Thread Chris Samuel
On Tuesday, 17 April 2018 5:26:26 AM AEST Stéphane Larose wrote: > So some jobs are now sharing the same cores but I don’t understand why since > OverSubscribe is set to no. You might want to double check the config is acting as expected with: scontrol show part | fgrep OverSubscribe Also what

Re: [slurm-users] Limit to "-N1" and "-n1" with job_submit.lua

2018-04-17 Thread Chris Samuel
On Monday, 16 April 2018 11:00:59 PM AEST Sysadmin CAOS wrote: > However, this script is not working as I desire because logged value for > job_desc.max_cpus return a value too big. I get "4294967294" for > job_desc.max_cpus and "1" for job_desc.max_nodes... That large number means the limit has

Re: [slurm-users] ulimit in sbatch script

2018-04-17 Thread Chris Samuel
On Tuesday, 17 April 2018 5:08:09 PM AEST Mahmood Naderan wrote: > So, UsePAM has not been set. So, slurm shouldn't limit anything. Is > that correct? however, I see that slurm limits the virtual memory size What does this say? scontrol show config | fgrep VSizeFactor -- Chris Samuel : htt

Re: [slurm-users] What version I should install?

2018-04-17 Thread Chris Samuel
On Tuesday, 17 April 2018 5:14:39 PM AEST David Rodríguez wrote: > There are two steps that change a little from actual version. For example, > in slurm-17.11.5-1 does not appear "slurm-plugins-$VER*rpm" Yes, that's a change documented in the RELEASE_NOTES for Slurm 17.11.x. NOTE: The slurm.spec

Re: [slurm-users] What version I should install?

2018-04-17 Thread David Rodríguez
Thanks Chris! Thanks Ole! In fact, I followed your wiki. But I had many doubts in order to use version 17.11 or 17.02 because I don know the differences between them. Finally, I installed the last one. There are two steps that change a little from actual version. For example, in slurm-17.11.5-1

Re: [slurm-users] ulimit in sbatch script

2018-04-17 Thread Mahmood Naderan
Hi Bill, Sorry for the late reply. As I greped for pam_limits.so, I see [root@rocks7 ~]# grep -r pam_limits.so /etc/pam.d/ /etc/pam.d/sudo:sessionrequired pam_limits.so /etc/pam.d/runuser:session requiredpam_limits.so /etc/pam.d/sudo-i:sessionrequired pam_limit

[slurm-users] How to have an array job name include the array task ID

2018-04-17 Thread Alex Reynolds
Hello all, I am submitting a job to a SLURM scheduler, which contains an array of small jobs. For example, here's a script that simply prints out the date and hostname of the compute node from within a heredoc: --- #!/bin/bash ...(variables)... sbatch --parsable --partition=${job