[slurm-users] What is the hide option of scontrol for?

2019-10-24 Thread Uemoto, Tomoki
Hi, all

"man scontrol" says:
Do not display information about hidden partitions, their jobs and job steps.  
By default, neither partitions that are configured as hidden nor those 
partitions 
unavailable to user's group will  be displayed (i.e. this is the default 
behavior).

'hide' option is not necessary because it does not display information about 
hidden partitions by default.

When should I use the hide option?

Best Regards,
Tomo



Re: [slurm-users] OverMemoryKill Not Working?

2019-10-24 Thread mercan

Hi;

You should set

SelectType=select/cons_res

and plus one of these:

SelectTypeParameters=CR_Memory
SelectTypeParameters=CR_Core_Memory
SelectTypeParameters=CR_CPU_Memory
SelectTypeParameters=CR_Socket_Memory

to open Memory allocation tracking according to documentation:

https://slurm.schedmd.com/cons_res_share.html

Also, the line:

#SBATCH --mem=1GBB

contains "1GBB". Is this same at job script?


Regards;

Ahmet M.


24.10.2019 23:00 tarihinde Mike Mosley yazdı:

Hello,

We are testing Slurm19.05 on Linux RHEL7.5+ with the intent to migrate 
from it toTorque/Moab in the near future.


One of the things our users are used to is that when their jobs exceed 
the amount of memory they requested, the job is terminated by the 
scheduler.   We realize the Slurm prefers to use cgroups to contain 
rather than kill the jobs but initially we need to have the kill 
option in place to transition our users.


So, looking at the documentation, it appears that in 19.05, the 
following needs to be set to accomplish this:


JobAcctGatherParams = OverMemoryKill


Other possibly relevant settings we made:

JobAcctGatherType = jobacct_gather/linux

ProctrackType = proctrack/linuxproc


We have avoided configuring any cgroup parameters for the time being.

Unfortunately, when we submit a job with the following:

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --mem=1GBB


We see RSS ofthe  job steadily increase beyond the 1GB limit and it is 
never killed.    Interestingly enough, the proc information shows the 
ulimit (hard and soft) for the process set to around 1GB.


We have tried various settings without any success.   Can anyone point 
out what we are doing wrong?


Thanks,

Mike

--
*/J. Michael Mosley/*
University Research Computing
The University of North Carolina at Charlotte
9201 University City Blvd
Charlotte, NC  28223
_704.687.7065 _ _ j/mmos...@uncc.edu /_




Re: [slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-24 Thread Brian Andrus
IIRC, the big difference is if you want to use cgroups on the nodes. You 
must use the cgroup plugin.


Brian Andrus

On 10/24/2019 3:54 PM, Christopher Benjamin Coffey wrote:

Hi Juergen,

 From what I see so far, there is nothing missing from the jobacct_gather/linux 
plugin vs the cgroup version. In fact, the extern step now has data where as it 
is empty when using the cgroup version.

Anyone know the differences?

Best,
Chris
  




Re: [slurm-users] jobacct_gather/linux vs jobacct_gather/cgroup

2019-10-24 Thread Christopher Benjamin Coffey
Hi Juergen,

From what I see so far, there is nothing missing from the jobacct_gather/linux 
plugin vs the cgroup version. In fact, the extern step now has data where as it 
is empty when using the cgroup version. 

Anyone know the differences?

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 10/22/19, 10:52 AM, "slurm-users on behalf of Juergen Salk" 
 
wrote:

Dear Chris,

I could not find this warning in the slurm.conf man page. So I googled
it and found a reference in the Slurm developers documentation: 


https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fjobacct_gatherplugins.htmldata=02%7C01%7Cchris.coffey%40nau.edu%7Cd82fa0e7b1b444f33d1608d757188d36%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637073635277184549sdata=54t98dF9mAbR7bRmeiyF0OUN3dPULWKVoG08H7Y3TtY%3Dreserved=0

However, this web page says in its footer: "Last modified 27 March 2015". 
So maybe (means: hopefully) this caveat is somewhat outdated today. 

I have also `JobAcctGatherType=jobacct_gather/cgroup´ in my slurm.conf 
but for no deeper reason than that we also use cgroups for
process tracking (i.e. ProctrackType=proctrack/cgroup) and to limit 
resources used by users. So it just felt more consistent to me to 
use cgroups for jobacct_gather plugin as well - even though SchedMD 
recommends jobacct_gather/linux (according to the slurm.conf man page)

That said, I'd also be interested in the pros and cons of 
jobacct_gather/cgroup 
versus jobacct_gather/linux and also why jobacct_gather/linux is the 
recommended
one.

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471





* Christopher Benjamin Coffey  [191022 16:26]:
> Hi,
> 
> We've been using jobacct_gather/cgroup for quite some time and haven't 
had any issues (I think). We do see some lengthy job cleanup times when there 
are lots of small jobs completing at once, maybe that is due to the cgroup 
plugin. At SLUG19 a slurm dev presented information that the 
jobacct_gather/cgroup plugin has quite the performance hit and that 
jobacct_gather/linux should be set instead. 
> 
> Can someone help me with the difference between these two gather plugins? 
If one were to switch to jobacct_gather/linux, what are the cons? Do you lose 
some job resource usage information?
> 
> Checking out the docs again on schedmd site regarding the jobacct_gather 
plugins I see:
> 
> cgroup — Gathers information from Linux cgroup infrastructure and adds 
this information to the standard rusage information also gathered for each job. 
(Experimental, not to be used in production.)
> 
> I don't believe I saw that before: "Experimental" ! Hah.
> 
> Thanks!
> 
> Best,
> Chris
>  
> -- 
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
>  
> 

-- 
GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A





[slurm-users] OverMemoryKill Not Working?

2019-10-24 Thread Mike Mosley
Hello,

We are testing Slurm19.05 on Linux RHEL7.5+ with the intent to migrate from
it toTorque/Moab in the near future.

One of the things our users are used to is that when their jobs exceed the
amount of memory they requested, the job is terminated by the scheduler.
 We realize the Slurm prefers to use cgroups to contain rather than kill
the jobs but initially we need to have the kill option in place to
transition our users.

So, looking at the documentation, it appears that in 19.05, the following
needs to be set to accomplish this:

JobAcctGatherParams = OverMemoryKill


Other possibly relevant settings we made:

JobAcctGatherType   = jobacct_gather/linux

ProctrackType   = proctrack/linuxproc

We have avoided configuring any cgroup parameters for the time being.

Unfortunately, when we submit a job with the following:

#SBATCH --nodes=1

#SBATCH --ntasks-per-node=1

#SBATCH --mem=1GBB

We see RSS ofthe  job steadily increase beyond the 1GB limit and it is
never killed.Interestingly enough, the proc information shows the
ulimit (hard and soft) for the process set to around 1GB.

We have tried various settings without any success.   Can anyone point out
what we are doing wrong?

Thanks,

Mike

-- 
*J. Michael Mosley*
University Research Computing
The University of North Carolina at Charlotte
9201 University City Blvd
Charlotte, NC  28223
*704.687.7065 ** jmmos...@uncc.edu *


smime.p7s
Description: S/MIME Cryptographic Signature