[slurm-dev] Re: Stopping compute usage on login nodes

Ryan Cox Thu, 09 Feb 2017 07:31:11 -0800


John,


We use /etc/security/limits.conf to set cputime limits on processes:
* hard cpu 60
root hard cpu unlimited

It works pretty well but long running file transfers can get killed. Wehave a script that looks for whitelisted programs to remove the limitfrom on a periodic basis. We haven't experienced problems with thisapproach in users (that anyone has reported to us, at least). Threadedprograms get killed more quickly than multi-process programs since thelimit is per process.

Additionally, we use cgroups for limits in a similar way to Sean butwith an older approach than pam_cgroup. We also use the cpu cgrouprather than cpuset because it doesn't limit them to particular CPUs anddoesn't limit them when no one else is running (it's shares-based). Wealso have an OOM notifier daemon that writes to a user's tty so theyknow if they ran out of memory. "Killed" isn't usually a helpful errormessage that they understand.

We have this in a github repo: https://github.com/BYUHPC/uft.Directories that may be useful include cputime_controls, oom_notifierd,loginlimits (lets users see their cgroup limits with some explanations).


Ryan

On 02/09/2017 07:18 AM, Sean McGrath wrote:

Hi,

We use cgroups to limit usage to 3 cores and 4G of memory on the head nodes. I
didn't do it but will copy and paste in our documentation below.

Those limits, 3 cores are 4G are global to all non root users I think as they
apply to a group. We obviously don't do this on the nodes.

We also monitor system utilisation with nagios and will intervene if needed.
Before we had cgroups in place I very occasionally had to do a pkill -u baduser
and lock them out temporarily until the situation was explained to them.

Any questions please let me know.

Sean



===== How to configure Cgroups locally on a system =====

This is a step-to-step guide to configure Cgroups locally on a system.

==== 1. Install the libraries to control Cgroups and to enforce it via PAM ====

<code bash>$ yum install libcgroup libcgroup-pam</code>

==== 2. Load the Cgroups module on PAM ====

<code bash>
$ echo session    required    pam_cgroup.so>>/etc/pam.d/login
$ echo session    required    pam_cgroup.so>>/etc/pam.d/password-auth-ac
$ echo session    required    pam_cgroup.so>>/etc/pam.d/system-auth-ac
</code>

==== 3. Set the Cgroup limits and associate them to a user group ====

add to /etc/cgconfig.conf:
<code bash>
# cpuset.mems may be different in different architectures, e.g. in Parsons there
# is only "0".
group users {
   memory {
     memory.limit_in_bytes="4G";
     memory.memsw.limit_in_bytes="6G";
   }
   cpuset {
     cpuset.mems="0-1";
     cpuset.cpus="0-2";
   }
}
</code>

Note that the ''memory.memsw.limit_in_bytes'' limit is //inclusive// of the
''memory.limit_in_bytes'' limit. So in the above example, the limit is 4GB of
RAM following by a further 2 GB of swap. See:

[[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html#proc-cpu_and_mem
]]

Set no limit for root and set limits for every other individual user:

<code bash>
$ echo "root    *      /">>/etc/cgrules.conf
$ echo "*   cpuset,memory    users">>/etc/cgrules.conf
</code>

Note also that the ''users'' cgroup defined above is inclusive of **all** users
(the * wildcard). So it is not a 4GB RAM limit for one user, it is a 4GB RAM
limit in total for every non-root user.

==== 4. Start the daemon that manages Cgroups configuration and set it to start
on boot ====

<code bash>
$ /etc/init.d/cgconfig start
$ chkconfig cgconfig on
</code>





On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote:

Does anyone have a good suggestion for this problem?

On a cluster I am implementing I noticed a user is running a code on 16 cores, 
on one of the login nodes, outside the batch system.
What are the accepted techniques to combat this? Other than applying a LART, if 
you all know what this means.

On one system I set up a year or so ago I was asked to implement a shell 
timeout, so if the user was idle for 30 minutes they would be logged out.
This actually is quite easy to set up as I recall.
I guess in this case as the user is connected to a running process then they 
are not 'idle'.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

[slurm-dev] Re: Stopping compute usage on login nodes

Reply via email to