John,
We use /etc/security/limits.conf to set cputime limits on processes:
* hard cpu 60
root hard cpu unlimited
It works pretty well but long running file transfers can get killed. We
have a script that looks for whitelisted programs to remove the limit
from on a periodic basis. We haven't experienced problems with this
approach in users (that anyone has reported to us, at least). Threaded
programs get killed more quickly than multi-process programs since the
limit is per process.
Additionally, we use cgroups for limits in a similar way to Sean but
with an older approach than pam_cgroup. We also use the cpu cgroup
rather than cpuset because it doesn't limit them to particular CPUs and
doesn't limit them when no one else is running (it's shares-based). We
also have an OOM notifier daemon that writes to a user's tty so they
know if they ran out of memory. "Killed" isn't usually a helpful error
message that they understand.
We have this in a github repo: https://github.com/BYUHPC/uft.
Directories that may be useful include cputime_controls, oom_notifierd,
loginlimits (lets users see their cgroup limits with some explanations).
Ryan
On 02/09/2017 07:18 AM, Sean McGrath wrote:
Hi,
We use cgroups to limit usage to 3 cores and 4G of memory on the head nodes. I
didn't do it but will copy and paste in our documentation below.
Those limits, 3 cores are 4G are global to all non root users I think as they
apply to a group. We obviously don't do this on the nodes.
We also monitor system utilisation with nagios and will intervene if needed.
Before we had cgroups in place I very occasionally had to do a pkill -u baduser
and lock them out temporarily until the situation was explained to them.
Any questions please let me know.
Sean
===== How to configure Cgroups locally on a system =====
This is a step-to-step guide to configure Cgroups locally on a system.
==== 1. Install the libraries to control Cgroups and to enforce it via PAM ====
<code bash>$ yum install libcgroup libcgroup-pam</code>
==== 2. Load the Cgroups module on PAM ====
<code bash>
$ echo session required pam_cgroup.so>>/etc/pam.d/login
$ echo session required pam_cgroup.so>>/etc/pam.d/password-auth-ac
$ echo session required pam_cgroup.so>>/etc/pam.d/system-auth-ac
</code>
==== 3. Set the Cgroup limits and associate them to a user group ====
add to /etc/cgconfig.conf:
<code bash>
# cpuset.mems may be different in different architectures, e.g. in Parsons there
# is only "0".
group users {
memory {
memory.limit_in_bytes="4G";
memory.memsw.limit_in_bytes="6G";
}
cpuset {
cpuset.mems="0-1";
cpuset.cpus="0-2";
}
}
</code>
Note that the ''memory.memsw.limit_in_bytes'' limit is //inclusive// of the
''memory.limit_in_bytes'' limit. So in the above example, the limit is 4GB of
RAM following by a further 2 GB of swap. See:
[[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html#proc-cpu_and_mem
]]
Set no limit for root and set limits for every other individual user:
<code bash>
$ echo "root * /">>/etc/cgrules.conf
$ echo "* cpuset,memory users">>/etc/cgrules.conf
</code>
Note also that the ''users'' cgroup defined above is inclusive of **all** users
(the * wildcard). So it is not a 4GB RAM limit for one user, it is a 4GB RAM
limit in total for every non-root user.
==== 4. Start the daemon that manages Cgroups configuration and set it to start
on boot ====
<code bash>
$ /etc/init.d/cgconfig start
$ chkconfig cgconfig on
</code>
On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote:
Does anyone have a good suggestion for this problem?
On a cluster I am implementing I noticed a user is running a code on 16 cores,
on one of the login nodes, outside the batch system.
What are the accepted techniques to combat this? Other than applying a LART, if
you all know what this means.
On one system I set up a year or so ago I was asked to implement a shell
timeout, so if the user was idle for 30 minutes they would be logged out.
This actually is quite easy to set up as I recall.
I guess in this case as the user is connected to a running process then they
are not 'idle'.
Any views or opinions presented in this email are solely those of the author
and do not necessarily represent those of the company. Employees of XMA Ltd are
expressly required not to make defamatory statements and not to infringe or
authorise any infringement of copyright or any other legal right by email
communications. Any such communication is contrary to company policy and
outside the scope of the employment of the individual concerned. The company
will not accept any liability in respect of such communication, and the
employee responsible will be personally liable for any damages or other
liability arising. XMA Limited is registered in England and Wales (registered
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane,
Wilford, Nottingham, NG11 7EP
--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University