John,

We use /etc/security/limits.conf to set cputime limits on processes:
* hard cpu 60
root hard cpu unlimited

It works pretty well but long running file transfers can get killed. We have a script that looks for whitelisted programs to remove the limit from on a periodic basis. We haven't experienced problems with this approach in users (that anyone has reported to us, at least). Threaded programs get killed more quickly than multi-process programs since the limit is per process.

Additionally, we use cgroups for limits in a similar way to Sean but with an older approach than pam_cgroup. We also use the cpu cgroup rather than cpuset because it doesn't limit them to particular CPUs and doesn't limit them when no one else is running (it's shares-based). We also have an OOM notifier daemon that writes to a user's tty so they know if they ran out of memory. "Killed" isn't usually a helpful error message that they understand.

We have this in a github repo: https://github.com/BYUHPC/uft. Directories that may be useful include cputime_controls, oom_notifierd, loginlimits (lets users see their cgroup limits with some explanations).

Ryan

On 02/09/2017 07:18 AM, Sean McGrath wrote:
Hi,

We use cgroups to limit usage to 3 cores and 4G of memory on the head nodes. I
didn't do it but will copy and paste in our documentation below.

Those limits, 3 cores are 4G are global to all non root users I think as they
apply to a group. We obviously don't do this on the nodes.

We also monitor system utilisation with nagios and will intervene if needed.
Before we had cgroups in place I very occasionally had to do a pkill -u baduser
and lock them out temporarily until the situation was explained to them.

Any questions please let me know.

Sean



===== How to configure Cgroups locally on a system =====

This is a step-to-step guide to configure Cgroups locally on a system.

==== 1. Install the libraries to control Cgroups and to enforce it via PAM ====

<code bash>$ yum install libcgroup libcgroup-pam</code>

==== 2. Load the Cgroups module on PAM ====

<code bash>
$ echo session    required    pam_cgroup.so>>/etc/pam.d/login
$ echo session    required    pam_cgroup.so>>/etc/pam.d/password-auth-ac
$ echo session    required    pam_cgroup.so>>/etc/pam.d/system-auth-ac
</code>

==== 3. Set the Cgroup limits and associate them to a user group ====

add to /etc/cgconfig.conf:
<code bash>
# cpuset.mems may be different in different architectures, e.g. in Parsons there
# is only "0".
group users {
   memory {
     memory.limit_in_bytes="4G";
     memory.memsw.limit_in_bytes="6G";
   }
   cpuset {
     cpuset.mems="0-1";
     cpuset.cpus="0-2";
   }
}
</code>

Note that the ''memory.memsw.limit_in_bytes'' limit is //inclusive// of the
''memory.limit_in_bytes'' limit. So in the above example, the limit is 4GB of
RAM following by a further 2 GB of swap. See:

[[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html#proc-cpu_and_mem
]]

Set no limit for root and set limits for every other individual user:

<code bash>
$ echo "root    *      /">>/etc/cgrules.conf
$ echo "*   cpuset,memory    users">>/etc/cgrules.conf
</code>

Note also that the ''users'' cgroup defined above is inclusive of **all** users
(the * wildcard). So it is not a 4GB RAM limit for one user, it is a 4GB RAM
limit in total for every non-root user.

==== 4. Start the daemon that manages Cgroups configuration and set it to start
on boot ====

<code bash>
$ /etc/init.d/cgconfig start
$ chkconfig cgconfig on
</code>





On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote:

Does anyone have a good suggestion for this problem?

On a cluster I am implementing I noticed a user is running a code on 16 cores, 
on one of the login nodes, outside the batch system.
What are the accepted techniques to combat this? Other than applying a LART, if 
you all know what this means.

On one system I set up a year or so ago I was asked to implement a shell 
timeout, so if the user was idle for 30 minutes they would be logged out.
This actually is quite easy to set up as I recall.
I guess in this case as the user is connected to a running process then they 
are not 'idle'.


Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

Reply via email to