[slurm-dev] Re: Stopping compute usage on login nodes

John Hearns Thu, 09 Feb 2017 07:42:03 -0800

Thanks to Ryan,  Sarlo and Sean.

> "Killed" isn't usually a helpful error message that they understand.
Au contraire, I usually find that is a message they understand. Pour 
encourarger les autres you understand.






-----Original Message-----
From: Ryan Cox [mailto:ryan_...@byu.edu]
Sent: 09 February 2017 15:31
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Stopping compute usage on login nodes


John,

We use /etc/security/limits.conf to set cputime limits on processes:
* hard cpu 60
root hard cpu unlimited

It works pretty well but long running file transfers can get killed.  We have a 
script that looks for whitelisted programs to remove the limit from on a 
periodic basis.  We haven't experienced problems with this approach in users 
(that anyone has reported to us, at least).  Threaded programs get killed more 
quickly than multi-process programs since the limit is per process.

Additionally, we use cgroups for limits in a similar way to Sean but with an 
older approach than pam_cgroup.  We also use the cpu cgroup rather than cpuset 
because it doesn't limit them to particular CPUs and doesn't limit them when no 
one else is running (it's shares-based).  We also have an OOM notifier daemon 
that writes to a user's tty so they know if they ran out of memory.  "Killed" 
isn't usually a helpful error message that they understand.

We have this in a github repo: https://github.com/BYUHPC/uft.
Directories that may be useful include cputime_controls, oom_notifierd, 
loginlimits (lets users see their cgroup limits with some explanations).

Ryan

On 02/09/2017 07:18 AM, Sean McGrath wrote:
> Hi,
>
> We use cgroups to limit usage to 3 cores and 4G of memory on the head
> nodes. I didn't do it but will copy and paste in our documentation below.
>
> Those limits, 3 cores are 4G are global to all non root users I think
> as they apply to a group. We obviously don't do this on the nodes.
>
> We also monitor system utilisation with nagios and will intervene if needed.
> Before we had cgroups in place I very occasionally had to do a pkill
> -u baduser and lock them out temporarily until the situation was explained to 
> them.
>
> Any questions please let me know.
>
> Sean
>
>
>
> ===== How to configure Cgroups locally on a system =====
>
> This is a step-to-step guide to configure Cgroups locally on a system.
>
> ==== 1. Install the libraries to control Cgroups and to enforce it via
> PAM ====
>
> <code bash>$ yum install libcgroup libcgroup-pam</code>
>
> ==== 2. Load the Cgroups module on PAM ====
>
> <code bash>
> $ echo session    required    pam_cgroup.so>>/etc/pam.d/login
> $ echo session    required    pam_cgroup.so>>/etc/pam.d/password-auth-ac
> $ echo session    required    pam_cgroup.so>>/etc/pam.d/system-auth-ac
> </code>
>
> ==== 3. Set the Cgroup limits and associate them to a user group ====
>
> add to /etc/cgconfig.conf:
> <code bash>
> # cpuset.mems may be different in different architectures, e.g. in
> Parsons there # is only "0".
> group users {
>    memory {
>      memory.limit_in_bytes="4G";
>      memory.memsw.limit_in_bytes="6G";
>    }
>    cpuset {
>      cpuset.mems="0-1";
>      cpuset.cpus="0-2";
>    }
> }
> </code>
>
> Note that the ''memory.memsw.limit_in_bytes'' limit is //inclusive//
> of the ''memory.limit_in_bytes'' limit. So in the above example, the
> limit is 4GB of RAM following by a further 2 GB of swap. See:
>
> [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Lin
> ux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html#p
> roc-cpu_and_mem
> ]]
>
> Set no limit for root and set limits for every other individual user:
>
> <code bash>
> $ echo "root    *      /">>/etc/cgrules.conf
> $ echo "*   cpuset,memory    users">>/etc/cgrules.conf
> </code>
>
> Note also that the ''users'' cgroup defined above is inclusive of
> **all** users (the * wildcard). So it is not a 4GB RAM limit for one
> user, it is a 4GB RAM limit in total for every non-root user.
>
> ==== 4. Start the daemon that manages Cgroups configuration and set it
> to start on boot ====
>
> <code bash>
> $ /etc/init.d/cgconfig start
> $ chkconfig cgconfig on
> </code>
>
>
>
>
>
> On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote:
>
>> Does anyone have a good suggestion for this problem?
>>
>> On a cluster I am implementing I noticed a user is running a code on 16 
>> cores, on one of the login nodes, outside the batch system.
>> What are the accepted techniques to combat this? Other than applying a LART, 
>> if you all know what this means.
>>
>> On one system I set up a year or so ago I was asked to implement a shell 
>> timeout, so if the user was idle for 30 minutes they would be logged out.
>> This actually is quite easy to set up as I recall.
>> I guess in this case as the user is connected to a running process then they 
>> are not 'idle'.
>>
>>
>> Any views or opinions presented in this email are solely those of the
>> author and do not necessarily represent those of the company.
>> Employees of XMA Ltd are expressly required not to make defamatory
>> statements and not to infringe or authorise any infringement of
>> copyright or any other legal right by email communications. Any such
>> communication is contrary to company policy and outside the scope of
>> the employment of the individual concerned. The company will not
>> accept any liability in respect of such communication, and the
>> employee responsible will be personally liable for any damages or
>> other liability arising. XMA Limited is registered in England and
>> Wales (registered no. 2051703). Registered Office: Wilford Industrial
>> Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP

--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University
Any views or opinions presented in this email are solely those of the author 
and do not necessarily represent those of the company. Employees of XMA Ltd are 
expressly required not to make defamatory statements and not to infringe or 
authorise any infringement of copyright or any other legal right by email 
communications. Any such communication is contrary to company policy and 
outside the scope of the employment of the individual concerned. The company 
will not accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or other 
liability arising. XMA Limited is registered in England and Wales (registered 
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Stopping compute usage on login nodes

Reply via email to