subject:"\[slurm\-dev\] Re\: Stopping compute usage on login nodes"

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-10 Thread Marcin Stolarek

On the cluster I've been managing we had a solution with pam_script that
was choosing for each user two random cores and bounding his session to
those (if this is second session use the same cores). I think it's quite
good solution, since
1) User is not able to take all server resources
2) The probability that two users will be bound to the same resources is
decreased (so one user will not affect others) . It can be optimized with
change of two cores to any number that is optimal for the login node
resources and number of users loged in.

Additionally to this we had simply cron job to notify admins when user
process cputime is larger than 2 minutes and the difference between real
time and cpu time is small.

cheers,
Marcin

2017-02-09 20:01 GMT+01:00 Ryan Novosielski :

> I have used ulimits in the past to limit users to 768MB of RAM per
> process. This seemed to be enough to run anything they were actually
> supposed to be running. I would use cgroups on a more modern (this was
> RHEL5).
>
> A related question: we used cgroups on a CentOS 6 system, but then
> switched our accounts to private user groups as opposed to a more general
> "hpcusers" group. It doesn't seem like there is a way to use cgroups on a
> secondary group, or any other easy way to do this. The setup was that the
> main user group was limited to "most" of the machine and users were limited
> to some percentage of the most. With users not sharing any group, this
> stopped working. Anyone know of an alternative (I guess doing it based on
> excluding system users and applying limits to everyone else, but this seems
> hamfisted).
>
> --
> 
> || \UTGERS,   |---*O
> *---
> ||_// the State | Ryan Novosielski - novos...@rutgers.edu
> || \ University | Sr. Technologist - 973/972.0922 <(973)%20972-0922>
> (2x0922) ~*~ RBHS Campus
> ||  \of NJ | Office of Advanced Research Computing - MSB C630,
> Newark
> `'
>
> On Feb 9, 2017, at 13:05, Ole Holm Nielsen 
> wrote:
>
> We limit the cpu times in /etc/security/limits.conf so that user processes
> have a maximum of 10 minutes. It doesn't eliminate the problem completely,
> but it's fairly effective on users who misunderstood the role of login
> nodes.
>
>
>
> On Thu, Feb 9, 2017 at 6:38 PM +0100, "Jason Bacon" 
> wrote:
>
> We simply make it impossible to run computational software on the head
>> nodes.
>>
>> 1.No scientific software packages are installed on the local disk.
>> 2.Our NFS-mounted application directory is mounted with noexec.
>>
>> Regards,
>>
>>  Jason
>>
>> On 02/09/17 07:09, John Hearns wrote:
>> >
>> > Does anyone have a good suggestion for this problem?
>> >
>> > On a cluster I am implementing I noticed a user is running a code on
>> > 16 cores, on one of the login nodes, outside the batch system.
>> >
>> > What are the accepted techniques to combat this? Other than applying a
>> > LART, if you all know what this means.
>> >
>> > On one system I set up a year or so ago I was asked to implement a
>> > shell timeout, so if the user was idle for 30 minutes they would be
>> > logged out.
>> >
>> > This actually is quite easy to set up as I recall.
>> >
>> > I guess in this case as the user is connected to a running process
>> > then they are not ‘idle’.
>> >
>> > Any views or opinions presented in this email are solely those of the
>> > author and do not necessarily represent those of the company.
>> > Employees of XMA Ltd are expressly required not to make defamatory
>> > statements and not to infringe or authorise any infringement of
>> > copyright or any other legal right by email communications. Any such
>> > communication is contrary to company policy and outside the scope of
>> > the employment of the individual concerned. The company will not
>> > accept any liability in respect of such communication, and the
>> > employee responsible will be personally liable for any damages or
>> > other liability arising. XMA Limited is registered in England and
>> > Wales (registered no. 2051703). Registered Office: Wilford Industrial
>> > Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP
>>
>>
>> --
>> Earth is a beta site.
>>
>>

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Novosielski

I have used ulimits in the past to limit users to 768MB of RAM per process. 
This seemed to be enough to run anything they were actually supposed to be 
running. I would use cgroups on a more modern (this was RHEL5).

A related question: we used cgroups on a CentOS 6 system, but then switched our 
accounts to private user groups as opposed to a more general "hpcusers" group. 
It doesn't seem like there is a way to use cgroups on a secondary group, or any 
other easy way to do this. The setup was that the main user group was limited 
to "most" of the machine and users were limited to some percentage of the most. 
With users not sharing any group, this stopped working. Anyone know of an 
alternative (I guess doing it based on excluding system users and applying 
limits to everyone else, but this seems hamfisted).

--

|| \UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu
|| \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Feb 9, 2017, at 13:05, Ole Holm Nielsen 
mailto:ole.h.niel...@fysik.dtu.dk>> wrote:

We limit the cpu times in /etc/security/limits.conf so that user processes have 
a maximum of 10 minutes. It doesn't eliminate the problem completely, but it's 
fairly effective on users who misunderstood the role of login nodes.

On Thu, Feb 9, 2017 at 6:38 PM +0100, "Jason Bacon" 
mailto:bacon4...@gmail.com>> wrote:

We simply make it impossible to run computational software on the head
nodes.

1.No scientific software packages are installed on the local disk.
2.Our NFS-mounted application directory is mounted with noexec.

Regards,

 Jason

On 02/09/17 07:09, John Hearns wrote:
>
> Does anyone have a good suggestion for this problem?
>
> On a cluster I am implementing I noticed a user is running a code on
> 16 cores, on one of the login nodes, outside the batch system.
>
> What are the accepted techniques to combat this? Other than applying a
> LART, if you all know what this means.
>
> On one system I set up a year or so ago I was asked to implement a
> shell timeout, so if the user was idle for 30 minutes they would be
> logged out.
>
> This actually is quite easy to set up as I recall.
>
> I guess in this case as the user is connected to a running process
> then they are not ‘idle’.
>
> Any views or opinions presented in this email are solely those of the
> author and do not necessarily represent those of the company.
> Employees of XMA Ltd are expressly required not to make defamatory
> statements and not to infringe or authorise any infringement of
> copyright or any other legal right by email communications. Any such
> communication is contrary to company policy and outside the scope of
> the employment of the individual concerned. The company will not
> accept any liability in respect of such communication, and the
> employee responsible will be personally liable for any damages or
> other liability arising. XMA Limited is registered in England and
> Wales (registered no. 2051703). Registered Office: Wilford Industrial
> Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP

--
Earth is a beta site.

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox



If you're interested in the programmatic method I mentioned to increase 
limits for file transfers, 
https://github.com/BYUHPC/uft/tree/master/cputime_controls might be 
worth looking at.  It works well for us, though a user will occasionally 
start using a new file transfer program that you might want to centrally 
install and whitelist.


We used to use LVS for load balancing and it worked pretty well.  We 
finally scrapped it in favor of DNS round robin since it gets expensive 
to have a load balancer that's capable of moving that much bandwidth.  
We have a script that can drop some of the login nodes from the DNS 
round robin based on CPU and memory usage (with sanity checks to not 
drop all of them at the same time, of course :) ). There may be a better 
way of doing this but it has worked so far.


Ryan

On 02/09/2017 11:15 AM, Nicholas McCollum wrote:

While this isn't a SLURM issue, it's something we all face.  Due to my
system being primarily students, it's something I face a lot.

I second the use of ulimits, although this can kill off long running
file transfers.  What you can do to help out users is set a low soft
limit and a somewhat larger hard limit.  Encourage users that want to
do a file transfer to increase their limit (they wont be able to go
over the hard limit).

A method that I am testing to employ is having each login node as a KVM
virtual machine, and then limiting the amount of CPU that the virtual
machine can use.  Each login-VM will be identical minus the MAC and the
IP address, then using IP tables on the VM-host to push the connections
out to the VM that responds first.  The idea is that a loaded down VM
would have a delay in responding and provide a user with a login node
that doesn't have any users on it.

I'm sure someone has already blazed this trail before, but this is how
I am going about it.




--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Jason Bacon




That reminds me, we also don't allow file transfers through the head node:

chmod 750 /usr/bin/sftp /usr/bin/scp /usr/bin/rsync

All file transfer operations must go through one of the file servers.

On 02/09/17 12:13, Nicholas McCollum wrote:

While this isn't a SLURM issue, it's something we all face.  Due to my
system being primarily students, it's something I face a lot.

I second the use of ulimits, although this can kill off long running
file transfers.  What you can do to help out users is set a low soft
limit and a somewhat larger hard limit.  Encourage users that want to
do a file transfer to increase their limit (they wont be able to go
over the hard limit).

A method that I am testing to employ is having each login node as a KVM
virtual machine, and then limiting the amount of CPU that the virtual
machine can use.  Each login-VM will be identical minus the MAC and the
IP address, then using IP tables on the VM-host to push the connections
out to the VM that responds first.  The idea is that a loaded down VM
would have a delay in responding and provide a user with a login node
that doesn't have any users on it.

I'm sure someone has already blazed this trail before, but this is how
I am going about it.





--
Earth is a beta site.

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Nicholas McCollum

While this isn't a SLURM issue, it's something we all face.  Due to my
system being primarily students, it's something I face a lot.

I second the use of ulimits, although this can kill off long running
file transfers.  What you can do to help out users is set a low soft
limit and a somewhat larger hard limit.  Encourage users that want to
do a file transfer to increase their limit (they wont be able to go
over the hard limit).  

A method that I am testing to employ is having each login node as a KVM
virtual machine, and then limiting the amount of CPU that the virtual
machine can use.  Each login-VM will be identical minus the MAC and the
IP address, then using IP tables on the VM-host to push the connections
out to the VM that responds first.  The idea is that a loaded down VM
would have a delay in responding and provide a user with a login node
that doesn't have any users on it.

I'm sure someone has already blazed this trail before, but this is how
I am going about it.


-- 
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

On Thu, 2017-02-09 at 07:32 -0800, Ryan Cox wrote:
> John,
> 
> We use /etc/security/limits.conf to set cputime limits on processes:
> * hard cpu 60
> root hard cpu unlimited
> 
> It works pretty well but long running file transfers can get
> killed.  We 
> have a script that looks for whitelisted programs to remove the
> limit 
> from on a periodic basis.  We haven't experienced problems with this 
> approach in users (that anyone has reported to us, at
> least).  Threaded 
> programs get killed more quickly than multi-process programs since
> the 
> limit is per process.
> 
> Additionally, we use cgroups for limits in a similar way to Sean but 
> with an older approach than pam_cgroup.  We also use the cpu cgroup 
> rather than cpuset because it doesn't limit them to particular CPUs
> and 
> doesn't limit them when no one else is running (it's shares-
> based).  We 
> also have an OOM notifier daemon that writes to a user's tty so they 
> know if they ran out of memory.  "Killed" isn't usually a helpful
> error 
> message that they understand.
> 
> We have this in a github repo: https://github.com/BYUHPC/uft. 
> Directories that may be useful include cputime_controls,
> oom_notifierd, 
> loginlimits (lets users see their cgroup limits with some
> explanations).
> 
> Ryan
> 
> On 02/09/2017 07:18 AM, Sean McGrath wrote:
> > Hi,
> > 
> > We use cgroups to limit usage to 3 cores and 4G of memory on the
> > head nodes. I
> > didn't do it but will copy and paste in our documentation below.
> > 
> > Those limits, 3 cores are 4G are global to all non root users I
> > think as they
> > apply to a group. We obviously don't do this on the nodes.
> > 
> > We also monitor system utilisation with nagios and will intervene
> > if needed.
> > Before we had cgroups in place I very occasionally had to do a
> > pkill -u baduser
> > and lock them out temporarily until the situation was explained to
> > them.
> > 
> > Any questions please let me know.
> > 
> > Sean
> > 
> > 
> > 
> > = How to configure Cgroups locally on a system =
> > 
> > This is a step-to-step guide to configure Cgroups locally on a
> > system.
> > 
> >  1. Install the libraries to control Cgroups and to enforce it
> > via PAM 
> > 
> > $ yum install libcgroup libcgroup-pam
> > 
> >  2. Load the Cgroups module on PAM 
> > 
> > 
> > $ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/login
> > $ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/password-
> > auth-ac
> > $ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/system-
> > auth-ac
> > 
> > 
> >  3. Set the Cgroup limits and associate them to a user group
> > 
> > 
> > add to /etc/cgconfig.conf:
> > 
> > # cpuset.mems may be different in different architectures, e.g. in
> > Parsons there
> > # is only "0".
> > group users {
> >    memory {
> >  memory.limit_in_bytes="4G";
> >  memory.memsw.limit_in_bytes="6G";
> >    }
> >    cpuset {
> >  cpuset.mems="0-1";
> >  cpuset.cpus="0-2";
> >    }
> > }
> > 
> > 
> > Note that the ''memory.memsw.limit_in_bytes'' limit is
> > //inclusive// of the
> > ''memory.limit_in_bytes'' limit. So in the above example, the limit
> > is 4GB of
> > RAM following by a further 2 GB of swap. See:
> > 
> > [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_
> > Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-
> > use_case.html#proc-cpu_and_mem
> > ]]
> > 
> > Set no limit for root and set limits for every other individual
> > user:
> > 
> > 
> > $ echo "root*  /">>/etc/cgrules.conf
> > $ echo "*   cpuset,memoryusers">>/etc/cgrules.conf
> > 
> > 
> > Note also that the ''users'' cgroup defined above is inclusive of
> > **all** users
> > (the * wildcard). So it is not a 4GB RAM limit for one user, it is
> > a 4GB RAM
> > limit in total for every non-root user.
> > 
> >  4. Start the daemon that manages Cgroups configuratio

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ole Holm Nielsen

We limit the cpu times in /etc/security/limits.conf so that user processes have 
a maximum of 10 minutes. It doesn't eliminate the problem completely, but it's 
fairly effective on users who misunderstood the role of login nodes.



On Thu, Feb 9, 2017 at 6:38 PM +0100, "Jason Bacon" 
mailto:bacon4...@gmail.com>> wrote:




We simply make it impossible to run computational software on the head
nodes.

1.No scientific software packages are installed on the local disk.
2.Our NFS-mounted application directory is mounted with noexec.

Regards,

 Jason

On 02/09/17 07:09, John Hearns wrote:
>
> Does anyone have a good suggestion for this problem?
>
> On a cluster I am implementing I noticed a user is running a code on
> 16 cores, on one of the login nodes, outside the batch system.
>
> What are the accepted techniques to combat this? Other than applying a
> LART, if you all know what this means.
>
> On one system I set up a year or so ago I was asked to implement a
> shell timeout, so if the user was idle for 30 minutes they would be
> logged out.
>
> This actually is quite easy to set up as I recall.
>
> I guess in this case as the user is connected to a running process
> then they are not ‘idle’.
>
> Any views or opinions presented in this email are solely those of the
> author and do not necessarily represent those of the company.
> Employees of XMA Ltd are expressly required not to make defamatory
> statements and not to infringe or authorise any infringement of
> copyright or any other legal right by email communications. Any such
> communication is contrary to company policy and outside the scope of
> the employment of the individual concerned. The company will not
> accept any liability in respect of such communication, and the
> employee responsible will be personally liable for any damages or
> other liability arising. XMA Limited is registered in England and
> Wales (registered no. 2051703). Registered Office: Wilford Industrial
> Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP


--
Earth is a beta site.

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Jason Bacon




We simply make it impossible to run computational software on the head 
nodes.


1.No scientific software packages are installed on the local disk.
2.Our NFS-mounted application directory is mounted with noexec.

Regards,

Jason

On 02/09/17 07:09, John Hearns wrote:


Does anyone have a good suggestion for this problem?

On a cluster I am implementing I noticed a user is running a code on 
16 cores, on one of the login nodes, outside the batch system.


What are the accepted techniques to combat this? Other than applying a 
LART, if you all know what this means.


On one system I set up a year or so ago I was asked to implement a 
shell timeout, so if the user was idle for 30 minutes they would be 
logged out.


This actually is quite easy to set up as I recall.

I guess in this case as the user is connected to a running process 
then they are not ‘idle’.


Any views or opinions presented in this email are solely those of the 
author and do not necessarily represent those of the company. 
Employees of XMA Ltd are expressly required not to make defamatory 
statements and not to infringe or authorise any infringement of 
copyright or any other legal right by email communications. Any such 
communication is contrary to company policy and outside the scope of 
the employment of the individual concerned. The company will not 
accept any liability in respect of such communication, and the 
employee responsible will be personally liable for any damages or 
other liability arising. XMA Limited is registered in England and 
Wales (registered no. 2051703). Registered Office: Wilford Industrial 
Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP 



--
Earth is a beta site.

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread John Hearns

Thanks to Ryan,  Sarlo and Sean.

> "Killed" isn't usually a helpful error message that they understand.
Au contraire, I usually find that is a message they understand. Pour 
encourarger les autres you understand.





-Original Message-
From: Ryan Cox [mailto:ryan_...@byu.edu]
Sent: 09 February 2017 15:31
To: slurm-dev 
Subject: [slurm-dev] Re: Stopping compute usage on login nodes


John,

We use /etc/security/limits.conf to set cputime limits on processes:
* hard cpu 60
root hard cpu unlimited

It works pretty well but long running file transfers can get killed.  We have a 
script that looks for whitelisted programs to remove the limit from on a 
periodic basis.  We haven't experienced problems with this approach in users 
(that anyone has reported to us, at least).  Threaded programs get killed more 
quickly than multi-process programs since the limit is per process.

Additionally, we use cgroups for limits in a similar way to Sean but with an 
older approach than pam_cgroup.  We also use the cpu cgroup rather than cpuset 
because it doesn't limit them to particular CPUs and doesn't limit them when no 
one else is running (it's shares-based).  We also have an OOM notifier daemon 
that writes to a user's tty so they know if they ran out of memory.  "Killed" 
isn't usually a helpful error message that they understand.

We have this in a github repo: https://github.com/BYUHPC/uft.
Directories that may be useful include cputime_controls, oom_notifierd, 
loginlimits (lets users see their cgroup limits with some explanations).

Ryan

On 02/09/2017 07:18 AM, Sean McGrath wrote:
> Hi,
>
> We use cgroups to limit usage to 3 cores and 4G of memory on the head
> nodes. I didn't do it but will copy and paste in our documentation below.
>
> Those limits, 3 cores are 4G are global to all non root users I think
> as they apply to a group. We obviously don't do this on the nodes.
>
> We also monitor system utilisation with nagios and will intervene if needed.
> Before we had cgroups in place I very occasionally had to do a pkill
> -u baduser and lock them out temporarily until the situation was explained to 
> them.
>
> Any questions please let me know.
>
> Sean
>
>
>
> = How to configure Cgroups locally on a system =
>
> This is a step-to-step guide to configure Cgroups locally on a system.
>
>  1. Install the libraries to control Cgroups and to enforce it via
> PAM 
>
> $ yum install libcgroup libcgroup-pam
>
>  2. Load the Cgroups module on PAM 
>
> 
> $ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/login
> $ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/password-auth-ac
> $ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/system-auth-ac
> 
>
>  3. Set the Cgroup limits and associate them to a user group 
>
> add to /etc/cgconfig.conf:
> 
> # cpuset.mems may be different in different architectures, e.g. in
> Parsons there # is only "0".
> group users {
>memory {
>  memory.limit_in_bytes="4G";
>  memory.memsw.limit_in_bytes="6G";
>}
>cpuset {
>  cpuset.mems="0-1";
>  cpuset.cpus="0-2";
>}
> }
> 
>
> Note that the ''memory.memsw.limit_in_bytes'' limit is //inclusive//
> of the ''memory.limit_in_bytes'' limit. So in the above example, the
> limit is 4GB of RAM following by a further 2 GB of swap. See:
>
> [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Lin
> ux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html#p
> roc-cpu_and_mem
> ]]
>
> Set no limit for root and set limits for every other individual user:
>
> 
> $ echo "root*  /">>/etc/cgrules.conf
> $ echo "*   cpuset,memoryusers">>/etc/cgrules.conf
> 
>
> Note also that the ''users'' cgroup defined above is inclusive of
> **all** users (the * wildcard). So it is not a 4GB RAM limit for one
> user, it is a 4GB RAM limit in total for every non-root user.
>
>  4. Start the daemon that manages Cgroups configuration and set it
> to start on boot 
>
> 
> $ /etc/init.d/cgconfig start
> $ chkconfig cgconfig on
> 
>
>
>
>
>
> On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote:
>
>> Does anyone have a good suggestion for this problem?
>>
>> On a cluster I am implementing I noticed a user is running a code on 16 
>> cores, on one of the login nodes, outside the batch system.
>> What are the accepted techniques to combat this? Other than applying a LART, 
>> if you all know what this means.
>>
&

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Ryan Cox

John,

We use /etc/security/limits.conf to set cputime limits on processes:
* hard cpu 60
root hard cpu unlimited

It works pretty well but long running file transfers can get killed. We
have a script that looks for whitelisted programs to remove the limit
from on a periodic basis. We haven't experienced problems with this
approach in users (that anyone has reported to us, at least). Threaded
programs get killed more quickly than multi-process programs since the
limit is per process.

Additionally, we use cgroups for limits in a similar way to Sean but
with an older approach than pam_cgroup. We also use the cpu cgroup
rather than cpuset because it doesn't limit them to particular CPUs and
doesn't limit them when no one else is running (it's shares-based). We
also have an OOM notifier daemon that writes to a user's tty so they
know if they ran out of memory. "Killed" isn't usually a helpful error
message that they understand.

We have this in a github repo: https://github.com/BYUHPC/uft.
Directories that may be useful include cputime_controls, oom_notifierd,
loginlimits (lets users see their cgroup limits with some explanations).

Ryan

On 02/09/2017 07:18 AM, Sean McGrath wrote:

Hi,

We use cgroups to limit usage to 3 cores and 4G of memory on the head nodes. I
didn't do it but will copy and paste in our documentation below.

Those limits, 3 cores are 4G are global to all non root users I think as they
apply to a group. We obviously don't do this on the nodes.

We also monitor system utilisation with nagios and will intervene if needed.
Before we had cgroups in place I very occasionally had to do a pkill -u baduser
and lock them out temporarily until the situation was explained to them.

Any questions please let me know.

Sean

= How to configure Cgroups locally on a system =

This is a step-to-step guide to configure Cgroups locally on a system.

1. Install the libraries to control Cgroups and to enforce it via PAM

$ yum install libcgroup libcgroup-pam

2. Load the Cgroups module on PAM

$ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/login
$ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/password-auth-ac
$ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/system-auth-ac

3. Set the Cgroup limits and associate them to a user group

add to /etc/cgconfig.conf:

# cpuset.mems may be different in different architectures, e.g. in Parsons there
# is only "0".
group users {
memory {
memory.limit_in_bytes="4G";
memory.memsw.limit_in_bytes="6G";
}
cpuset {
cpuset.mems="0-1";
cpuset.cpus="0-2";
}
}

Note that the ''memory.memsw.limit_in_bytes'' limit is //inclusive// of the
''memory.limit_in_bytes'' limit. So in the above example, the limit is 4GB of
RAM following by a further 2 GB of swap. See:

[[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html#proc-cpu_and_mem
]]

Set no limit for root and set limits for every other individual user:

$ echo "root* /">>/etc/cgrules.conf
$ echo "* cpuset,memoryusers">>/etc/cgrules.conf

Note also that the ''users'' cgroup defined above is inclusive of **all** users
(the * wildcard). So it is not a 4GB RAM limit for one user, it is a 4GB RAM
limit in total for every non-root user.

4. Start the daemon that manages Cgroups configuration and set it to start
on boot

$ /etc/init.d/cgconfig start
$ chkconfig cgconfig on

On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote:

Does anyone have a good suggestion for this problem?

On a cluster I am implementing I noticed a user is running a code on 16 cores,
on one of the login nodes, outside the batch system.
What are the accepted techniques to combat this? Other than applying a LART, if
you all know what this means.

On one system I set up a year or so ago I was asked to implement a shell
timeout, so if the user was idle for 30 minutes they would be logged out.
This actually is quite easy to set up as I recall.
I guess in this case as the user is connected to a running process then they
are not 'idle'.

Any views or opinions presented in this email are solely those of the author
and do not necessarily represent those of the company. Employees of XMA Ltd are
expressly required not to make defamatory statements and not to infringe or
authorise any infringement of copyright or any other legal right by email
communications. Any such communication is contrary to company policy and
outside the scope of the employment of the individual concerned. The company
will not accept any liability in respect of such communication, and the
employee responsible will be personally liable for any damages or other
liability arising. XMA Limited is registered in England and Wales (registered
no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane,
Wilford, Nottingham, NG11 7EP

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread Sean McGrath

Hi,

We use cgroups to limit usage to 3 cores and 4G of memory on the head nodes. I
didn't do it but will copy and paste in our documentation below.

Those limits, 3 cores are 4G are global to all non root users I think as they
apply to a group. We obviously don't do this on the nodes.

We also monitor system utilisation with nagios and will intervene if needed.
Before we had cgroups in place I very occasionally had to do a pkill -u baduser
and lock them out temporarily until the situation was explained to them.

Any questions please let me know.

Sean

= How to configure Cgroups locally on a system =

This is a step-to-step guide to configure Cgroups locally on a system.

 1. Install the libraries to control Cgroups and to enforce it via PAM 

$ yum install libcgroup libcgroup-pam

 2. Load the Cgroups module on PAM 

$ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/login
$ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/password-auth-ac
$ echo sessionrequiredpam_cgroup.so>>/etc/pam.d/system-auth-ac

 3. Set the Cgroup limits and associate them to a user group 

add to /etc/cgconfig.conf:

# cpuset.mems may be different in different architectures, e.g. in Parsons there
# is only "0".
group users {
  memory {
memory.limit_in_bytes="4G";
memory.memsw.limit_in_bytes="6G";
  }
  cpuset {
cpuset.mems="0-1";
cpuset.cpus="0-2";
  }
}

Note that the ''memory.memsw.limit_in_bytes'' limit is //inclusive// of the
''memory.limit_in_bytes'' limit. So in the above example, the limit is 4GB of
RAM following by a further 2 GB of swap. See:

[[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html#proc-cpu_and_mem
]]

Set no limit for root and set limits for every other individual user:

$ echo "root*  /">>/etc/cgrules.conf
$ echo "*   cpuset,memoryusers">>/etc/cgrules.conf

Note also that the ''users'' cgroup defined above is inclusive of **all** users
(the * wildcard). So it is not a 4GB RAM limit for one user, it is a 4GB RAM
limit in total for every non-root user.

 4. Start the daemon that manages Cgroups configuration and set it to start
on boot 

$ /etc/init.d/cgconfig start
$ chkconfig cgconfig on

On Thu, Feb 09, 2017 at 05:12:12AM -0800, John Hearns wrote:

> Does anyone have a good suggestion for this problem?
> 
> On a cluster I am implementing I noticed a user is running a code on 16 
> cores, on one of the login nodes, outside the batch system.
> What are the accepted techniques to combat this? Other than applying a LART, 
> if you all know what this means.
> 
> On one system I set up a year or so ago I was asked to implement a shell 
> timeout, so if the user was idle for 30 minutes they would be logged out.
> This actually is quite easy to set up as I recall.
> I guess in this case as the user is connected to a running process then they 
> are not 'idle'.
> 
> 
> Any views or opinions presented in this email are solely those of the author 
> and do not necessarily represent those of the company. Employees of XMA Ltd 
> are expressly required not to make defamatory statements and not to infringe 
> or authorise any infringement of copyright or any other legal right by email 
> communications. Any such communication is contrary to company policy and 
> outside the scope of the employment of the individual concerned. The company 
> will not accept any liability in respect of such communication, and the 
> employee responsible will be personally liable for any damages or other 
> liability arising. XMA Limited is registered in England and Wales (registered 
> no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, 
> Wilford, Nottingham, NG11 7EP

-- 
Sean McGrath M.Sc

Systems Administrator
Trinity Centre for High Performance and Research Computing
Trinity College Dublin

sean.mcgr...@tchpc.tcd.ie

https://www.tcd.ie/
https://www.tchpc.tcd.ie/

+353 (0) 1 896 3725

[slurm-dev] Re: Stopping compute usage on login nodes

[slurm-dev] Re: Stopping compute usage on login nodes

[slurm-dev] Re: Stopping compute usage on login nodes

[slurm-dev] Re: Stopping compute usage on login nodes

[slurm-dev] Re: Stopping compute usage on login nodes

[slurm-dev] Re: Stopping compute usage on login nodes

[slurm-dev] Re: Stopping compute usage on login nodes

[slurm-dev] Re: Stopping compute usage on login nodes

[slurm-dev] Re: Stopping compute usage on login nodes

[slurm-dev] Re: Stopping compute usage on login nodes

10 matches

Site Navigation

Mail list logo

Footer information