[gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users

2011-08-19 Thread Gowtham


In some of the computing clusters across our campus, we have 
noticed many users running their jobs outside of the SGE 
queuing system. While we have plans to continue tutoring 
them about the benefits of using a queuing system, not 
everyone seems to be getting the message - as such, these

violating-users' jobs are hampering those who have been
using SGE.

On all our Rocks based clusters, we do keep the list of
cluster's uses in a flat text file, one user per line.

Is there a way by which I (as root) can kill all those
jobs submitted outside of SGE on compute nodes by these
normal users?

Thanks,
g

--
Gowtham
Advanced IT Research Support
Michigan Technological University

(906) 487/3593

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users

2011-08-19 Thread Reuti
Hi,

Am 19.08.2011 um 18:30 schrieb Gowtham:

 In some of the computing clusters across our campus, we have noticed many 
 users running their jobs outside of the SGE queuing system. While we have 
 plans to continue tutoring them about the benefits of using a queuing system, 
 not everyone seems to be getting the message - as such, these
 violating-users' jobs are hampering those who have been
 using SGE.
 
 On all our Rocks based clusters, we do keep the list of
 cluster's uses in a flat text file, one user per line.
 
 Is there a way by which I (as root) can kill all those
 jobs submitted outside of SGE on compute nodes by these
 normal users?

how were they able to run something thereon?

I set up my clusters without rsh, and ssh only allowed for admin staff. When 
you have a tight integration of parallel jobs, they will still run.

If users want to check something on the nodes, they have to use `qrsh` and get 
an interactive queue with a set h_cpu 60 limit.

-- Reuti


 Thanks,
 g
 
 --
 Gowtham
 Advanced IT Research Support
 Michigan Technological University
 
 (906) 487/3593
 
 ___
 users mailing list
 users@gridengine.org
 https://gridengine.org/mailman/listinfo/users


___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users

2011-08-19 Thread Chris Dagdigian

I think I learned this trick from Reuti:

 - Any legit job running under Grid Engine will be a child process of 
an sge_execd daemon.


A nice little trick is a cronjob that does a kill -9 on any user 
process that is not a child of sge_execd -- that will quickly send a 
message to the people bypassing the resource scheduling layer.


That said, however, I've been in this position in a number of 
environments and I can tell you that you will NEVER win the battle with 
users trying to game the system. The motivated user will always have 
more time and more incentive than an overworked cluster administrator.


While simple technical measures like that kill -9 trick or Reuti's 
more sensible suggestion of blocking interactive SSH access to nodes 
outside of SGE should be pursued I'd suggest that you don't spend much 
more time than that developing technical countermeasures.


The real way this gets solved in a multi-user cluster environment is by 
treating acceptable cluster usage as a human resources policy. You'll 
never win a technical battle with a motivated power user.


Acceptable cluster use should be governed by a published policy and when 
the policy is avoided or gamed then the response should involve mentors, 
managers or the HR department, not technology or scripts.


In a corporate setting this comes down to:

1. First time you bypass SGE the admins send you a warning

2. Second time you get caught your manager gets notified

3. Third time? Account is disabled and you are reported to the HR 
department for violating company policy repeatedly


Sorry for being long winded but most long-time cluster admins might 
share my option that cluster use policies can't be treated as a 
technical war between admins and users -- it's far easier and better to 
treat this as a workplace behavior thing.


-Chris






Reuti wrote:

Hi,

Am 19.08.2011 um 18:30 schrieb Gowtham:


In some of the computing clusters across our campus, we have noticed many users 
running their jobs outside of the SGE queuing system. While we have plans to 
continue tutoring them about the benefits of using a queuing system, not 
everyone seems to be getting the message - as such, these
violating-users' jobs are hampering those who have been
using SGE.

On all our Rocks based clusters, we do keep the list of
cluster's uses in a flat text file, one user per line.

Is there a way by which I (as root) can kill all those
jobs submitted outside of SGE on compute nodes by these
normal users?

___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users

2011-08-19 Thread Gowtham


On Fri, 19 Aug 2011, Reuti wrote:

| Hi,
| 
| Am 19.08.2011 um 18:30 schrieb Gowtham:
| 
|  In some of the computing clusters across our campus, we have noticed many 
users running their jobs outside of the SGE queuing system. While we have plans 
to continue tutoring them about the benefits of using a queuing system, not 
everyone seems to be getting the message - as such, these
|  violating-users' jobs are hampering those who have been
|  using SGE.
|  
|  On all our Rocks based clusters, we do keep the list of
|  cluster's uses in a flat text file, one user per line.
|  
|  Is there a way by which I (as root) can kill all those
|  jobs submitted outside of SGE on compute nodes by these
|  normal users?
| 
| how were they able to run something thereon?
| 
| I set up my clusters without rsh, and ssh only allowed for admin staff. When 
you have a tight integration of parallel jobs, they will still run.
| 
| If users want to check something on the nodes, they have to use `qrsh` and 
get an interactive queue with a set h_cpu 60 limit.
| 
| -- Reuti


Thank you for your response. So far, the users who use SGE 
submit via qsub (most of our programs are compiled with 
MPICH2). Those who didn't use the SGE, made a list of 
'hot nodes' and put them in machinefile. then submitted 
their jobs via mpirun

The cluster has the following programs installed on it:

  # Crystal 2003 | 2006 | 2009
  # DMol3
  # Gaussian 1998 | 2003 | 2009
  # NAMD 2.8
  # Quantum Espresso 4.2.1
  # SIESTA 1.3-f1p | 2.0.1
  # SMEAGOL 1.0b
  # VASP 4.6.28 | 4.6.31 | 5.2.2 

and they have been behaving well with SGE so far.

How would I tighten up the SSH screws so that their jobs 
will run but won't be able to log into compute nodes? Is
it via /etc/ssh/sshd_config or some other such file?

Please let me know.

Thanks,
g


| 
| 
|  Thanks,
|  g
|  
|  --
|  Gowtham
|  Advanced IT Research Support
|  Michigan Technological University
|  
|  (906) 487/3593
|  
|  ___
|  users mailing list
|  users@gridengine.org
|  https://gridengine.org/mailman/listinfo/users
| 
| 
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users

2011-08-19 Thread Gowtham

We have similar computing policies (3rd strike and out) in 
place starting Fall 2011 semester but would love to know the
technique of killing any/all user process that is not a 
child of sge_execd. Gives me something to learn about and 
use later, if need arises.

But I do agree about your other findings - even the most 
extensive manuals/user guides we have written have mostly
gone in vain. We are starting to employ a polite version of
RTFM policy as well - at least to those groups to whom the
documentation  demonstration were given.

Thanks for your time :)

Best,
g

--
Gowtham
Advanced IT Research Support
Michigan Technological University

(906) 487/3593


On Fri, 19 Aug 2011, Chris Dagdigian wrote:

| I think I learned this trick from Reuti:
| 
|  - Any legit job running under Grid Engine will be a child process of an
| sge_execd daemon.
| 
| A nice little trick is a cronjob that does a kill -9 on any user process
| that is not a child of sge_execd -- that will quickly send a message to the
| people bypassing the resource scheduling layer.
| 
| That said, however, I've been in this position in a number of environments and
| I can tell you that you will NEVER win the battle with users trying to game
| the system. The motivated user will always have more time and more incentive
| than an overworked cluster administrator.
| 
| While simple technical measures like that kill -9 trick or Reuti's more
| sensible suggestion of blocking interactive SSH access to nodes outside of SGE
| should be pursued I'd suggest that you don't spend much more time than that
| developing technical countermeasures.
| 
| The real way this gets solved in a multi-user cluster environment is by
| treating acceptable cluster usage as a human resources policy. You'll never
| win a technical battle with a motivated power user.
| 
| Acceptable cluster use should be governed by a published policy and when the
| policy is avoided or gamed then the response should involve mentors, managers
| or the HR department, not technology or scripts.
| 
| In a corporate setting this comes down to:
| 
| 1. First time you bypass SGE the admins send you a warning
| 
| 2. Second time you get caught your manager gets notified
| 
| 3. Third time? Account is disabled and you are reported to the HR department
| for violating company policy repeatedly
| 
| Sorry for being long winded but most long-time cluster admins might share my
| option that cluster use policies can't be treated as a technical war between
| admins and users -- it's far easier and better to treat this as a workplace
| behavior thing.
| 
| -Chris
| 
| 
| 
| 
| 
| 
| Reuti wrote:
|  Hi,
|  
|  Am 19.08.2011 um 18:30 schrieb Gowtham:
|  
|   In some of the computing clusters across our campus, we have noticed many
|   users running their jobs outside of the SGE queuing system. While we have
|   plans to continue tutoring them about the benefits of using a queuing
|   system, not everyone seems to be getting the message - as such, these
|   violating-users' jobs are hampering those who have been
|   using SGE.
|   
|   On all our Rocks based clusters, we do keep the list of
|   cluster's uses in a flat text file, one user per line.
|   
|   Is there a way by which I (as root) can kill all those
|   jobs submitted outside of SGE on compute nodes by these
|   normal users?
| 
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users

2011-08-19 Thread Gowtham


On Fri, 19 Aug 2011, Reuti wrote:

| Am 19.08.2011 um 19:43 schrieb Gowtham:
| 
|  snip 
|  | If users want to check something on the nodes, they have to use `qrsh` 
and get an interactive queue with a set h_cpu 60 limit.
|  | 
|  | -- Reuti
|  
|  
|  Thank you for your response. So far, the users who use SGE 
|  submit via qsub (most of our programs are compiled with 
|  MPICH2). Those who didn't use the SGE, made a list of 
|  'hot nodes' and put them in machinefile. then submitted 
|  their jobs via mpirun
|  
|  The cluster has the following programs installed on it:
|  
|   # Crystal 2003 | 2006 | 2009
|   # DMol3
|   # Gaussian 1998 | 2003 | 2009
|   # NAMD 2.8
|   # Quantum Espresso 4.2.1
|   # SIESTA 1.3-f1p | 2.0.1
|   # SMEAGOL 1.0b
|   # VASP 4.6.28 | 4.6.31 | 5.2.2 
|  
|  and they have been behaving well with SGE so far.
|  
|  How would I tighten up the SSH screws so that their jobs 
|  will run but won't be able to log into compute nodes? Is
|  it via /etc/ssh/sshd_config or some other such file?
| 
| Yes, it's a line like:
| 
| AllowGroups root operator
| 
| on the nodes and put the admin staff in this additional group.
| 
| MPICH2 since 1.3 has a tight integration into SGE by default. For Gaussian 
it's necessary to adjust Linda_rsh to call a plain rsh instead of /usr/bin/rsh, 
so that the rsh-wrapper will catch it and route it to SGE's qrsh (in case you 
use Linda).
| 
| -- Reuti


Thank you! And these edits in /etc/ssh/sshd_config go on 
all the compute nodes, right? If yes, I could edit the
extend-compute.xml and have them in place for next 
re-install.

Thank you,
g


| 
| 
|  Please let me know.
|  
|  Thanks,
|  g
|  
|  
|  | 
|  | 
|  |  Thanks,
|  |  g
|  |  
|  |  --
|  |  Gowtham
|  |  Advanced IT Research Support
|  |  Michigan Technological University
|  |  
|  |  (906) 487/3593
|  |  
|  |  ___
|  |  users mailing list
|  |  users@gridengine.org
|  |  https://gridengine.org/mailman/listinfo/users
|  | 
|  | 
| 
| 
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users

2011-08-19 Thread Gary_Smith
I agree with Chris.  You will never find a technical solution to 
policing your cluster.  I have found the number one most effective tool 
to be shame.  I publish a largest disk hog and rogue job list to the 
entire community under the guise of a How is the cluster doing report. 
Nothing like peer pressure to put a stop to this kind of activity.

--Gary



From:   Chris Dagdigian d...@sonsorol.org
To: Reuti re...@staff.uni-marburg.de, Gowtham sgowt...@mtu.edu, 
Sun Grid Engine Discussion List users@gridengine.org, NPACI Rocks 
Discussion List npaci-rocks-discuss...@sdsc.edu
Date:   08/19/2011 12:52 PM
Subject:Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs 
on Compute Nodes by Normal Users
Sent by:users-boun...@gridengine.org



I think I learned this trick from Reuti:

  - Any legit job running under Grid Engine will be a child process of 
an sge_execd daemon.

A nice little trick is a cronjob that does a kill -9 on any user 
process that is not a child of sge_execd -- that will quickly send a 
message to the people bypassing the resource scheduling layer.

That said, however, I've been in this position in a number of 
environments and I can tell you that you will NEVER win the battle with 
users trying to game the system. The motivated user will always have 
more time and more incentive than an overworked cluster administrator.

While simple technical measures like that kill -9 trick or Reuti's 
more sensible suggestion of blocking interactive SSH access to nodes 
outside of SGE should be pursued I'd suggest that you don't spend much 
more time than that developing technical countermeasures.

The real way this gets solved in a multi-user cluster environment is by 
treating acceptable cluster usage as a human resources policy. You'll 
never win a technical battle with a motivated power user.

Acceptable cluster use should be governed by a published policy and when 
the policy is avoided or gamed then the response should involve mentors, 
managers or the HR department, not technology or scripts.

In a corporate setting this comes down to:

1. First time you bypass SGE the admins send you a warning

2. Second time you get caught your manager gets notified

3. Third time? Account is disabled and you are reported to the HR 
department for violating company policy repeatedly

Sorry for being long winded but most long-time cluster admins might 
share my option that cluster use policies can't be treated as a 
technical war between admins and users -- it's far easier and better to 
treat this as a workplace behavior thing.

-Chris






Reuti wrote:
 Hi,

 Am 19.08.2011 um 18:30 schrieb Gowtham:

 In some of the computing clusters across our campus, we have noticed 
many users running their jobs outside of the SGE queuing system. While we 
have plans to continue tutoring them about the benefits of using a queuing 
system, not everyone seems to be getting the message - as such, these
 violating-users' jobs are hampering those who have been
 using SGE.

 On all our Rocks based clusters, we do keep the list of
 cluster's uses in a flat text file, one user per line.

 Is there a way by which I (as root) can kill all those
 jobs submitted outside of SGE on compute nodes by these
 normal users?
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users



___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users


Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users

2011-08-19 Thread Reuti
Am 19.08.2011 um 19:54 schrieb Gowtham:

 On Fri, 19 Aug 2011, Reuti wrote:
 
 | Am 19.08.2011 um 19:43 schrieb Gowtham:
 | 
 |  snip 
 |  | If users want to check something on the nodes, they have to use `qrsh` 
 and get an interactive queue with a set h_cpu 60 limit.
 |  | 
 |  | -- Reuti
 |  
 |  
 |  Thank you for your response. So far, the users who use SGE 
 |  submit via qsub (most of our programs are compiled with 
 |  MPICH2). Those who didn't use the SGE, made a list of 
 |  'hot nodes' and put them in machinefile. then submitted 
 |  their jobs via mpirun
 |  
 |  The cluster has the following programs installed on it:
 |  
 |   # Crystal 2003 | 2006 | 2009
 |   # DMol3
 |   # Gaussian 1998 | 2003 | 2009
 |   # NAMD 2.8
 |   # Quantum Espresso 4.2.1
 |   # SIESTA 1.3-f1p | 2.0.1
 |   # SMEAGOL 1.0b
 |   # VASP 4.6.28 | 4.6.31 | 5.2.2 
 |  
 |  and they have been behaving well with SGE so far.
 |  
 |  How would I tighten up the SSH screws so that their jobs 
 |  will run but won't be able to log into compute nodes? Is
 |  it via /etc/ssh/sshd_config or some other such file?
 | 
 | Yes, it's a line like:
 | 
 | AllowGroups root operator
 | 
 | on the nodes and put the admin staff in this additional group.
 | 
 | MPICH2 since 1.3 has a tight integration into SGE by default. For Gaussian 
 it's necessary to adjust Linda_rsh to call a plain rsh instead of 
 /usr/bin/rsh, so that the rsh-wrapper will catch it and route it to SGE's 
 qrsh (in case you use Linda).
 | 
 | -- Reuti
 
 
 Thank you! And these edits in /etc/ssh/sshd_config go on 
 all the compute nodes, right? If yes, I could edit the
 extend-compute.xml and have them in place for next 
 re-install.

Yep, just be sure that the groups exist to avoid locking out yourself.

-- Reuti
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users