[gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users
In some of the computing clusters across our campus, we have noticed many users running their jobs outside of the SGE queuing system. While we have plans to continue tutoring them about the benefits of using a queuing system, not everyone seems to be getting the message - as such, these violating-users' jobs are hampering those who have been using SGE. On all our Rocks based clusters, we do keep the list of cluster's uses in a flat text file, one user per line. Is there a way by which I (as root) can kill all those jobs submitted outside of SGE on compute nodes by these normal users? Thanks, g -- Gowtham Advanced IT Research Support Michigan Technological University (906) 487/3593 ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users
Hi, Am 19.08.2011 um 18:30 schrieb Gowtham: In some of the computing clusters across our campus, we have noticed many users running their jobs outside of the SGE queuing system. While we have plans to continue tutoring them about the benefits of using a queuing system, not everyone seems to be getting the message - as such, these violating-users' jobs are hampering those who have been using SGE. On all our Rocks based clusters, we do keep the list of cluster's uses in a flat text file, one user per line. Is there a way by which I (as root) can kill all those jobs submitted outside of SGE on compute nodes by these normal users? how were they able to run something thereon? I set up my clusters without rsh, and ssh only allowed for admin staff. When you have a tight integration of parallel jobs, they will still run. If users want to check something on the nodes, they have to use `qrsh` and get an interactive queue with a set h_cpu 60 limit. -- Reuti Thanks, g -- Gowtham Advanced IT Research Support Michigan Technological University (906) 487/3593 ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users
I think I learned this trick from Reuti: - Any legit job running under Grid Engine will be a child process of an sge_execd daemon. A nice little trick is a cronjob that does a kill -9 on any user process that is not a child of sge_execd -- that will quickly send a message to the people bypassing the resource scheduling layer. That said, however, I've been in this position in a number of environments and I can tell you that you will NEVER win the battle with users trying to game the system. The motivated user will always have more time and more incentive than an overworked cluster administrator. While simple technical measures like that kill -9 trick or Reuti's more sensible suggestion of blocking interactive SSH access to nodes outside of SGE should be pursued I'd suggest that you don't spend much more time than that developing technical countermeasures. The real way this gets solved in a multi-user cluster environment is by treating acceptable cluster usage as a human resources policy. You'll never win a technical battle with a motivated power user. Acceptable cluster use should be governed by a published policy and when the policy is avoided or gamed then the response should involve mentors, managers or the HR department, not technology or scripts. In a corporate setting this comes down to: 1. First time you bypass SGE the admins send you a warning 2. Second time you get caught your manager gets notified 3. Third time? Account is disabled and you are reported to the HR department for violating company policy repeatedly Sorry for being long winded but most long-time cluster admins might share my option that cluster use policies can't be treated as a technical war between admins and users -- it's far easier and better to treat this as a workplace behavior thing. -Chris Reuti wrote: Hi, Am 19.08.2011 um 18:30 schrieb Gowtham: In some of the computing clusters across our campus, we have noticed many users running their jobs outside of the SGE queuing system. While we have plans to continue tutoring them about the benefits of using a queuing system, not everyone seems to be getting the message - as such, these violating-users' jobs are hampering those who have been using SGE. On all our Rocks based clusters, we do keep the list of cluster's uses in a flat text file, one user per line. Is there a way by which I (as root) can kill all those jobs submitted outside of SGE on compute nodes by these normal users? ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users
On Fri, 19 Aug 2011, Reuti wrote: | Hi, | | Am 19.08.2011 um 18:30 schrieb Gowtham: | | In some of the computing clusters across our campus, we have noticed many users running their jobs outside of the SGE queuing system. While we have plans to continue tutoring them about the benefits of using a queuing system, not everyone seems to be getting the message - as such, these | violating-users' jobs are hampering those who have been | using SGE. | | On all our Rocks based clusters, we do keep the list of | cluster's uses in a flat text file, one user per line. | | Is there a way by which I (as root) can kill all those | jobs submitted outside of SGE on compute nodes by these | normal users? | | how were they able to run something thereon? | | I set up my clusters without rsh, and ssh only allowed for admin staff. When you have a tight integration of parallel jobs, they will still run. | | If users want to check something on the nodes, they have to use `qrsh` and get an interactive queue with a set h_cpu 60 limit. | | -- Reuti Thank you for your response. So far, the users who use SGE submit via qsub (most of our programs are compiled with MPICH2). Those who didn't use the SGE, made a list of 'hot nodes' and put them in machinefile. then submitted their jobs via mpirun The cluster has the following programs installed on it: # Crystal 2003 | 2006 | 2009 # DMol3 # Gaussian 1998 | 2003 | 2009 # NAMD 2.8 # Quantum Espresso 4.2.1 # SIESTA 1.3-f1p | 2.0.1 # SMEAGOL 1.0b # VASP 4.6.28 | 4.6.31 | 5.2.2 and they have been behaving well with SGE so far. How would I tighten up the SSH screws so that their jobs will run but won't be able to log into compute nodes? Is it via /etc/ssh/sshd_config or some other such file? Please let me know. Thanks, g | | | Thanks, | g | | -- | Gowtham | Advanced IT Research Support | Michigan Technological University | | (906) 487/3593 | | ___ | users mailing list | users@gridengine.org | https://gridengine.org/mailman/listinfo/users | | ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users
We have similar computing policies (3rd strike and out) in place starting Fall 2011 semester but would love to know the technique of killing any/all user process that is not a child of sge_execd. Gives me something to learn about and use later, if need arises. But I do agree about your other findings - even the most extensive manuals/user guides we have written have mostly gone in vain. We are starting to employ a polite version of RTFM policy as well - at least to those groups to whom the documentation demonstration were given. Thanks for your time :) Best, g -- Gowtham Advanced IT Research Support Michigan Technological University (906) 487/3593 On Fri, 19 Aug 2011, Chris Dagdigian wrote: | I think I learned this trick from Reuti: | | - Any legit job running under Grid Engine will be a child process of an | sge_execd daemon. | | A nice little trick is a cronjob that does a kill -9 on any user process | that is not a child of sge_execd -- that will quickly send a message to the | people bypassing the resource scheduling layer. | | That said, however, I've been in this position in a number of environments and | I can tell you that you will NEVER win the battle with users trying to game | the system. The motivated user will always have more time and more incentive | than an overworked cluster administrator. | | While simple technical measures like that kill -9 trick or Reuti's more | sensible suggestion of blocking interactive SSH access to nodes outside of SGE | should be pursued I'd suggest that you don't spend much more time than that | developing technical countermeasures. | | The real way this gets solved in a multi-user cluster environment is by | treating acceptable cluster usage as a human resources policy. You'll never | win a technical battle with a motivated power user. | | Acceptable cluster use should be governed by a published policy and when the | policy is avoided or gamed then the response should involve mentors, managers | or the HR department, not technology or scripts. | | In a corporate setting this comes down to: | | 1. First time you bypass SGE the admins send you a warning | | 2. Second time you get caught your manager gets notified | | 3. Third time? Account is disabled and you are reported to the HR department | for violating company policy repeatedly | | Sorry for being long winded but most long-time cluster admins might share my | option that cluster use policies can't be treated as a technical war between | admins and users -- it's far easier and better to treat this as a workplace | behavior thing. | | -Chris | | | | | | | Reuti wrote: | Hi, | | Am 19.08.2011 um 18:30 schrieb Gowtham: | | In some of the computing clusters across our campus, we have noticed many | users running their jobs outside of the SGE queuing system. While we have | plans to continue tutoring them about the benefits of using a queuing | system, not everyone seems to be getting the message - as such, these | violating-users' jobs are hampering those who have been | using SGE. | | On all our Rocks based clusters, we do keep the list of | cluster's uses in a flat text file, one user per line. | | Is there a way by which I (as root) can kill all those | jobs submitted outside of SGE on compute nodes by these | normal users? | ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users
On Fri, 19 Aug 2011, Reuti wrote: | Am 19.08.2011 um 19:43 schrieb Gowtham: | | snip | | If users want to check something on the nodes, they have to use `qrsh` and get an interactive queue with a set h_cpu 60 limit. | | | | -- Reuti | | | Thank you for your response. So far, the users who use SGE | submit via qsub (most of our programs are compiled with | MPICH2). Those who didn't use the SGE, made a list of | 'hot nodes' and put them in machinefile. then submitted | their jobs via mpirun | | The cluster has the following programs installed on it: | | # Crystal 2003 | 2006 | 2009 | # DMol3 | # Gaussian 1998 | 2003 | 2009 | # NAMD 2.8 | # Quantum Espresso 4.2.1 | # SIESTA 1.3-f1p | 2.0.1 | # SMEAGOL 1.0b | # VASP 4.6.28 | 4.6.31 | 5.2.2 | | and they have been behaving well with SGE so far. | | How would I tighten up the SSH screws so that their jobs | will run but won't be able to log into compute nodes? Is | it via /etc/ssh/sshd_config or some other such file? | | Yes, it's a line like: | | AllowGroups root operator | | on the nodes and put the admin staff in this additional group. | | MPICH2 since 1.3 has a tight integration into SGE by default. For Gaussian it's necessary to adjust Linda_rsh to call a plain rsh instead of /usr/bin/rsh, so that the rsh-wrapper will catch it and route it to SGE's qrsh (in case you use Linda). | | -- Reuti Thank you! And these edits in /etc/ssh/sshd_config go on all the compute nodes, right? If yes, I could edit the extend-compute.xml and have them in place for next re-install. Thank you, g | | | Please let me know. | | Thanks, | g | | | | | | | | Thanks, | | g | | | | -- | | Gowtham | | Advanced IT Research Support | | Michigan Technological University | | | | (906) 487/3593 | | | | ___ | | users mailing list | | users@gridengine.org | | https://gridengine.org/mailman/listinfo/users | | | | | | ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users
I agree with Chris. You will never find a technical solution to policing your cluster. I have found the number one most effective tool to be shame. I publish a largest disk hog and rogue job list to the entire community under the guise of a How is the cluster doing report. Nothing like peer pressure to put a stop to this kind of activity. --Gary From: Chris Dagdigian d...@sonsorol.org To: Reuti re...@staff.uni-marburg.de, Gowtham sgowt...@mtu.edu, Sun Grid Engine Discussion List users@gridengine.org, NPACI Rocks Discussion List npaci-rocks-discuss...@sdsc.edu Date: 08/19/2011 12:52 PM Subject:Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users Sent by:users-boun...@gridengine.org I think I learned this trick from Reuti: - Any legit job running under Grid Engine will be a child process of an sge_execd daemon. A nice little trick is a cronjob that does a kill -9 on any user process that is not a child of sge_execd -- that will quickly send a message to the people bypassing the resource scheduling layer. That said, however, I've been in this position in a number of environments and I can tell you that you will NEVER win the battle with users trying to game the system. The motivated user will always have more time and more incentive than an overworked cluster administrator. While simple technical measures like that kill -9 trick or Reuti's more sensible suggestion of blocking interactive SSH access to nodes outside of SGE should be pursued I'd suggest that you don't spend much more time than that developing technical countermeasures. The real way this gets solved in a multi-user cluster environment is by treating acceptable cluster usage as a human resources policy. You'll never win a technical battle with a motivated power user. Acceptable cluster use should be governed by a published policy and when the policy is avoided or gamed then the response should involve mentors, managers or the HR department, not technology or scripts. In a corporate setting this comes down to: 1. First time you bypass SGE the admins send you a warning 2. Second time you get caught your manager gets notified 3. Third time? Account is disabled and you are reported to the HR department for violating company policy repeatedly Sorry for being long winded but most long-time cluster admins might share my option that cluster use policies can't be treated as a technical war between admins and users -- it's far easier and better to treat this as a workplace behavior thing. -Chris Reuti wrote: Hi, Am 19.08.2011 um 18:30 schrieb Gowtham: In some of the computing clusters across our campus, we have noticed many users running their jobs outside of the SGE queuing system. While we have plans to continue tutoring them about the benefits of using a queuing system, not everyone seems to be getting the message - as such, these violating-users' jobs are hampering those who have been using SGE. On all our Rocks based clusters, we do keep the list of cluster's uses in a flat text file, one user per line. Is there a way by which I (as root) can kill all those jobs submitted outside of SGE on compute nodes by these normal users? ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users
Re: [gridengine users] Rocks 5.4: Terminate Non-SGE Jobs on Compute Nodes by Normal Users
Am 19.08.2011 um 19:54 schrieb Gowtham: On Fri, 19 Aug 2011, Reuti wrote: | Am 19.08.2011 um 19:43 schrieb Gowtham: | | snip | | If users want to check something on the nodes, they have to use `qrsh` and get an interactive queue with a set h_cpu 60 limit. | | | | -- Reuti | | | Thank you for your response. So far, the users who use SGE | submit via qsub (most of our programs are compiled with | MPICH2). Those who didn't use the SGE, made a list of | 'hot nodes' and put them in machinefile. then submitted | their jobs via mpirun | | The cluster has the following programs installed on it: | | # Crystal 2003 | 2006 | 2009 | # DMol3 | # Gaussian 1998 | 2003 | 2009 | # NAMD 2.8 | # Quantum Espresso 4.2.1 | # SIESTA 1.3-f1p | 2.0.1 | # SMEAGOL 1.0b | # VASP 4.6.28 | 4.6.31 | 5.2.2 | | and they have been behaving well with SGE so far. | | How would I tighten up the SSH screws so that their jobs | will run but won't be able to log into compute nodes? Is | it via /etc/ssh/sshd_config or some other such file? | | Yes, it's a line like: | | AllowGroups root operator | | on the nodes and put the admin staff in this additional group. | | MPICH2 since 1.3 has a tight integration into SGE by default. For Gaussian it's necessary to adjust Linda_rsh to call a plain rsh instead of /usr/bin/rsh, so that the rsh-wrapper will catch it and route it to SGE's qrsh (in case you use Linda). | | -- Reuti Thank you! And these edits in /etc/ssh/sshd_config go on all the compute nodes, right? If yes, I could edit the extend-compute.xml and have them in place for next re-install. Yep, just be sure that the groups exist to avoid locking out yourself. -- Reuti ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users