All, We are running SGE 2011.11 on a CentOS 6.3 cluster. Twice now we've had the following experience: -- Any new jobs submitted sit in the qw state, even though there are plenty of nodes available that could satisfy the requirements of the jobs. -- top reveals that sge_qmaster is eating way too much memory: > 50 GiB in one case, > 128 GiB in another. -- We restarted the sgemaster. That fixes it, but... -- Many (though not all) jobs were lost during the master restart. :(
We have a suspect (but are not sure) about what jobs are triggering it, but we do not know why or what to do about it. Both times that this happened someone was running a script that automatically generates and submits multiple jobs. But it wasn't submitting that many jobs -- only 18. We have users who do similar things with over 1,000 jobs without causing this. The generated scripts themselves look like reasonable job scripts. The only twist is using our threaded parallel environment and asking for a range of slots. An example job is: #!/bin/bash #$ -S /bin/bash #$ -N c-di-GMP-I.cm #$ -cwd module load infernal/1.1 cmcalibrate --cpu $((NSLOTS-1)) c-di-GMP-I.cm The scripts are submitted from a perl script with: system(qq!qsub -pe threaded 1-32 $dir/$filename.sh!); Our threaded parallel environment is: pe_name threaded slots 5000 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary FALSE Any ideas on the following would be appreciated: 1) What is causing this? 2) How can we prevent it? 3) Is it normal that restarting the qmaster kills some jobs? 4) Is there a safer way to get out of the bad state once we are in it? 5) Is there a safe way to debug this problem, given that any given experiment might put us back in the bad state? Some background: Our cluster uses the Bright cluster management system. We have 64 regular nodes, with 32 slots each. (Each node has 16 real cores, but with hyper-threading is turned on.) 62 of the regular nodes are in one queue. 2 of the regular nodes are in a special queue to which most users do not have access. A high-memory node (with 64 slots) is in its own queue. Each node including the head node (and the redundant head node) has 128 GiB of RAM, except for one high memory node with 1 TiB of RAM. We have memory over-commiting turned off: vm.overcommit_memory = 2 [root@phoenix1 ~]# cat /etc/redhat-release CentOS release 6.3 (Final) [root@phoenix1 ~]# uname -a Linux phoenix1 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux [root@phoenix1 ~]# rpm -qa sge\* sge-client-2011.11-323_cm6.0.x86_64 sge-2011.11-323_cm6.0.x86_64 Any ideas would be greatly appreciated. ------------------------------------------------------------ This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain information that is proprietary, confidential, and exempt from disclosure under applicable law. Any unauthorized review, use, disclosure, or distribution is prohibited. If you have received this email in error please notify the sender by return email and delete the original message. Please note, the recipient should check this email and any attachments for the presence of viruses. The organization accepts no liability for any damage caused by any virus transmitted by this email. ================================= _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
