Same symptoms seen at one of my clients just yesterday. Programmatic scripts that send a small number of jobs into qsub that all use a threaded PE or similar. Our cluster routinely runs much larger workloads all the time.
Our sge_qmaster ran the master node out of memory and was killed hard by the OOM daemon. We also disabled schedd_job_info and have not seen the issue although it's only been about 24 hours. I'm actually a huge fan of this parameter as it's one of the few troubleshooting tools available to regular non-admin users. The qalter '-w v' workaround is a great catch though! Regards, Chris Peskin, Eric wrote: > All, > > We are running SGE 2011.11 on a CentOS 6.3 cluster. > Twice now we've had the following experience: > -- Any new jobs submitted sit in the qw state, even though there are plenty > of nodes available that could satisfy the requirements of the jobs. > -- top reveals that sge_qmaster is eating way too much memory: > 50 GiB in > one case, > 128 GiB in another. > -- We restarted the sgemaster. That fixes it, but... > -- Many (though not all) jobs were lost during the master restart. :( > > We have a suspect (but are not sure) about what jobs are triggering it, but > we do not know why or what to do about it. Both times that this happened > someone was running a script that automatically generates and submits > multiple jobs. But it wasn't submitting that many jobs -- only 18. We have > users who do similar things with over 1,000 jobs without causing this. > > The generated scripts themselves look like reasonable job scripts. The only > twist is using our threaded parallel environment and asking for a range of > slots. An example job is: > > #!/bin/bash > #$ -S /bin/bash > #$ -N c-di-GMP-I.cm > #$ -cwd > > module load infernal/1.1 > cmcalibrate --cpu $((NSLOTS-1)) c-di-GMP-I.cm > > > The scripts are submitted from a perl script with: > system(qq!qsub -pe threaded 1-32 $dir/$filename.sh!); > > Our threaded parallel environment is: > pe_name threaded > slots 5000 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule $pe_slots > control_slaves FALSE > job_is_first_task TRUE > urgency_slots min > accounting_summary FALSE > > Any ideas on the following would be appreciated: > 1) What is causing this? > 2) How can we prevent it? > 3) Is it normal that restarting the qmaster kills some jobs? > 4) Is there a safer way to get out of the bad state once we are in it? > 5) Is there a safe way to debug this problem, given that any given > experiment might put us back in the bad state? > > Some background: > Our cluster uses the Bright cluster management system. > > We have 64 regular nodes, with 32 slots each. (Each node has 16 real cores, > but with hyper-threading is turned on.) > 62 of the regular nodes are in one queue. > 2 of the regular nodes are in a special queue to which most users do not have > access. > A high-memory node (with 64 slots) is in its own queue. > > Each node including the head node (and the redundant head node) has 128 GiB > of RAM, except for one high memory node with 1 TiB of RAM. We have memory > over-commiting turned off: vm.overcommit_memory = 2 > > [root@phoenix1 ~]# cat /etc/redhat-release > CentOS release 6.3 (Final) > [root@phoenix1 ~]# uname -a > Linux phoenix1 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 > x86_64 x86_64 x86_64 GNU/Linux > [root@phoenix1 ~]# rpm -qa sge\* > sge-client-2011.11-323_cm6.0.x86_64 > sge-2011.11-323_cm6.0.x86_64 > > Any ideas would be greatly appreciated. > > > ------------------------------------------------------------ > This email message, including any attachments, is for the sole use of the > intended recipient(s) and may contain information that is proprietary, > confidential, and exempt from disclosure under applicable law. Any > unauthorized review, use, disclosure, or distribution is prohibited. If you > have received this email in error please notify the sender by return email > and delete the original message. Please note, the recipient should check this > email and any attachments for the presence of viruses. The organization > accepts no liability for any damage caused by any virus transmitted by this > email. > ================================= > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
