Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

Peskin, Eric Mon, 07 Apr 2014 09:37:50 -0700

All,

Thank you for the replies.  I set schedd_job_info to false (via qconf -msconf), 
and now I can run the same jobs (even through the script that automatically 
submits them) without issue.  Thanks!


Eric


On Apr 2, 2014, at 2:39 PM, Chris Dagdigian wrote:

> 
> Same symptoms seen at one of my clients just yesterday. Programmatic
> scripts that send a small number of jobs into qsub that all use a
> threaded PE or similar. Our cluster routinely runs much larger workloads
> all the time.
> 
> Our sge_qmaster ran the master node out of memory and was killed hard by
> the OOM daemon.
> 
> We also disabled schedd_job_info and have not seen the issue although
> it's only been about 24 hours.  I'm actually a huge fan of this
> parameter as it's one of the few troubleshooting tools available to
> regular non-admin users. The qalter '-w v' workaround is a great catch
> though!
> 
> Regards,
> Chris
> 
> 
> 
> Peskin, Eric wrote:
>> All,
>> 
>> We are running SGE 2011.11 on a CentOS 6.3 cluster.
>> Twice now we've had the following experience:
>> -- Any new jobs submitted sit in the qw state, even though there are plenty 
>> of nodes available that could satisfy the requirements of the jobs.
>> -- top reveals that sge_qmaster is eating way too much memory:  > 50 GiB in 
>> one case, > 128 GiB in another.
>> -- We restarted the sgemaster.  That fixes it, but...
>> -- Many (though not all) jobs were lost during the master restart.  :(
>> 
>> We have a suspect (but are not sure) about what jobs are triggering it, but 
>> we do not know why or what to do about it.  Both times that this happened 
>> someone was running a script that automatically generates and submits 
>> multiple jobs.  But it wasn't submitting that many jobs -- only 18.  We have 
>> users who do similar things with over 1,000 jobs without causing this.
>> 
>> The generated scripts themselves look like reasonable job scripts.  The only 
>> twist is using our threaded parallel environment and asking for a range of 
>> slots.  An example job is:
>> 
>> #!/bin/bash
>> #$ -S /bin/bash
>> #$ -N c-di-GMP-I.cm
>> #$ -cwd
>> 
>> module load infernal/1.1
>> cmcalibrate --cpu $((NSLOTS-1)) c-di-GMP-I.cm
>> 
>> 
>> The scripts are submitted from a perl script with:
>> system(qq!qsub -pe threaded 1-32 $dir/$filename.sh!);
>> 
>> Our threaded parallel environment is:
>> pe_name            threaded
>> slots              5000
>> user_lists         NONE
>> xuser_lists        NONE
>> start_proc_args    /bin/true
>> stop_proc_args     /bin/true
>> allocation_rule    $pe_slots
>> control_slaves     FALSE
>> job_is_first_task  TRUE
>> urgency_slots      min
>> accounting_summary FALSE
>> 
>> Any ideas on the following would be appreciated:
>> 1)  What is causing this?
>> 2)  How can we prevent it?
>> 3)  Is it normal that restarting the qmaster kills some jobs?
>> 4)  Is there a safer way to get out of the bad state once we are in it?
>> 5)  Is there a safe way to debug this problem, given that any given 
>> experiment might put us back in the bad state?
>> 
>> Some background:
>> Our cluster uses the Bright cluster management system.
>> 
>> We have 64 regular nodes, with 32 slots each.  (Each node has 16 real cores, 
>> but with hyper-threading is turned on.)
>> 62 of the regular nodes are in one queue.  
>> 2 of the regular nodes are in a special queue to which most users do not 
>> have access.
>> A high-memory node (with 64 slots) is in its own queue.
>> 
>> Each node including the head node (and the redundant head node) has 128 GiB 
>> of RAM, except for one high memory node with 1 TiB of RAM.  We have memory 
>> over-commiting turned off:  vm.overcommit_memory = 2
>> 
>> [root@phoenix1 ~]# cat /etc/redhat-release 
>> CentOS release 6.3 (Final)
>> [root@phoenix1 ~]# uname -a
>> Linux phoenix1 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 
>> x86_64 x86_64 x86_64 GNU/Linux
>> [root@phoenix1 ~]# rpm -qa sge\*
>> sge-client-2011.11-323_cm6.0.x86_64
>> sge-2011.11-323_cm6.0.x86_64
>> 
>> Any ideas would be greatly appreciated.
>> 
>> 
>> ------------------------------------------------------------
>> This email message, including any attachments, is for the sole use of the 
>> intended recipient(s) and may contain information that is proprietary, 
>> confidential, and exempt from disclosure under applicable law. Any 
>> unauthorized review, use, disclosure, or distribution is prohibited. If you 
>> have received this email in error please notify the sender by return email 
>> and delete the original message. Please note, the recipient should check 
>> this email and any attachments for the presence of viruses. The organization 
>> accepts no liability for any damage caused by any virus transmitted by this 
>> email.
>> =================================
>> 
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users


------------------------------------------------------------
This email message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain information that is proprietary, 
confidential, and exempt from disclosure under applicable law. Any unauthorized 
review, use, disclosure, or distribution is prohibited. If you have received 
this email in error please notify the sender by return email and delete the 
original message. Please note, the recipient should check this email and any 
attachments for the presence of viruses. The organization accepts no liability 
for any damage caused by any virus transmitted by this email.
=================================


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

Reply via email to