Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

Skylar Thompson Wed, 02 Apr 2014 11:38:34 -0700

We encountered a similar problem with GE 6.2u5, and it turned out to be a
bug in schedd_job_info in sched_conf. Disabling it made our problems go
away. We don't depend heavily on schedd_job_info; most of the time using
"-w v" with qalter or qsub is sufficient.


On Wed, Apr 02, 2014 at 06:27:07PM +0000, Peskin, Eric wrote:
> All,
> 
> We are running SGE 2011.11 on a CentOS 6.3 cluster.
> Twice now we've had the following experience:
> -- Any new jobs submitted sit in the qw state, even though there are plenty 
> of nodes available that could satisfy the requirements of the jobs.
> -- top reveals that sge_qmaster is eating way too much memory:  > 50 GiB in 
> one case, > 128 GiB in another.
> -- We restarted the sgemaster.  That fixes it, but...
> -- Many (though not all) jobs were lost during the master restart.  :(
> 
> We have a suspect (but are not sure) about what jobs are triggering it, but 
> we do not know why or what to do about it.  Both times that this happened 
> someone was running a script that automatically generates and submits 
> multiple jobs.  But it wasn't submitting that many jobs -- only 18.  We have 
> users who do similar things with over 1,000 jobs without causing this.
> 
> The generated scripts themselves look like reasonable job scripts.  The only 
> twist is using our threaded parallel environment and asking for a range of 
> slots.  An example job is:
> 
> #!/bin/bash
> #$ -S /bin/bash
> #$ -N c-di-GMP-I.cm
> #$ -cwd
> 
> module load infernal/1.1
> cmcalibrate --cpu $((NSLOTS-1)) c-di-GMP-I.cm
> 
> 
> The scripts are submitted from a perl script with:
> system(qq!qsub -pe threaded 1-32 $dir/$filename.sh!);
> 
> Our threaded parallel environment is:
> pe_name            threaded
> slots              5000
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    $pe_slots
> control_slaves     FALSE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary FALSE
> 
> Any ideas on the following would be appreciated:
> 1)  What is causing this?
> 2)  How can we prevent it?
> 3)  Is it normal that restarting the qmaster kills some jobs?
> 4)  Is there a safer way to get out of the bad state once we are in it?
> 5)  Is there a safe way to debug this problem, given that any given 
> experiment might put us back in the bad state?
> 
> Some background:
> Our cluster uses the Bright cluster management system.
> 
> We have 64 regular nodes, with 32 slots each.  (Each node has 16 real cores, 
> but with hyper-threading is turned on.)
> 62 of the regular nodes are in one queue.  
> 2 of the regular nodes are in a special queue to which most users do not have 
> access.
> A high-memory node (with 64 slots) is in its own queue.
> 
> Each node including the head node (and the redundant head node) has 128 GiB 
> of RAM, except for one high memory node with 1 TiB of RAM.  We have memory 
> over-commiting turned off:  vm.overcommit_memory = 2
> 
> [root@phoenix1 ~]# cat /etc/redhat-release 
> CentOS release 6.3 (Final)
> [root@phoenix1 ~]# uname -a
> Linux phoenix1 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 
> x86_64 x86_64 x86_64 GNU/Linux
> [root@phoenix1 ~]# rpm -qa sge\*
> sge-client-2011.11-323_cm6.0.x86_64
> sge-2011.11-323_cm6.0.x86_64
> 
> Any ideas would be greatly appreciated.
> 
> 
> ------------------------------------------------------------
> This email message, including any attachments, is for the sole use of the 
> intended recipient(s) and may contain information that is proprietary, 
> confidential, and exempt from disclosure under applicable law. Any 
> unauthorized review, use, disclosure, or distribution is prohibited. If you 
> have received this email in error please notify the sender by return email 
> and delete the original message. Please note, the recipient should check this 
> email and any attachments for the presence of viruses. The organization 
> accepts no liability for any damage caused by any virus transmitted by this 
> email.
> =================================
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

-- 
-- Skylar Thompson ([email protected])
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

Reply via email to