[gridengine users] sge_qmaster uses too much memory and becomes unresponsive

Peskin, Eric Wed, 02 Apr 2014 11:29:31 -0700

All,

We are running SGE 2011.11 on a CentOS 6.3 cluster.
Twice now we've had the following experience:
-- Any new jobs submitted sit in the qw state, even though there are plenty of 
nodes available that could satisfy the requirements of the jobs.
-- top reveals that sge_qmaster is eating way too much memory:  > 50 GiB in one 
case, > 128 GiB in another.
-- We restarted the sgemaster.  That fixes it, but...
-- Many (though not all) jobs were lost during the master restart.  :(


We have a suspect (but are not sure) about what jobs are triggering it, but we 
do not know why or what to do about it.  Both times that this happened someone 
was running a script that automatically generates and submits multiple jobs.  
But it wasn't submitting that many jobs -- only 18.  We have users who do 
similar things with over 1,000 jobs without causing this.

The generated scripts themselves look like reasonable job scripts.  The only 
twist is using our threaded parallel environment and asking for a range of 
slots.  An example job is:

#!/bin/bash
#$ -S /bin/bash
#$ -N c-di-GMP-I.cm
#$ -cwd

module load infernal/1.1
cmcalibrate --cpu $((NSLOTS-1)) c-di-GMP-I.cm


The scripts are submitted from a perl script with:
system(qq!qsub -pe threaded 1-32 $dir/$filename.sh!);

Our threaded parallel environment is:
pe_name            threaded
slots              5000
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $pe_slots
control_slaves     FALSE
job_is_first_task  TRUE
urgency_slots      min
accounting_summary FALSE

Any ideas on the following would be appreciated:
1)  What is causing this?
2)  How can we prevent it?
3)  Is it normal that restarting the qmaster kills some jobs?
4)  Is there a safer way to get out of the bad state once we are in it?
5)  Is there a safe way to debug this problem, given that any given experiment 
might put us back in the bad state?

Some background:
Our cluster uses the Bright cluster management system.

We have 64 regular nodes, with 32 slots each.  (Each node has 16 real cores, 
but with hyper-threading is turned on.)
62 of the regular nodes are in one queue.  
2 of the regular nodes are in a special queue to which most users do not have 
access.
A high-memory node (with 64 slots) is in its own queue.

Each node including the head node (and the redundant head node) has 128 GiB of 
RAM, except for one high memory node with 1 TiB of RAM.  We have memory 
over-commiting turned off:  vm.overcommit_memory = 2

[root@phoenix1 ~]# cat /etc/redhat-release 
CentOS release 6.3 (Final)
[root@phoenix1 ~]# uname -a
Linux phoenix1 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 
x86_64 x86_64 x86_64 GNU/Linux
[root@phoenix1 ~]# rpm -qa sge\*
sge-client-2011.11-323_cm6.0.x86_64
sge-2011.11-323_cm6.0.x86_64

Any ideas would be greatly appreciated.


------------------------------------------------------------
This email message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain information that is proprietary, 
confidential, and exempt from disclosure under applicable law. Any unauthorized 
review, use, disclosure, or distribution is prohibited. If you have received 
this email in error please notify the sender by return email and delete the 
original message. Please note, the recipient should check this email and any 
attachments for the presence of viruses. The organization accepts no liability 
for any damage caused by any virus transmitted by this email.
=================================


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] sge_qmaster uses too much memory and becomes unresponsive

Reply via email to