Hi all,

In our cluster we use virtual_Free and h_vmmem as consumable resources
per job:

# qconf -sc|egrep 'virtual_free|h_vmem|^#'
#name               shortcut     type        relop requestable consumable 
default  urgency 
#------------------------------------------------------------------------------------------
h_vmem              h_vmem       MEMORY      <=    YES         JOB        0     
   0
virtual_free        vf           MEMORY      <=    YES         JOB        0     
   0


yesterday I found a paralle job that asked for 64GB of h_vmem that was
using more than 100GB of mem but SGE did not kill it :

# qstat -j 2098938|grep vmem
hard resource_list:         virtual_free=64G,h_vmem=64G,h_rt=172800
usage    1:                 cpu=18:26:24, mem=111455.48587 GBs, io=1735.61545, 
vmem=196.038G, maxvmem=197.132G

the node ran out of memory and it killed some processes, and finally we
killed (qdel) the job:

# grep 2098938 messages
10/22/2013 18:20:49|worker|ant-master2|W|job 2098938.1 failed on host YY 
assumedly after job because: job 2098938.1 died through signal KILL (9)


# qacct -j 2098938 -f joao 
==============================================================
qname        rg-el6              
hostname     YY
group        XX
owner        jcurado             
project      NONE                
department   defaultdepartment   
jobname      ZZ           
jobnumber    2098938             
taskid       undefined
account      sge                 
priority     0                   
qsub_time    Tue Oct 22 12:55:58 2013
start_time   Tue Oct 22 12:59:01 2013
end_time     Tue Oct 22 18:20:48 2013
granted_pe   smp                 
slots        8                   
failed       100 : assumedly after job
exit_status  137                 
ru_wallclock 19307        
ru_utime     0.058        
ru_stime     1.662        
ru_maxrss    5412                
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    14819               
ru_majflt    2                   
ru_nswap     0                   
ru_inblock   967416              
ru_oublock   1298344             
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     2324                
ru_nivcsw    15165               
cpu          67178.120    
mem          125116.602        
io           1745.077          
iow          0.000             
maxvmem      197.184G
arid         undefined

I'm looking for some extra info in node YY, but I find nothing in
messages.
That node did kill other jobs becaue the used more memory than
requested in h_vmem:

main|YY|W|job 1993603 exceeds job hard limit "h_vmem" of queue "rg-el6@YY" 
(53771632640.00000 > limit:53687091200.00000) - sending SIGKILL

So, why it did not kill that job? how may I start debugging the problem? (I'm 
submiting the exact same job)


TIA,
Arnau
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to