Hi all, In our cluster we use virtual_Free and h_vmmem as consumable resources per job:
# qconf -sc|egrep 'virtual_free|h_vmem|^#' #name shortcut type relop requestable consumable default urgency #------------------------------------------------------------------------------------------ h_vmem h_vmem MEMORY <= YES JOB 0 0 virtual_free vf MEMORY <= YES JOB 0 0 yesterday I found a paralle job that asked for 64GB of h_vmem that was using more than 100GB of mem but SGE did not kill it : # qstat -j 2098938|grep vmem hard resource_list: virtual_free=64G,h_vmem=64G,h_rt=172800 usage 1: cpu=18:26:24, mem=111455.48587 GBs, io=1735.61545, vmem=196.038G, maxvmem=197.132G the node ran out of memory and it killed some processes, and finally we killed (qdel) the job: # grep 2098938 messages 10/22/2013 18:20:49|worker|ant-master2|W|job 2098938.1 failed on host YY assumedly after job because: job 2098938.1 died through signal KILL (9) # qacct -j 2098938 -f joao ============================================================== qname rg-el6 hostname YY group XX owner jcurado project NONE department defaultdepartment jobname ZZ jobnumber 2098938 taskid undefined account sge priority 0 qsub_time Tue Oct 22 12:55:58 2013 start_time Tue Oct 22 12:59:01 2013 end_time Tue Oct 22 18:20:48 2013 granted_pe smp slots 8 failed 100 : assumedly after job exit_status 137 ru_wallclock 19307 ru_utime 0.058 ru_stime 1.662 ru_maxrss 5412 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 14819 ru_majflt 2 ru_nswap 0 ru_inblock 967416 ru_oublock 1298344 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 2324 ru_nivcsw 15165 cpu 67178.120 mem 125116.602 io 1745.077 iow 0.000 maxvmem 197.184G arid undefined I'm looking for some extra info in node YY, but I find nothing in messages. That node did kill other jobs becaue the used more memory than requested in h_vmem: main|YY|W|job 1993603 exceeds job hard limit "h_vmem" of queue "rg-el6@YY" (53771632640.00000 > limit:53687091200.00000) - sending SIGKILL So, why it did not kill that job? how may I start debugging the problem? (I'm submiting the exact same job) TIA, Arnau _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
