Am 23.10.2013 um 10:29 schrieb Arnau Bria: > On Wed, 23 Oct 2013 10:06:12 +0200 > Reuti Reuti wrote: > >> Hi, > Hi Reuti, > >>> # qconf -sc|egrep 'virtual_free|h_vmem|^#' >>> #name shortcut type relop requestable >>> consumable default urgency >>> #------------------------------------------------------------------------------------------ >>> h_vmem h_vmem MEMORY <= YES >>> JOB 0 0 virtual_free vf MEMORY >>> <= YES JOB 0 0 >>> >>> >>> yesterday I found a paralle job that asked for 64GB of h_vmem that >>> was using more than 100GB of mem but SGE did not kill it : >> >> More than 100G in total or per slot (as the limit is multiplied)? > ?? > > from sge_complex: > > A consumable defined by āyā is a per slot consumables which means > the limit is multiplied by the number of slots being used by the job > before being applied. In case of ājā the consumable is a per job > consumable. > > doesn't "JOB" mean per job total?
Hehe - I missed that "JOB" setting. There was already a discussion about this symptom for 6.2u5 and I don't know whether these were fixed already: http://gridengine.org/pipermail/users/2013-January/005419.html There are some flaws for this setting and it's sometimes working, sometimes not: $ qconf -sc #name shortcut type relop requestable consumable default urgency ... h_vmem h_vmem MEMORY <= YES JOB 128M 0 reuti@pc15370:~> qsub -pe openmpi 2 -l h_vmem=2M,h=pc15370 test.sh Your job 10091 ("test.sh") has been submitted reuti@pc15370:~> qsub -pe openmpi 2 -l h_vmem=2M,h=pc15370 test.sh Your job 10092 ("test.sh") has been submitted reuti@pc15370:~> qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 10091 1.05000 test.sh reuti r 10/23/2013 16:29:28 all.q@pc15370 2 10092 1.05000 test.sh reuti dr 10/23/2013 16:29:28 all.q@pc15370 2 10/23/2013 16:29:29| main|pc15370|W|job 10092 exceeds job hard limit "h_vmem" of queue "all.q@pc15370" (6164480.00000 > limit:4194304.00000) - sending SIGKILL 10/23/2013 16:29:29| main|pc15370|I|SIGNAL jid: 10092 jatask: 1 signal: KILL But the other job 10091 survived and ended properly - and why is the limit 4194304 and not 2M, not to mention the ulimit: reuti@pc15370:~> grep virtual test.sh.o10091 test.sh.o10092 test.sh.o10091:virtual memory (kbytes, -v) 10240 test.sh.o10092:virtual memory (kbytes, -v) 10240 (Don't request h_vmem and it is fine set to the default but multiplied! by the requested slot count - despite the "JOB" setting.) -- Reuti >>> # qstat -j 2098938|grep vmem >>> hard resource_list: virtual_free=64G,h_vmem=64G,h_rt=172800 >>> usage 1: cpu=18:26:24, mem=111455.48587 GBs, >>> io=1735.61545, vmem=196.038G, maxvmem=197.132G >> >> Can you please `grep` the messages file for the executing node for >> other entries of job "2098938". > > # ls > active_jobs job_scripts messages-20130630.gz messages-20130721.gz > messages-20130811.gz messages-20130901.gz messages-20130922.gz > messages-20131013.gz > execd.pid messages messages-20130707.gz messages-20130728.gz > messages-20130818.gz messages-20130908.gz messages-20130929.gz > messages-20131020.gz > jobs messages-20130623.gz messages-20130714.gz messages-20130804.gz > messages-20130825.gz messages-20130915.gz messages-20131006.gz > # zgrep 2098938 messages* > # > > there are no entries for that job.... > >> -- Reuti > Thanks, > Arnau > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
