Re: [gridengine users] job not killed after reaching h_vmem

Reuti Wed, 23 Oct 2013 08:06:12 -0700

Am 23.10.2013 um 10:29 schrieb Arnau Bria:

> On Wed, 23 Oct 2013 10:06:12 +0200
> Reuti Reuti wrote:
> 
>> Hi,
> Hi Reuti,
> 
>>> # qconf -sc|egrep 'virtual_free|h_vmem|^#'
>>> #name               shortcut     type        relop requestable
>>> consumable default  urgency
>>> #------------------------------------------------------------------------------------------
>>> h_vmem              h_vmem       MEMORY      <=    YES
>>> JOB        0        0 virtual_free        vf           MEMORY
>>> <=    YES         JOB        0        0
>>> 
>>> 
>>> yesterday I found a paralle job that asked for 64GB of h_vmem that
>>> was using more than 100GB of mem but SGE did not kill it :
>> 
>> More than 100G in total or per slot (as the limit is multiplied)?
> ?? 
> 
> from sge_complex:
> 
> A  consumable  defined  by ’y’ is a per slot consumables which means
> the limit is multiplied by the number of slots being used by the job
> before being applied.  In case of ’j’ the consumable is a per job
> consumable.
> 
> doesn't "JOB" mean per job total?


Hehe - I missed that "JOB" setting. There was already a discussion about this 
symptom for 6.2u5 and I don't know whether these were fixed already:

http://gridengine.org/pipermail/users/2013-January/005419.html

There are some flaws for this setting and it's sometimes working, sometimes not:

$ qconf -sc
#name               shortcut   type        relop   requestable consumable 
default  urgency 
...
h_vmem              h_vmem     MEMORY      <=      YES         JOB        128M  
   0


reuti@pc15370:~> qsub -pe openmpi 2 -l h_vmem=2M,h=pc15370 test.sh
Your job 10091 ("test.sh") has been submitted
reuti@pc15370:~> qsub -pe openmpi 2 -l h_vmem=2M,h=pc15370 test.sh
Your job 10092 ("test.sh") has been submitted

reuti@pc15370:~> qstat
job-ID  prior   name       user         state submit/start at     queue         
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  10091 1.05000 test.sh    reuti        r     10/23/2013 16:29:28 all.q@pc15370 
    2
  10092 1.05000 test.sh    reuti        dr    10/23/2013 16:29:28 all.q@pc15370 
    2


10/23/2013 16:29:29|  main|pc15370|W|job 10092 exceeds job hard limit "h_vmem" 
of queue "all.q@pc15370" (6164480.00000 > limit:4194304.00000) - sending SIGKILL
10/23/2013 16:29:29|  main|pc15370|I|SIGNAL jid: 10092 jatask: 1 signal: KILL


But the other job 10091 survived and ended properly - and why is the limit 
4194304 and not 2M, not to mention the ulimit:

reuti@pc15370:~> grep virtual test.sh.o10091 test.sh.o10092
test.sh.o10091:virtual memory          (kbytes, -v) 10240
test.sh.o10092:virtual memory          (kbytes, -v) 10240

(Don't request h_vmem and it is fine set to the default but multiplied! by the 
requested slot count - despite the "JOB" setting.)

-- Reuti


>>> # qstat -j 2098938|grep vmem
>>> hard resource_list:         virtual_free=64G,h_vmem=64G,h_rt=172800
>>> usage    1:                 cpu=18:26:24, mem=111455.48587 GBs,
>>> io=1735.61545, vmem=196.038G, maxvmem=197.132G
>> 
>> Can you please `grep` the messages file for the executing node for
>> other entries of job "2098938".
> 
> # ls
> active_jobs  job_scripts           messages-20130630.gz  messages-20130721.gz 
>  messages-20130811.gz  messages-20130901.gz  messages-20130922.gz  
> messages-20131013.gz
> execd.pid    messages              messages-20130707.gz  messages-20130728.gz 
>  messages-20130818.gz  messages-20130908.gz  messages-20130929.gz  
> messages-20131020.gz
> jobs         messages-20130623.gz  messages-20130714.gz  messages-20130804.gz 
>  messages-20130825.gz  messages-20130915.gz  messages-20131006.gz
> # zgrep 2098938 messages*
> # 
> 
> there are no entries for that job....
> 
>> -- Reuti
> Thanks,
> Arnau
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] job not killed after reaching h_vmem

Reply via email to