Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

Reuti Tue, 29 Jul 2014 23:41:18 -0700

Hi,

Am 30.07.2014 um 03:29 schrieb Derrick Lin:


> **No** initial value per queue instance, I force the users to specify both 
> h_vmem and mem_requested by defining default values inside sge_default file.
> 
> No h_vmem on exechost level either, because we want to use mem_requested 
> instead since it's already been setup across all exechosts.
> 
> My original issue was, when I set params  MONITOR=1 jobs failed to start.
> 
> Now I have MONITOR=1 removed, all jobs start and run fine. Any idea?

They still shouldn't start. As you defined "h_vmem" as being consumable, it's a 
question: consume from what?

Nevertheless you can set an arbitrary high value in the global exechost `qconf 
-me global` there under "complex_values".

-- Reuti


> D
> 
> 
> On Tue, Jul 29, 2014 at 7:43 PM, Reuti <[email protected]> wrote:
> Hi,
> 
> Am 29.07.2014 um 06:07 schrieb Derrick Lin:
> 
> > This is qhost of one of our compute nodes:
> >
> > pwbcad@gamma01:~$ qhost -F -h omega-0-9
> > HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  
> > SWAPUS
> > -------------------------------------------------------------------------------
> > global                  -               -     -       -       -       -     
> >   -
> > omega-0-9               lx26-amd64     64 12.34  504.9G  273.6G  256.0G   
> > 14.6G
> >    hl:arch=lx26-amd64
> >    hl:num_proc=64.000000
> >    hl:mem_total=504.890G
> >    hl:swap_total=256.000G
> >    hl:virtual_total=760.890G
> >    hl:load_avg=12.340000
> >    hl:load_short=9.720000
> >    hl:load_medium=12.340000
> >    hl:load_long=18.900000
> >    hl:mem_free=231.308G
> >    hl:swap_free=241.356G
> >    hl:virtual_free=472.663G
> >    hl:mem_used=273.582G
> >    hl:swap_used=14.644G
> >    hl:virtual_used=288.226G
> >    hl:cpu=15.400000
> >    
> > hl:m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT
> >    
> > hl:m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT
> >    hl:m_socket=4.000000
> >    hl:m_core=32.000000
> >    hl:np_load_avg=0.192812
> >    hl:np_load_short=0.151875
> >    hl:np_load_medium=0.192812
> >    hl:np_load_long=0.295312
> >    hc:mem_requested=502.890G
> 
> So, here is no h_vmem on an exechost level.
> 
> 
> > We do not set h_vmem in queue instance level, that's intended because we 
> > just need h_vmem in per user quota like:
> 
> Typo and you mean exechost level?
> 
> 
> > {
> >         name    default_per_user
> >         enabled true
> >         description     "Each user entitles to resources equivalent to two 
> > nodes"
> >         limit   users {*} queues {all.q} to slots=16,h_vmem=16G
> > }
> 
> RQS limits are not enforced. The user has to specify it by hand then with the 
> -l option to `qsub`.
> 
> Is "h_vmem" then in "complex_values" in the queue definition with an initial 
> value per queue instance?
> 
> -- Reuti
> 
> 
> > At the queue instance level, we use mem_requested as "per host quota" 
> > instead. It's a custom complex attr we setup for our specific applications.
> >
> > Cheers,
> > D
> >
> >
> > On Tue, Jul 29, 2014 at 1:02 AM, Reuti <[email protected]> wrote:
> > Hi,
> >
> > Am 04.07.2014 um 06:04 schrieb Derrick Lin:
> >
> > > Interestingly, I have a small test cluster that basically have the same 
> > > SGE setup does *not* have such problem. h_vmem in complex is exactly the 
> > > same. The test queue instance looks almost the same (except the CPU 
> > > layout etc)
> > >
> > >  qstat -F -q all.q@eva00
> > > queuename                      qtype resv/used/tot. load_avg arch         
> > >  states
> > > ---------------------------------------------------------------------------------
> > > [email protected]              BP    0/0/8          0.00     lx26-amd64
> > >        ...
> > >         hc:mem_requested=7.814G
> > >         qf:qname=all.q
> > >         qf:hostname=eva00.local
> > >         qc:slots=8
> > >         qf:tmpdir=/tmp
> > >         qf:seq_no=0
> > >         qf:rerun=0.000000
> > >         qf:calendar=NONE
> > >         qf:s_rt=infinity
> > >         qf:h_rt=infinity
> > >         qf:s_cpu=infinity
> > >         qf:h_cpu=infinity
> > >         qf:s_fsize=infinity
> > >         qf:h_fsize=infinity
> > >         qf:s_data=infinity
> > >         qf:h_data=infinity
> > >         qf:s_stack=infinity
> > >         qf:h_stack=infinity
> > >         qf:s_core=infinity
> > >         qf:h_core=infinity
> > >         qf:s_rss=infinity
> > >         qf:h_rss=infinity
> > >         qf:s_vmem=infinity
> > >         qf:h_vmem=infinity
> > >         qf:min_cpu_interval=00:05:00
> > >
> > > Both clusters don't have h_vmem defined in exechost level.
> >
> > What is the output of:
> >
> > `qhost -F`
> >
> > Below you write that it's also defined on a queue instance level, hence in 
> > both places (as "complex_values")?
> >
> > -- Reuti
> >
> >
> > > Derrick
> > >
> > >
> > > On Fri, Jul 4, 2014 at 1:58 PM, Derrick Lin <[email protected]> wrote:
> > > Hi all,
> > >
> > > We start using h_vmem to control jobs by their memory usage. However jobs 
> > > couldn't start when there is -l h_vmem. The reason is
> > >
> > > (-l h_vmem=1G) cannot run in queue "[email protected]" because job 
> > > requests unknown resource (h_vmem)
> > >
> > > However, h_vmem is definitely on the queue instance:
> > >
> > > queuename                      qtype resv/used/tot. load_avg arch         
> > >  states
> > > ---------------------------------------------------------------------------------
> > > [email protected]        BIP   0/0/64         6.27     lx26-amd64
> > >         ....
> > >         hl:np_load_long=0.091563
> > >         hc:mem_requested=504.903G
> > >         qf:qname=intel.q
> > >         qf:hostname=delta-5-1.local
> > >         qc:slots=64
> > >         qf:tmpdir=/tmp
> > >         qf:seq_no=0
> > >         qf:rerun=0.000000
> > >         qf:calendar=NONE
> > >         qf:s_rt=infinity
> > >         qf:h_rt=infinity
> > >         qf:s_cpu=infinity
> > >         qf:h_cpu=infinity
> > >         qf:s_fsize=infinity
> > >         qf:h_fsize=infinity
> > >         qf:s_data=infinity
> > >         qf:h_data=infinity
> > >         qf:s_stack=infinity
> > >         qf:h_stack=infinity
> > >         qf:s_core=infinity
> > >         qf:h_core=infinity
> > >         qf:s_rss=infinity
> > >         qf:h_rss=infinity
> > >         qf:s_vmem=infinity
> > >         qf:h_vmem=infinity
> > >         qf:min_cpu_interval=00:05:00
> > >
> > > I tried to specify other attr such as h_rt, jobs started and finished 
> > > successfully.
> > >
> > >
> > >
> > >
> > > qconf -sc
> > >
> > >
> > >
> > > #name               shortcut   type        relop requestable consumable 
> > > default  urgency
> > >
> > >
> > >
> > > #----------------------------------------------------------------------------------------
> > >
> > >
> > >
> > > h_vmem              h_vmem     MEMORY      <=    YES         YES        0 
> > >        0
> > >
> > >
> > >
> > > #
> > >
> > > Can anyone shed light on this?
> > >
> > > Cheers,
> > > Derrick
> > >
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

Reply via email to