Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

Derrick Lin Sun, 03 Aug 2014 17:51:25 -0700

Hi Reuti,

So I removed the RQS, jobs still started and ran.


derlin@nerv-geofront:~$ qconf -srqs
No resource quota set found
derlin@nerv-geofront:~$ qconf -sc | grep h_vmem
h_vmem              h_vmem     MEMORY      <=    YES         YES        0
     0

derlin@nerv-geofront:~$ qhost -F h_vmem
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO
 SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -
  -
eva00                   lx26-amd64      8  0.00    7.8G  313.2M   32.0G
0.0
eva01                   lx26-amd64      8  0.00    7.8G  335.2M   32.0G
0.0
eva02                   lx26-amd64      8  0.00    7.8G  332.0M   32.0G
0.0
shito00                 lx26-amd64      8  0.01    7.8G  369.3M   32.0G
0.0
derlin@nerv-geofront:~$ qstat -F h_vmem
queuename                      qtype resv/used/tot. load_avg arch
 states
---------------------------------------------------------------------------------
[email protected]              BP    0/0/4          0.00     lx26-amd64
        qf:h_vmem=infinity
---------------------------------------------------------------------------------
[email protected]              BP    0/0/8          0.00     lx26-amd64
        qf:h_vmem=infinity
---------------------------------------------------------------------------------
[email protected]              BP    0/0/8          0.00     lx26-amd64
        qf:h_vmem=infinity
---------------------------------------------------------------------------------
[email protected]            BP    0/0/8          0.01     lx26-amd64    d
        qf:h_vmem=infinity
---------------------------------------------------------------------------------
[email protected]              IP    0/0/4          0.00     lx26-amd64
        qf:h_vmem=infinity
---------------------------------------------------------------------------------
[email protected]          BIP   0/0/8          0.01     lx26-amd64
        qf:h_vmem=infinity
---------------------------------------------------------------------------------
[email protected]            BIP   0/0/8          0.01     lx26-amd64
        qf:h_vmem=infinity

h_vmem appears on qstat -F but not qhost -F. Does it make sense to you? I
am not so sure about the difference between these two commands.

Cheers,
D


On Fri, Aug 1, 2014 at 7:35 PM, Reuti <[email protected]> wrote:

> Am 01.08.2014 um 01:39 schrieb Derrick Lin:
>
> > Do you have
> >
> > params                            MONITOR=1??
>
> No:
>
> $ qconf -ssconf
> ...
> params                            none
>
>
> > This is what gave me the same error.
>
> IMO it's not an error. If there is nothing to consume from, then the job
> shouldn't be scheduled.
>
> Is the job also running for you when you:
>
> - remove the RQS
> - h_vmem set to consumable
> - nowhere any initial value for h_vmem is set (exechost, queue or global)?
>
> -- Reuti
>
>
> > I am running GE 6.2u5 as well
> >
> > D
> >
> >
> > On Thu, Jul 31, 2014 at 8:53 PM, Reuti <[email protected]>
> wrote:
> > Am 31.07.2014 um 03:06 schrieb Derrick Lin:
> >
> > > Hi Reuti,
> > >
> > > That's interesting, but it works without any hack:
> > >
> > > {
> > >         name    default_per_user
> > >         enabled true
> > >         description     "Each user entitles to resources equivalent to
> three nodes"
> > >         limit   users {*} queues {all.q} to slots=192,h_vmem=1536G
> >
> > Not for me in 6.2u5, ist shows "because job requests unknown resource
> (h_vmem)" as expected until I add a decent value to "complex_values"
> anywhere.
> >
> > -- Reuti
> >
> >
> > > }
> > >
> > > Then it consumes from user's quota:
> > >
> > > $ qquota -u "*"
> > > resource quota rule limit                filter
> > >
> --------------------------------------------------------------------------------
> > > default_per_user/1 slots=166/192        users b****** queues all.q
> > > default_per_user/1 h_vmem=400.000G/1536 users b****** queues all.q
> > >
> > > Is it illegal to set h_vmem in per user quota in the first place?
> > >
> > > Cheers,
> > > D
> > >
> > >
> > > On Wed, Jul 30, 2014 at 4:37 PM, Reuti <[email protected]>
> wrote:
> > > Hi,
> > >
> > > Am 30.07.2014 um 03:29 schrieb Derrick Lin:
> > >
> > > > **No** initial value per queue instance, I force the users to
> specify both h_vmem and mem_requested by defining default values inside
> sge_default file.
> > > >
> > > > No h_vmem on exechost level either, because we want to use
> mem_requested instead since it's already been setup across all exechosts.
> > > >
> > > > My original issue was, when I set params  MONITOR=1 jobs failed to
> start.
> > > >
> > > > Now I have MONITOR=1 removed, all jobs start and run fine. Any idea?
> > >
> > > They still shouldn't start. As you defined "h_vmem" as being
> consumable, it's a question: consume from what?
> > >
> > > Nevertheless you can set an arbitrary high value in the global
> exechost `qconf -me global` there under "complex_values".
> > >
> > > -- Reuti
> > >
> > >
> > > > D
> > > >
> > > >
> > > > On Tue, Jul 29, 2014 at 7:43 PM, Reuti <[email protected]>
> wrote:
> > > > Hi,
> > > >
> > > > Am 29.07.2014 um 06:07 schrieb Derrick Lin:
> > > >
> > > > > This is qhost of one of our compute nodes:
> > > > >
> > > > > pwbcad@gamma01:~$ qhost -F -h omega-0-9
> > > > > HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE
>  SWAPTO  SWAPUS
> > > > >
> -------------------------------------------------------------------------------
> > > > > global                  -               -     -       -       -
>     -       -
> > > > > omega-0-9               lx26-amd64     64 12.34  504.9G  273.6G
>  256.0G   14.6G
> > > > >    hl:arch=lx26-amd64
> > > > >    hl:num_proc=64.000000
> > > > >    hl:mem_total=504.890G
> > > > >    hl:swap_total=256.000G
> > > > >    hl:virtual_total=760.890G
> > > > >    hl:load_avg=12.340000
> > > > >    hl:load_short=9.720000
> > > > >    hl:load_medium=12.340000
> > > > >    hl:load_long=18.900000
> > > > >    hl:mem_free=231.308G
> > > > >    hl:swap_free=241.356G
> > > > >    hl:virtual_free=472.663G
> > > > >    hl:mem_used=273.582G
> > > > >    hl:swap_used=14.644G
> > > > >    hl:virtual_used=288.226G
> > > > >    hl:cpu=15.400000
> > > > >
>  
> hl:m_topology=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT
> > > > >
>  
> hl:m_topology_inuse=SCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTTSCTTCTTCTTCTTCTTCTTCTTCTT
> > > > >    hl:m_socket=4.000000
> > > > >    hl:m_core=32.000000
> > > > >    hl:np_load_avg=0.192812
> > > > >    hl:np_load_short=0.151875
> > > > >    hl:np_load_medium=0.192812
> > > > >    hl:np_load_long=0.295312
> > > > >    hc:mem_requested=502.890G
> > > >
> > > > So, here is no h_vmem on an exechost level.
> > > >
> > > >
> > > > > We do not set h_vmem in queue instance level, that's intended
> because we just need h_vmem in per user quota like:
> > > >
> > > > Typo and you mean exechost level?
> > > >
> > > >
> > > > > {
> > > > >         name    default_per_user
> > > > >         enabled true
> > > > >         description     "Each user entitles to resources
> equivalent to two nodes"
> > > > >         limit   users {*} queues {all.q} to slots=16,h_vmem=16G
> > > > > }
> > > >
> > > > RQS limits are not enforced. The user has to specify it by hand then
> with the -l option to `qsub`.
> > > >
> > > > Is "h_vmem" then in "complex_values" in the queue definition with an
> initial value per queue instance?
> > > >
> > > > -- Reuti
> > > >
> > > >
> > > > > At the queue instance level, we use mem_requested as "per host
> quota" instead. It's a custom complex attr we setup for our specific
> applications.
> > > > >
> > > > > Cheers,
> > > > > D
> > > > >
> > > > >
> > > > > On Tue, Jul 29, 2014 at 1:02 AM, Reuti <[email protected]>
> wrote:
> > > > > Hi,
> > > > >
> > > > > Am 04.07.2014 um 06:04 schrieb Derrick Lin:
> > > > >
> > > > > > Interestingly, I have a small test cluster that basically have
> the same SGE setup does *not* have such problem. h_vmem in complex is
> exactly the same. The test queue instance looks almost the same (except the
> CPU layout etc)
> > > > > >
> > > > > >  qstat -F -q all.q@eva00
> > > > > > queuename                      qtype resv/used/tot. load_avg
> arch          states
> > > > > >
> ---------------------------------------------------------------------------------
> > > > > > [email protected]              BP    0/0/8          0.00
> lx26-amd64
> > > > > >        ...
> > > > > >         hc:mem_requested=7.814G
> > > > > >         qf:qname=all.q
> > > > > >         qf:hostname=eva00.local
> > > > > >         qc:slots=8
> > > > > >         qf:tmpdir=/tmp
> > > > > >         qf:seq_no=0
> > > > > >         qf:rerun=0.000000
> > > > > >         qf:calendar=NONE
> > > > > >         qf:s_rt=infinity
> > > > > >         qf:h_rt=infinity
> > > > > >         qf:s_cpu=infinity
> > > > > >         qf:h_cpu=infinity
> > > > > >         qf:s_fsize=infinity
> > > > > >         qf:h_fsize=infinity
> > > > > >         qf:s_data=infinity
> > > > > >         qf:h_data=infinity
> > > > > >         qf:s_stack=infinity
> > > > > >         qf:h_stack=infinity
> > > > > >         qf:s_core=infinity
> > > > > >         qf:h_core=infinity
> > > > > >         qf:s_rss=infinity
> > > > > >         qf:h_rss=infinity
> > > > > >         qf:s_vmem=infinity
> > > > > >         qf:h_vmem=infinity
> > > > > >         qf:min_cpu_interval=00:05:00
> > > > > >
> > > > > > Both clusters don't have h_vmem defined in exechost level.
> > > > >
> > > > > What is the output of:
> > > > >
> > > > > `qhost -F`
> > > > >
> > > > > Below you write that it's also defined on a queue instance level,
> hence in both places (as "complex_values")?
> > > > >
> > > > > -- Reuti
> > > > >
> > > > >
> > > > > > Derrick
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 4, 2014 at 1:58 PM, Derrick Lin <[email protected]>
> wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > We start using h_vmem to control jobs by their memory usage.
> However jobs couldn't start when there is -l h_vmem. The reason is
> > > > > >
> > > > > > (-l h_vmem=1G) cannot run in queue "[email protected]"
> because job requests unknown resource (h_vmem)
> > > > > >
> > > > > > However, h_vmem is definitely on the queue instance:
> > > > > >
> > > > > > queuename                      qtype resv/used/tot. load_avg
> arch          states
> > > > > >
> ---------------------------------------------------------------------------------
> > > > > > [email protected]        BIP   0/0/64         6.27
> lx26-amd64
> > > > > >         ....
> > > > > >         hl:np_load_long=0.091563
> > > > > >         hc:mem_requested=504.903G
> > > > > >         qf:qname=intel.q
> > > > > >         qf:hostname=delta-5-1.local
> > > > > >         qc:slots=64
> > > > > >         qf:tmpdir=/tmp
> > > > > >         qf:seq_no=0
> > > > > >         qf:rerun=0.000000
> > > > > >         qf:calendar=NONE
> > > > > >         qf:s_rt=infinity
> > > > > >         qf:h_rt=infinity
> > > > > >         qf:s_cpu=infinity
> > > > > >         qf:h_cpu=infinity
> > > > > >         qf:s_fsize=infinity
> > > > > >         qf:h_fsize=infinity
> > > > > >         qf:s_data=infinity
> > > > > >         qf:h_data=infinity
> > > > > >         qf:s_stack=infinity
> > > > > >         qf:h_stack=infinity
> > > > > >         qf:s_core=infinity
> > > > > >         qf:h_core=infinity
> > > > > >         qf:s_rss=infinity
> > > > > >         qf:h_rss=infinity
> > > > > >         qf:s_vmem=infinity
> > > > > >         qf:h_vmem=infinity
> > > > > >         qf:min_cpu_interval=00:05:00
> > > > > >
> > > > > > I tried to specify other attr such as h_rt, jobs started and
> finished successfully.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > qconf -sc
> > > > > >
> > > > > >
> > > > > >
> > > > > > #name               shortcut   type        relop requestable
> consumable default  urgency
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> #----------------------------------------------------------------------------------------
> > > > > >
> > > > > >
> > > > > >
> > > > > > h_vmem              h_vmem     MEMORY      <=    YES         YES
>        0        0
> > > > > >
> > > > > >
> > > > > >
> > > > > > #
> > > > > >
> > > > > > Can anyone shed light on this?
> > > > > >
> > > > > > Cheers,
> > > > > > Derrick
> > > > > >
> > > > > > _______________________________________________
> > > > > > users mailing list
> > > > > > [email protected]
> > > > > > https://gridengine.org/mailman/listinfo/users
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Job couldn't start because job requests unknown resource (h_vmem)

Reply via email to