Re: [gridengine users] Large cluster with memory reservation leaving cores idle

Reuti Wed, 09 Mar 2016 05:42:55 -0800

Hi Chris,

Am 08.03.2016 um 18:29 schrieb Christopher Black:


> Thanks for the reply Reuti!
> 
> Sounds like some of the suggestions are moving limits out of RQS and into
> complexes and consumable resources.

Yep.


> How do we make that happen without
> requiring users to add -l bits to their qsubs?

You can use a JSV (job submission verifier) and each time a queue is specified 
in addition a consumable complex for this type of queue is requested too. This 
could also overwrite any specified value for the consumable complex (users 
could request zero of them - AFAIK only UGE introduced also a lower limit for 
the consumption which could be requested).

Do your users need the feature to specify more than one queue per submission?


> On 3/8/16, 7:32 AM, "Reuti" <re...@staff.uni-marburg.de> wrote:
> 
>> I saw cases were RQS blocks further scheduling and shows up in `qstat -j`
>> with a cryptic message. Although this was in 6.2u5, I don't know whether
>> there was any work in this area to fix it.
>> 
>> Often you can spot it in the scheduling output that an RQS was violated
>> although it's not true that the rule is violated. For me it kicked in
>> when I requested a complex with a load value in the submission command.
>> 
>> cannot run because it exceeds limit "////node20/" in rule "general/slots"
> 
> I've also seen cryptic qstat -j messages about slots not available that
> ended up requiring changing values in a pe, but that was settled a while
> ago.
> Within queue definitions we have:
> slots                 1,[@16core=16],[@20core=20],[@28core=28]
> 
> Those are per node limits, unsure how to change this to total slots for a
> queue across all eligible nodes.

Not at all. These can stay as they are. Only the overall usage per queue (which 
is in the RQS) would be rephrased. There is no need to change anything on a 
queue-instance level.


> We have 20+ queues and use RQS, host groups and disabling/enabling queue
> instances to manage balancing nodes and load between queues.
> 
> 
> When I turn schedd_job_info back on and look at qstat -j, I sometimes (but
> not often) see those "exceeds limit" entries, but before that there are
> MANY entries like:
> queue instance "de...@pnode073.nygenome.org" dropped because it is
> temporarily not available
> queue instance "cus...@pnode077.nygenome.org" dropped because it is
> disabled
> 
> And these are for queues other than the one specified in -q
> hard_queue_list. I am wondering if qmaster is giving up checking eligible
> matching queue instances after checking all of these disabled instances
> for other queues. Perhaps utilizing hostgroups in queue definitions would
> be more efficient than disabling queue instances.
> 
>> AFAICS:
>> 
>>> 
>>> Some config snippets showing non-default and potentially-relevant
>>> values,
>>> I can put full output to a pastebin if it is useful:
>>> qconf -srqs:
>>> {
>>>  name         slots_per_host
>>>  description  Limit slots per host
>>>  enabled      TRUE
>>>  limit        hosts {@16core} to slots=16
>>>  limit        hosts {@20core} to slots=20
>>>  limit        hosts {@28core} to slots=28
>>>  limit        hosts {!@physicalNodes} to slots=2
>>> }
>> 
>> The above RQS could be put in individual complex_values per exechost. Yes
>> - the above is handy, I know.
> 
> Is the idea to define a new complex (qconf -mc) and then set it per
> exechost (qconf -me), or re-use "slots" somehow?

It would mean one complex per type of queue: dev_slots, io_slots, pipeline_slots


> Current qconf -mc |grep slot:
> slots               s          INT       <=      YES         YES        1
>      1000
> 
> Putting slots per exechost wouldn't be too awful to script (qconf -mattr)
> but we'd need it to work without asking users to change qsub commands.
> Would this be more efficient than having it in the RQS stanza above?

Usually the number of slots across all queues per machine is fixed, it should 
be a one-time setting for the above mentioned 16, 20, 28 and 2.

The values in the global configuration, which reflect the usage per queue will 
change.

This can be done with the `qconf -mattr exechost complex_values limit_dev=99 
global` too.


> {
>>>  name         io
>>>  description  Limit max concurrent io.q slots
>>>  enabled      TRUE
>>>  limit        queues io.q to slots=300
>>> }
>>> {
>>>  name         dev
>>>  description  Limit max concurrent dev.q slots
>>>  enabled      TRUE
>>>  limit        queues dev.q to slots=250
>>> }
>>> {
>>>  name         pipeline
>>>  description  Limit max concurrent pipeline.q slots
>>>  enabled      TRUE
>>>  limit        queues pipeline.q to slots=4000
>>> }
>>> ...other queues..
>> 
>> Here one could use a global complex for each type of queue, as long as
>> the users specify the particular queue. One will lose the ability that
>> potentially a job may be scheduled to different types of queues, as long
>> as resource requests are met.
> 
> Sounds like we could switch from RQS for queue core limits to complex per
> queue, but I'm not sure how to make all slots for qsub -q foo.q
> automatically debit against a consumable resource like foo_slots. What
> would we need to do beyond defining the global complex, setting value and
> marking it consumable?

Yes.

-- Reuti


> This would be a big change to the way we currently limit load across
> queues and nodes but we are willing to try to get past our issues.
> 
>> I can't predict whether this would improve anything in the situation you
>> face.
> 
> Understood!
> 
> Thanks!
> Chris
> 
>>> qconf -sc|grep mem (note default mem per job is 8GB and this is
>>> consumable):
>>> h_vmem              mem        MEMORY    <=      YES         JOB
>>> 8G
>>>     0
>>> 
>>> A typical exechost qconf -se:
>>> complex_values        h_vmem=240G,exclusive=true
>>> 
>>> qconf -sconf:
>>> shell_start_mode             unix_behavior
>>> reporting_params             accounting=true reporting=false \
>>>                            flush_time=00:00:15 joblog=true
>>> sharelog=00:00:00
>>> finished_jobs                100
>>> gid_range                    20000-20100
>>> max_aj_instances             3000
>>> max_aj_tasks                 75000
>>> max_u_jobs                   0
>>> max_jobs                     0
>>> max_advance_reservations     50
>>> 
>>> qconf -msconf:
>>> schedule_interval                 0:0:45
>>> maxujobs                          0
>>> queue_sort_method                 load
>>> 
>>> schedd_job_info                   false  (this used to be true, as qstat
>>> -j on a stuck job can be useful)
>>> params                            monitor=false
>>> max_functional_jobs_to_schedule   1000
>>> max_pending_tasks_per_job         50
>>> max_reservation                   0  (used to be 50 to allow large jobs
>>> with -R y to have a better chance to run)
>>> default_duration                  4320:0:0
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
> 
> This electronic message is intended for the use of the named recipient only, 
> and may contain information that is confidential, privileged or protected 
> from disclosure under applicable law. If you are not the intended recipient, 
> or an employee or agent responsible for delivering this message to the 
> intended recipient, you are hereby notified that any reading, disclosure, 
> dissemination, distribution, copying or use of the contents of this message 
> including any of its attachments is strictly prohibited. If you have received 
> this message in error or are not the named recipient, please notify us 
> immediately by contacting the sender at the electronic mail address noted 
> above, and destroy all copies of this message. Please note, the recipient 
> should check this email and any attachments for the presence of viruses. The 
> organization accepts no liability for any damage caused by any virus 
> transmitted by this email.
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Large cluster with memory reservation leaving cores idle

Reply via email to