Hi,
Am 12.01.2012 um 22:07 schrieb Brendan Moloney:
> Hello,
>
>>> {
>>> name shortlimit
>>> description NONE
>>> enabled TRUE
>>> limit queues short.q hosts * to slots=32
>
>> I think you can leave the "hosts *" out here and the other RQS below. It
>> means "used slots across all machines" limited to 32 in this queue. The same
>> can be achieved by specifying only the queue.
>
> Yes, I ended up making some things overly explicit while trying to debug the
> issue.
>
>>> }
>>> {
>>> name longlimit
>>> description NONE
>>> enabled TRUE
>>> limit queues long.q hosts * to slots=16
>>> }
>>> {
>>> name verylonglimit
>>> description NONE
>>> enabled TRUE
>>> limit queues verylong.q hosts * to slots=4
>>> }
>>> {
>>> name urgentlimit
>>> description NONE
>>> enabled TRUE
>>> limit users {*} queues urgent.q hosts * to slots=1
>>> }
>>> {
>>> name debuglimit
>>> description NONE
>>> enabled TRUE
>>> limit users {*} queues debug.q hosts {*} to slots=1
>>> }
>
>> As the above 5 limits are disjunct, they can also be put in one and the same
>> RQS. You can give each a name to get it listed instead of the number of the
>> rule, which is always 1 right now.
>
> I originally had these as one RQS, but again tried to make things more
> explicit (or at least easier for me to understand) while debugging.
>
>>> This will cause a parallel job across multiple queues to never schedule. If
>>> I get rid of the "nodelimit" and instead set the number of slots using
>>> the complex value in the host configuration, then everything works (except
>>> my debug queue).
>
>> Do you have many machinetypes? What happens, if you don't use $num_proc
>> there but specify a hard coded limit per hostgroup for a machinetype or so?
>>
>> limit queues !debug.q hosts {@quadcore} to slots=4
>> limit queues !debug.q hosts {@hexacore} to slots=6
>
> I don't have many machine types, in fact I don't have many machines! I tried
> to replace the nodelimit RQS with:
>
> {
> name nodelimit
> description NONE
> enabled TRUE
> limit queues !debug.q hosts {animal.ohsu.edu,kermit.ohsu.edu} to
> slots=24
> limit queues !debug.q hosts {piggy.ohsu.edu} to slots=8
> }
>
> This gives the same result as the original nodelimit RQS that used $num_proc
> (the job never gets scheduled).
>
>>> Below I give an example of a hanging job (with the scheduler output
>>> enabled).
>>> I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and
>>> verylong.q. I request 40 slots as that will have to span multiple queues.
>
>> If I get you right, SGE could find different combinations for the slot
>> allocation, depending on the algorithm which is used as all the queues are
>> on the same machines?
>
> All the queues are on the same machines. I am not sure which "algorithm" you
> refer to.
I refer to the internal algorithm of SGE how to collect slots from various
queues.
> As mentioned, the scheduler sorts by sequence number so the queues are
> checked in shortest to longest order.
Not for parallel jobs. Only the allocation_rule is used (except for $pe_slots).
http://blogs.oracle.com/sgrell/entry/grid_engine_scheduler_hacks_least
Does your observation fit to the aspects of parallel jobs at the end of the
above link?
> Thus my job that requests 40 slots with the given h_rt value should take 32
> slots from short.q and 8 slots from long.q (provided nothing else is running
> on the cluster, which is the case for my testing).
Interesting. Collecting slots from different queues has some implications
anyway:
- the name of the $TMPDIR depends on the name of the queue, hence it's not the
same on all nodes
- `qrsh -inherit ...` can't distinguish between the granted queues:
https://arc.liv.ac.uk/trac/SGE/ticket/813
-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users