Hello,

>> {
>>   name         shortlimit
>>   description  NONE
>>   enabled      TRUE
>>   limit        queues short.q hosts * to slots=32

> I think you can leave the "hosts *" out here and the other RQS below. It 
> means "used slots across all machines" limited to 32 in this queue. The same 
> can be achieved by specifying only the queue.

Yes, I ended up making some things overly explicit while trying to debug the 
issue.

>> }
>> {
>>   name         longlimit
>>   description  NONE
>>   enabled      TRUE
>>   limit        queues long.q hosts * to slots=16
>> }
>> {
>>   name         verylonglimit
>>   description  NONE
>>   enabled      TRUE
>>   limit        queues verylong.q hosts * to slots=4
>> }
>> {
>>   name         urgentlimit
>>   description  NONE
>>   enabled      TRUE
>>   limit        users {*} queues urgent.q hosts * to slots=1
>> }
>> {
>>   name         debuglimit
>>   description  NONE
>>   enabled      TRUE
>>   limit        users {*} queues debug.q hosts {*} to slots=1
>> }

>As the above 5 limits are disjunct, they can also be put in one and the same 
>RQS. You can give each a name to get it listed instead of the number of the 
>rule, which is always 1 right now.

I originally had these as one RQS, but again tried to make things more explicit 
(or at least easier for me to understand) while debugging.

>> This will cause a parallel job across multiple queues to never schedule. If
>> I get rid of the "nodelimit" and instead set the number of slots using
>> the complex value in the host configuration, then everything works (except
>> my debug queue).

>Do you have many machinetypes? What happens, if you don't use $num_proc there 
>but specify a hard coded limit per hostgroup for a machinetype or so?
>
>limit        queues !debug.q hosts {@quadcore} to slots=4
>limit        queues !debug.q hosts {@hexacore} to slots=6

I don't have many machine types, in fact I don't have many machines! I tried to 
replace the nodelimit RQS with:

{
   name         nodelimit
   description  NONE
   enabled      TRUE
   limit        queues !debug.q hosts {animal.ohsu.edu,kermit.ohsu.edu} to 
slots=24
   limit        queues !debug.q hosts {piggy.ohsu.edu} to slots=8
}

This gives the same result as the original nodelimit RQS that used $num_proc 
(the job never gets scheduled).

>> Below I give an example of a hanging job (with the scheduler output enabled).
>> I set h_rt to 3:50:00 as this will allow the queues short.q, long.q, and
>> verylong.q. I request 40 slots as that will have to span multiple queues.

>If I get you right, SGE could find different combinations for the slot 
>allocation, depending on the algorithm which is used as all the queues are on 
>the same machines?

All the queues are on the same machines. I am not sure which "algorithm" you 
refer to. As mentioned, the scheduler sorts by sequence number so the queues 
are checked in shortest to longest order. Thus my job that requests 40 slots 
with the given h_rt value should take 32 slots from short.q and 8 slots from 
long.q (provided nothing else is running on the cluster, which is the case for 
my testing). 

Thanks,
Brendan

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to