Re: [gridengine users] Resource quotas and parallel jobs across multiple queues

Reuti Thu, 12 Jan 2012 17:40:35 -0800

Am 12.01.2012 um 23:52 schrieb Brendan Moloney:

>>> All the queues are on the same machines. I am not sure which "algorithm" 
>>> you refer to.
>> 
>> I refer to the internal algorithm of SGE how to collect slots from various 
>> queues.
>> 
>>> As mentioned, the scheduler sorts by sequence number so the queues are 
>>> checked in shortest to longest order.
>> 
>> Not for parallel jobs. Only the allocation_rule is used (except for 
>> $pe_slots).
>> 
>> http://blogs.oracle.com/sgrell/entry/grid_engine_scheduler_hacks_least
>> 
>> Does your observation fit to the aspects of parallel jobs at the end of the 
>> above link?
> 
> There is definitely still some interaction between the scheduler 
> configuration and the pe allocation rule. The allocation rule for the "mpi" 
> pe is $round_robin. When I run this example successfully (the per node slot 
> limits done through complex values) then the grid engine will do round robin 
> allocation in short.q (animal and kermit get 12 slots, piggy gets 8) followed 
> by round robin allocation in long.q (animal and kermit get 4 slots).
> 
>> Interesting. Collecting slots from different queues has some implications 
>> anyway:
>> 
>> - the name of the $TMPDIR depends on the name of the queue, hence it's not 
>> the same on all nodes
> 
> This should not be an issue for correctly written software, right?


This depends on what you define as "correctly":

Case 1: you have no queuing system, users are requested to create by hand 
something like /scratch/reuti/foobar17 on all nodes for a particular job. You 
set this value as an argument to `mpiexec` and you are quite happy that it's 
forwarded by the application internally to all nodes. Changing ~/.profile to 
set it by the ssh login would mean to change it for each `mpiexec`. Even if 
it's only /scratch/reuti to be created as a one time setup, it's the same on 
all nodes. No need to set any variable.

Case 2: You have a queuing system and want to use $TMPDIR - it must be the one 
on the node, not the one forwarded from the master node of the parallel job 
like in case 1. It depends whether the software honors something like $TMP, 
$TMPDIR or has the behavior like in case 1.

Case 3: The software is just using the $PWD for its scratch data. Hence you 
make a `cd $TMPDIR` on the master node and this will also be used as path on 
all slave nodes. If the directory isn't there, you are out of luck or use only 
/tmp (or your home) and lose the handling of $TMPDIR by SGE.

In fact: this was tricky with some applications with Codine 5.3 - no cluster 
queues, and although the $TMPDIR was created on the slave nodes they all had 
different names, as each queue had an unique name like node01.long.q 
node02.long.q (with only one host per queue)... IIRC I made a loop across the 
involved nodes to create a symbolic link with a name I like to Codine's created 
$TMPDIR. Oh dear, long ago...


>> - `qrsh -inherit ...` can't distinguish between the granted queues:
>> https://arc.liv.ac.uk/trac/SGE/ticket/813
> 
> I don't think this will affect us. We only run MPI programs with a tightly 
> integrated MPICH2 or SMP programs with the allocation rule set to $pe_slots.
> 
> So is it safe to say that I have found a bug?

I think so. The limit in the RQS should be handled as you expect it, especially 
as it's working as you note by setting individual slot counts in the exechost 
definitions.


> It seems like my original RQS should work.

Yes.


> Or at least doing qsub with '-w e' should fail immediately instead of 
> allowing the job to wait in 'qw' state forever.

This would be like "no suitable queue", but it first finds a possible 
assignment but fails to collect slots later on.

-- Reuti
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Resource quotas and parallel jobs across multiple queues

Reply via email to