Hi,

Am 13.09.2013 um 03:06 schrieb Brendan Moloney:

> I use multiple queues to divide up available resources based on job run times.

So you are requesting "-l h_rt=..."?


> Large parallel jobs will typically span multiple queues and this has 
> generally been working fine thus far.  However I recently increased the 
> number of queues (from 4 to 9) so that the time limits can be more fine 
> grained. After this change I noticed that large parallel jobs will 
> consistently fail if more than 3-4 queues are being used on each host.  The 
> failed jobs will generate the following messages:
> 
> Execution daemon on host <hostname> didn't accept task
> 
> I see this problem using both "builtin" and SSH for job startup.

Yes, the problem is that you can't address a specific queue in `qrsh -inherit 
...` and if you get several queues on a machine you might have used up the 
slots of the queue that is selected first for the `qrsh -inherit ...`.

https://arc.liv.ac.uk/trac/SGE/ticket/813

It should help to have a PE for each queue, but you end up with 9 PEs for each 
PE you have right now.

BUT: What type of parallel applications are you using? With a tight integration 
of MPICH2/3 and Open MPI there is only one `qrsh -inherit ...` call per 
exechost and all other processes are forks. And as you get "Execution daemon on 
host <hostname> didn't accept task" you are having a tight integration.

-- Reuti


> While the error message is different, I think this may be related to a 
> problem I had previously 
> (http://gridengine.org/pipermail/users/2012-November/005164.html).  In that 
> case I was having problems starting large numbers of small parallel jobs at 
> the same time (which would in turn cause jobs to start on many different 
> queues at the same time).  I am thinking there must be some race condition 
> going on in this specific scenario (many parallel jobs starting at the same 
> time across multiple queues on the same host).
> 
> Thanks,
> Brendan
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to