Hi, Am 13.09.2013 um 03:06 schrieb Brendan Moloney:
> I use multiple queues to divide up available resources based on job run times. So you are requesting "-l h_rt=..."? > Large parallel jobs will typically span multiple queues and this has > generally been working fine thus far. However I recently increased the > number of queues (from 4 to 9) so that the time limits can be more fine > grained. After this change I noticed that large parallel jobs will > consistently fail if more than 3-4 queues are being used on each host. The > failed jobs will generate the following messages: > > Execution daemon on host <hostname> didn't accept task > > I see this problem using both "builtin" and SSH for job startup. Yes, the problem is that you can't address a specific queue in `qrsh -inherit ...` and if you get several queues on a machine you might have used up the slots of the queue that is selected first for the `qrsh -inherit ...`. https://arc.liv.ac.uk/trac/SGE/ticket/813 It should help to have a PE for each queue, but you end up with 9 PEs for each PE you have right now. BUT: What type of parallel applications are you using? With a tight integration of MPICH2/3 and Open MPI there is only one `qrsh -inherit ...` call per exechost and all other processes are forks. And as you get "Execution daemon on host <hostname> didn't accept task" you are having a tight integration. -- Reuti > While the error message is different, I think this may be related to a > problem I had previously > (http://gridengine.org/pipermail/users/2012-November/005164.html). In that > case I was having problems starting large numbers of small parallel jobs at > the same time (which would in turn cause jobs to start on many different > queues at the same time). I am thinking there must be some race condition > going on in this specific scenario (many parallel jobs starting at the same > time across multiple queues on the same host). > > Thanks, > Brendan > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
