Hello,

Well I was wrong, it appears the temporary directory issue is preventing me 
from getting the full stdout/stderr results from all processes.  Also, just 
switching to using $fill_up instead of $round_robin doesn't always prevent the 
jobs from failing (due to too many queues being used on a single host).

So I spent some time trying to determine a way to keep the same functional 
policies without having multiple queues, and I think I found a solution.  
However there are a couple of downsides to my plan, so I would appreciate some 
feedback on how to improve it.

The basic idea is that I create a specialized slot complex for each time limit 
(e.g. slots_short, slots_mid, slots_long, etc). Then on my only queue (lets say 
all.q) I set the total number of slots available for each time limit (i.e. 90% 
of total slots for slots_short, then 80% for slots_mid, 70% for slots_long, 
etc). Then I use a JSV to parse the requested number of slots and h_rt value, 
and then add a request for each specialized slot complex with a time limit 
equal to or less than the requested h_rt.  So a job that would normally run on 
my mid.q and use 10 slots instead runs on all.q and request 10 each of slots, 
slots_mid, and slots_short.

There are two main down sides to this approach I can see:

  1) Requesting a slot range would no longer work as the JSV has no way of 
knowing how many slots are actually going to be used. 

  2) I have to manually update all of the complex values any time a node is 
added or removed from the cluster.

Any thoughts or suggestions?

Thanks,
Brendan

________________________________________
From: Brendan Moloney
Sent: Monday, September 16, 2013 5:03 PM
To: Dave Love
Cc: users@gridengine.org
Subject: RE: [gridengine users] Problems with multi-queue parallel jobs

Hello,

I have heard of the temporary directory issue before, but we run a very small 
number of MPI applications and none of them have this problem.

I would move away from our current multi-queue setup if there was a viable 
alternative that meets our needs.  In particular we need to limit available 
resources based on run time while still allowing very short jobs (including MPI 
jobs) to utilize all of the available resources.  If there are other (better 
supported) ways to achieve these goals then I would appreciate some pointers.

Thanks,
Brendan
________________________________________
From: Dave Love,,, [d.l...@liverpool.ac.uk]
Sent: Monday, September 16, 2013 3:01 PM
To: Brendan Moloney
Cc: users@gridengine.org
Subject: Re: [gridengine users] Problems with multi-queue parallel jobs

Brendan Moloney <molo...@ohsu.edu> writes:

> Hello,
>
> I use multiple queues to divide up available resources based on job
> run times. Large parallel jobs will typically span multiple queues and
> this has generally been working fine thus far.

I'd strongly recommend avoiding that.  Another reason is possible
trouble due to the temporary directory name being derived from the queue
name.  (I changed that but had some odd failures when I introduced it,
so it's not in the current version, and I haven't had a chance to go
back and figure out why.)

--
Community Grid Engine:  http://arc.liv.ac.uk/SGE/

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to