We're planning an outage on our cluster for the 12th of this month.
I've added reservations for each of the subclusters  to ensure that
nothing is running at that time.   The command I use is something like
qrsub -l mem=4G,job=true -a 04120800 -d 24:0:0 -pe '*-j' 256 where mem
is a consumable resource used to control memory usage and job is an
exclusive resource associated with each host and the pe varies
depending on which subcluster I'm reserving.

The reservations appear to be fine themselves but checking  the
schedule file it appears that queued jobs now make  reservations after
the outage even though they have plenty of time to run before it (I'm
making the reservations this early because we have a few people
submitting 7 day jobs).

If I restart the scheduler then the jobs start reserving slots prior
to the outage but the queues acquire a qtype of N according to qstat
-f and jobs don't actually start in them.  I can change the qtype in
qstat -f to B by using qconf to change the qtype attribute of each
queue to batch (which it already is according to qconf -sq).

I can change the qtype to BP in qstat -f  by modifying pe_list on each
queue but it won't let me do this with a reservation in place  (even
though I'm just repeating what is already there).  If I delete the
reservation,modify the pe_list and recreate the reservation then I'm
back to my original problem

The upshot of this is that the cluster is now dominated by low
priority small jobs while the high priority parallel jobs are making
reservations after the outage.

Also after a scheduler restart it takes a while for existing jobs to
start making reservations.  For a few hours thereafter only jobs
submitted after the restart make reservations.

Running SGE 6.2u3 at the moment.  Is an upgrade likely to fix this?
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to