Hello,
Well I went a head and set up something similar to what I described before and
it seems to be working well, so I figured I would share the setup in case
anyone else is interested. I was able to keep slot ranges functioning, but I do
still have to update a bunch of complex values when adding/removing nodes from
the cluster.
Just to refresh, the idea is to allow jobs with the shortest run time limit
(less than 30 minutes) to have access to 100% of the cluster resources, while
jobs with longer run time limits get access to a decreasing portion of the
cluster resources. Currently I have 9 classification of jobs based on the run
time limit: <= 30 minutes (veryveryshort), <= 1 hour (veryshort), <= 2 hours
(short), <= 4 hours (midshort), <= 8 hours (mid), <= 12 hours (midlong), <= 24
hours (long), <= 48 hours (verylong), and > 48 hours (veryverylong).
For each h_rt cutoff (except the smallest) I create a specialized complex for
keeping track of the total amount of slots and mem_free available to jobs that
take that long (or longer). The complex values are setup in the same way as the
standard slots/mem_free complexes:
$ qconf -sc | grep mem_free
mem_free mf MEMORY <= YES YES
0 -0.0000001
mem_free_l mem_free_l MEMORY <= YES YES
0 0
mem_free_m mem_free_m MEMORY <= YES YES
0 0
mem_free_ml mem_free_ml MEMORY <= YES YES
0 0
mem_free_ms mem_free_ms MEMORY <= YES YES
0 0
mem_free_s mem_free_s MEMORY <= YES YES
0 0
mem_free_vl mem_free_vl MEMORY <= YES YES
0 0
mem_free_vs mem_free_vs MEMORY <= YES YES
0 0
mem_free_vvl mem_free_vvl MEMORY <= YES YES
0 0
$ qconf -sc | grep slots
slots s INT <= YES YES
1 0
slots_l slots_l INT <= YES YES
0 0
slots_m slots_m INT <= YES YES
0 0
slots_ml slots_ml INT <= YES YES
0 0
slots_ms slots_ms INT <= YES YES
0 0
slots_s slots_s INT <= YES YES
0 0
slots_vl slots_vl INT <= YES YES
0 0
slots_vs slots_vs INT <= YES YES
0 0
slots_vvl slots_vvl INT <= YES YES
0 0
I define the standard slots and mem_free values for each host (our cluster is
heterogeneous) to prevent over-subscription, for example:
$ qconf -se node01 | grep complex
complex_values slots=24,mem_free=32G
Then on the "global" pseudo host I set the specialized slots/mem_free
complexes. In our case the cluster has a total of 120 cores and 268GB of RAM
and I give each subsequent h_rt cutoff access to a smaller percentage of these
resources:
$ qconf -se global
hostname global
load_scaling NONE
complex_values slots_vs=108,slots_s=96,slots_ms=84,slots_m=72, \
slots_ml=60,slots_l=48,slots_vl=36,slots_vvl=24, \
mem_free_vs=241.2G,mem_free_s=214.4G,mem_free_ms=187.6G, \
mem_free_m=160.8G,mem_free_ml=147.4G,mem_free_l=134G, \
mem_free_vl=120.6G,mem_free_vvl=107.2G
load_values NONE
processors 0
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
Instead of multiple queues with different h_rt limits, I now have a single
queue without any limits defined. Finally, I use a JSV to parse the h_rt value
and set any appropriate specialized slots/mem_free complexes (the JSV also sets
h_vmem to be mem_free + 512M if it is not specified):
#!/bin/bash
jsv_on_start() {
return
}
jsv_on_verify() {
REQ_MEM_FREE=$(jsv_sub_get_param l_hard mem_free)
REQ_MF=$(jsv_sub_get_param l_hard mf)
REQ_H_VMEM=$(jsv_sub_get_param l_hard h_vmem)
REQ_H_RT=$(jsv_sub_get_param l_hard h_rt)
#If mf or mem_free is not set, set it here
if [ "$REQ_MF" == "" ] && [ "$REQ_MEM_FREE" == "" ]; then
REQ_MF="1G"
jsv_sub_add_param l_hard mem_free $REQ_MF
fi
#Handle possibility of the short or long name being provided
if [ "$REQ_MF" == "" ]; then
REQ_MF=$REQ_MEM_FREE
fi
#Convert mem_free value to bytes
REQ_MF_NDIGITS=`expr match "$REQ_MF" '[0-9.]*'`
REQ_MF_DIGITS=${REQ_MF:0:$REQ_MF_NDIGITS}
REQ_MF_SUFFIX=${REQ_MF:$REQ_MF_NDIGITS}
REQ_MF_BYTES=$REQ_MF_DIGITS
case "$REQ_MF_SUFFIX" in
K | k )
REQ_MF_BYTES=$(echo $REQ_MF_BYTES*1024 | bc)
;;
M | m)
REQ_MF_BYTES=$(echo $REQ_MF_BYTES*1048576 | bc)
;;
G | g)
REQ_MF_BYTES=$(echo $REQ_MF_BYTES*1073741824 | bc)
;;
esac
#If h_vmem is not specified, set it to mem_free plus 512M
if [ "$REQ_H_VMEM" == "" ]; then
REQ_H_VMEM_BYTES=$(echo $REQ_MF_BYTES+536870912 | bc)
jsv_sub_add_param l_hard h_vmem $REQ_H_VMEM_BYTES
fi
#Parse h_rt into seconds
CURR_STR=$REQ_H_RT
N_DIGITS=`expr match "$CURR_STR" '[0-9]*'`
if [ $N_DIGITS == 0 ] ; then
TIME_VAL=0
else
TIME_VAL=${CURR_STR:0:$N_DIGITS}
fi
CURR_STR=${CURR_STR:`expr $N_DIGITS + 1`}
if [ "$CURR_STR" != "" ] ; then
HOUR_SECS=`expr $TIME_VAL \* 3600`
N_DIGITS=`expr match "$CURR_STR" '[0-9]*'`
if [ $N_DIGITS == 0 ] ; then
TIME_VAL=0
else
TIME_VAL=${CURR_STR:0:$N_DIGITS}
fi
CURR_STR=${CURR_STR:`expr $N_DIGITS + 1`}
MIN_SECS=`expr $TIME_VAL \* 60`
N_DIGITS=`expr match "$CURR_STR" '[0-9]*'`
if [ $N_DIGITS == 0 ] ; then
TIME_VAL=0
else
TIME_VAL=${CURR_STR:0:$N_DIGITS}
fi
TIME_VAL=`expr $HOUR_SECS + $MIN_SECS + $TIME_VAL`
fi
#Set any specialized mem_free_* and slots_* complex values based on
#h_rt. This limits the total resources available for jobs with
#different run time limits
if [ $TIME_VAL -gt `expr 30 \* 60` ] ; then
jsv_sub_add_param l_hard mem_free_vs $REQ_MF
jsv_sub_add_param l_hard slots_vs 1
fi
if [ $TIME_VAL -gt 3600 ] ; then
jsv_sub_add_param l_hard mem_free_s $REQ_MF
jsv_sub_add_param l_hard slots_s 1
fi
if [ $TIME_VAL -gt `expr 2 \* 3600` ] ; then
jsv_sub_add_param l_hard mem_free_ms $REQ_MF
jsv_sub_add_param l_hard slots_ms 1
fi
if [ $TIME_VAL -gt `expr 4 \* 3600` ] ; then
jsv_sub_add_param l_hard mem_free_m $REQ_MF
jsv_sub_add_param l_hard slots_m 1
fi
if [ $TIME_VAL -gt `expr 8 \* 3600` ] ; then
jsv_sub_add_param l_hard mem_free_ml $REQ_MF
jsv_sub_add_param l_hard slots_ml 1
fi
if [ $TIME_VAL -gt `expr 12 \* 3600` ] ; then
jsv_sub_add_param l_hard mem_free_l $REQ_MF
jsv_sub_add_param l_hard slots_l 1
fi
if [ $TIME_VAL -gt `expr 24 \* 3600` ] ; then
jsv_sub_add_param l_hard mem_free_vl $REQ_MF
jsv_sub_add_param l_hard slots_vl 1
fi
if [ $TIME_VAL -gt `expr 48 \* 3600` ] ; then
jsv_sub_add_param l_hard mem_free_vvl $REQ_MF
jsv_sub_add_param l_hard slots_vvl 1
fi
jsv_accept "Job OK"
return
}
. ${SGE_ROOT}/util/resources/jsv/jsv_include.sh
jsv_main
I hope someone else finds this helpful.
Thanks,
Brendan
________________________________________
From: [email protected] [[email protected]] on behalf of
Brendan Moloney [[email protected]]
Sent: Friday, September 20, 2013 4:34 PM
To: [email protected]
Subject: Re: [gridengine users] Problems with multi-queue parallel jobs
Hello,
Well I was wrong, it appears the temporary directory issue is preventing me
from getting the full stdout/stderr results from all processes. Also, just
switching to using $fill_up instead of $round_robin doesn't always prevent the
jobs from failing (due to too many queues being used on a single host).
So I spent some time trying to determine a way to keep the same functional
policies without having multiple queues, and I think I found a solution.
However there are a couple of downsides to my plan, so I would appreciate some
feedback on how to improve it.
The basic idea is that I create a specialized slot complex for each time limit
(e.g. slots_short, slots_mid, slots_long, etc). Then on my only queue (lets say
all.q) I set the total number of slots available for each time limit (i.e. 90%
of total slots for slots_short, then 80% for slots_mid, 70% for slots_long,
etc). Then I use a JSV to parse the requested number of slots and h_rt value,
and then add a request for each specialized slot complex with a time limit
equal to or less than the requested h_rt. So a job that would normally run on
my mid.q and use 10 slots instead runs on all.q and request 10 each of slots,
slots_mid, and slots_short.
There are two main down sides to this approach I can see:
1) Requesting a slot range would no longer work as the JSV has no way of
knowing how many slots are actually going to be used.
2) I have to manually update all of the complex values any time a node is
added or removed from the cluster.
Any thoughts or suggestions?
Thanks,
Brendan
________________________________________
From: Brendan Moloney
Sent: Monday, September 16, 2013 5:03 PM
To: Dave Love
Cc: [email protected]
Subject: RE: [gridengine users] Problems with multi-queue parallel jobs
Hello,
I have heard of the temporary directory issue before, but we run a very small
number of MPI applications and none of them have this problem.
I would move away from our current multi-queue setup if there was a viable
alternative that meets our needs. In particular we need to limit available
resources based on run time while still allowing very short jobs (including MPI
jobs) to utilize all of the available resources. If there are other (better
supported) ways to achieve these goals then I would appreciate some pointers.
Thanks,
Brendan
________________________________________
From: Dave Love,,, [[email protected]]
Sent: Monday, September 16, 2013 3:01 PM
To: Brendan Moloney
Cc: [email protected]
Subject: Re: [gridengine users] Problems with multi-queue parallel jobs
Brendan Moloney <[email protected]> writes:
> Hello,
>
> I use multiple queues to divide up available resources based on job
> run times. Large parallel jobs will typically span multiple queues and
> this has generally been working fine thus far.
I'd strongly recommend avoiding that. Another reason is possible
trouble due to the temporary directory name being derived from the queue
name. (I changed that but had some odd failures when I introduced it,
so it's not in the current version, and I haven't had a chance to go
back and figure out why.)
--
Community Grid Engine: http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users