Re: [gridengine users] Problems with multi-queue parallel jobs

Reuti Thu, 26 Sep 2013 02:41:42 -0700

Just to note: I applied my MPICH3-Patch also to MPICH2-1.4.1p1 by changing the 
subroutine in question and AFAICS it's working. Maybe you can skip this way the 
complexes.


Strange that no one on the MPICH list was interested in it.

-- Reuti

PS: I'm on vacation and will have a detailed look to your post in 2 weeks only.


Am 25.09.2013 um 01:08 schrieb Brendan Moloney:

> Hello,
> 
> Well I went a head and set up something similar to what I described before 
> and it seems to be working well, so I figured I would share the setup in case 
> anyone else is interested. I was able to keep slot ranges functioning, but I 
> do still have to update a bunch of complex values when adding/removing nodes 
> from the cluster.
> 
> Just to refresh, the idea is to allow jobs with the shortest run time limit 
> (less than 30 minutes) to have access to 100% of the cluster resources, while 
> jobs with longer run time limits get access to a decreasing portion of the 
> cluster resources. Currently I have 9 classification of jobs based on the run 
> time limit: <= 30 minutes (veryveryshort), <= 1 hour (veryshort), <= 2 hours 
> (short), <= 4 hours (midshort), <= 8 hours (mid), <= 12 hours (midlong), <= 
> 24 hours (long), <= 48 hours (verylong), and > 48 hours (veryverylong). 
> 
> For each h_rt cutoff (except the smallest) I create a specialized complex for 
> keeping track of the total amount of slots and mem_free available to jobs 
> that take that long (or longer). The complex values are setup in the same way 
> as the standard slots/mem_free complexes:
> 
> $ qconf -sc | grep mem_free
> 
>  mem_free            mf              MEMORY      <=    YES         YES        
> 0    -0.0000001
>  mem_free_l          mem_free_l      MEMORY      <=    YES         YES        
> 0    0
>  mem_free_m          mem_free_m      MEMORY      <=    YES         YES        
> 0    0
>  mem_free_ml         mem_free_ml     MEMORY      <=    YES         YES        
> 0    0
>  mem_free_ms         mem_free_ms     MEMORY      <=    YES         YES        
> 0    0
>  mem_free_s          mem_free_s      MEMORY      <=    YES         YES        
> 0    0
>  mem_free_vl         mem_free_vl     MEMORY      <=    YES         YES        
> 0    0
>  mem_free_vs         mem_free_vs     MEMORY      <=    YES         YES        
> 0    0
>  mem_free_vvl        mem_free_vvl    MEMORY      <=    YES         YES        
> 0    0
> 
> $ qconf -sc | grep slots
> 
>  slots               s               INT         <=    YES         YES        
> 1         0
>  slots_l             slots_l         INT         <=    YES         YES        
> 0         0
>  slots_m             slots_m         INT         <=    YES         YES        
> 0         0
>  slots_ml            slots_ml        INT         <=    YES         YES        
> 0         0
>  slots_ms            slots_ms        INT         <=    YES         YES        
> 0         0
>  slots_s             slots_s         INT         <=    YES         YES        
> 0         0
>  slots_vl            slots_vl        INT         <=    YES         YES        
> 0         0
>  slots_vs            slots_vs        INT         <=    YES         YES        
> 0         0
>  slots_vvl           slots_vvl       INT         <=    YES         YES        
> 0         0
> 
> I define the standard slots and mem_free values for each host (our cluster is 
> heterogeneous) to prevent over-subscription, for example:
> 
> $ qconf -se node01 | grep complex
>  complex_values        slots=24,mem_free=32G
> 
> Then on the "global" pseudo host I set the specialized slots/mem_free 
> complexes.  In our case the cluster has a total of 120 cores and 268GB of RAM 
> and I give each subsequent h_rt cutoff access to a smaller percentage of 
> these resources:
> 
> $ qconf -se global
> 
>  hostname              global
>  load_scaling          NONE
>  complex_values        slots_vs=108,slots_s=96,slots_ms=84,slots_m=72, \
>                        slots_ml=60,slots_l=48,slots_vl=36,slots_vvl=24, \
>                        
> mem_free_vs=241.2G,mem_free_s=214.4G,mem_free_ms=187.6G, \
>                        mem_free_m=160.8G,mem_free_ml=147.4G,mem_free_l=134G, \
>                        mem_free_vl=120.6G,mem_free_vvl=107.2G
>  load_values           NONE
>  processors            0
>  user_lists            NONE
>  xuser_lists           NONE
>  projects              NONE
>  xprojects             NONE
>  usage_scaling         NONE
>  report_variables      NONE
> 
> Instead of multiple queues with different h_rt limits, I now have a single 
> queue without any limits defined. Finally, I use a JSV to parse the h_rt 
> value and set any appropriate specialized slots/mem_free complexes (the JSV 
> also sets h_vmem to be mem_free + 512M if it is not specified):
> 
> #!/bin/bash
> 
> jsv_on_start() {
>    return
> }
> 
> jsv_on_verify() {
>    REQ_MEM_FREE=$(jsv_sub_get_param l_hard mem_free)
>    REQ_MF=$(jsv_sub_get_param l_hard mf)
>    REQ_H_VMEM=$(jsv_sub_get_param l_hard h_vmem)
>    REQ_H_RT=$(jsv_sub_get_param l_hard h_rt)
> 
>    #If mf or mem_free is not set, set it here
>    if [ "$REQ_MF" == "" ] && [ "$REQ_MEM_FREE" == "" ]; then
>        REQ_MF="1G"
>        jsv_sub_add_param l_hard mem_free $REQ_MF
>    fi
> 
>    #Handle possibility of the short or long name being provided
>    if [ "$REQ_MF" == "" ]; then
>        REQ_MF=$REQ_MEM_FREE
>    fi
> 
>    #Convert mem_free value to bytes
>    REQ_MF_NDIGITS=`expr match "$REQ_MF" '[0-9.]*'`
>    REQ_MF_DIGITS=${REQ_MF:0:$REQ_MF_NDIGITS}
>    REQ_MF_SUFFIX=${REQ_MF:$REQ_MF_NDIGITS}
>    REQ_MF_BYTES=$REQ_MF_DIGITS
>    case "$REQ_MF_SUFFIX" in
>        K | k )
>            REQ_MF_BYTES=$(echo $REQ_MF_BYTES*1024 | bc)
>        ;;
>        M | m)
>            REQ_MF_BYTES=$(echo $REQ_MF_BYTES*1048576 | bc)
>        ;;
>        G | g)
>            REQ_MF_BYTES=$(echo $REQ_MF_BYTES*1073741824 | bc)
>        ;;
>    esac
> 
>    #If h_vmem is not specified, set it to mem_free plus 512M
>    if [ "$REQ_H_VMEM" == "" ]; then
>        REQ_H_VMEM_BYTES=$(echo $REQ_MF_BYTES+536870912 | bc)
>        jsv_sub_add_param l_hard h_vmem $REQ_H_VMEM_BYTES
>    fi
> 
>    #Parse h_rt into seconds
>    CURR_STR=$REQ_H_RT
>    N_DIGITS=`expr match "$CURR_STR" '[0-9]*'`
>    if [ $N_DIGITS == 0 ] ; then
>        TIME_VAL=0
>    else
>        TIME_VAL=${CURR_STR:0:$N_DIGITS}
>    fi
>    CURR_STR=${CURR_STR:`expr $N_DIGITS + 1`}
>    if [ "$CURR_STR" != "" ] ; then
>        HOUR_SECS=`expr $TIME_VAL \* 3600`
> 
>        N_DIGITS=`expr match "$CURR_STR" '[0-9]*'`
>        if [ $N_DIGITS == 0 ] ; then
>            TIME_VAL=0
>        else
>            TIME_VAL=${CURR_STR:0:$N_DIGITS}
>        fi
>        CURR_STR=${CURR_STR:`expr $N_DIGITS + 1`}
>        MIN_SECS=`expr $TIME_VAL \* 60`
> 
>        N_DIGITS=`expr match "$CURR_STR" '[0-9]*'`
>        if [ $N_DIGITS == 0 ] ; then
>            TIME_VAL=0
>        else
>            TIME_VAL=${CURR_STR:0:$N_DIGITS}
>        fi
>        TIME_VAL=`expr $HOUR_SECS + $MIN_SECS + $TIME_VAL`
>    fi
> 
>    #Set any specialized mem_free_* and slots_* complex values based on 
>    #h_rt. This limits the total resources available for jobs with 
>    #different run time limits
>    if [ $TIME_VAL -gt `expr 30 \* 60` ] ; then
>        jsv_sub_add_param l_hard mem_free_vs $REQ_MF
>        jsv_sub_add_param l_hard slots_vs 1
>    fi
>    if [ $TIME_VAL -gt 3600 ] ; then
>        jsv_sub_add_param l_hard mem_free_s $REQ_MF
>        jsv_sub_add_param l_hard slots_s 1
>    fi
>    if [ $TIME_VAL -gt `expr 2 \* 3600` ] ; then
>        jsv_sub_add_param l_hard mem_free_ms $REQ_MF
>        jsv_sub_add_param l_hard slots_ms 1
>    fi
>    if [ $TIME_VAL -gt `expr 4 \* 3600` ] ; then
>        jsv_sub_add_param l_hard mem_free_m $REQ_MF
>        jsv_sub_add_param l_hard slots_m 1
>    fi
>    if [ $TIME_VAL -gt `expr 8 \* 3600` ] ; then
>        jsv_sub_add_param l_hard mem_free_ml $REQ_MF
>        jsv_sub_add_param l_hard slots_ml 1
>    fi
>    if [ $TIME_VAL -gt `expr 12 \* 3600` ] ; then
>        jsv_sub_add_param l_hard mem_free_l $REQ_MF
>        jsv_sub_add_param l_hard slots_l 1
>    fi
>    if [ $TIME_VAL -gt `expr 24 \* 3600` ] ; then
>        jsv_sub_add_param l_hard mem_free_vl $REQ_MF
>        jsv_sub_add_param l_hard slots_vl 1
>    fi
>    if [ $TIME_VAL -gt `expr 48 \* 3600` ] ; then
>        jsv_sub_add_param l_hard mem_free_vvl $REQ_MF
>        jsv_sub_add_param l_hard slots_vvl 1
>    fi
> 
>    jsv_accept "Job OK"
>    return
> }
> 
> . ${SGE_ROOT}/util/resources/jsv/jsv_include.sh
> 
> jsv_main
> 
> 
> 
> I hope someone else finds this helpful.
> 
> Thanks,
> Brendan
> 
> ________________________________________
> From: users-boun...@gridengine.org [users-boun...@gridengine.org] on behalf 
> of Brendan Moloney [molo...@ohsu.edu]
> Sent: Friday, September 20, 2013 4:34 PM
> To: users@gridengine.org
> Subject: Re: [gridengine users] Problems with multi-queue parallel jobs
> 
> Hello,
> 
> Well I was wrong, it appears the temporary directory issue is preventing me 
> from getting the full stdout/stderr results from all processes.  Also, just 
> switching to using $fill_up instead of $round_robin doesn't always prevent 
> the jobs from failing (due to too many queues being used on a single host).
> 
> So I spent some time trying to determine a way to keep the same functional 
> policies without having multiple queues, and I think I found a solution.  
> However there are a couple of downsides to my plan, so I would appreciate 
> some feedback on how to improve it.
> 
> The basic idea is that I create a specialized slot complex for each time 
> limit (e.g. slots_short, slots_mid, slots_long, etc). Then on my only queue 
> (lets say all.q) I set the total number of slots available for each time 
> limit (i.e. 90% of total slots for slots_short, then 80% for slots_mid, 70% 
> for slots_long, etc). Then I use a JSV to parse the requested number of slots 
> and h_rt value, and then add a request for each specialized slot complex with 
> a time limit equal to or less than the requested h_rt.  So a job that would 
> normally run on my mid.q and use 10 slots instead runs on all.q and request 
> 10 each of slots, slots_mid, and slots_short.
> 
> There are two main down sides to this approach I can see:
> 
>  1) Requesting a slot range would no longer work as the JSV has no way of 
> knowing how many slots are actually going to be used.
> 
>  2) I have to manually update all of the complex values any time a node is 
> added or removed from the cluster.
> 
> Any thoughts or suggestions?
> 
> Thanks,
> Brendan
> 
> ________________________________________
> From: Brendan Moloney
> Sent: Monday, September 16, 2013 5:03 PM
> To: Dave Love
> Cc: users@gridengine.org
> Subject: RE: [gridengine users] Problems with multi-queue parallel jobs
> 
> Hello,
> 
> I have heard of the temporary directory issue before, but we run a very small 
> number of MPI applications and none of them have this problem.
> 
> I would move away from our current multi-queue setup if there was a viable 
> alternative that meets our needs.  In particular we need to limit available 
> resources based on run time while still allowing very short jobs (including 
> MPI jobs) to utilize all of the available resources.  If there are other 
> (better supported) ways to achieve these goals then I would appreciate some 
> pointers.
> 
> Thanks,
> Brendan
> ________________________________________
> From: Dave Love,,, [d.l...@liverpool.ac.uk]
> Sent: Monday, September 16, 2013 3:01 PM
> To: Brendan Moloney
> Cc: users@gridengine.org
> Subject: Re: [gridengine users] Problems with multi-queue parallel jobs
> 
> Brendan Moloney <molo...@ohsu.edu> writes:
> 
>> Hello,
>> 
>> I use multiple queues to divide up available resources based on job
>> run times. Large parallel jobs will typically span multiple queues and
>> this has generally been working fine thus far.
> 
> I'd strongly recommend avoiding that.  Another reason is possible
> trouble due to the temporary directory name being derived from the queue
> name.  (I changed that but had some odd failures when I introduced it,
> so it's not in the current version, and I haven't had a chance to go
> back and figure out why.)
> 
> --
> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Problems with multi-queue parallel jobs

Reply via email to