Re: [gridengine users] users Digest, Vol 72, Issue 13

Reuti Mon, 12 Dec 2016 04:37:53 -0800

> Am 12.12.2016 um 07:02 schrieb John_Tai <[email protected]>:
> 
> Thank you all for trying to work this out.
> 
> 
> 
>>> allocation_rule $fill_up <---work better for parallel jobs
> 
> I do want my job to run on one machine only
> 
>>> control_slaves TRUE < ---- you want tight integration with SGE
>>> job_is_first_task  <----can go either way, unless you are sure your 
>>> software will control job distro...
> 
> And the job will be controlled by my software, not SGE. I only need SGE to 
> keep track of the slots (i.e. CPU cores).
> 
> -------------------------------------------------
> 
> There were no messages on qmaster or ibm038. The job I submitted is not in 
> error, it's just waiting for free slots.
> 
> -------------------------------------------------
> 
> I changed queue slots setting and removed all other PE, but I got the same 
> error.
> 
> 
> # qconf -sq all.q
> qname                 all.q
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75


Unless you want to oversubscribe by intention, the above can be set to NONE. In 
fact, it might look ahead of the coming load and together with:

$ qconf -ssconf
...
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30

it can lead to the effect, that the job can't be scheduled. This can even be 
adjusted to read:

job_load_adjustments              NONE
load_adjustment_decay_time        0:0:0

In your current case of course, where 8 slots are defined and you test with 2 
this shouldn't be a problem though.

Did you set up and/or request any memory per machine?

OTOH: if you submit 2 single CPU jobs to node ibm038, are they scheduled?

-- Reuti


> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               cores
> rerun                 FALSE
> slots                 8
> tmpdir                /tmp
> shell                 /bin/sh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> # qsub -V -b y -cwd -now n -pe cores 2 -q all.q@ibm038 xclock
> Your job 92 ("xclock") has been submitted
> # qstat
> job-ID  prior   name       user         state submit/start at     queue       
>                    slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
>     91 0.55500 xclock     johnt        qw    12/12/2016 13:54:02              
>                       2
>     92 0.00000 xclock     johnt        qw    12/12/2016 13:55:59              
>                       2
> # qalter -w p 92
> Job 92 cannot run in queue "pc.q" because it is not contained in its hard 
> queue list (-q)
> Job 92 cannot run in queue "sim.q" because it is not contained in its hard 
> queue list (-q)
> Job 92 cannot run in queue "all.q@ibm021" because it is not contained in its 
> hard queue list (-q)
> Job 92 cannot run in queue "all.q@ibm037" because it is not contained in its 
> hard queue list (-q)
> Job 92 cannot run in PE "cores" because it only offers 0 slots
> verification: no suitable queues
> 
> 
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On 
> Behalf Of Coleman, Marcus [JRDUS Non-J&J]
> Sent: Monday, December 12, 2016 1:35
> To: [email protected]
> Subject: Re: [gridengine users] users Digest, Vol 72, Issue 13
> 
> Hi
> 
> I am sure this is your problem....You are submitting a job that requires 2 
> cores, to a queue that has only 1 slot available.
> If your host all have the same amount of cores...it is no reason to separate 
> them via commons. This is only needed if the host have different amount of 
> slots or you want to manipulate  the slots...
> 
> slots                 1,[ibm021=8],[ibm037=8],[ibm038=8]
> slots             8
> 
> 
> I would only list the pe I am using I am requesting...unless you plan to use 
> each of those PE's
> pe_list               make mpi smp cores
> pe_list               cores
> 
> 
> Also you mentioned parallel env, I WOULD change allocation to $fill_up unless 
> your software (not sge) control jobs distribution..
> 
> qconf -sp core
>  allocation_rule    $pe_slots <---( ONLY USE ONE MACHINE)
> control_slaves     FALSE <--- (I think you want tight integration)
> job_is_first_task  TRUE  <----( this is true if the  first job submitted only 
> kicks off other jobs)
> 
> allocation_rule $fill_up <---work better for parallel jobs
> control_slaves TRUE < ---- you want tight integration with SGE
> job_is_first_task  <----can go either way, unless you are sure your software 
> will control job distro...
> 
> 
> Also what does qmaster message and the associated node sge message say...
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On 
> Behalf Of [email protected]
> Sent: Sunday, December 11, 2016 9:05 PM
> To: [email protected]
> Subject: [EXTERNAL] users Digest, Vol 72, Issue 13
> 
> Send users mailing list submissions to
>        [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>        https://gridengine.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>        [email protected]
> 
> You can reach the person managing the list at
>        [email protected]
> 
> When replying, please edit your Subject line so it is more specific than "Re: 
> Contents of users digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: CPU complex (John_Tai)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 12 Dec 2016 05:04:33 +0000
> From: John_Tai <[email protected]>
> To: Christopher Heiny <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: [gridengine users] CPU complex
> Message-ID: <EB25FF8EBBD4BC478EF05F2F4C436479021D2BA5A2@shex-d02>
> Content-Type: text/plain; charset="utf-8"
> 
> # qconf -sq all.q
> qname                 all.q
> hostlist              @allhosts
> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make mpi smp cores
> rerun                 FALSE
> slots                 1,[ibm021=8],[ibm037=8],[ibm038=8]
> tmpdir                /tmp
> shell                 /bin/sh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> 
> 
> From: Christopher Heiny [mailto:[email protected]]
> Sent: Monday, December 12, 2016 12:22
> To: John_Tai
> Cc: [email protected]; Reuti
> Subject: Re: [gridengine users] CPU complex
> 
> 
> 
> On Dec 11, 2016 5:11 PM, "John_Tai" 
> <[email protected]<mailto:[email protected]>> wrote:
> I associated the queue with the PE:
> 
>        qconf -aattr queue pe_list cores all.q The only slots were defined in 
> the all.q queue, and just the total slots in the PE:
> 
>>> # qconf -sp cores
>>> pe_name            cores
>>> slots              999
>>> user_lists         NONE
>>> xuser_lists        NONE
> Do I need to define slots in another way for each exec host? Is there a way 
> to check the current free slots for a host, other than the qstat -f below?
> 
>> # qstat -f
>> queuename                      qtype resv/used/tot. load_avg arch          
>> states
>> ---------------------------------------------------------------------------------
>> all.q@ibm021<mailto:all.q@ibm021>                   BIP   0/0/8          
>> 0.02     lx-amd64
>> ---------------------------------------------------------------------------------
>> all.q@ibm037<mailto:all.q@ibm037>                   BIP   0/0/8          
>> 0.00     lx-amd64
>> ---------------------------------------------------------------------------------
>> all.q@ibm038<mailto:all.q@ibm038>                   BIP   0/0/8          
>> 0.00     lx-amd64
> 
> What is the output of the command
>    qconf -sq all.q
> ? (I think that's right one)
> 
> Chris
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Reuti 
> [mailto:[email protected]<mailto:[email protected]>]
> Sent: Saturday, December 10, 2016 5:40
> To: John_Tai
> Cc: [email protected]<mailto:[email protected]>
> Subject: Re: [gridengine users] CPU complex
> 
> Am 09.12.2016 um 10:36 schrieb John_Tai:
> 
>> 8 slots:
>> 
>> # qstat -f
>> queuename                      qtype resv/used/tot. load_avg arch          
>> states
>> ---------------------------------------------------------------------------------
>> all.q@ibm021<mailto:all.q@ibm021>                   BIP   0/0/8          
>> 0.02     lx-amd64
>> ---------------------------------------------------------------------------------
>> all.q@ibm037<mailto:all.q@ibm037>                   BIP   0/0/8          
>> 0.00     lx-amd64
>> ---------------------------------------------------------------------------------
>> all.q@ibm038<mailto:all.q@ibm038>                   BIP   0/0/8          
>> 0.00     lx-amd64
>> ---------------------------------------------------------------------------------
>> pc.q@ibm021<mailto:pc.q@ibm021>                    BIP   0/0/1          0.02 
>>     lx-amd64
>> ---------------------------------------------------------------------------------
>> sim.q@ibm021<mailto:sim.q@ibm021>                   BIP   0/0/1          
>> 0.02     lx-amd64
> 
> Is there any limit of slots in the exechost defined, or in an RQS?
> 
> -- Reuti
> 
> 
>> 
>> ######################################################################
>> ######
>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>> JOBS 
>> ############################################################################
>>    89 0.55500 xclock     johnt        qw    12/09/2016 15:14:25     2
>> 
>> 
>> 
>> -----Original Message-----
>> From: Reuti
>> [mailto:[email protected]<mailto:[email protected]>]
>> Sent: Friday, December 09, 2016 3:46
>> To: John_Tai
>> Cc: [email protected]<mailto:[email protected]>
>> Subject: Re: [gridengine users] CPU complex
>> 
>> Hi,
>> 
>> Am 09.12.2016 um 08:20 schrieb John_Tai:
>> 
>>> I've setup PE but I'm having problems submitting jobs.
>>> 
>>> - Here's the PE I created:
>>> 
>>> # qconf -sp cores
>>> pe_name            cores
>>> slots              999
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    /bin/true
>>> stop_proc_args     /bin/true
>>> allocation_rule    $pe_slots
>>> control_slaves     FALSE
>>> job_is_first_task  TRUE
>>> urgency_slots      min
>>> accounting_summary FALSE
>>> qsort_args         NONE
>>> 
>>> - I've then added this to all.q:
>>> 
>>> qconf -aattr queue pe_list cores all.q
>> 
>> How many "slots" were defined in there queue definition for all.q?
>> 
>> -- Reuti
>> 
>> 
>>> - Now I submit a job:
>>> 
>>> # qsub -V -b y -cwd -now n -pe cores 2 -q
>>> all.q@ibm038<mailto:all.q@ibm038> xclock Your job
>>> 89 ("xclock") has been submitted # qstat
>>> job-ID  prior   name       user         state submit/start at     queue     
>>>                      slots ja-task-ID
>>> -----------------------------------------------------------------------------------------------------------------
>>>   89 0.00000 xclock     johnt        qw    12/09/2016 15:14:25              
>>>                       2
>>> # qalter -w p 89
>>> Job 89 cannot run in PE "cores" because it only offers 0 slots
>>> verification: no suitable queues
>>> # qstat -f
>>> queuename                      qtype resv/used/tot. load_avg arch          
>>> states
>>> ---------------------------------------------------------------------------------
>>> all.q@ibm038<mailto:all.q@ibm038>                   BIP   0/0/8          
>>> 0.00     lx-amd64
>>> 
>>> #####################################################################
>>> #
>>> ######
>>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>>> JOBS 
>>> ############################################################################
>>>   89 0.55500 xclock     johnt        qw    12/09/2016 15:14:25     2
>>> 
>>> 
>>> ----------------------------------------------------
>>> 
>>> It looks like all.q@ibm038<mailto:all.q@ibm038> should have 8 free slots, 
>>> so why is it only offering 0?
>>> 
>>> Hope you can help me.
>>> Thanks
>>> John
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Reuti
>>> [mailto:[email protected]<mailto:[email protected]>
>>> ]
>>> Sent: Monday, December 05, 2016 6:32
>>> To: John_Tai
>>> Cc: [email protected]<mailto:[email protected]>
>>> Subject: Re: [gridengine users] CPU complex
>>> 
>>> Hi,
>>> 
>>>> Am 05.12.2016 um 09:36 schrieb John_Tai 
>>>> <[email protected]<mailto:[email protected]>>:
>>>> 
>>>> Thank you so much for your reply!
>>>> 
>>>>>> Will you use the consumable virtual_free here instead mem?
>>>> 
>>>> Yes I meant to write virtual_free, not mem. Apologies.
>>>> 
>>>>>> For parallel jobs you need to configure a (or some) so called PE 
>>>>>> (Parallel Environment).
>>>> 
>>>> My jobs are actually just one process which uses multiple cores, so for 
>>>> example in top one process "simv" is currently using 2 cpu cores (200%).
>>> 
>>> Yes, then it's a parallel job for SGE. Although the entries for 
>>> start_proc_args resp. stop_proc_args can be left untouched to the default, 
>>> a PE is the paradigm in SGE for a parallel job.
>>> 
>>> 
>>>> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>> 3017 kelly     20   0 3353m 3.0g 165m R 200.0  0.6  15645:46 simv
>>>> 
>>>> So I'm not sure PE is suitable for my case, since it is not multiple 
>>>> parallel processes running at the same time. Am I correct?
>>>> 
>>>> If so, I am trying to find a way to get SGE to keep track of the number of 
>>>> cores used, but I believe it only keeps track of the total CPU usage in %. 
>>>> I guess I could use this and and the <total num cores> to get the <num of 
>>>> cores in use>, but how to integrate it in SGE?
>>> 
>>> You can specify a necessary number of cores for your job in the -pe 
>>> parameter, which can also be a range. The granted allocation by SGE you can 
>>> check in the job script $NHOSTS, $NSLOTS, $PE_HOSTFILE.
>>> 
>>> Having this setup, SGE will track the number of used cores per machine. The 
>>> available ones you define in the queue definition. In case you have more 
>>> than one queue per exechost, we need to setup in addition an overall limit 
>>> of cores which can be used at the same time to avoid oversubscription.
>>> 
>>> -- Reuti
>>> 
>>>> Thank you again for your help.
>>>> 
>>>> John
>>>> 
>>>> -----Original Message-----
>>>> From: Reuti
>>>> [mailto:[email protected]<mailto:[email protected]
>>>>> ]
>>>> Sent: Monday, December 05, 2016 4:21
>>>> To: John_Tai
>>>> Cc: [email protected]<mailto:[email protected]>
>>>> Subject: Re: [gridengine users] CPU complex
>>>> 
>>>> Hi,
>>>> 
>>>> Am 05.12.2016 um 08:00 schrieb John_Tai:
>>>> 
>>>>> Newbie here, hope to understand SGE usage.
>>>>> 
>>>>> I've successfully configured virtual_free as a complex for telling SGE 
>>>>> how much memory is needed when submitting a job, as described here:
>>>>> 
>>>>> https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html#
>>>>> <https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html
>>>>>> 
>>>>> i
>>>>> 1000029
>>>>> 
>>>>> How do I do the same for telling SGE how many CPU cores a job needs? For 
>>>>> example:
>>>>> 
>>>>>             qsub -l mem=24G,cpu=4 myjob
>>>> 
>>>> Will you use the consumable virtual_free here instead mem?
>>>> 
>>>> 
>>>>> Obviously I'd need for SGE to keep track of the actual CPU utilization in 
>>>>> the host, just as virtual_free is being tracked independently of the SGE 
>>>>> jobs.
>>>> 
>>>> For parallel jobs you need to configure a (or some) so called PE (Parallel 
>>>> Environment). Purpose of this is, to make preparations for the parallel 
>>>> jobs like rearranging the list of granted slots, prepare shared 
>>>> directories between the nodes,...
>>>> 
>>>> These PEs were of higher importance in former times, when parallel 
>>>> libraries were not programmed to integrate automatically in SGE for a 
>>>> tight integration. Your submissions could read:
>>>> 
>>>> qsub -pe smp 4 myjob        # allocation_rule $peslots, control_slaves true
>>>> qsub -pe orte 16 myjob        # allovation_rule $round_robin, 
>>>> control_slaves tue
>>>> 
>>>> where smp resp. orte is the chosen parallel environment for OpenMP resp. 
>>>> Open MPI. Its settings are explained in `man sge_pe`, the "-pe" parameter 
>>>> to in the submission command in `man qsub`.
>>>> 
>>>> -- Reuti
>>>> ________________________________
>>>> 
>>>> This email (including its attachments, if any) may be confidential and 
>>>> proprietary information of SMIC, and intended only for the use of the 
>>>> named recipient(s) above. Any unauthorized use or disclosure of this email 
>>>> is strictly prohibited. If you are not the intended recipient(s), please 
>>>> notify the sender immediately and delete this email from your computer.
>>>> 
>>> 
>>> ________________________________
>>> 
>>> This email (including its attachments, if any) may be confidential and 
>>> proprietary information of SMIC, and intended only for the use of the named 
>>> recipient(s) above. Any unauthorized use or disclosure of this email is 
>>> strictly prohibited. If you are not the intended recipient(s), please 
>>> notify the sender immediately and delete this email from your computer.
>>> 
>> 
>> ________________________________
>> 
>> This email (including its attachments, if any) may be confidential and 
>> proprietary information of SMIC, and intended only for the use of the named 
>> recipient(s) above. Any unauthorized use or disclosure of this email is 
>> strictly prohibited. If you are not the intended recipient(s), please notify 
>> the sender immediately and delete this email from your computer.
>> 
> 
> ________________________________
> 
> This email (including its attachments, if any) may be confidential and 
> proprietary information of SMIC, and intended only for the use of the named 
> recipient(s) above. Any unauthorized use or disclosure of this email is 
> strictly prohibited. If you are not the intended recipient(s), please notify 
> the sender immediately and delete this email from your computer.
> 
> _______________________________________________
> users mailing list
> [email protected]<mailto:[email protected]>
> https://gridengine.org/mailman/listinfo/users
> 
> ________________________________
> This email (including its attachments, if any) may be confidential and 
> proprietary information of SMIC, and intended only for the use of the named 
> recipient(s) above. Any unauthorized use or disclosure of this email is 
> strictly prohibited. If you are not the intended recipient(s), please notify 
> the sender immediately and delete this email from your computer.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://gridengine.org/pipermail/users/attachments/20161212/5666d5d4/attachment.html>
> 
> ------------------------------
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 
> 
> End of users Digest, Vol 72, Issue 13
> *************************************
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> ________________________________
> 
> This email (including its attachments, if any) may be confidential and 
> proprietary information of SMIC, and intended only for the use of the named 
> recipient(s) above. Any unauthorized use or disclosure of this email is 
> strictly prohibited. If you are not the intended recipient(s), please notify 
> the sender immediately and delete this email from your computer.
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] users Digest, Vol 72, Issue 13

Reply via email to