> Am 12.12.2016 um 07:02 schrieb John_Tai <[email protected]>:
>
> Thank you all for trying to work this out.
>
>
>
>>> allocation_rule $fill_up <---work better for parallel jobs
>
> I do want my job to run on one machine only
>
>>> control_slaves TRUE < ---- you want tight integration with SGE
>>> job_is_first_task <----can go either way, unless you are sure your
>>> software will control job distro...
>
> And the job will be controlled by my software, not SGE. I only need SGE to
> keep track of the slots (i.e. CPU cores).
>
> -------------------------------------------------
>
> There were no messages on qmaster or ibm038. The job I submitted is not in
> error, it's just waiting for free slots.
>
> -------------------------------------------------
>
> I changed queue slots setting and removed all other PE, but I got the same
> error.
>
>
> # qconf -sq all.q
> qname all.q
> hostlist @allhosts
> seq_no 0
> load_thresholds np_load_avg=1.75
Unless you want to oversubscribe by intention, the above can be set to NONE. In
fact, it might look ahead of the coming load and together with:
$ qconf -ssconf
...
job_load_adjustments np_load_avg=0.50
load_adjustment_decay_time 0:7:30
it can lead to the effect, that the job can't be scheduled. This can even be
adjusted to read:
job_load_adjustments NONE
load_adjustment_decay_time 0:0:0
In your current case of course, where 8 slots are defined and you test with 2
this shouldn't be a problem though.
Did you set up and/or request any memory per machine?
OTOH: if you submit 2 single CPU jobs to node ibm038, are they scheduled?
-- Reuti
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list NONE
> pe_list cores
> rerun FALSE
> slots 8
> tmpdir /tmp
> shell /bin/sh
> prolog NONE
> epilog NONE
> shell_start_mode posix_compliant
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists NONE
> xuser_lists NONE
> subordinate_list NONE
> complex_values NONE
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
> # qsub -V -b y -cwd -now n -pe cores 2 -q all.q@ibm038 xclock
> Your job 92 ("xclock") has been submitted
> # qstat
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 91 0.55500 xclock johnt qw 12/12/2016 13:54:02
> 2
> 92 0.00000 xclock johnt qw 12/12/2016 13:55:59
> 2
> # qalter -w p 92
> Job 92 cannot run in queue "pc.q" because it is not contained in its hard
> queue list (-q)
> Job 92 cannot run in queue "sim.q" because it is not contained in its hard
> queue list (-q)
> Job 92 cannot run in queue "all.q@ibm021" because it is not contained in its
> hard queue list (-q)
> Job 92 cannot run in queue "all.q@ibm037" because it is not contained in its
> hard queue list (-q)
> Job 92 cannot run in PE "cores" because it only offers 0 slots
> verification: no suitable queues
>
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Coleman, Marcus [JRDUS Non-J&J]
> Sent: Monday, December 12, 2016 1:35
> To: [email protected]
> Subject: Re: [gridengine users] users Digest, Vol 72, Issue 13
>
> Hi
>
> I am sure this is your problem....You are submitting a job that requires 2
> cores, to a queue that has only 1 slot available.
> If your host all have the same amount of cores...it is no reason to separate
> them via commons. This is only needed if the host have different amount of
> slots or you want to manipulate the slots...
>
> slots 1,[ibm021=8],[ibm037=8],[ibm038=8]
> slots 8
>
>
> I would only list the pe I am using I am requesting...unless you plan to use
> each of those PE's
> pe_list make mpi smp cores
> pe_list cores
>
>
> Also you mentioned parallel env, I WOULD change allocation to $fill_up unless
> your software (not sge) control jobs distribution..
>
> qconf -sp core
> allocation_rule $pe_slots <---( ONLY USE ONE MACHINE)
> control_slaves FALSE <--- (I think you want tight integration)
> job_is_first_task TRUE <----( this is true if the first job submitted only
> kicks off other jobs)
>
> allocation_rule $fill_up <---work better for parallel jobs
> control_slaves TRUE < ---- you want tight integration with SGE
> job_is_first_task <----can go either way, unless you are sure your software
> will control job distro...
>
>
> Also what does qmaster message and the associated node sge message say...
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of [email protected]
> Sent: Sunday, December 11, 2016 9:05 PM
> To: [email protected]
> Subject: [EXTERNAL] users Digest, Vol 72, Issue 13
>
> Send users mailing list submissions to
> [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://gridengine.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
> [email protected]
>
> You can reach the person managing the list at
> [email protected]
>
> When replying, please edit your Subject line so it is more specific than "Re:
> Contents of users digest..."
>
>
> Today's Topics:
>
> 1. Re: CPU complex (John_Tai)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 12 Dec 2016 05:04:33 +0000
> From: John_Tai <[email protected]>
> To: Christopher Heiny <[email protected]>
> Cc: "[email protected]" <[email protected]>
> Subject: Re: [gridengine users] CPU complex
> Message-ID: <EB25FF8EBBD4BC478EF05F2F4C436479021D2BA5A2@shex-d02>
> Content-Type: text/plain; charset="utf-8"
>
> # qconf -sq all.q
> qname all.q
> hostlist @allhosts
> seq_no 0
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list NONE
> pe_list make mpi smp cores
> rerun FALSE
> slots 1,[ibm021=8],[ibm037=8],[ibm038=8]
> tmpdir /tmp
> shell /bin/sh
> prolog NONE
> epilog NONE
> shell_start_mode posix_compliant
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists NONE
> xuser_lists NONE
> subordinate_list NONE
> complex_values NONE
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_rt INFINITY
> h_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
>
>
>
> From: Christopher Heiny [mailto:[email protected]]
> Sent: Monday, December 12, 2016 12:22
> To: John_Tai
> Cc: [email protected]; Reuti
> Subject: Re: [gridengine users] CPU complex
>
>
>
> On Dec 11, 2016 5:11 PM, "John_Tai"
> <[email protected]<mailto:[email protected]>> wrote:
> I associated the queue with the PE:
>
> qconf -aattr queue pe_list cores all.q The only slots were defined in
> the all.q queue, and just the total slots in the PE:
>
>>> # qconf -sp cores
>>> pe_name cores
>>> slots 999
>>> user_lists NONE
>>> xuser_lists NONE
> Do I need to define slots in another way for each exec host? Is there a way
> to check the current free slots for a host, other than the qstat -f below?
>
>> # qstat -f
>> queuename qtype resv/used/tot. load_avg arch
>> states
>> ---------------------------------------------------------------------------------
>> all.q@ibm021<mailto:all.q@ibm021> BIP 0/0/8
>> 0.02 lx-amd64
>> ---------------------------------------------------------------------------------
>> all.q@ibm037<mailto:all.q@ibm037> BIP 0/0/8
>> 0.00 lx-amd64
>> ---------------------------------------------------------------------------------
>> all.q@ibm038<mailto:all.q@ibm038> BIP 0/0/8
>> 0.00 lx-amd64
>
> What is the output of the command
> qconf -sq all.q
> ? (I think that's right one)
>
> Chris
>
>
>
>
>
>
> -----Original Message-----
> From: Reuti
> [mailto:[email protected]<mailto:[email protected]>]
> Sent: Saturday, December 10, 2016 5:40
> To: John_Tai
> Cc: [email protected]<mailto:[email protected]>
> Subject: Re: [gridengine users] CPU complex
>
> Am 09.12.2016 um 10:36 schrieb John_Tai:
>
>> 8 slots:
>>
>> # qstat -f
>> queuename qtype resv/used/tot. load_avg arch
>> states
>> ---------------------------------------------------------------------------------
>> all.q@ibm021<mailto:all.q@ibm021> BIP 0/0/8
>> 0.02 lx-amd64
>> ---------------------------------------------------------------------------------
>> all.q@ibm037<mailto:all.q@ibm037> BIP 0/0/8
>> 0.00 lx-amd64
>> ---------------------------------------------------------------------------------
>> all.q@ibm038<mailto:all.q@ibm038> BIP 0/0/8
>> 0.00 lx-amd64
>> ---------------------------------------------------------------------------------
>> pc.q@ibm021<mailto:pc.q@ibm021> BIP 0/0/1 0.02
>> lx-amd64
>> ---------------------------------------------------------------------------------
>> sim.q@ibm021<mailto:sim.q@ibm021> BIP 0/0/1
>> 0.02 lx-amd64
>
> Is there any limit of slots in the exechost defined, or in an RQS?
>
> -- Reuti
>
>
>>
>> ######################################################################
>> ######
>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>> JOBS
>> ############################################################################
>> 89 0.55500 xclock johnt qw 12/09/2016 15:14:25 2
>>
>>
>>
>> -----Original Message-----
>> From: Reuti
>> [mailto:[email protected]<mailto:[email protected]>]
>> Sent: Friday, December 09, 2016 3:46
>> To: John_Tai
>> Cc: [email protected]<mailto:[email protected]>
>> Subject: Re: [gridengine users] CPU complex
>>
>> Hi,
>>
>> Am 09.12.2016 um 08:20 schrieb John_Tai:
>>
>>> I've setup PE but I'm having problems submitting jobs.
>>>
>>> - Here's the PE I created:
>>>
>>> # qconf -sp cores
>>> pe_name cores
>>> slots 999
>>> user_lists NONE
>>> xuser_lists NONE
>>> start_proc_args /bin/true
>>> stop_proc_args /bin/true
>>> allocation_rule $pe_slots
>>> control_slaves FALSE
>>> job_is_first_task TRUE
>>> urgency_slots min
>>> accounting_summary FALSE
>>> qsort_args NONE
>>>
>>> - I've then added this to all.q:
>>>
>>> qconf -aattr queue pe_list cores all.q
>>
>> How many "slots" were defined in there queue definition for all.q?
>>
>> -- Reuti
>>
>>
>>> - Now I submit a job:
>>>
>>> # qsub -V -b y -cwd -now n -pe cores 2 -q
>>> all.q@ibm038<mailto:all.q@ibm038> xclock Your job
>>> 89 ("xclock") has been submitted # qstat
>>> job-ID prior name user state submit/start at queue
>>> slots ja-task-ID
>>> -----------------------------------------------------------------------------------------------------------------
>>> 89 0.00000 xclock johnt qw 12/09/2016 15:14:25
>>> 2
>>> # qalter -w p 89
>>> Job 89 cannot run in PE "cores" because it only offers 0 slots
>>> verification: no suitable queues
>>> # qstat -f
>>> queuename qtype resv/used/tot. load_avg arch
>>> states
>>> ---------------------------------------------------------------------------------
>>> all.q@ibm038<mailto:all.q@ibm038> BIP 0/0/8
>>> 0.00 lx-amd64
>>>
>>> #####################################################################
>>> #
>>> ######
>>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>>> JOBS
>>> ############################################################################
>>> 89 0.55500 xclock johnt qw 12/09/2016 15:14:25 2
>>>
>>>
>>> ----------------------------------------------------
>>>
>>> It looks like all.q@ibm038<mailto:all.q@ibm038> should have 8 free slots,
>>> so why is it only offering 0?
>>>
>>> Hope you can help me.
>>> Thanks
>>> John
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Reuti
>>> [mailto:[email protected]<mailto:[email protected]>
>>> ]
>>> Sent: Monday, December 05, 2016 6:32
>>> To: John_Tai
>>> Cc: [email protected]<mailto:[email protected]>
>>> Subject: Re: [gridengine users] CPU complex
>>>
>>> Hi,
>>>
>>>> Am 05.12.2016 um 09:36 schrieb John_Tai
>>>> <[email protected]<mailto:[email protected]>>:
>>>>
>>>> Thank you so much for your reply!
>>>>
>>>>>> Will you use the consumable virtual_free here instead mem?
>>>>
>>>> Yes I meant to write virtual_free, not mem. Apologies.
>>>>
>>>>>> For parallel jobs you need to configure a (or some) so called PE
>>>>>> (Parallel Environment).
>>>>
>>>> My jobs are actually just one process which uses multiple cores, so for
>>>> example in top one process "simv" is currently using 2 cpu cores (200%).
>>>
>>> Yes, then it's a parallel job for SGE. Although the entries for
>>> start_proc_args resp. stop_proc_args can be left untouched to the default,
>>> a PE is the paradigm in SGE for a parallel job.
>>>
>>>
>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>>> 3017 kelly 20 0 3353m 3.0g 165m R 200.0 0.6 15645:46 simv
>>>>
>>>> So I'm not sure PE is suitable for my case, since it is not multiple
>>>> parallel processes running at the same time. Am I correct?
>>>>
>>>> If so, I am trying to find a way to get SGE to keep track of the number of
>>>> cores used, but I believe it only keeps track of the total CPU usage in %.
>>>> I guess I could use this and and the <total num cores> to get the <num of
>>>> cores in use>, but how to integrate it in SGE?
>>>
>>> You can specify a necessary number of cores for your job in the -pe
>>> parameter, which can also be a range. The granted allocation by SGE you can
>>> check in the job script $NHOSTS, $NSLOTS, $PE_HOSTFILE.
>>>
>>> Having this setup, SGE will track the number of used cores per machine. The
>>> available ones you define in the queue definition. In case you have more
>>> than one queue per exechost, we need to setup in addition an overall limit
>>> of cores which can be used at the same time to avoid oversubscription.
>>>
>>> -- Reuti
>>>
>>>> Thank you again for your help.
>>>>
>>>> John
>>>>
>>>> -----Original Message-----
>>>> From: Reuti
>>>> [mailto:[email protected]<mailto:[email protected]
>>>>> ]
>>>> Sent: Monday, December 05, 2016 4:21
>>>> To: John_Tai
>>>> Cc: [email protected]<mailto:[email protected]>
>>>> Subject: Re: [gridengine users] CPU complex
>>>>
>>>> Hi,
>>>>
>>>> Am 05.12.2016 um 08:00 schrieb John_Tai:
>>>>
>>>>> Newbie here, hope to understand SGE usage.
>>>>>
>>>>> I've successfully configured virtual_free as a complex for telling SGE
>>>>> how much memory is needed when submitting a job, as described here:
>>>>>
>>>>> https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html#
>>>>> <https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html
>>>>>>
>>>>> i
>>>>> 1000029
>>>>>
>>>>> How do I do the same for telling SGE how many CPU cores a job needs? For
>>>>> example:
>>>>>
>>>>> qsub -l mem=24G,cpu=4 myjob
>>>>
>>>> Will you use the consumable virtual_free here instead mem?
>>>>
>>>>
>>>>> Obviously I'd need for SGE to keep track of the actual CPU utilization in
>>>>> the host, just as virtual_free is being tracked independently of the SGE
>>>>> jobs.
>>>>
>>>> For parallel jobs you need to configure a (or some) so called PE (Parallel
>>>> Environment). Purpose of this is, to make preparations for the parallel
>>>> jobs like rearranging the list of granted slots, prepare shared
>>>> directories between the nodes,...
>>>>
>>>> These PEs were of higher importance in former times, when parallel
>>>> libraries were not programmed to integrate automatically in SGE for a
>>>> tight integration. Your submissions could read:
>>>>
>>>> qsub -pe smp 4 myjob # allocation_rule $peslots, control_slaves true
>>>> qsub -pe orte 16 myjob # allovation_rule $round_robin,
>>>> control_slaves tue
>>>>
>>>> where smp resp. orte is the chosen parallel environment for OpenMP resp.
>>>> Open MPI. Its settings are explained in `man sge_pe`, the "-pe" parameter
>>>> to in the submission command in `man qsub`.
>>>>
>>>> -- Reuti
>>>> ________________________________
>>>>
>>>> This email (including its attachments, if any) may be confidential and
>>>> proprietary information of SMIC, and intended only for the use of the
>>>> named recipient(s) above. Any unauthorized use or disclosure of this email
>>>> is strictly prohibited. If you are not the intended recipient(s), please
>>>> notify the sender immediately and delete this email from your computer.
>>>>
>>>
>>> ________________________________
>>>
>>> This email (including its attachments, if any) may be confidential and
>>> proprietary information of SMIC, and intended only for the use of the named
>>> recipient(s) above. Any unauthorized use or disclosure of this email is
>>> strictly prohibited. If you are not the intended recipient(s), please
>>> notify the sender immediately and delete this email from your computer.
>>>
>>
>> ________________________________
>>
>> This email (including its attachments, if any) may be confidential and
>> proprietary information of SMIC, and intended only for the use of the named
>> recipient(s) above. Any unauthorized use or disclosure of this email is
>> strictly prohibited. If you are not the intended recipient(s), please notify
>> the sender immediately and delete this email from your computer.
>>
>
> ________________________________
>
> This email (including its attachments, if any) may be confidential and
> proprietary information of SMIC, and intended only for the use of the named
> recipient(s) above. Any unauthorized use or disclosure of this email is
> strictly prohibited. If you are not the intended recipient(s), please notify
> the sender immediately and delete this email from your computer.
>
> _______________________________________________
> users mailing list
> [email protected]<mailto:[email protected]>
> https://gridengine.org/mailman/listinfo/users
>
> ________________________________
> This email (including its attachments, if any) may be confidential and
> proprietary information of SMIC, and intended only for the use of the named
> recipient(s) above. Any unauthorized use or disclosure of this email is
> strictly prohibited. If you are not the intended recipient(s), please notify
> the sender immediately and delete this email from your computer.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://gridengine.org/pipermail/users/attachments/20161212/5666d5d4/attachment.html>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
>
> End of users Digest, Vol 72, Issue 13
> *************************************
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> ________________________________
>
> This email (including its attachments, if any) may be confidential and
> proprietary information of SMIC, and intended only for the use of the named
> recipient(s) above. Any unauthorized use or disclosure of this email is
> strictly prohibited. If you are not the intended recipient(s), please notify
> the sender immediately and delete this email from your computer.
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users