Re: [gridengine users] users Digest, Vol 72, Issue 13

John_Tai Sun, 11 Dec 2016 22:05:47 -0800

Thank you all for trying to work this out.



>>allocation_rule $fill_up <---work better for parallel jobs

I do want my job to run on one machine only

>>control_slaves TRUE < ---- you want tight integration with SGE
>>job_is_first_task  <----can go either way, unless you are sure your software 
>>will control job distro...

And the job will be controlled by my software, not SGE. I only need SGE to keep 
track of the slots (i.e. CPU cores).

-------------------------------------------------

There were no messages on qmaster or ibm038. The job I submitted is not in 
error, it's just waiting for free slots.

-------------------------------------------------

I changed queue slots setting and removed all other PE, but I got the same 
error.


# qconf -sq all.q
qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               cores
rerun                 FALSE
slots                 8
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY
# qsub -V -b y -cwd -now n -pe cores 2 -q all.q@ibm038 xclock
Your job 92 ("xclock") has been submitted
# qstat
job-ID  prior   name       user         state submit/start at     queue         
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     91 0.55500 xclock     johnt        qw    12/12/2016 13:54:02               
                     2
     92 0.00000 xclock     johnt        qw    12/12/2016 13:55:59               
                     2
# qalter -w p 92
Job 92 cannot run in queue "pc.q" because it is not contained in its hard queue 
list (-q)
Job 92 cannot run in queue "sim.q" because it is not contained in its hard 
queue list (-q)
Job 92 cannot run in queue "all.q@ibm021" because it is not contained in its 
hard queue list (-q)
Job 92 cannot run in queue "all.q@ibm037" because it is not contained in its 
hard queue list (-q)
Job 92 cannot run in PE "cores" because it only offers 0 slots
verification: no suitable queues



-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Coleman, Marcus [JRDUS Non-J&J]
Sent: Monday, December 12, 2016 1:35
To: [email protected]
Subject: Re: [gridengine users] users Digest, Vol 72, Issue 13

Hi

I am sure this is your problem....You are submitting a job that requires 2 
cores, to a queue that has only 1 slot available.
If your host all have the same amount of cores...it is no reason to separate 
them via commons. This is only needed if the host have different amount of 
slots or you want to manipulate  the slots...

slots                 1,[ibm021=8],[ibm037=8],[ibm038=8]
slots             8


I would only list the pe I am using I am requesting...unless you plan to use 
each of those PE's
pe_list               make mpi smp cores
pe_list               cores


Also you mentioned parallel env, I WOULD change allocation to $fill_up unless 
your software (not sge) control jobs distribution..

qconf -sp core
  allocation_rule    $pe_slots <---( ONLY USE ONE MACHINE)
 control_slaves     FALSE <--- (I think you want tight integration)
 job_is_first_task  TRUE  <----( this is true if the  first job submitted only 
kicks off other jobs)

allocation_rule $fill_up <---work better for parallel jobs
control_slaves TRUE < ---- you want tight integration with SGE
job_is_first_task  <----can go either way, unless you are sure your software 
will control job distro...


Also what does qmaster message and the associated node sge message say...








-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of [email protected]
Sent: Sunday, December 11, 2016 9:05 PM
To: [email protected]
Subject: [EXTERNAL] users Digest, Vol 72, Issue 13

Send users mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        https://gridengine.org/mailman/listinfo/users
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of users digest..."


Today's Topics:

   1. Re: CPU complex (John_Tai)


----------------------------------------------------------------------

Message: 1
Date: Mon, 12 Dec 2016 05:04:33 +0000
From: John_Tai <[email protected]>
To: Christopher Heiny <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [gridengine users] CPU complex
Message-ID: <EB25FF8EBBD4BC478EF05F2F4C436479021D2BA5A2@shex-d02>
Content-Type: text/plain; charset="utf-8"

# qconf -sq all.q
qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpi smp cores
rerun                 FALSE
slots                 1,[ibm021=8],[ibm037=8],[ibm038=8]
tmpdir                /tmp
shell                 /bin/sh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY



From: Christopher Heiny [mailto:[email protected]]
Sent: Monday, December 12, 2016 12:22
To: John_Tai
Cc: [email protected]; Reuti
Subject: Re: [gridengine users] CPU complex



On Dec 11, 2016 5:11 PM, "John_Tai" 
<[email protected]<mailto:[email protected]>> wrote:
I associated the queue with the PE:

        qconf -aattr queue pe_list cores all.q The only slots were defined in 
the all.q queue, and just the total slots in the PE:

>> # qconf -sp cores
>> pe_name            cores
>> slots              999
>> user_lists         NONE
>> xuser_lists        NONE
Do I need to define slots in another way for each exec host? Is there a way to 
check the current free slots for a host, other than the qstat -f below?

> # qstat -f
> queuename                      qtype resv/used/tot. load_avg arch          
> states
> ---------------------------------------------------------------------------------
> all.q@ibm021<mailto:all.q@ibm021>                   BIP   0/0/8          0.02 
>     lx-amd64
> ---------------------------------------------------------------------------------
> all.q@ibm037<mailto:all.q@ibm037>                   BIP   0/0/8          0.00 
>     lx-amd64
> ---------------------------------------------------------------------------------
> all.q@ibm038<mailto:all.q@ibm038>                   BIP   0/0/8          0.00 
>     lx-amd64

What is the output of the command
    qconf -sq all.q
? (I think that's right one)

Chris






-----Original Message-----
From: Reuti 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Saturday, December 10, 2016 5:40
To: John_Tai
Cc: [email protected]<mailto:[email protected]>
Subject: Re: [gridengine users] CPU complex

Am 09.12.2016 um 10:36 schrieb John_Tai:

> 8 slots:
>
> # qstat -f
> queuename                      qtype resv/used/tot. load_avg arch          
> states
> ---------------------------------------------------------------------------------
> all.q@ibm021<mailto:all.q@ibm021>                   BIP   0/0/8          0.02 
>     lx-amd64
> ---------------------------------------------------------------------------------
> all.q@ibm037<mailto:all.q@ibm037>                   BIP   0/0/8          0.00 
>     lx-amd64
> ---------------------------------------------------------------------------------
> all.q@ibm038<mailto:all.q@ibm038>                   BIP   0/0/8          0.00 
>     lx-amd64
> ---------------------------------------------------------------------------------
> pc.q@ibm021<mailto:pc.q@ibm021>                    BIP   0/0/1          0.02  
>    lx-amd64
> ---------------------------------------------------------------------------------
> sim.q@ibm021<mailto:sim.q@ibm021>                   BIP   0/0/1          0.02 
>     lx-amd64

Is there any limit of slots in the exechost defined, or in an RQS?

-- Reuti


>
> ######################################################################
> ######
> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
> JOBS 
> ############################################################################
>     89 0.55500 xclock     johnt        qw    12/09/2016 15:14:25     2
>
>
>
> -----Original Message-----
> From: Reuti
> [mailto:[email protected]<mailto:[email protected]>]
> Sent: Friday, December 09, 2016 3:46
> To: John_Tai
> Cc: [email protected]<mailto:[email protected]>
> Subject: Re: [gridengine users] CPU complex
>
> Hi,
>
> Am 09.12.2016 um 08:20 schrieb John_Tai:
>
>> I've setup PE but I'm having problems submitting jobs.
>>
>> - Here's the PE I created:
>>
>> # qconf -sp cores
>> pe_name            cores
>> slots              999
>> user_lists         NONE
>> xuser_lists        NONE
>> start_proc_args    /bin/true
>> stop_proc_args     /bin/true
>> allocation_rule    $pe_slots
>> control_slaves     FALSE
>> job_is_first_task  TRUE
>> urgency_slots      min
>> accounting_summary FALSE
>> qsort_args         NONE
>>
>> - I've then added this to all.q:
>>
>> qconf -aattr queue pe_list cores all.q
>
> How many "slots" were defined in there queue definition for all.q?
>
> -- Reuti
>
>
>> - Now I submit a job:
>>
>> # qsub -V -b y -cwd -now n -pe cores 2 -q
>> all.q@ibm038<mailto:all.q@ibm038> xclock Your job
>> 89 ("xclock") has been submitted # qstat
>> job-ID  prior   name       user         state submit/start at     queue      
>>                     slots ja-task-ID
>> -----------------------------------------------------------------------------------------------------------------
>>    89 0.00000 xclock     johnt        qw    12/09/2016 15:14:25              
>>                       2
>> # qalter -w p 89
>> Job 89 cannot run in PE "cores" because it only offers 0 slots
>> verification: no suitable queues
>> # qstat -f
>> queuename                      qtype resv/used/tot. load_avg arch          
>> states
>> ---------------------------------------------------------------------------------
>> all.q@ibm038<mailto:all.q@ibm038>                   BIP   0/0/8          
>> 0.00     lx-amd64
>>
>> #####################################################################
>> #
>> ######
>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
>> JOBS 
>> ############################################################################
>>    89 0.55500 xclock     johnt        qw    12/09/2016 15:14:25     2
>>
>>
>> ----------------------------------------------------
>>
>> It looks like all.q@ibm038<mailto:all.q@ibm038> should have 8 free slots, so 
>> why is it only offering 0?
>>
>> Hope you can help me.
>> Thanks
>> John
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Reuti
>> [mailto:[email protected]<mailto:[email protected]>
>> ]
>> Sent: Monday, December 05, 2016 6:32
>> To: John_Tai
>> Cc: [email protected]<mailto:[email protected]>
>> Subject: Re: [gridengine users] CPU complex
>>
>> Hi,
>>
>>> Am 05.12.2016 um 09:36 schrieb John_Tai 
>>> <[email protected]<mailto:[email protected]>>:
>>>
>>> Thank you so much for your reply!
>>>
>>>>> Will you use the consumable virtual_free here instead mem?
>>>
>>> Yes I meant to write virtual_free, not mem. Apologies.
>>>
>>>>> For parallel jobs you need to configure a (or some) so called PE 
>>>>> (Parallel Environment).
>>>
>>> My jobs are actually just one process which uses multiple cores, so for 
>>> example in top one process "simv" is currently using 2 cpu cores (200%).
>>
>> Yes, then it's a parallel job for SGE. Although the entries for 
>> start_proc_args resp. stop_proc_args can be left untouched to the default, a 
>> PE is the paradigm in SGE for a parallel job.
>>
>>
>>> PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> 3017 kelly     20   0 3353m 3.0g 165m R 200.0  0.6  15645:46 simv
>>>
>>> So I'm not sure PE is suitable for my case, since it is not multiple 
>>> parallel processes running at the same time. Am I correct?
>>>
>>> If so, I am trying to find a way to get SGE to keep track of the number of 
>>> cores used, but I believe it only keeps track of the total CPU usage in %. 
>>> I guess I could use this and and the <total num cores> to get the <num of 
>>> cores in use>, but how to integrate it in SGE?
>>
>> You can specify a necessary number of cores for your job in the -pe 
>> parameter, which can also be a range. The granted allocation by SGE you can 
>> check in the job script $NHOSTS, $NSLOTS, $PE_HOSTFILE.
>>
>> Having this setup, SGE will track the number of used cores per machine. The 
>> available ones you define in the queue definition. In case you have more 
>> than one queue per exechost, we need to setup in addition an overall limit 
>> of cores which can be used at the same time to avoid oversubscription.
>>
>> -- Reuti
>>
>>> Thank you again for your help.
>>>
>>> John
>>>
>>> -----Original Message-----
>>> From: Reuti
>>> [mailto:[email protected]<mailto:[email protected]
>>> >]
>>> Sent: Monday, December 05, 2016 4:21
>>> To: John_Tai
>>> Cc: [email protected]<mailto:[email protected]>
>>> Subject: Re: [gridengine users] CPU complex
>>>
>>> Hi,
>>>
>>> Am 05.12.2016 um 08:00 schrieb John_Tai:
>>>
>>>> Newbie here, hope to understand SGE usage.
>>>>
>>>> I've successfully configured virtual_free as a complex for telling SGE how 
>>>> much memory is needed when submitting a job, as described here:
>>>>
>>>> https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html#
>>>> <https://docs.oracle.com/cd/E19957-01/820-0698/6ncdvjclk/index.html
>>>> >
>>>> i
>>>> 1000029
>>>>
>>>> How do I do the same for telling SGE how many CPU cores a job needs? For 
>>>> example:
>>>>
>>>>              qsub -l mem=24G,cpu=4 myjob
>>>
>>> Will you use the consumable virtual_free here instead mem?
>>>
>>>
>>>> Obviously I'd need for SGE to keep track of the actual CPU utilization in 
>>>> the host, just as virtual_free is being tracked independently of the SGE 
>>>> jobs.
>>>
>>> For parallel jobs you need to configure a (or some) so called PE (Parallel 
>>> Environment). Purpose of this is, to make preparations for the parallel 
>>> jobs like rearranging the list of granted slots, prepare shared directories 
>>> between the nodes,...
>>>
>>> These PEs were of higher importance in former times, when parallel 
>>> libraries were not programmed to integrate automatically in SGE for a tight 
>>> integration. Your submissions could read:
>>>
>>>  qsub -pe smp 4 myjob        # allocation_rule $peslots, control_slaves true
>>>  qsub -pe orte 16 myjob        # allovation_rule $round_robin, 
>>> control_slaves tue
>>>
>>> where smp resp. orte is the chosen parallel environment for OpenMP resp. 
>>> Open MPI. Its settings are explained in `man sge_pe`, the "-pe" parameter 
>>> to in the submission command in `man qsub`.
>>>
>>> -- Reuti
>>> ________________________________
>>>
>>> This email (including its attachments, if any) may be confidential and 
>>> proprietary information of SMIC, and intended only for the use of the named 
>>> recipient(s) above. Any unauthorized use or disclosure of this email is 
>>> strictly prohibited. If you are not the intended recipient(s), please 
>>> notify the sender immediately and delete this email from your computer.
>>>
>>
>> ________________________________
>>
>> This email (including its attachments, if any) may be confidential and 
>> proprietary information of SMIC, and intended only for the use of the named 
>> recipient(s) above. Any unauthorized use or disclosure of this email is 
>> strictly prohibited. If you are not the intended recipient(s), please notify 
>> the sender immediately and delete this email from your computer.
>>
>
> ________________________________
>
> This email (including its attachments, if any) may be confidential and 
> proprietary information of SMIC, and intended only for the use of the named 
> recipient(s) above. Any unauthorized use or disclosure of this email is 
> strictly prohibited. If you are not the intended recipient(s), please notify 
> the sender immediately and delete this email from your computer.
>

________________________________

This email (including its attachments, if any) may be confidential and 
proprietary information of SMIC, and intended only for the use of the named 
recipient(s) above. Any unauthorized use or disclosure of this email is 
strictly prohibited. If you are not the intended recipient(s), please notify 
the sender immediately and delete this email from your computer.

_______________________________________________
users mailing list
[email protected]<mailto:[email protected]>
https://gridengine.org/mailman/listinfo/users

________________________________
This email (including its attachments, if any) may be confidential and 
proprietary information of SMIC, and intended only for the use of the named 
recipient(s) above. Any unauthorized use or disclosure of this email is 
strictly prohibited. If you are not the intended recipient(s), please notify 
the sender immediately and delete this email from your computer.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://gridengine.org/pipermail/users/attachments/20161212/5666d5d4/attachment.html>

------------------------------

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


End of users Digest, Vol 72, Issue 13
*************************************

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
________________________________

This email (including its attachments, if any) may be confidential and 
proprietary information of SMIC, and intended only for the use of the named 
recipient(s) above. Any unauthorized use or disclosure of this email is 
strictly prohibited. If you are not the intended recipient(s), please notify 
the sender immediately and delete this email from your computer.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] users Digest, Vol 72, Issue 13

Reply via email to