Re: [OMPI users] Running a hybrid MPI+openMP program

Reuti Thu, 21 Aug 2014 05:52:29 -0400 (EDT)

Hi,

Am 21.08.2014 um 01:56 schrieb tmish...@jcity.maeda.co.jp:


> Reuti,
> 
> Sorry for confusing you. Under the managed condition, actually
> -np option is not necessary. So, this cmd line also works for me
> with Torque.
> 
> $ qsub -l nodes=10:ppn=N
> $ mpirun -map-by slot:pe=N ./inverse.exe

Aha, yes. Works in SGE too.

To make the notation of threads generic, what about an extension to use:

-map-by slot:pe=omp

where the literal "omp" will trigger to use $OMP_NUM_THREADS instead?

-- Reuti


> At least, Ralph confirmed it worked with Slurm and I comfirmed
> with Torque as shown below:
> 
> [mishima@manage ~]$ qsub -I -l nodes=4:ppn=8
> qsub: waiting for job 8798.manage.cluster to start
> qsub: job 8798.manage.cluster ready
> 
> [mishima@node09 ~]$ cat $PBS_NODEFILE
> node09
> node09
> node09
> node09
> node09
> node09
> node09
> node09
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node12
> node12
> node12
> node12
> node12
> node12
> node12
> node12
> [mishima@node09 ~]$ mpirun -map-by slot:pe=8 -display-map
> ~/mis/openmpi/demos/myprog
> Data for JOB [8050,1] offset 0
> 
> ========================   JOB MAP   ========================
> 
> Data for node: node09  Num slots: 8    Max slots: 0    Num procs: 1
>        Process OMPI jobid: [8050,1] App: 0 Process rank: 0
> 
> Data for node: node10  Num slots: 8    Max slots: 0    Num procs: 1
>        Process OMPI jobid: [8050,1] App: 0 Process rank: 1
> 
> Data for node: node11  Num slots: 8    Max slots: 0    Num procs: 1
>        Process OMPI jobid: [8050,1] App: 0 Process rank: 2
> 
> Data for node: node12  Num slots: 8    Max slots: 0    Num procs: 1
>        Process OMPI jobid: [8050,1] App: 0 Process rank: 3
> 
> =============================================================
> Hello world from process 0 of 4
> Hello world from process 2 of 4
> Hello world from process 3 of 4
> Hello world from process 1 of 4
> [mishima@node09 ~]$ mpirun -map-by slot:pe=4 -display-map
> ~/mis/openmpi/demos/myprog
> Data for JOB [8056,1] offset 0
> 
> ========================   JOB MAP   ========================
> 
> Data for node: node09  Num slots: 8    Max slots: 0    Num procs: 2
>        Process OMPI jobid: [8056,1] App: 0 Process rank: 0
>        Process OMPI jobid: [8056,1] App: 0 Process rank: 1
> 
> Data for node: node10  Num slots: 8    Max slots: 0    Num procs: 2
>        Process OMPI jobid: [8056,1] App: 0 Process rank: 2
>        Process OMPI jobid: [8056,1] App: 0 Process rank: 3
> 
> Data for node: node11  Num slots: 8    Max slots: 0    Num procs: 2
>        Process OMPI jobid: [8056,1] App: 0 Process rank: 4
>        Process OMPI jobid: [8056,1] App: 0 Process rank: 5
> 
> Data for node: node12  Num slots: 8    Max slots: 0    Num procs: 2
>        Process OMPI jobid: [8056,1] App: 0 Process rank: 6
>        Process OMPI jobid: [8056,1] App: 0 Process rank: 7
> 
> =============================================================
> Hello world from process 1 of 8
> Hello world from process 0 of 8
> Hello world from process 2 of 8
> Hello world from process 3 of 8
> Hello world from process 4 of 8
> Hello world from process 5 of 8
> Hello world from process 6 of 8
> Hello world from process 7 of 8
> 
> I don't know why it dosen't work with SGE. Could you show me
> your output adding -display-map and -mca rmaps_base_verbose 5 options?
> 
> By the way, the option -map-by ppr:N:node or ppr:N:socket might be
> useful for your purpose. The ppr can reduce the slot counts given
> by RM without binding and allocate N procs by the specified resource.
> 
> [mishima@node09 ~]$ mpirun -map-by ppr:1:node -display-map
> ~/mis/openmpi/demos/myprog
> Data for JOB [7913,1] offset 0
> 
> ========================   JOB MAP   ========================
> 
> Data for node: node09  Num slots: 8    Max slots: 0    Num procs: 1
>        Process OMPI jobid: [7913,1] App: 0 Process rank: 0
> 
> Data for node: node10  Num slots: 8    Max slots: 0    Num procs: 1
>        Process OMPI jobid: [7913,1] App: 0 Process rank: 1
> 
> Data for node: node11  Num slots: 8    Max slots: 0    Num procs: 1
>        Process OMPI jobid: [7913,1] App: 0 Process rank: 2
> 
> Data for node: node12  Num slots: 8    Max slots: 0    Num procs: 1
>        Process OMPI jobid: [7913,1] App: 0 Process rank: 3
> 
> =============================================================
> Hello world from process 0 of 4
> Hello world from process 2 of 4
> Hello world from process 1 of 4
> Hello world from process 3 of 4
> 
> Tetsuya
> 
> 
>> Hi,
>> 
>> Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp:
>> 
>>> Reuti,
>>> 
>>> If you want to allocate 10 procs with N threads, the Torque
>>> script below should work for you:
>>> 
>>> qsub -l nodes=10:ppn=N
>>> mpirun -map-by slot:pe=N -np 10 -x OMP_NUM_THREADS=N ./inverse.exe
>> 
>> I played around with giving -np 10 in addition to a Tight Integration.
> The slot count is not really divided I think, but only 10 out of the
> granted maximum is used (while on each of the listed
>> machines an `orted` is started). Due to the fixed allocation this is of
> course the result we want to achieve as it subtracts bunches of 8 from the
> given list of machines resp. slots. In SGE it's
>> sufficient to use and AFAICS it works (without touching the $PE_HOSTFILE
> any longer):
>> 
>> ===
>> export OMP_NUM_THREADS=8
>> mpirun -map-by slot:pe=$OMP_NUM_THREADS -np $(bc <<<"$NSLOTS /
> $OMP_NUM_THREADS") ./inverse.exe
>> ===
>> 
>> and submit with:
>> 
>> $ qsub -pe orte 80 job.sh
>> 
>> as the variables are distributed to the slave nodes by SGE already.
>> 
>> Nevertheless, using -np in addition to the Tight Integration gives a
> taste of a kind of half-tight integration in some way. And would not work
> for us because "--bind-to none" can't be used in such a
>> command (see below) and throws an error.
>> 
>> 
>>> Then, the openmpi automatically reduces the logical slot count to 10
>>> by dividing real slot count 10N by binding width of N.
>>> 
>>> I don't know why you want to use pe=N without binding, but
> unfortunately
>>> the openmpi allocates successive cores to each process so far when you
>>> use pe option - it forcibly bind_to core.
>> 
>> In a shared cluster with many users and different MPI libraries in use,
> only the queuingsystem could know which job got which cores granted. This
> avoids any oversubscription of cores, while others
>> are idle.
>> 
>> -- Reuti
>> 
>> 
>>> Tetsuya
>>> 
>>> 
>>>> Hi,
>>>> 
>>>> Am 20.08.2014 um 06:26 schrieb Tetsuya Mishima:
>>>> 
>>>>> Reuti and Oscar,
>>>>> 
>>>>> I'm a Torque user and I myself have never used SGE, so I hesitated to
>>> join
>>>>> the discussion.
>>>>> 
>>>>> From my experience with the Torque, the openmpi 1.8 series has
> already
>>>>> resolved the issue you pointed out in combining MPI with OpenMP.
>>>>> 
>>>>> Please try to add --map-by slot:pe=8 option, if you want to use 8
>>> threads.
>>>>> Then, then openmpi 1.8 should allocate processes properly without any
>>> modification
>>>>> of the hostfile provided by the Torque.
>>>>> 
>>>>> In your case(8 threads and 10 procs):
>>>>> 
>>>>> # you have to request 80 slots using SGE command before mpirun
>>>>> mpirun --map-by slot:pe=8 -np 10 ./inverse.exe
>>>> 
>>>> Thx for pointing me to this option, for now I can't get it working
> though
>>> (in fact, I want to use it without binding essentially). This allows to
>>> tell Open MPI to bind more cores to each of the MPI
>>>> processes - ok, but does it lower the slot count granted by Torque
> too? I
>>> mean, was your submission command like:
>>>> 
>>>> $ qsub -l nodes=10:ppn=8 ...
>>>> 
>>>> so that Torque knows, that it should grant and remember this slot
> count
>>> of a total of 80 for the correct accounting?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> where you can omit --bind-to option because --bind-to core is assumed
>>>>> as default when pe=N is provided by the user.
>>>>> Regards,
>>>>> Tetsuya
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Am 19.08.2014 um 19:06 schrieb Oscar Mojica:
>>>>>> 
>>>>>>> I discovered what was the error. I forgot include the '-fopenmp'
> when
>>> I compiled the objects in the Makefile, so the program worked but it
> didn't
>>> divide the job
>>>>> in threads. Now the program is working and I can use until 15 cores
> for
>>> machine in the queue one.q.
>>>>>>> 
>>>>>>> Anyway i would like to try implement your advice. Well I'm not
> alone
>>> in the cluster so i must implement your second suggestion. The steps
> are
>>>>>>> 
>>>>>>> a) Use '$ qconf -mp orte' to change the allocation rule to 8
>>>>>> 
>>>>>> The number of slots defined in your used one.q was also increased to
> 8
>>> (`qconf -sq one.q`)?
>>>>>> 
>>>>>> 
>>>>>>> b) Set '#$ -pe orte 80' in the script
>>>>>> 
>>>>>> Fine.
>>>>>> 
>>>>>> 
>>>>>>> c) I'm not sure how to do this step. I'd appreciate your help here.
> I
>>> can add some lines to the script to determine the PE_HOSTFILE path and
>>> contents, but i
>>>>> don't know how alter it
>>>>>> 
>>>>>> For now you can put in your jobscript (just after OMP_NUM_THREAD is
>>> exported):
>>>>>> 
>>>>>> awk -v omp_num_threads=$OMP_NUM_THREADS '{ $2/=omp_num_threads;
>>> print }' $PE_HOSTFILE > $TMPDIR/machines
>>>>>> export PE_HOSTFILE=$TMPDIR/machines
>>>>>> 
>>>>>> =============
>>>>>> 
>>>>>> Unfortunately noone stepped into this discussion, as in my opinion
>>> it's a much broader issue which targets all users who want to combine
> MPI
>>> with OpenMP. The
>>>>> queuingsystem should get a proper request for the overall amount of
>>> slots the user needs. For now this will be forwarded to Open MPI and it
>>> will use this
>>>>> information to start the appropriate number of processes (which was
> an
>>> achievement for the Tight Integration out-of-the-box of course) and
> ignores
>>> any setting of
>>>>> OMP_NUM_THREADS. So, where should the generated list of machines be
>>> adjusted; there are several options:
>>>>>> 
>>>>>> a) The PE of the queuingsystem should do it:
>>>>>> 
>>>>>> + a one time setup for the admin
>>>>>> + in SGE the "start_proc_args" of the PE could alter the
> $PE_HOSTFILE
>>>>>> - the "start_proc_args" would need to know the number of threads,
> i.e.
>>> OMP_NUM_THREADS must be defined by "qsub -v ..." outside of the
> jobscript
>>> (tricky scanning
>>>>> of the submitted jobscript for OMP_NUM_THREADS would be too nasty)
>>>>>> - limits to use inside the jobscript calls to libraries behaving in
>>> the same way as Open MPI only
>>>>>> 
>>>>>> 
>>>>>> b) The particular queue should do it in a queue prolog:
>>>>>> 
>>>>>> same as a) I think
>>>>>> 
>>>>>> 
>>>>>> c) The user should do it
>>>>>> 
>>>>>> + no change in the SGE installation
>>>>>> - each and every user must include it in all the jobscripts to
> adjust
>>> the list and export the pointer to the $PE_HOSTFILE, but he could
> change it
>>> forth and back
>>>>> for different steps of the jobscript though
>>>>>> 
>>>>>> 
>>>>>> d) Open MPI should do it
>>>>>> 
>>>>>> + no change in the SGE installation
>>>>>> + no change to the jobscript
>>>>>> + OMP_NUM_THREADS can be altered for different steps of the
> jobscript
>>> while staying inside the granted allocation automatically
>>>>>> o should MKL_NUM_THREADS be covered too (does it use OMP_NUM_THREADS
>>> already)?
>>>>>> 
>>>>>> -- Reuti
>>>>>> 
>>>>>> 
>>>>>>> echo "PE_HOSTFILE:"
>>>>>>> echo $PE_HOSTFILE
>>>>>>> echo
>>>>>>> echo "cat PE_HOSTFILE:"
>>>>>>> cat $PE_HOSTFILE
>>>>>>> 
>>>>>>> Thanks for take a time for answer this emails, your advices had
> been
>>> very useful
>>>>>>> 
>>>>>>> PS: The version of SGE is   OGS/GE 2011.11p1
>>>>>>> 
>>>>>>> 
>>>>>>> Oscar Fabian Mojica Ladino
>>>>>>> Geologist M.S. in  Geophysics
>>>>>>> 
>>>>>>> 
>>>>>>>> From: re...@staff.uni-marburg.de
>>>>>>>> Date: Fri, 15 Aug 2014 20:38:12 +0200
>>>>>>>> To: us...@open-mpi.org
>>>>>>>> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Am 15.08.2014 um 19:56 schrieb Oscar Mojica:
>>>>>>>> 
>>>>>>>>> Yes, my installation of Open MPI is SGE-aware. I got the
> following
>>>>>>>>> 
>>>>>>>>> [oscar@compute-1-2 ~]$ ompi_info | grep grid
>>>>>>>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.2)
>>>>>>>> 
>>>>>>>> Fine.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> I'm a bit slow and I didn't understand the las part of your
>>> message. So i made a test trying to solve my doubts.
>>>>>>>>> This is the cluster configuration: There are some machines turned
>>> off but that is no problem
>>>>>>>>> 
>>>>>>>>> [oscar@aguia free-noise]$ qhost
>>>>>>>>> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
>>>>>>>>> 
>>> 
> -------------------------------------------------------------------------------
> 
>>> 
>>>>>>>>> global - - - - - - -
>>>>>>>>> compute-1-10 linux-x64 16 0.97 23.6G 558.6M 996.2M 0.0
>>>>>>>>> compute-1-11 linux-x64 16 - 23.6G - 996.2M -
>>>>>>>>> compute-1-12 linux-x64 16 0.97 23.6G 561.1M 996.2M 0.0
>>>>>>>>> compute-1-13 linux-x64 16 0.99 23.6G 558.7M 996.2M 0.0
>>>>>>>>> compute-1-14 linux-x64 16 1.00 23.6G 555.1M 996.2M 0.0
>>>>>>>>> compute-1-15 linux-x64 16 0.97 23.6G 555.5M 996.2M 0.0
>>>>>>>>> compute-1-16 linux-x64 8 0.00 15.7G 296.9M 1000.0M 0.0
>>>>>>>>> compute-1-17 linux-x64 8 0.00 15.7G 299.4M 1000.0M 0.0
>>>>>>>>> compute-1-18 linux-x64 8 - 15.7G - 1000.0M -
>>>>>>>>> compute-1-19 linux-x64 8 - 15.7G - 996.2M -
>>>>>>>>> compute-1-2 linux-x64 16 1.19 23.6G 468.1M 1000.0M 0.0
>>>>>>>>> compute-1-20 linux-x64 8 0.04 15.7G 297.2M 1000.0M 0.0
>>>>>>>>> compute-1-21 linux-x64 8 - 15.7G - 1000.0M -
>>>>>>>>> compute-1-22 linux-x64 8 0.00 15.7G 297.2M 1000.0M 0.0
>>>>>>>>> compute-1-23 linux-x64 8 0.16 15.7G 299.6M 1000.0M 0.0
>>>>>>>>> compute-1-24 linux-x64 8 0.00 15.7G 291.5M 996.2M 0.0
>>>>>>>>> compute-1-25 linux-x64 8 0.04 15.7G 293.4M 996.2M 0.0
>>>>>>>>> compute-1-26 linux-x64 8 - 15.7G - 1000.0M -
>>>>>>>>> compute-1-27 linux-x64 8 0.00 15.7G 297.0M 1000.0M 0.0
>>>>>>>>> compute-1-29 linux-x64 8 - 15.7G - 1000.0M -
>>>>>>>>> compute-1-3 linux-x64 16 - 23.6G - 996.2M -
>>>>>>>>> compute-1-30 linux-x64 16 - 23.6G - 996.2M -
>>>>>>>>> compute-1-4 linux-x64 16 0.97 23.6G 571.6M 996.2M 0.0
>>>>>>>>> compute-1-5 linux-x64 16 1.00 23.6G 559.6M 996.2M 0.0
>>>>>>>>> compute-1-6 linux-x64 16 0.66 23.6G 403.1M 996.2M 0.0
>>>>>>>>> compute-1-7 linux-x64 16 0.95 23.6G 402.7M 996.2M 0.0
>>>>>>>>> compute-1-8 linux-x64 16 0.97 23.6G 556.8M 996.2M 0.0
>>>>>>>>> compute-1-9 linux-x64 16 1.02 23.6G 566.0M 1000.0M 0.0
>>>>>>>>> 
>>>>>>>>> I ran my program using only MPI with 10 processors of the queue
>>> one.q which has 14 machines (compute-1-2 to compute-1-15). Whit 'qstat
> -t'
>>> I got:
>>>>>>>>> 
>>>>>>>>> [oscar@aguia free-noise]$ qstat -t
>>>>>>>>> job-ID prior name user state submit/start at queue master
>>> ja-task-ID task-ID state cpu mem io stat failed
>>>>>>>>> 
>>>>> 
>>> 
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
>>> 
>>>>> ----
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-2.local MASTER r 00:49:12 554.13753 0.09163
>>>>>>>>> one.q@compute-1-2.local SLAVE
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-5.local SLAVE 1.compute-1-5 r 00:48:53 551.49022
> 0.09410
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-9.local SLAVE 1.compute-1-9 r 00:50:00 564.22764
> 0.09409
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-12.local SLAVE 1.compute-1-12 r 00:47:30 535.30379
> 0.09379
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-13.local SLAVE 1.compute-1-13 r 00:49:51 561.69868
> 0.09379
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-14.local SLAVE 1.compute-1-14 r 00:49:14 554.60818
> 0.09379
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-10.local SLAVE 1.compute-1-10 r 00:49:59 562.95487
> 0.09349
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-15.local SLAVE 1.compute-1-15 r 00:50:01 563.27221
> 0.09361
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-8.local SLAVE 1.compute-1-8 r 00:49:26 556.68431
> 0.09349
>>>>>>>>> 2726 0.50500 job oscar r 08/15/2014 12:38:21
>>> one.q@compute-1-4.local SLAVE 1.compute-1-4 r 00:49:27 556.87510
> 0.04967
>>>>>>>> 
>>>>>>>> Yes, here you got 10 slots (= cores) granted by SGE. So there is
> no
>>> free core left inside the allocation of SGE to allow the use of
> additional
>>> cores for your
>>>>> threads. If you use more cores than granted by SGE, it will
>>> oversubscribe the machines.
>>>>>>>> 
>>>>>>>> The issue is now:
>>>>>>>> 
>>>>>>>> a) If you want 8 threads per MPI process, your job will use 80
> cores
>>> in total - for now SGE isn't aware of it.
>>>>>>>> 
>>>>>>>> b) Although you specified $fill_up as allocation rule, it looks
> like
>>> $round_robin. Is there more than one slot defined in the queue
> definition
>>> of one.q to get
>>>>> exclusive access?
>>>>>>>> 
>>>>>>>> c) What version of SGE are you using? Certain ones use cgroups or
>>> bind processes directly to cores (although it usually needs to be
> requested
>>> by the job:
>>>>> first line of `qconf -help`).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> In case you are alone in the cluster, you could bypass the
>>> allocation with b) (unless you are hit by c)). But having a mixture of
>>> users and jobs a different
>>>>> handling would be necessary to handle this in a proper way IMO:
>>>>>>>> 
>>>>>>>> a) having a PE with a fixed allocation rule of 8
>>>>>>>> 
>>>>>>>> b) requesting this PE with an overall slot count of 80
>>>>>>>> 
>>>>>>>> c) copy and alter the $PE_HOSTFILE to show only (granted core
> count
>>> per machine) divided by (OMP_NUM_THREADS) per entry, change
> $PE_HOSTFILE so
>>> that it points
>>>>> to the altered file
>>>>>>>> 
>>>>>>>> d) Open MPI with a Tight Integration will now start only N process
>>> per machine according to the altered hostfile, in your case one
>>>>>>>> 
>>>>>>>> e) Your application can start the desired threads and you stay
>>> inside the granted allocation
>>>>>>>> 
>>>>>>>> -- Reuti
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> I accessed to the MASTER processor with 'ssh compute-1-2.local' ,
>>> and with $ ps -e f and got this, I'm showing only the last lines
>>>>>>>>> 
>>>>>>>>> 2506 ? Ss 0:00 /usr/sbin/atd
>>>>>>>>> 2548 tty1 Ss+ 0:00 /sbin/mingetty /dev/tty1
>>>>>>>>> 2550 tty2 Ss+ 0:00 /sbin/mingetty /dev/tty2
>>>>>>>>> 2552 tty3 Ss+ 0:00 /sbin/mingetty /dev/tty3
>>>>>>>>> 2554 tty4 Ss+ 0:00 /sbin/mingetty /dev/tty4
>>>>>>>>> 2556 tty5 Ss+ 0:00 /sbin/mingetty /dev/tty5
>>>>>>>>> 2558 tty6 Ss+ 0:00 /sbin/mingetty /dev/tty6
>>>>>>>>> 3325 ? Sl 0:04 /opt/gridengine/bin/linux-x64/sge_execd
>>>>>>>>> 17688 ? S 0:00 \_ sge_shepherd-2726 -bg
>>>>>>>>> 17695 ? Ss 0:00 \_
>>> -bash /opt/gridengine/default/spool/compute-1-2/job_scripts/2726
>>>>>>>>> 17797 ? S 0:00 \_ /usr/bin/time -f %E /opt/openmpi/bin/mpirun -v
>>> -np 10 ./inverse.exe
>>>>>>>>> 17798 ? S 0:01 \_ /opt/openmpi/bin/mpirun -v -np 10 ./inverse.exe
>>>>>>>>> 17799 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
>>> -nostdin -V compute-1-5.local PATH=/opt/openmpi/bin:$PATH ; expo
>>>>>>>>> 17800 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
>>> -nostdin -V compute-1-9.local PATH=/opt/openmpi/bin:$PATH ; expo
>>>>>>>>> 17801 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
>>> -nostdin -V compute-1-12.local PATH=/opt/openmpi/bin:$PATH ; exp
>>>>>>>>> 17802 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
>>> -nostdin -V compute-1-13.local PATH=/opt/openmpi/bin:$PATH ; exp
>>>>>>>>> 17803 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
>>> -nostdin -V compute-1-14.local PATH=/opt/openmpi/bin:$PATH ; exp
>>>>>>>>> 17804 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
>>> -nostdin -V compute-1-10.local PATH=/opt/openmpi/bin:$PATH ; exp
>>>>>>>>> 17805 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
>>> -nostdin -V compute-1-15.local PATH=/opt/openmpi/bin:$PATH ; exp
>>>>>>>>> 17806 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit
>>> -nostdin -V compute-1-8.local PATH=/opt/openmpi/bin:$PATH ; expo
>>>>>>>>> 17807 ? Sl 0:00 \_ /opt/gridengine/bin/linux-x64/qrsh -inherit>
> -nostdin -V compute-1-4.local PATH=/opt/openmpi/bin:$PATH ; expo
>>>>>>>>> 17826 ? R 31:36 \_ ./inverse.exe
>>>>>>>>> 3429 ? Ssl 0:00 automount --pid-file /var/run/autofs.pid
>>>>>>>>> 
>>>>>>>>> So the job is using the 10 machines, Until here is all right OK.
> Do
>>> you think that changing the "allocation_rule " to a number instead
> $fill_up
>>> the MPI
>>>>> processes would divide the work in that number of threads?
>>>>>>>>> 
>>>>>>>>> Thanks a lot
>>>>>>>>> 
>>>>>>>>> Oscar Fabian Mojica Ladino
>>>>>>>>> Geologist M.S. in Geophysics
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> PS: I have another doubt, what is a slot? is a physical core?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> From: re...@staff.uni-marburg.de
>>>>>>>>>> Date: Thu, 14 Aug 2014 23:54:22 +0200
>>>>>>>>>> To: us...@open-mpi.org
>>>>>>>>>> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I think this is a broader issue in case an MPI library is used
> in
>>> conjunction with threads while running inside a queuing system. First:
>>> whether your
>>>>> actual installation of Open MPI is SGE-aware you can check with:
>>>>>>>>>> 
>>>>>>>>>> $ ompi_info | grep grid
>>>>>>>>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.5)
>>>>>>>>>> 
>>>>>>>>>> Then we can look at the definition of your PE: "allocation_rule
>>> $fill_up". This means that SGE will grant you 14 slots in total in any
>>> combination on the
>>>>> available machines, means 8+4+2 slots allocation is an allowed
>>> combination like 4+4+3+3 and so on. Depending on the SGE-awareness it's
> a
>>> question: will your
>>>>> application just start processes on all nodes and completely
> disregard
>>> the granted allocation, or as the other extreme does it stays on one
> and
>>> the same machine
>>>>> for all started processes? On the master node of the parallel job you
>>> can issue:
>>>>>>>>>> 
>>>>>>>>>> $ ps -e f
>>>>>>>>>> 
>>>>>>>>>> (f w/o -) to have a look whether `ssh` or `qrsh -inhert ...` is
>>> used to reach other machines and their requested process count.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Now to the common problem in such a set up:
>>>>>>>>>> 
>>>>>>>>>> AFAICS: for now there is no way in the Open MPI + SGE
> combination
>>> to specify the number of MPI processes and intended number of threads
> which
>>> are
>>>>> automatically read by Open MPI while staying inside the granted slot
>>> count and allocation. So it seems to be necessary to have the intended
>>> number of threads being
>>>>> honored by Open MPI too.
>>>>>>>>>> 
>>>>>>>>>> Hence specifying e.g. "allocation_rule 8" in such a setup while
>>> requesting 32 processes, would for now start 32 processes by MPI
> already,
>>> as Open MP reads > the $PE_HOSTFILE and acts accordingly.
>>>>>>>>>> 
>>>>>>>>>> Open MPI would have to read the generated machine file in a
>>> slightly different way regarding threads: a) read the $PE_HOSTFILE, b)
>>> divide the granted
>>>>> slots per machine by OMP_NUM_THREADS, c) throw an error in case it's
>>> not divisible by OMP_NUM_THREADS. Then start one process per quotient.
>>>>>>>>>> 
>>>>>>>>>> Would this work for you?
>>>>>>>>>> 
>>>>>>>>>> -- Reuti
>>>>>>>>>> 
>>>>>>>>>> PS: This would also mean to have a couple of PEs in SGE having a
>>> fixed "allocation_rule". While this works right now, an extension in
> SGE
>>> could be
>>>>> "$fill_up_omp"/"$round_robin_omp" and using OMP_NUM_THREADS there
> too,
>>> hence it must not be specified as an `export` in the job script but
> either
>>> on the command
>>>>> line or inside the job script in #$ lines as job requests. This would
>>> mean to collect slots in bunches of OMP_NUM_THREADS on each machine to
>>> reach the overall
>>>>> specified slot count. Whether OMP_NUM_THREADS or n times
>>> OMP_NUM_THREADS is allowed per machine needs to be discussed.
>>>>>>>>>> 
>>>>>>>>>> PS2: As Univa SGE can also supply a list of granted cores in the
>>> $PE_HOSTFILE, it would be an extension to feed this to Open MPI to
> allow
>>> any UGE aware
>>>>> binding.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Am 14.08.2014 um 21:52 schrieb Oscar Mojica:
>>>>>>>>>> 
>>>>>>>>>>> Guys
>>>>>>>>>>> 
>>>>>>>>>>> I changed the line to run the program in the script with both
>>> options
>>>>>>>>>>> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v --bind-to-none
>>> -np $NSLOTS ./inverse.exe
>>>>>>>>>>> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v
> --bind-to-socket
>>> -np $NSLOTS ./inverse.exe
>>>>>>>>>>> 
>>>>>>>>>>> but I got the same results. When I use man mpirun appears:
>>>>>>>>>>> 
>>>>>>>>>>> -bind-to-none, --bind-to-none
>>>>>>>>>>> Do not bind processes. (Default.)
>>>>>>>>>>> 
>>>>>>>>>>> and the output of 'qconf -sp orte' is
>>>>>>>>>>> 
>>>>>>>>>>> pe_name orte
>>>>>>>>>>> slots 9999
>>>>>>>>>>> user_lists NONE
>>>>>>>>>>> xuser_lists NONE
>>>>>>>>>>> start_proc_args /bin/true
>>>>>>>>>>> stop_proc_args /bin/true
>>>>>>>>>>> allocation_rule $fill_up
>>>>>>>>>>> control_slaves TRUE
>>>>>>>>>>> job_is_first_task FALSE
>>>>>>>>>>> urgency_slots min
>>>>>>>>>>> accounting_summary TRUE
>>>>>>>>>>> 
>>>>>>>>>>> I don't know if the installed Open MPI was compiled with
>>> '--with-sge'. How can i know that?
>>>>>>>>>>> before to think in an hybrid application i was using only MPI
> and
>>> the program used few processors (14). The cluster possesses 28
> machines, 15
>>> with 16
>>>>> cores and 13 with 8 cores totalizing 344 units of processing. When I
>>> submitted the job (only MPI), the MPI processes were spread to the
> cores
>>> directly, for that
>>>>> reason I created a new queue with 14 machines trying to gain more
> time.
>>> the results were the same in both cases. In the last case i could prove
>>> that the processes
>>>>> were distributed to all machines correctly.
>>>>>>>>>>> 
>>>>>>>>>>> What I must to do?
>>>>>>>>>>> Thanks
>>>>>>>>>>> 
>>>>>>>>>>> Oscar Fabian Mojica Ladino
>>>>>>>>>>> Geologist M.S. in Geophysics
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Date: Thu, 14 Aug 2014 10:10:17 -0400
>>>>>>>>>>>> From: maxime.boissonnea...@calculquebec.ca
>>>>>>>>>>>> To: us...@open-mpi.org
>>>>>>>>>>>> Subject: Re: [OMPI users] Running a hybrid MPI+openMP program
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> You DEFINITELY need to disable OpenMPI's new default binding.
>>> Otherwise,
>>>>>>>>>>>> your N threads will run on a single core. --bind-to socket
> would
>>> be my
>>>>>>>>>>>> recommendation for hybrid jobs.
>>>>>>>>>>>> 
>>>>>>>>>>>> Maxime
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a 馗rit :
>>>>>>>>>>>>> I don't know much about OpenMP, but do you need to disable
> Open
>>> MPI's default bind-to-core functionality (I'm assuming you're using
> Open
>>> MPI 1.8.x)?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You can try "mpirun --bind-to none ...", which will have Open
>>> MPI not bind MPI processes to cores, which might allow OpenMP to think
> that
>>> it can use
>>>>> all the cores, and therefore it will spawn num_cores threads...?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Aug 14, 2014, at 9:50 AM, Oscar Mojica
>>> <o_moji...@hotmail.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hello everybody
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am trying to run a hybrid mpi + openmp program in a
> cluster.
>>> I created a queue with 14 machines, each one with 16 cores. The program
>>> divides the
>>>>> work among the 14 processors with MPI and within each processor a
> loop
>>> is also divided into 8 threads for example, using openmp. The problem
> is
>>> that when I submit
>>>>> the job to the queue the MPI processes don't divide the work into
>>> threads and the program prints the number of threads that are working
>>> within each process as one.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I made a simple test program that uses openmp and I logged
> in
>>> one machine of the fourteen. I compiled it using gfortran -fopenmp
>>> program.f -o exe,
>>>>> set the OMP_NUM_THREADS environment variable equal to 8 and when I
> ran
>>> directly in the terminal the loop was effectively divided among the
> cores
>>> and for example in
>>>>> this case the program printed the number of threads equal to 8
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is my Makefile
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> # Start of the makefile
>>>>>>>>>>>>>> # Defining variables
>>>>>>>>>>>>>> objects = inv_grav3d.o funcpdf.o gr3dprm.o fdjac.o dsvd.o
>>>>>>>>>>>>>> #f90comp = /opt/openmpi/bin/mpif90
>>>>>>>>>>>>>> f90comp = /usr/bin/mpif90
>>>>>>>>>>>>>> #switch = -O3
>>>>>>>>>>>>>> executable = inverse.exe
>>>>>>>>>>>>>> # Makefile
>>>>>>>>>>>>>> all : $(executable)
>>>>>>>>>>>>>> $(executable) : $(objects)
>>>>>>>>>>>>>> $(f90comp) -fopenmp -g -O -o $(executable) $(objects)
>>>>>>>>>>>>>> rm $(objects)
>>>>>>>>>>>>>> %.o: %.f
>>>>>>>>>>>>>> $(f90comp) -c $<
>>>>>>>>>>>>>> # Cleaning everything
>>>>>>>>>>>>>> clean:
>>>>>>>>>>>>>> rm $(executable)
>>>>>>>>>>>>>> #        rm $(objects)
>>>>>>>>>>>>>> # End of the makefile
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> and the script that i am using is
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> #!/bin/bash
>>>>>>>>>>>>>> #$ -cwd
>>>>>>>>>>>>>> #$ -j y
>>>>>>>>>>>>>> #$ -S /bin/bash
>>>>>>>>>>>>>> #$ -pe orte 14
>>>>>>>>>>>>>> #$ -N job
>>>>>>>>>>>>>> #$ -q new.q
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> export OMP_NUM_THREADS=8
>>>>>>>>>>>>>> /usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v -np
>>> $NSLOTS ./inverse.exe
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> am I forgetting something?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Oscar Fabian Mojica Ladino
>>>>>>>>>>>>>> Geologist M.S. in Geophysics
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> Subscription:
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25016.php
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> ---------------------------------
>>>>>>>>>>>> Maxime Boissonneault
>>>>>>>>>>>> Analyste de calcul - Calcul Qu饕ec, Universit・Laval
>>>>>>>>>>>> Ph. D. en physique
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> Subscription:
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25020.php
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> Subscription:
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25032.php
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25034.php
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25037.php
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25038.php
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25079.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25080.php
>>>>> 
>>>>> ----
>>>>> Tetsuya Mishima  tmish...@jcity.maeda.co.jp
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25081.php
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/08/25083.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25084.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25087.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/25102.php

Re: [OMPI users] Running a hybrid MPI+openMP program

Reply via email to