Re: [OMPI users] Allgather Implementation Details

2015-07-01 Thread Saliya Ekanayake
Thank you George. This is very informative.

Is it possible to pass the option in runtime rather setting up in the
config file?

Thank you
Saliya

On Tue, Jun 30, 2015 at 7:20 PM, George Bosilca  wrote:

> Saliya,
>
> On Tue, Jun 30, 2015 at 10:50 AM, Saliya Ekanayake 
> wrote:
>
>> Hi,
>>
>> I am experiencing some bottleneck with allgatherv routine in one of our
>> programs and wonder how it works internally. Could you please share some
>> details on this?
>>
>
> Open MPI has a tunable approach to all the collective algorithms. In case
> you have the tuned collective enabled (--mca coll tuned,inter,self,basic as
> an example) you do have access to the pipelined ring version you made
> reference to. However, in addition to that particular version you also have
> access to other, sometimes faster, algorithms such as Bruck.
>
> Do a quick "ompi_info --param coll tuned -l 9" to see all the tuned
> collective options. You can alter the selection of a particular AllgatherV
> algorithm in Open MPI by adding the 2 following lines in yout
> ${HOME}/.openmpi/mca-params.conf file.
> coll_tuned_use_dynamic_rules = 1
> coll_tuned_allgatherv_algorithm = 3
> With the above 2 lines I force the Bruck algorithm (which has ID 3 in the
> output of ompi_info) for all allgathev collectives.
>
> You can benchmark the MPI_Allgatherv for your particular case and then
> force the selection of the right algorithm.
>
>   George.
>
>
>
>
>>
>> I found this [1] paper from Gropp discussing an efficient implementation.
>> Is this similar to what we get in OpenMPI?
>>
>>
>>
>> [1]
>> http://www.researchgate.net/profile/William_Gropp/publication/221597354_A_Simple_Pipelined_Algorithm_for_Large_Irregular_All-gather_Problems/links/00b49525d291830c6700.pdf
>>
>>
>>
>> Thank you,
>> Saliya
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/06/27229.php
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27232.php
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org


Re: [OMPI users] Allgather Implementation Details

2015-07-01 Thread George Bosilca
Use --mca to pass the options directly through the mpirun.

  George.


On Wed, Jul 1, 2015 at 9:14 AM, Saliya Ekanayake  wrote:

> Thank you George. This is very informative.
>
> Is it possible to pass the option in runtime rather setting up in the
> config file?
>
> Thank you
> Saliya
>
> On Tue, Jun 30, 2015 at 7:20 PM, George Bosilca 
> wrote:
>
>> Saliya,
>>
>> On Tue, Jun 30, 2015 at 10:50 AM, Saliya Ekanayake 
>> wrote:
>>
>>> Hi,
>>>
>>> I am experiencing some bottleneck with allgatherv routine in one of our
>>> programs and wonder how it works internally. Could you please share some
>>> details on this?
>>>
>>
>> Open MPI has a tunable approach to all the collective algorithms. In case
>> you have the tuned collective enabled (--mca coll tuned,inter,self,basic as
>> an example) you do have access to the pipelined ring version you made
>> reference to. However, in addition to that particular version you also have
>> access to other, sometimes faster, algorithms such as Bruck.
>>
>> Do a quick "ompi_info --param coll tuned -l 9" to see all the tuned
>> collective options. You can alter the selection of a particular AllgatherV
>> algorithm in Open MPI by adding the 2 following lines in yout
>> ${HOME}/.openmpi/mca-params.conf file.
>> coll_tuned_use_dynamic_rules = 1
>> coll_tuned_allgatherv_algorithm = 3
>> With the above 2 lines I force the Bruck algorithm (which has ID 3 in the
>> output of ompi_info) for all allgathev collectives.
>>
>> You can benchmark the MPI_Allgatherv for your particular case and then
>> force the selection of the right algorithm.
>>
>>   George.
>>
>>
>>
>>
>>>
>>> I found this [1] paper from Gropp discussing an efficient
>>> implementation. Is this similar to what we get in OpenMPI?
>>>
>>>
>>>
>>> [1]
>>> http://www.researchgate.net/profile/William_Gropp/publication/221597354_A_Simple_Pipelined_Algorithm_for_Large_Irregular_All-gather_Problems/links/00b49525d291830c6700.pdf
>>>
>>>
>>>
>>> Thank you,
>>> Saliya
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/06/27229.php
>>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/06/27232.php
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Searchable archives:
> http://www.open-mpi.org/community/lists/users/2015/07/27236.php
>


Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-07-01 Thread Stefan Paquay
Hi all,

Hopefully this mail gets posted in the right thread...

I have noticed the (I guess same) leak using OpenMPI 1.8.6 with LAMMPS, a
molecular dynamics program, without any use of CUDA. I am not that familiar
with how the internal memory management of LAMMPS works, but it does not
appear CUDA-related.

The symptoms are the same:
OpenMPI 1.8.5: everything is fine
OpenMPI 1.8.6: same setup, pretty large leak

Unfortunately, I have no idea how to isolate the bug, but to reproduce it:
1. clone LAMMPS (git clone git://git.lammps.org/lammps-ro.git lammps)
2. cd src/, compile with openMPI 1.8.6
3. run the example listed in lammps/examples/melt

I would like to help find this bug but I am not sure what would help.
LAMMPS itself is pretty big so I can imagine you might not want to go
through all of the code...


[OMPI users] Running 1 proc per socket but no more

2015-07-01 Thread Saliya Ekanayake
Hi,

I am doing some benchmarks and would like to test the following two
scenarios. Each machine has 4 sockets each with 6 cores (lstopo image
attached).

Scenario 1
---
Run 12 procs per node each bound to 2 cores. I can do this by --map-by
socket:PE=2

Scenario 2
Run 12 procs per node each bound to just 1 core. This is what I don't know
how to do, because if I do --map-by socket:PE=1 then mpirun will put more
than 12 procs per node as it can do so.

I'd appreciate any help on this.

Thank you,
Saliya

-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org


Re: [OMPI users] Running 1 proc per socket but no more

2015-07-01 Thread Ralph Castain
Scenario 2: --map-by ppr:12:node,span --bind-to core

will put 12 procs on each node, load balanced across the sockets, each proc
bound to 1 core

HTH
Ralph


On Wed, Jul 1, 2015 at 2:42 PM, Saliya Ekanayake  wrote:

> Hi,
>
> I am doing some benchmarks and would like to test the following two
> scenarios. Each machine has 4 sockets each with 6 cores (lstopo image
> attached).
>
> Scenario 1
> ---
> Run 12 procs per node each bound to 2 cores. I can do this by --map-by
> socket:PE=2
>
> Scenario 2
> Run 12 procs per node each bound to just 1 core. This is what I don't know
> how to do, because if I do --map-by socket:PE=1 then mpirun will put more
> than 12 procs per node as it can do so.
>
> I'd appreciate any help on this.
>
> Thank you,
> Saliya
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/07/27239.php
>


Re: [OMPI users] Running 1 proc per socket but no more

2015-07-01 Thread Saliya Ekanayake
Thank you Ralph

Saliya

On Wed, Jul 1, 2015 at 4:01 PM, Ralph Castain  wrote:

> Scenario 2: --map-by ppr:12:node,span --bind-to core
>
> will put 12 procs on each node, load balanced across the sockets, each
> proc bound to 1 core
>
> HTH
> Ralph
>
>
> On Wed, Jul 1, 2015 at 2:42 PM, Saliya Ekanayake 
> wrote:
>
>> Hi,
>>
>> I am doing some benchmarks and would like to test the following two
>> scenarios. Each machine has 4 sockets each with 6 cores (lstopo image
>> attached).
>>
>> Scenario 1
>> ---
>> Run 12 procs per node each bound to 2 cores. I can do this by --map-by
>> socket:PE=2
>>
>> Scenario 2
>> Run 12 procs per node each bound to just 1 core. This is what I don't
>> know how to do, because if I do --map-by socket:PE=1 then mpirun will put
>> more than 12 procs per node as it can do so.
>>
>> I'd appreciate any help on this.
>>
>> Thank you,
>> Saliya
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/07/27239.php
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/07/27240.php
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org


Re: [OMPI users] Running 1 proc per socket but no more

2015-07-01 Thread Saliya Ekanayake
I tried this, but I get an error,

---
An invalid value was given for the number of processes
per resource (ppr) to be mapped on each node:

  PPR:  12:node,span

The specification must be a comma-separated list containing
combinations of number, followed by a colon, followed
by the resource type. For example, a value of "1:socket" indicates that
one process is to be mapped onto each socket. Values are supported
for hwthread, core, L1-3 caches, socket, numa, and node. Note that
enough characters must be provided to clearly specify the desired
resource (e.g., "nu" for "numa").

On Wed, Jul 1, 2015 at 4:04 PM, Saliya Ekanayake  wrote:

> Thank you Ralph
>
> Saliya
>
> On Wed, Jul 1, 2015 at 4:01 PM, Ralph Castain  wrote:
>
>> Scenario 2: --map-by ppr:12:node,span --bind-to core
>>
>> will put 12 procs on each node, load balanced across the sockets, each
>> proc bound to 1 core
>>
>> HTH
>> Ralph
>>
>>
>> On Wed, Jul 1, 2015 at 2:42 PM, Saliya Ekanayake 
>> wrote:
>>
>>> Hi,
>>>
>>> I am doing some benchmarks and would like to test the following two
>>> scenarios. Each machine has 4 sockets each with 6 cores (lstopo image
>>> attached).
>>>
>>> Scenario 1
>>> ---
>>> Run 12 procs per node each bound to 2 cores. I can do this by --map-by
>>> socket:PE=2
>>>
>>> Scenario 2
>>> Run 12 procs per node each bound to just 1 core. This is what I don't
>>> know how to do, because if I do --map-by socket:PE=1 then mpirun will put
>>> more than 12 procs per node as it can do so.
>>>
>>> I'd appreciate any help on this.
>>>
>>> Thank you,
>>> Saliya
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/07/27239.php
>>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/07/27240.php
>>
>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>



-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org


Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-07-01 Thread Rolf vandeVaart
Hi Stefan (and Steven who reported this earlier with CUDA-aware program)



I have managed to observed the leak when running LAMMPS as well.  Note that 
this has nothing to do with CUDA-aware features.  I am going to move this 
discussion to the Open MPI developer’s list to dig deeper into this issue.  
Thanks for reporting.



Rolf



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Stefan Paquay
Sent: Wednesday, July 01, 2015 11:43 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi all,
Hopefully this mail gets posted in the right thread...
I have noticed the (I guess same) leak using OpenMPI 1.8.6 with LAMMPS, a 
molecular dynamics program, without any use of CUDA. I am not that familiar 
with how the internal memory management of LAMMPS works, but it does not appear 
CUDA-related.
The symptoms are the same:
OpenMPI 1.8.5: everything is fine
OpenMPI 1.8.6: same setup, pretty large leak
Unfortunately, I have no idea how to isolate the bug, but to reproduce it:
1. clone LAMMPS (git clone 
git://git.lammps.org/lammps-ro.git lammps)
2. cd src/, compile with openMPI 1.8.6
3. run the example listed in lammps/examples/melt
I would like to help find this bug but I am not sure what would help. LAMMPS 
itself is pretty big so I can imagine you might not want to go through all of 
the code...


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] IB to some nodes but TCP for others

2015-07-01 Thread Tim Miller
Hi All,

Sorry for the late reply on this. I've been digging through the OpenMPI
FAQ. I've never explicitly set the subnet IDs for my IB subnets, so I
suspect I'm using the factory defaults. Probably, if I change this, it will
"just work". I'll see if the end user is still interested in testing this
and, if so, try it out.

Thanks,
Tim

On Tue, Jun 16, 2015 at 7:15 PM, Jeff Squyres (jsquyres)  wrote:

> Do you have different IB subnet IDs?  That would be the only way for Open
> MPI to tell the two IB subnets apart.
>
>
>
> > On Jun 16, 2015, at 1:25 PM, Tim Miller  wrote:
> >
> > Hi All,
> >
> > We have a set of nodes which are all connected via InfiniBand, but all
> are mutually connected. For example, nodes 1-32 are connected to IB switch
> A and 33-64 are connected to switch B, but there is no IB connection
> between switches A and B. However, all nodes are mutually routable over TCP.
> >
> > What we'd like to do is tell OpenMPI to use IB when communicating
> amongst nodes 1-32 or 33-64, but to use TCP whenever a node in the set 1-32
> needs to talk to another node in the set 33-64 or vice-versa. We've written
> an application in such a way that we can confine most of the bandwidth and
> latency sensitive operations to within groups of 32 nodes, but members of
> the two groups do have to communicate infrequently via MPI.
> >
> > Is there any way to tell OpenMPI to use IB within an IB-connected group
> and TCP for inter-group communications? Obvoiously, we would need to tell
> OpenMPI the membership of the two groups. If there's no such functionality,
> would it be a difficult thing to hack in (I'd be glad to give it a try
> myself, but I'm not that familiar with the codebase, so a couple of
> pointers would be helpful, or a note saying I'm crazy for trying).
> >
> > Thanks,
> > Tim
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27141.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27144.php
>


[OMPI users] Binding width affects allgatherv performance?

2015-07-01 Thread Saliya Ekanayake
Hi,

I am getting strange performance results for allgatherv operation for the
same number of procs and data, but with varying binding width. For example
here are two cases with about 180x difference in performance.

Each machine has 4 sockets each with 6 cores totaling 24 cores per node
(topology attached).

Case 1

12 procs per node each bound to 1 core times 30 nodes --> 1929 ms

Case 2

12 procs per node each bound to 2 cores times 30 nodes --> 357209 ms


Another set of variations for 2 procs per node and 4 procs per node is
given below in the chart. Is such variation expected with binding width? I
am a bit puzzled and would appreciate any help to understand this.

[image: Inline image 1]

Thank you,
Saliya

-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org