[OMPI users] Bandwidth efficiency advice

2017-05-26 Thread marcin.krotkiewski

Dear All,

I would appreciate some general advice on how to efficiently implement 
the following scenario.


I am looking into how to send a large amount of data over IB _once_, to 
multiple receivers. The trick is, of course, that while the ping-pong 
benchmark delivers great bandwidth, it does so by re-using the already 
registered memory buffers. Since I need to send the data once, the 
memory registration penalty is not easily avoided. I've been looking 
into the following approaches:


1. have multiple ranks send different parts of the data to different 
receivers, in the hope that the memory registration cost will be hidden
2. pre-register two smaller buffers, into which a data is copied before 
sending


The first approach is the best I've managed so far, but the bandwidth 
reached is still lower than what I observe using the pingpong benchmark. 
Also, the performance depends on the number of sending ranks and drops 
if there are too many.


In the second approach one pays for a data copy. My thinking was that 
since the effective memory bandwidth available on a single modern CPU is 
larger than the IB bandwidth, I could squeeze out some performance by 
combining double buffering and multithreading, e.g.,


Step 1. thread A sends the data in the current buffer. Behind the 
scenes, thread B copies data from memory to the next buffer

Step 2. buffers are switched

A similar idea would be to use MPI_Get on the remote rank. The sender 
would copy the data from the memory to the second buffer while the RMA 
window with the first buffer is exposed. In theory, I would expect those 
two operations to be executed simultaneously, with the memory copy 
hopefully hidden behind the IB transfer.


Of course, the experiments didn't really work. While the first 
(multi-rank) approach is OK and shows some improvement, the bandwidth 
could still be improved. None of my double-buffering approaches worked 
at all, possibly because memory bandwidth contention.


So I was wondering, has any of you had any experience with similar 
approaches? In your experience, what would be the best approach?


Thanks a lot!

Marcin

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Bandwidth efficiency advice

2017-05-26 Thread George Bosilca
If you have multiple receivers then use MPI_Bcast, it does all the
necessary optimizations such that MPI users do not have to struggle to
adapt/optimize their application for a specific architecture/network.

  George.



On Fri, May 26, 2017 at 6:43 AM, marcin.krotkiewski <
marcin.krotkiew...@gmail.com> wrote:

> Dear All,
>
> I would appreciate some general advice on how to efficiently implement the
> following scenario.
>
> I am looking into how to send a large amount of data over IB _once_, to
> multiple receivers. The trick is, of course, that while the ping-pong
> benchmark delivers great bandwidth, it does so by re-using the already
> registered memory buffers. Since I need to send the data once, the memory
> registration penalty is not easily avoided. I've been looking into the
> following approaches:
>
> 1. have multiple ranks send different parts of the data to different
> receivers, in the hope that the memory registration cost will be hidden
> 2. pre-register two smaller buffers, into which a data is copied before
> sending
>
> The first approach is the best I've managed so far, but the bandwidth
> reached is still lower than what I observe using the pingpong benchmark.
> Also, the performance depends on the number of sending ranks and drops if
> there are too many.
>
> In the second approach one pays for a data copy. My thinking was that
> since the effective memory bandwidth available on a single modern CPU is
> larger than the IB bandwidth, I could squeeze out some performance by
> combining double buffering and multithreading, e.g.,
>
> Step 1. thread A sends the data in the current buffer. Behind the scenes,
> thread B copies data from memory to the next buffer
> Step 2. buffers are switched
>
> A similar idea would be to use MPI_Get on the remote rank. The sender
> would copy the data from the memory to the second buffer while the RMA
> window with the first buffer is exposed. In theory, I would expect those
> two operations to be executed simultaneously, with the memory copy
> hopefully hidden behind the IB transfer.
>
> Of course, the experiments didn't really work. While the first
> (multi-rank) approach is OK and shows some improvement, the bandwidth could
> still be improved. None of my double-buffering approaches worked at all,
> possibly because memory bandwidth contention.
>
> So I was wondering, has any of you had any experience with similar
> approaches? In your experience, what would be the best approach?
>
> Thanks a lot!
>
> Marcin
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] pmix, lxc, hpcx

2017-05-26 Thread John Marshall

Hi,

I have built openmpi 2.1.1 with hpcx-1.8 and tried to run some mpi code under
ubuntu 14.04 and LXC (1.x) but I get the following:

[ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
src/dstore/pmix_esh.c at line 1651
[ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
src/dstore/pmix_esh.c at line 1751
[ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
src/dstore/pmix_esh.c at line 1114
[ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
src/common/pmix_jobdata.c at line 93
[ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
src/common/pmix_jobdata.c at line 333
[ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
src/server/pmix_server.c at line 606

I do not get the same outside of the LXC container and my code runs fine.

I've looked for more info on these messages but could not find anything
helpful. Are these messages indicative of something missing in, or some
incompatibility with, the container?

When I build using 2.0.2, I do not have a problem running inside or outside of
the container.

Thanks,
John
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] pmix, lxc, hpcx

2017-05-26 Thread Howard Pritchard
Hi John,

In the 2.1.x release stream a shared memory capability was introduced into
the PMIx component.

I know nothing about LXC containers, but it looks to me like there's some
issue when PMIx tries
to create these shared memory segments.  I'd check to see if there's
something about your
container configuration that is preventing the creation of shared memory
segments.

Howard


2017-05-26 15:18 GMT-06:00 John Marshall :

> Hi,
>
> I have built openmpi 2.1.1 with hpcx-1.8 and tried to run some mpi code
> under
> ubuntu 14.04 and LXC (1.x) but I get the following:
>
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/dstore/pmix_esh.c at line 1651
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/dstore/pmix_esh.c at line 1751
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/dstore/pmix_esh.c at line 1114
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/common/pmix_jobdata.c at line 93
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/common/pmix_jobdata.c at line 333
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/server/pmix_server.c at line 606
>
> I do not get the same outside of the LXC container and my code runs fine.
>
> I've looked for more info on these messages but could not find anything
> helpful. Are these messages indicative of something missing in, or some
> incompatibility with, the container?
>
> When I build using 2.0.2, I do not have a problem running inside or
> outside of
> the container.
>
> Thanks,
> John
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] pmix, lxc, hpcx

2017-05-26 Thread r...@open-mpi.org
You can also get around it by configuring OMPI with “--disable-pmix-dstore”


> On May 26, 2017, at 3:02 PM, Howard Pritchard  wrote:
> 
> Hi John,
> 
> In the 2.1.x release stream a shared memory capability was introduced into 
> the PMIx component.
> 
> I know nothing about LXC containers, but it looks to me like there's some 
> issue when PMIx tries
> to create these shared memory segments.  I'd check to see if there's 
> something about your
> container configuration that is preventing the creation of shared memory 
> segments.
> 
> Howard
> 
> 
> 2017-05-26 15:18 GMT-06:00 John Marshall  >:
> Hi,
> 
> I have built openmpi 2.1.1 with hpcx-1.8 and tried to run some mpi code under
> ubuntu 14.04 and LXC (1.x) but I get the following:
> [ib7-bc2oo42-be10p16.science.gc.ca:16035 
> ] PMIX ERROR: 
> OUT-OF-RESOURCE in file src/dstore/pmix_esh.c at line 1651
> [ib7-bc2oo42-be10p16.science.gc.ca:16035 
> ] PMIX ERROR: 
> OUT-OF-RESOURCE in file src/dstore/pmix_esh.c at line 1751
> [ib7-bc2oo42-be10p16.science.gc.ca:16035 
> ] PMIX ERROR: 
> OUT-OF-RESOURCE in file src/dstore/pmix_esh.c at line 1114
> [ib7-bc2oo42-be10p16.science.gc.ca:16035 
> ] PMIX ERROR: 
> OUT-OF-RESOURCE in file src/common/pmix_jobdata.c at line 93
> [ib7-bc2oo42-be10p16.science.gc.ca:16035 
> ] PMIX ERROR: 
> OUT-OF-RESOURCE in file src/common/pmix_jobdata.c at line 333
> [ib7-bc2oo42-be10p16.science.gc.ca:16035 
> ] PMIX ERROR: 
> OUT-OF-RESOURCE in file src/server/pmix_server.c at line 606
> I do not get the same outside of the LXC container and my code runs fine.
> 
> I've looked for more info on these messages but could not find anything
> helpful. Are these messages indicative of something missing in, or some
> incompatibility with, the container?
> 
> When I build using 2.0.2, I do not have a problem running inside or outside of
> the container.
> 
> Thanks,
> John
> 
> ___
> users mailing list
> users@lists.open-mpi.org 
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Problems with IPoIB and Openib

2017-05-26 Thread Allan Overstreet
I have been having some issues with using openmpi with tcp over IPoIB 
and openib. The problems arise when I run a program that uses basic 
collective communication. The two programs that I have been using are 
attached.


*** IPoIB ***

The mpirun command I am using to run mpi over IPoIB is,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_include 
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes 
-np 8 ./avg 8000


This program will appear to run on the nodes, but will sit at 100% CPU 
and use no memory. On the host node an error will be printed,


[sm1][[58411,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 10.1.0.3 failed: No route to host (113)


Using another program,

mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_include 
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes 
-np 8 ./congrad 800
Produces the following result. This program will also run on the nodes 
sm1, sm2, sm3, and sm4 at 100% and use no memory.
[sm3][[61383,1],4][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],6][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],3][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 10.1.0.2 failed: No route to host (113)
[sm3][[61383,1],5][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],7][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 10.1.0.2 failed: No route to host (113)
[sm1][[61383,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 10.1.0.3 failed: No route to host (113)
[sm1][[61383,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect] 
connect() to 10.1.0.3 failed: No route to host (113)


*** openib ***

Running the congrad program over openib will produce the result,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_include 
10.1.0.0/24 -hostfile nodes -np 8 ./avg 800

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
--
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.
Host:  sm2.overst.local
Framework: btl
Component: openib
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
  mca_bml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send() 
to socket 29 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in 
file oob_usock_connection.c at line 316
[sm1.overst.local:32239] [[57506,0],1]-[[57506,1],1] usock_peer_accept: 
usock_peer_send_connect_ack failed
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send() 
to socket 27 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in 
file oob_usock_connection.c at line 316
[sm1.overst.local:32239] [[57506,0],1]-[[57506,1],0] usock_peer_accept: 
usock_peer_send_connect_ack failed