Re: [OMPI users] Order of ranks in mpirun

2019-05-17 Thread Adam Sylvester via users
Thanks - "--map-by numa:span" did exactly what I wanted!

On Wed, May 15, 2019 at 10:34 PM Ralph Castain via users <
users@lists.open-mpi.org> wrote:

>
>
> > On May 15, 2019, at 7:18 PM, Adam Sylvester via users <
> users@lists.open-mpi.org> wrote:
> >
> > Up to this point, I've been running a single MPI rank per physical host
> (using multithreading within my application to use all available cores).  I
> use this command:
> > mpirun -N 1 --bind-to none --hostfile hosts.txt
> > Where hosts.txt has an IP address on each line
> >
> > I've started running on machines with significant NUMA effects... on a
> single one of these machines, I've started running a separate rank per NUMA
> node.  On a machine with 64 CPUs and 4 NUMA nodes, I do this:
> > mpirun -N 1 --bind-to numa
> > I've convinced myself by watching the processors that are active on
> 'top' that this is behaving like I want it to.
> >
> > I now want to combine these two - running on, say, 10 physical hosts
> with 4 NUMA nodes - a total of 40 ranks.  But, the order of the ranks is
> important (for efficiency, due to how the application divides up work
> across ranks).  So, I want ranks 0-3 to be on host 0 across its NUMA nodes,
> then ranks 4-7 on host 1 across its NUMA nodes, etc.
> >
> > Some guesses:
> > mpirun -n 40 --map-by numa --rank-by numa --hostfile hosts.txt
>^^
> This is the one you want. If you want it “load balanced” (i.e., you want
> to round-robin across all the numas before adding a second proc to one of
> them), then change the map-by option to be “--map-by numa:span” so it
> treats all the numa regions as if they were on one gigantic node and
> round-robins across them. Then you won’t need any “slots” argument
> regardless of how many procs total you execute (even if you want to put
> some extras on the first numa nodes). Note that the above cmd line will
> default to “--bind-to numa” to match the mapping policy unless you tell it
> otherwise.
>
>
> > or
> > mpirun --map-by ppr:4:node --rank-by numa --hostfile hosts.txt
> > Where hosts.txt still has a single IP address per line (and doesn't need
> a 'slots=4')
> >
> > I'd like to make sure I get the syntax right in general and not just
> empirically try guesses until one looks like it works... and find
> inevitably it doesn't work like I thought when I change the # of machines
> or run on machines with a different # of NUMA nodes.
> >
> > Thanks.
> > -Adam
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Order of ranks in mpirun

2019-05-15 Thread Adam Sylvester via users
Up to this point, I've been running a single MPI rank per physical host
(using multithreading within my application to use all available cores).  I
use this command:
mpirun -N 1 --bind-to none --hostfile hosts.txt
Where hosts.txt has an IP address on each line

I've started running on machines with significant NUMA effects... on a
single one of these machines, I've started running a separate rank per NUMA
node.  On a machine with 64 CPUs and 4 NUMA nodes, I do this:
mpirun -N 1 --bind-to numa
I've convinced myself by watching the processors that are active on 'top'
that this is behaving like I want it to.

I now want to combine these two - running on, say, 10 physical hosts with 4
NUMA nodes - a total of 40 ranks.  But, the order of the ranks is important
(for efficiency, due to how the application divides up work across ranks).
So, I want ranks 0-3 to be on host 0 across its NUMA nodes, then ranks 4-7
on host 1 across its NUMA nodes, etc.

Some guesses:
mpirun -n 40 --map-by numa --rank-by numa --hostfile hosts.txt
or
mpirun --map-by ppr:4:node --rank-by numa --hostfile hosts.txt
Where hosts.txt still has a single IP address per line (and doesn't need a
'slots=4')

I'd like to make sure I get the syntax right in general and not just
empirically try guesses until one looks like it works... and find
inevitably it doesn't work like I thought when I change the # of machines
or run on machines with a different # of NUMA nodes.

Thanks.
-Adam
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Network performance over TCP

2019-03-23 Thread Adam Sylvester
Thanks Gilles.  Unfortunately, my understanding is that EFA is only
available on C5n instances, not 'regular' C5 instances (
https://aws.amazon.com/about-aws/whats-new/2018/11/introducing-elastic-fabric-adapter/).
I will be using C5n instances in the future but not at this time, so I'm
hoping to get btl_tcp_links or equivalent to work...

Adam

On Sat, Mar 23, 2019, 8:59 PM Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> FWIW, EFA adapter is available on this AWS instance, and Open MPI can use
> it via libfabric (aka OFI)
> Here is a link to Brian’s video
> https://insidehpc.com/2018/04/amazon-libfabric-case-study-flexible-hpc-infrastructure/
>
> Cheers,
>
> Gilles
>
> On Sunday, March 24, 2019, Adam Sylvester  wrote:
>
>> Digging up this old thread as it appears there's still an issue with
>> btl_tcp_links.
>>
>> I'm now using c5.18xlarge instances in AWS which have 25 Gbps
>> connectivity; using iperf3 with the -P option to drive multiple ports, I
>> achieve over 24 Gbps when communicating between two instances.
>>
>> When I originally asked this question, Gilles suggested I could do the
>> equivalent with OpenMPI via the --mca btl_tcp_links flag but then Brian
>> reported that this flag doesn't work in the 2.x and 3.x series.  I just
>> updated to OpenMPI 4.0.0, hoping that this was fixed, according to the FAQ
>> at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-links, it should
>> be working.  However, I see no difference in performance; on a simple
>> benchmark which passes 10 GB between two ranks (one rank per host) via
>> MPI_Send() and MPI_Recv(), I see around 9 Gb / s with or without this flag.
>>
>> In particular, I am running with:
>> mpirun --mca btl_tcp_links 4 -N 1 --bind-to none --hostfile hosts.txt
>> /path/to/my/application
>>
>> Trying a btl_tcp_links value of 2 or 3 also makes no difference.  Is
>> there another flag I need to be using or is something still broken?
>>
>> Thanks.
>> -Adam
>>
>> On Thu, Jul 13, 2017 at 12:05 PM Adam Sylvester  wrote:
>>
>>> Bummer - thanks for the info Brian.
>>>
>>> As an FYI, I do have a real world use case for this faster connectivity
>>> (i.e. beyond just a benchmark).  While my application will happily gobble
>>> up and run on however many machines it's given, there's a resource manager
>>> that lives on top of everything that doles out machines to applications.
>>> So there will be cases where my application will only get two machines to
>>> run and so I'd still like the big data transfers to happen as quickly as
>>> possible.  I agree that when there are many ranks all talking to each
>>> other, I should hopefully get closer to the full 20 Gbps.
>>>
>>> I appreciate that you have a number of other higher priorities, but
>>> wanted to make you aware that I do have a use case for it... look forward
>>> to using it when it's in place. :o)
>>>
>>> On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users <
>>> users@lists.open-mpi.org> wrote:
>>>
>>>> Adam -
>>>>
>>>> The btl_tcp_links flag does not currently work (for various reasons) in
>>>> the 2.x and 3.x series.  It’s on my todo list to fix, but I’m not sure it
>>>> will get done before the 3.0.0 release.  Part of the reason that it hasn’t
>>>> been a priority is that most applications (outside of benchmarks) don’t
>>>> benefit from the 20 Gbps between rank pairs, as they are generally talking
>>>> to multiple peers at once (and therefore can drive the full 20 Gbps).  It’s
>>>> definitely on our roadmap, but can’t promise a release just yet.
>>>>
>>>> Brian
>>>>
>>>> On Jul 12, 2017, at 11:44 AM, Adam Sylvester  wrote:
>>>>
>>>> I switched over to X1 instances in AWS which have 20 Gbps
>>>> connectivity.  Using iperf3, I'm seeing 11.1 Gbps between them with just
>>>> one port.  iperf3 supports a -P option which will connect using multiple
>>>> ports...  Setting this to use in the range of 5-20 ports (there's some
>>>> variability from run to run), I can get in the range of 18 Gbps aggregate
>>>> which for a real world speed seems pretty good.
>>>>
>>>> Using mpirun with the previously-suggested btl_tcp_sndbuf and
>>>> btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps.  So, pretty close to
>>>> iperf with just one port (makes sense there'd be some overhead with MPI).
>>>> My understanding of 

Re: [OMPI users] Network performance over TCP

2019-03-23 Thread Adam Sylvester
Digging up this old thread as it appears there's still an issue with
btl_tcp_links.

I'm now using c5.18xlarge instances in AWS which have 25 Gbps connectivity;
using iperf3 with the -P option to drive multiple ports, I achieve over 24
Gbps when communicating between two instances.

When I originally asked this question, Gilles suggested I could do the
equivalent with OpenMPI via the --mca btl_tcp_links flag but then Brian
reported that this flag doesn't work in the 2.x and 3.x series.  I just
updated to OpenMPI 4.0.0, hoping that this was fixed, according to the FAQ
at https://www.open-mpi.org/faq/?category=tcp#tcp-multi-links, it should be
working.  However, I see no difference in performance; on a simple
benchmark which passes 10 GB between two ranks (one rank per host) via
MPI_Send() and MPI_Recv(), I see around 9 Gb / s with or without this flag.

In particular, I am running with:
mpirun --mca btl_tcp_links 4 -N 1 --bind-to none --hostfile hosts.txt
/path/to/my/application

Trying a btl_tcp_links value of 2 or 3 also makes no difference.  Is there
another flag I need to be using or is something still broken?

Thanks.
-Adam

On Thu, Jul 13, 2017 at 12:05 PM Adam Sylvester  wrote:

> Bummer - thanks for the info Brian.
>
> As an FYI, I do have a real world use case for this faster connectivity
> (i.e. beyond just a benchmark).  While my application will happily gobble
> up and run on however many machines it's given, there's a resource manager
> that lives on top of everything that doles out machines to applications.
> So there will be cases where my application will only get two machines to
> run and so I'd still like the big data transfers to happen as quickly as
> possible.  I agree that when there are many ranks all talking to each
> other, I should hopefully get closer to the full 20 Gbps.
>
> I appreciate that you have a number of other higher priorities, but wanted
> to make you aware that I do have a use case for it... look forward to using
> it when it's in place. :o)
>
> On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users <
> users@lists.open-mpi.org> wrote:
>
>> Adam -
>>
>> The btl_tcp_links flag does not currently work (for various reasons) in
>> the 2.x and 3.x series.  It’s on my todo list to fix, but I’m not sure it
>> will get done before the 3.0.0 release.  Part of the reason that it hasn’t
>> been a priority is that most applications (outside of benchmarks) don’t
>> benefit from the 20 Gbps between rank pairs, as they are generally talking
>> to multiple peers at once (and therefore can drive the full 20 Gbps).  It’s
>> definitely on our roadmap, but can’t promise a release just yet.
>>
>> Brian
>>
>> On Jul 12, 2017, at 11:44 AM, Adam Sylvester  wrote:
>>
>> I switched over to X1 instances in AWS which have 20 Gbps connectivity.
>> Using iperf3, I'm seeing 11.1 Gbps between them with just one port.  iperf3
>> supports a -P option which will connect using multiple ports...  Setting
>> this to use in the range of 5-20 ports (there's some variability from run
>> to run), I can get in the range of 18 Gbps aggregate which for a real world
>> speed seems pretty good.
>>
>> Using mpirun with the previously-suggested btl_tcp_sndbuf and
>> btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps.  So, pretty close to
>> iperf with just one port (makes sense there'd be some overhead with MPI).
>> My understanding of the btl_tcp_links flag that Gilles mentioned is that it
>> should be analogous to iperf's -P flag - it should connect with multiple
>> ports in the hopes of improving the aggregate bandwidth.
>>
>> If that's what this flag is supposed to do, it does not appear to be
>> working properly for me.  With lsof, I can see the expected number of ports
>> show up when I run iperf.  However, with MPI I only ever see three
>> connections between the two machines - sshd, orted, and my actual
>> application.  No matter what I set btl_tcp_links to, I don't see any
>> additional ports show up (or any change in performance).
>>
>> Am I misunderstanding what this flag does or is there a bug here?  If I
>> am misunderstanding the flag's intent, is there a different flag that would
>> allow Open MPI to use multiple ports similar to what iperf is doing?
>>
>> Thanks.
>> -Adam
>>
>> On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester  wrote:
>>
>>> Thanks again Gilles.  Ahh, better yet - I wasn't familiar with the
>>> config file way to set these parameters... it'll be easy to bake this into
>>> my AMI so that I don't have to set them each time while waiting for the
>>> next Open MPI release.
>>>
>>> Out of mostly lazines

Re: [OMPI users] OpenMPI behavior with Ialltoall and GPUs

2019-03-14 Thread Adam Sylvester
FYI for others that have run into the same problem, see
https://github.com/openucx/ucx/issues/3359.  In short:
1. Use UCX 1.5 rather than 1.4 (I recommend updating
https://www.open-mpi.org/faq/?category=buildcuda)
2. Dynamically link in the cudart library (by default nvcc will statically
link it).  Future UCX versions will fix a lingering bug that makes this
required currently.

With these changes, I was able to successfully run my application.

On Sun, Mar 3, 2019 at 9:49 AM Adam Sylvester  wrote:

> I'm running OpenMPI 4.0.0 built with gdrcopy 1.3 and UCX 1.4 per the
> instructions at https://www.open-mpi.org/faq/?category=buildcuda, built
> against CUDA 10.0 on RHEL 7.  I'm running on a p2.xlarge instance in AWS
> (single NVIDIA K80 GPU).  OpenMPI reports CUDA support:
> $ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
> mca:mpi:base:param:mpi_built_with_cuda_support:value:true
>
> I'm attempting to use MPI_Ialltoall() to overlap a block of GPU
> computations with network transfers, using MPI_Test() to nudge async
> transfers along.  Based on the table 5 I see in
> https://www.open-mpi.org/faq/?category=runcuda, MPI_Ialltoall() should be
> supported (though I don't see MPI_Test() called out as supported or not
> supported... though my example crashes with or without it).  The behavior
> I'm seeing is that when running with a small number of elements, everything
> runs without issue.  However, for a larger number of elements (where
> "large" is just a few hundred), I start to get errors like this
> "cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 16000,
> error message Bad address".  Changing to synchronous MPI_alltoall() results
> in the program running successfully.
>
> I tried boiling my issue down to the simplest problem I could that
> recreates the crash.  Note that this needs to be compiled with
> "--std=c++11".  Running "mpirun -np 2 mpi_test_ialltoall 200 256 10" runs
> successfully; changing the 200 to a 400 results in a crash after a few
> blocks.  Thanks for any thoughts.
>
> Code sample:
> https://gist.github.com/asylvest/7c9d5c15a3a044a0a2338cf9c828d2c3
>
> -Adam
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] OpenMPI behavior with Ialltoall and GPUs

2019-03-03 Thread Adam Sylvester
I'm running OpenMPI 4.0.0 built with gdrcopy 1.3 and UCX 1.4 per the
instructions at https://www.open-mpi.org/faq/?category=buildcuda, built
against CUDA 10.0 on RHEL 7.  I'm running on a p2.xlarge instance in AWS
(single NVIDIA K80 GPU).  OpenMPI reports CUDA support:
$ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true

I'm attempting to use MPI_Ialltoall() to overlap a block of GPU
computations with network transfers, using MPI_Test() to nudge async
transfers along.  Based on the table 5 I see in
https://www.open-mpi.org/faq/?category=runcuda, MPI_Ialltoall() should be
supported (though I don't see MPI_Test() called out as supported or not
supported... though my example crashes with or without it).  The behavior
I'm seeing is that when running with a small number of elements, everything
runs without issue.  However, for a larger number of elements (where
"large" is just a few hundred), I start to get errors like this
"cma_ep.c:113  UCX  ERROR process_vm_readv delivered 0 instead of 16000,
error message Bad address".  Changing to synchronous MPI_alltoall() results
in the program running successfully.

I tried boiling my issue down to the simplest problem I could that
recreates the crash.  Note that this needs to be compiled with
"--std=c++11".  Running "mpirun -np 2 mpi_test_ialltoall 200 256 10" runs
successfully; changing the 200 to a 400 results in a crash after a few
blocks.  Thanks for any thoughts.

Code sample:
https://gist.github.com/asylvest/7c9d5c15a3a044a0a2338cf9c828d2c3

-Adam
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Querying/limiting OpenMPI memory allocations

2018-12-21 Thread Adam Sylvester
I did some additional profiling of my code.  While the application uses 10
ranks, this particular image breaks into two totally independent pieces,
and we split the world communicator, so really this section of code is
using 5 ranks.

There was a ~16 GB allocation buried way inside several layers of classes
that I was not tracking in my total memory calculations that was part of
the issue... obviously nothing to do with OpenMPI.

For the MPI_Allgatherv() stage, there is ~13 GB of data spread roughly
evenly across 5 ranks that we're gathering via MPI_Allgatherv().  During
that function call, I see 6-7 GB extra allocated which must be due to the
underlying buffers used for transfer.  I tried PMPI_Allgatherv() followed
by MPI_Barrier() but I saw the same 6-7 GB spike.  Examining the code more
closely, there is a way I can rearchitect this to send less data across the
ranks (each rank really just needs several rows above and below itself, not
the entire global data).

So, I think I'm set for now - thanks for the help.

On Thu, Dec 20, 2018 at 7:49 PM Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> you can rewrite MPI_Allgatherv() in your app. it should simply invoke
> PMPI_Allgatherv() (note the leading 'P') with the same arguments
> followed by MPI_Barrier() in the same communicator (feel free to also
> MPI_Barrier() before PMPI_Allgatherv()).
> That can make your code slower, but it will force the unexpected
> messages related to allgatherv being received.
> If it helps with respect to memory consumption, that means we have a lead
>
> Cheers,
>
> Gilles
>
> On Fri, Dec 21, 2018 at 5:00 AM Jeff Hammond 
> wrote:
> >
> > You might try replacing MPI_Allgatherv with the equivalent Send+Recv
> followed by Broadcast.  I don't think MPI_Allgatherv is particularly
> optimized (since it is hard to do and not a very popular function) and it
> might improve your memory utilization.
> >
> > Jeff
> >
> > On Thu, Dec 20, 2018 at 7:08 AM Adam Sylvester  wrote:
> >>
> >> Gilles,
> >>
> >> It is btl/tcp (we'll be upgrading to newer EC2 types next year to take
> advantage of libfabric).  I need to write a script to log and timestamp the
> memory usage of the process as reported by /proc//stat and sync that
> up with the application's log of what it's doing to say this definitively,
> but based on what I've watched on 'top' so far, I think where these big
> allocations are happening are two areas where I'm doing MPI_Allgatherv() -
> every rank has roughly 1/numRanks of the data (but not divided exactly
> evenly so need to use MPI_Allgatherv)... the ranks are reusing that
> pre-allocated buffer to store their local results and then pass that same
> pre-allocated buffer into MPI_Allgatherv() to bring results in from all
> ranks.  So, there is a lot of communication across all ranks at these
> points.  So, does your comment about using the coll/sync module apply in
> this case?  I'm not familiar with this module - is this something I specify
> at OpenMPI compile time or a runtime option that
>   I enable?
> >>
> >> Thanks for the detailed help.
> >> -Adam
> >>
> >> On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >>>
> >>> Adam,
> >>>
> >>> Are you using btl/tcp (e.g. plain TCP/IP) for internode communications
> >>> ? Or are you using libfabric on top of the latest EC2 drivers ?
> >>>
> >>> There is no control flow in btl/tcp, which means for example if all
> >>> your nodes send messages to rank 0, that can create a lot of
> >>> unexpected messages on that rank..
> >>> In the case of btl/tcp, this means a lot of malloc() on rank 0, until
> >>> these messages are received by the app.
> >>> If rank 0 is overflowed, then that will likely end up in the node
> >>> swapping to death (or killing your app if you have little or no swap).
> >>>
> >>> If you are using collective operations, make sure the coll/sync module
> >>> is selected.
> >>> This module insert MPI_Barrier() every n collectives on a given
> >>> communicator. This forces your processes to synchronize and can force
> >>> message to be received. (Think of the previous example if you run
> >>> MPI_Scatter(root=0) in a loop)
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester 
> wrote:
> >>> >
> >>> > This case is actually quite small - 10 physical machines with 18
> physical cores each, 1

Re: [OMPI users] Querying/limiting OpenMPI memory allocations

2018-12-20 Thread Adam Sylvester
Gilles,

It is btl/tcp (we'll be upgrading to newer EC2 types next year to take
advantage of libfabric).  I need to write a script to log and timestamp the
memory usage of the process as reported by /proc//stat and sync that
up with the application's log of what it's doing to say this definitively,
but based on what I've watched on 'top' so far, I think where these big
allocations are happening are two areas where I'm doing MPI_Allgatherv() -
every rank has roughly 1/numRanks of the data (but not divided exactly
evenly so need to use MPI_Allgatherv)... the ranks are reusing that
pre-allocated buffer to store their local results and then pass that same
pre-allocated buffer into MPI_Allgatherv() to bring results in from all
ranks.  So, there is a lot of communication across all ranks at these
points.  So, does your comment about using the coll/sync module apply in
this case?  I'm not familiar with this module - is this something I specify
at OpenMPI compile time or a runtime option that I enable?

Thanks for the detailed help.
-Adam

On Thu, Dec 20, 2018 at 9:41 AM Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> Are you using btl/tcp (e.g. plain TCP/IP) for internode communications
> ? Or are you using libfabric on top of the latest EC2 drivers ?
>
> There is no control flow in btl/tcp, which means for example if all
> your nodes send messages to rank 0, that can create a lot of
> unexpected messages on that rank..
> In the case of btl/tcp, this means a lot of malloc() on rank 0, until
> these messages are received by the app.
> If rank 0 is overflowed, then that will likely end up in the node
> swapping to death (or killing your app if you have little or no swap).
>
> If you are using collective operations, make sure the coll/sync module
> is selected.
> This module insert MPI_Barrier() every n collectives on a given
> communicator. This forces your processes to synchronize and can force
> message to be received. (Think of the previous example if you run
> MPI_Scatter(root=0) in a loop)
>
> Cheers,
>
> Gilles
>
> On Thu, Dec 20, 2018 at 11:06 PM Adam Sylvester  wrote:
> >
> > This case is actually quite small - 10 physical machines with 18
> physical cores each, 1 rank per machine.  These are AWS R4 instances (Intel
> Xeon E5 Broadwell processors).  OpenMPI version 2.1.0, using TCP (10 Gbps).
> >
> > I calculate the memory needs of my application upfront (in this case
> ~225 GB per machine), allocate one buffer upfront, and reuse this buffer
> for valid and scratch throughout processing.  This is running on RHEL 7 -
> I'm measuring memory usage via top where I see it go up to 248 GB in an
> MPI-intensive portion of processing.
> >
> > I thought I was being quite careful with my memory allocations and there
> weren't any other stray allocations going on, but of course it's possible
> there's a large temp buffer somewhere that I've missed... based on what
> you're saying, this is way more memory than should be attributed to OpenMPI
> - is there a way I can query OpenMPI to confirm that?  If the OS is unable
> to keep up with the network traffic, is it possible there's some low-level
> system buffer that gets allocated to gradually work off the TCP traffic?
> >
> > Thanks.
> >
> > On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> How many nodes are you using? How many processes per node? What kind of
> processor? Open MPI version? 25 GB is several orders of magnitude more
> memory than should be used except at extreme scale (1M+ processes). Also,
> how are you calculating memory usage?
> >>
> >> -Nathan
> >>
> >> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester  wrote:
> >> >
> >> > Is there a way at runtime to query OpenMPI to ask it how much memory
> it's using for internal buffers?  Is there a way at runtime to set a max
> amount of memory OpenMPI will use for these buffers?  I have an application
> where for certain inputs OpenMPI appears to be allocating ~25 GB and I'm
> not accounting for this in my memory calculations (and thus bricking the
> machine).
> >> >
> >> > Thanks.
> >> > -Adam
> >> > ___
> >> > users mailing list
> >> > users@lists.open-mpi.org
> >> > https://lists.open-mpi.org/mailman/listinfo/users
> >>
> >> ___
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Querying/limiting OpenMPI memory allocations

2018-12-20 Thread Adam Sylvester
This case is actually quite small - 10 physical machines with 18 physical
cores each, 1 rank per machine.  These are AWS R4 instances (Intel Xeon E5
Broadwell processors).  OpenMPI version 2.1.0, using TCP (10 Gbps).

I calculate the memory needs of my application upfront (in this case ~225
GB per machine), allocate one buffer upfront, and reuse this buffer for
valid and scratch throughout processing.  This is running on RHEL 7 - I'm
measuring memory usage via top where I see it go up to 248 GB in an
MPI-intensive portion of processing.

I thought I was being quite careful with my memory allocations and there
weren't any other stray allocations going on, but of course it's possible
there's a large temp buffer somewhere that I've missed... based on what
you're saying, this is way more memory than should be attributed to OpenMPI
- is there a way I can query OpenMPI to confirm that?  If the OS is unable
to keep up with the network traffic, is it possible there's some low-level
system buffer that gets allocated to gradually work off the TCP traffic?

Thanks.

On Thu, Dec 20, 2018 at 8:32 AM Nathan Hjelm via users <
users@lists.open-mpi.org> wrote:

> How many nodes are you using? How many processes per node? What kind of
> processor? Open MPI version? 25 GB is several orders of magnitude more
> memory than should be used except at extreme scale (1M+ processes). Also,
> how are you calculating memory usage?
>
> -Nathan
>
> > On Dec 20, 2018, at 4:49 AM, Adam Sylvester  wrote:
> >
> > Is there a way at runtime to query OpenMPI to ask it how much memory
> it's using for internal buffers?  Is there a way at runtime to set a max
> amount of memory OpenMPI will use for these buffers?  I have an application
> where for certain inputs OpenMPI appears to be allocating ~25 GB and I'm
> not accounting for this in my memory calculations (and thus bricking the
> machine).
> >
> > Thanks.
> > -Adam
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Querying/limiting OpenMPI memory allocations

2018-12-20 Thread Adam Sylvester
Is there a way at runtime to query OpenMPI to ask it how much memory it's
using for internal buffers?  Is there a way at runtime to set a max amount
of memory OpenMPI will use for these buffers?  I have an application where
for certain inputs OpenMPI appears to be allocating ~25 GB and I'm not
accounting for this in my memory calculations (and thus bricking the
machine).

Thanks.
-Adam
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Limit to number of asynchronous sends/receives?

2018-12-16 Thread Adam Sylvester
I'm running OpenMPI 2.1.0 on RHEL 7 using TCP communication.  For the
specific run that's crashing on me, I'm running with 17 ranks (on 17
different physical machines).  I've got a stage in my application where
ranks need to transfer chunks of data where the size of each chunk is
trivial (on the order of 100 MB) compared to the overall imagery.  However,
the chunks are spread out across many buffers in a way that makes the
indexing complicated (and the memory is not all within a single buffer)...
the simplest way to express the data movement in code is by a large number
of MPI_Isend() and MPI_Ireceive() calls followed of course by an eventual
MPI_Waitall().  This works fine for many cases, but I've run into a case
now where the chunks are imbalanced such that a few ranks have a total of
~450 MPI_Request objects (I do a single MPI_Waitall() with all requests at
once) and the remaining ranks have < 10 MPI_Requests.  In this scenario, I
get a seg fault inside PMPI_Waitall().

Is there an implementation limit as to how many asynchronous requests are
allowed?  Is there a way this can be queried either via a #define value or
runtime call?  I probably won't go this route, but when initially compiling
OpenMPI, is there a configure option to increase it?

I've done a fair amount of debugging and am pretty confident this is where
the error is occurring as opposed to indexing out of bounds somewhere, but
if there is no such limit in OpenMPI, that would be useful to know too.

Thanks.
-Adam
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Locking down TCP ports used

2018-07-07 Thread Adam Sylvester
Aha - thanks Ralph!  I'll give that a shot.

-Adam

On Sat, Jul 7, 2018 at 10:36 AM, r...@open-mpi.org  wrote:

> I suspect the OOB is working just fine and you are seeing the TCP/btl
> opening the other ports. There are two TCP elements at work here: the OOB
> (which sends management messages between daemons) and the BTL (which
> handles the MPI traffic). In addition to what you provided, you also need
> to provide the following params:
>
> btl_tcp_port_range_v4: The number of ports where the TCP BTL will try to
> bind.
>   This parameter together with the port min, define a range of
> ports
>   where Open MPI will open sockets
>
> btl_tcp_port_min_v4: starting port to use
>
> I can’t answer the question about #ports to open - will have to leave that
> to someone else
> Ralph
>
> > On Jul 7, 2018, at 6:34 AM, Adam Sylvester  wrote:
> >
> > I'm using OpenMPI 2.1.0 on RHEL 7, communicating between ranks via TCP
> >
> > I have a new cluster to install my application on with
> tightly-controlled firewalls.  I can have them open up a range of TCP ports
> which MPI can communicate over.  I thought I could force MPI to stick to a
> range of ports via "--mca oob_tcp_static_ports startPort-endPort" but this
> doesn't seem to be working; I still seem MPI opening up TCP ports outside
> of this range to communicate.  I've also seen "--mca oob_tcp_dynamic_ports"
> on message boards; I'm not sure what the difference is between these two
> but this flag doesn't seem to do what I want either.
> >
> > Is there a way to lock the TCP port range down?  As a general rule of
> thumb, if I'm communicating between up to 50 instances on a 10 Gbps network
> moving at several painful spots in the chain hundreds of GBs of data
> around, how large should I make this port range (i.e. if Open MPI would
> normally open a bunch of ports on each machine to improve the network
> transfer speed, I don't want to slow it down by allowing it too narrow of a
> port range).  Just need a rough order of magnitude - 10 ports, 100 ports,
> 1000 ports?
> >
> > Thanks!
> > -Adam
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Locking down TCP ports used

2018-07-07 Thread Adam Sylvester
I'm using OpenMPI 2.1.0 on RHEL 7, communicating between ranks via TCP

I have a new cluster to install my application on with tightly-controlled
firewalls.  I can have them open up a range of TCP ports which MPI can
communicate over.  I thought I could force MPI to stick to a range of ports
via "--mca oob_tcp_static_ports startPort-endPort" but this doesn't seem to
be working; I still seem MPI opening up TCP ports outside of this range to
communicate.  I've also seen "--mca oob_tcp_dynamic_ports" on message
boards; I'm not sure what the difference is between these two but this flag
doesn't seem to do what I want either.

Is there a way to lock the TCP port range down?  As a general rule of
thumb, if I'm communicating between up to 50 instances on a 10 Gbps network
moving at several painful spots in the chain hundreds of GBs of data
around, how large should I make this port range (i.e. if Open MPI would
normally open a bunch of ports on each machine to improve the network
transfer speed, I don't want to slow it down by allowing it too narrow of a
port range).  Just need a rough order of magnitude - 10 ports, 100 ports,
1000 ports?

Thanks!
-Adam
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpirun issue using more than 64 hosts

2018-02-12 Thread Adam Sylvester
A... thanks Gilles.  That makes sense.  I was stuck thinking there was
an ssh problem on rank 0; it never occurred to me mpirun was doing
something clever there and that those ssh errors were from a different
instance altogether.

It's no problem to put my private key on all instances - I'll go that route.

-Adam

On Mon, Feb 12, 2018 at 7:12 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> by default, when more than 64 hosts are involved, mpirun uses a tree
> spawn in order to remote launch the orted daemons.
>
> That means you have two options here :
>  - allow all compute nodes to ssh each other (e.g. the ssh private key
> of *all* the nodes should be in *all* the authorized_keys
>  - do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true
> ...)
>
> I recommend the first option, otherwise mpirun would fork a large
> number of ssh processes and  hence use quite a lot of
> resources on the node running mpirun.
>
> Cheers,
>
> Gilles
>
> On Tue, Feb 13, 2018 at 8:23 AM, Adam Sylvester <op8...@gmail.com> wrote:
> > I'm running OpenMPI 2.1.0, built from source, on RHEL 7.  I'm using the
> > default ssh-based launcher, where I have my private ssh key on rank 0 and
> > the associated public key on all ranks.  I create a hosts file with a
> list
> > of unique IPs, with the host that I'm running mpirun from on the first
> line,
> > and run this command:
> >
> > mpirun -N 1 --bind-to none --hostfile hosts.txt hostname
> >
> > This works fine up to 64 machines.  At 65 or greater, I get ssh errors.
> > Frequently
> >
> > Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
> >
> > though today another user got
> >
> > Host key verification failed.
> >
> > I have confirmed I can successfully manually ssh into these instances.
> I've
> > also written a loop in bash which will background an ssh sleep command
> to >
> > 64 instances and this succeeds.
> >
> > From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
> > connections have to do with inbound, not outbound limits, and I can
> prove by
> > running straight ssh commands that I'm not hitting a limit.
> >
> > Is there something wrong with my mpirun syntax (I've run this way
> thousands
> > of times without issues with fewer than 64 hosts, and I know MPI is
> > frequently used on orders of magnitudes more hosts than this)?  Or is
> this a
> > known bug that's addressed in a later MPI release?
> >
> > Thanks for the help.
> > -Adam
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] mpirun issue using more than 64 hosts

2018-02-12 Thread Adam Sylvester
I'm running OpenMPI 2.1.0, built from source, on RHEL 7.  I'm using the
default ssh-based launcher, where I have my private ssh key on rank 0 and
the associated public key on all ranks.  I create a hosts file with a list
of unique IPs, with the host that I'm running mpirun from on the first
line, and run this command:

mpirun -N 1 --bind-to none --hostfile hosts.txt hostname

This works fine up to 64 machines.  At 65 or greater, I get ssh errors.
Frequently

Permission denied (publickey,gssapi-keyex,gssapi-with-mic)

though today another user got

Host key verification failed.

I have confirmed I can successfully manually ssh into these instances.
I've also written a loop in bash which will background an ssh sleep command
to > 64 instances and this succeeds.

>From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
connections have to do with inbound, not outbound limits, and I can prove
by running straight ssh commands that I'm not hitting a limit.

Is there something wrong with my mpirun syntax (I've run this way thousands
of times without issues with fewer than 64 hosts, and I know MPI is
frequently used on orders of magnitudes more hosts than this)?  Or is this
a known bug that's addressed in a later MPI release?

Thanks for the help.
-Adam
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Tracking Open MPI memory usage

2017-11-26 Thread Adam Sylvester
I have an application running across 20 machines where each machine has 60
GB RAM.  For some large inputs, some ranks require 45-50 GB RAM.  The
behavior I'm seeing is that for some of these large cases, my application
will run for 10-15 minutes and then one rank will be killed; based on
watching top in the past, the application's memory usage gradually
increases until it eventually hits 60 GB and is killed (presumably by the
OOM killer).

There are a few possibilities that come to mind...
1. While I compute all memory requirements upfront and allocate one large
ping/pong buffer to reuse throughout the application, there are some other
(believed to be small) allocations here and there.  For large inputs, some
of these may not be quite as small as I think.
2. There's a memory leak.
3. Open MPI is allocating very large buffers for transferring data,
potentially because throughout the application I am *not* using synchronous
sends.

I can track down 1 and 2, but I'm wondering if there's some kind of
debug/logging mode I can run in to see Open MPI's buffer management.  All I
really care about is the total amount of memory it allocates, but if I need
to parse a list of buffers and sizes to infer the total allocation size,
that's fine.

Thanks for the help.
-Adam
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Forcing MPI processes to end

2017-11-17 Thread Adam Sylvester
Thanks - that's exactly what I needed!  Works as advertised. :o)

On Thu, Nov 16, 2017 at 1:27 PM, Aurelien Bouteiller <boute...@icl.utk.edu>
wrote:

> Adam. Your MPI program is incorrect. You need to replace the finalize on
> the process that found the error with MPIAbort
>
> On Nov 16, 2017 10:38, "Adam Sylvester" <op8...@gmail.com> wrote:
>
>> I'm using Open MPI 2.1.0 for this but I'm not sure if this is more of an
>> Open MPI-specific implementation question or what the MPI standard
>> guarantees.
>>
>> I have an application which runs across multiple ranks, eventually
>> reaching an MPI_Gather() call.  Along the way, if one of the ranks
>> encounters an error, it will call report the error to a log, call
>> MPI_Finalize(), and exit with a non-zero return code.  If this happens
>> prior to the other ranks making it to the gather, it seems like mpirun
>> notices this and the process ends on all ranks.  This is what I want to
>> happen - it's a legitimate error, so all processes should be freed up so
>> the next job can run.  It seems like if the other ranks make it into the
>> MPI_Gather() before the one rank reports an error, the other ranks wait in
>> the MPI_Gather() forever.
>>
>> Is there something simple I can do to guarantee that if any process calls
>> MPI_Finalize(), all my ranks terminate?
>>
>> Thanks.
>> -Adam
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] NUMA interaction with Open MPI

2017-07-16 Thread Adam Sylvester
I'll start with my question upfront: Is there a way to do the equivalent of
telling mpirun to do 'numactl --interleave=all' on the processes that it
runs?  Or if I want to control the memory placement of my applications run
through MPI will I need to use libnuma for this?  I tried doing "mpirun
 numactl --interleave=all ".  I
don't know how to explicitly verify if this ran the numactl command on each
host or not but based on the performance I'm seeing, it doesn't seem like
it did (or something else is causing my poor performance).

More details: For the particular image I'm benchmarking with, I have a
multi-threaded application which requires 60 GB of RAM to run if it's run
on one machine.  It allocates one large ping/pong buffer upfront and uses
this to avoid copies when updating the image at each step.  I'm running in
AWS and comparing performance on an r3.8xlarge (16 CPUs, 244 GB RAM, 10
Gbps) vs. an x1.32xlarge (64 CPUs, 2 TB RAM, 20 Gbps).  Running on a single
X1, my application runs ~3x faster than the R3; using numactl
--interleave=all has a significant positive effect on its performance,  I
assume because the various threads that are running are accessing memory
spread out across the nodes rather than most of them having slow access to
it.  So far so good.

My application also supports distributing across machines via MPI.  When
doing this, the memory requirement scales linearly with the number of
machines; there are three pinch points that involve large (GBs of data)
all-to-all communication.  For the slowest of these three, I've pipelined
this step and use MPI_Ialltoallv() to hide as much of the latency as I
can.  When run on R3 instances, overall runtime scales very well as
machines are added.  Still so far so good.

My problems start with the X1 instances.  I do get scaling as I add more
machines, but it is significantly worse than with the R3s.  This isn't just
a matter of there being more CPUs and the MPI communication time
dominating.  The actual time spent in the MPI all-to-all communication is
significantly longer than on the R3s for the same number of machines,
despite the network bandwidth being twice as high (in a post from a few
days ago some folks helped me with MPI settings to improve the network
communication speed - from toy benchmark MPI tests I know I'm getting
faster communication on the X1s than on the R3s, so this feels likely to be
an issue with NUMA, though I'd be interested in any other thoughts.

I looked at https://www.open-mpi.org/doc/current/man1/mpirun.1.php but this
didn't seem to have what I was looking for.  I want MPI to let my
application use all CPUs on the system (I'm the only one running on it)...
I just want to control the memory placement.

Thanks for the help.
-Adam
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Network performance over TCP

2017-07-13 Thread Adam Sylvester
Bummer - thanks for the info Brian.

As an FYI, I do have a real world use case for this faster connectivity
(i.e. beyond just a benchmark).  While my application will happily gobble
up and run on however many machines it's given, there's a resource manager
that lives on top of everything that doles out machines to applications.
So there will be cases where my application will only get two machines to
run and so I'd still like the big data transfers to happen as quickly as
possible.  I agree that when there are many ranks all talking to each
other, I should hopefully get closer to the full 20 Gbps.

I appreciate that you have a number of other higher priorities, but wanted
to make you aware that I do have a use case for it... look forward to using
it when it's in place. :o)

On Wed, Jul 12, 2017 at 2:18 PM, Barrett, Brian via users <
users@lists.open-mpi.org> wrote:

> Adam -
>
> The btl_tcp_links flag does not currently work (for various reasons) in
> the 2.x and 3.x series.  It’s on my todo list to fix, but I’m not sure it
> will get done before the 3.0.0 release.  Part of the reason that it hasn’t
> been a priority is that most applications (outside of benchmarks) don’t
> benefit from the 20 Gbps between rank pairs, as they are generally talking
> to multiple peers at once (and therefore can drive the full 20 Gbps).  It’s
> definitely on our roadmap, but can’t promise a release just yet.
>
> Brian
>
> On Jul 12, 2017, at 11:44 AM, Adam Sylvester <op8...@gmail.com> wrote:
>
> I switched over to X1 instances in AWS which have 20 Gbps connectivity.
> Using iperf3, I'm seeing 11.1 Gbps between them with just one port.  iperf3
> supports a -P option which will connect using multiple ports...  Setting
> this to use in the range of 5-20 ports (there's some variability from run
> to run), I can get in the range of 18 Gbps aggregate which for a real world
> speed seems pretty good.
>
> Using mpirun with the previously-suggested btl_tcp_sndbuf and
> btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps.  So, pretty close to
> iperf with just one port (makes sense there'd be some overhead with MPI).
> My understanding of the btl_tcp_links flag that Gilles mentioned is that it
> should be analogous to iperf's -P flag - it should connect with multiple
> ports in the hopes of improving the aggregate bandwidth.
>
> If that's what this flag is supposed to do, it does not appear to be
> working properly for me.  With lsof, I can see the expected number of ports
> show up when I run iperf.  However, with MPI I only ever see three
> connections between the two machines - sshd, orted, and my actual
> application.  No matter what I set btl_tcp_links to, I don't see any
> additional ports show up (or any change in performance).
>
> Am I misunderstanding what this flag does or is there a bug here?  If I am
> misunderstanding the flag's intent, is there a different flag that would
> allow Open MPI to use multiple ports similar to what iperf is doing?
>
> Thanks.
> -Adam
>
> On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester <op8...@gmail.com> wrote:
>
>> Thanks again Gilles.  Ahh, better yet - I wasn't familiar with the config
>> file way to set these parameters... it'll be easy to bake this into my AMI
>> so that I don't have to set them each time while waiting for the next Open
>> MPI release.
>>
>> Out of mostly laziness I try to keep to the formal releases rather than
>> applying patches myself, but thanks for the link to it (the commit comments
>> were useful to understand why this improved performance).
>>
>> -Adam
>>
>> On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet <gil...@rist.or.jp>
>> wrote:
>>
>>> Adam,
>>>
>>>
>>> Thanks for letting us know your performance issue has been resolved.
>>>
>>>
>>> yes, https://www.open-mpi.org/faq/?category=tcp is the best place to
>>> look for this kind of information.
>>>
>>> i will add a reference to these parameters. i will also ask folks at AWS
>>> if they have additional/other recommendations.
>>>
>>>
>>> note you have a few options before 2.1.2 (or 3.0.0) is released :
>>>
>>>
>>> - update your system wide config file (/.../etc/openmpi-mca-params.conf)
>>> or user config file
>>>
>>>   ($HOME/.openmpi/mca-params.conf) and add the following lines
>>>
>>> btl_tcp_sndbuf = 0
>>>
>>> btl_tcp_rcvbuf = 0
>>>
>>>
>>> - add the following environment variable to your environment
>>>
>>> export OMPI_MCA_btl_tcp_sndbuf=0
>>>
>>> export OMPI_MCA_btl_tcp_rcvbuf=0
>>&

Re: [OMPI users] Network performance over TCP

2017-07-12 Thread Adam Sylvester
I switched over to X1 instances in AWS which have 20 Gbps connectivity.
Using iperf3, I'm seeing 11.1 Gbps between them with just one port.  iperf3
supports a -P option which will connect using multiple ports...  Setting
this to use in the range of 5-20 ports (there's some variability from run
to run), I can get in the range of 18 Gbps aggregate which for a real world
speed seems pretty good.

Using mpirun with the previously-suggested btl_tcp_sndbuf and
btl_tcp_rcvbuf settings, I'm getting around 10.7 Gbps.  So, pretty close to
iperf with just one port (makes sense there'd be some overhead with MPI).
My understanding of the btl_tcp_links flag that Gilles mentioned is that it
should be analogous to iperf's -P flag - it should connect with multiple
ports in the hopes of improving the aggregate bandwidth.

If that's what this flag is supposed to do, it does not appear to be
working properly for me.  With lsof, I can see the expected number of ports
show up when I run iperf.  However, with MPI I only ever see three
connections between the two machines - sshd, orted, and my actual
application.  No matter what I set btl_tcp_links to, I don't see any
additional ports show up (or any change in performance).

Am I misunderstanding what this flag does or is there a bug here?  If I am
misunderstanding the flag's intent, is there a different flag that would
allow Open MPI to use multiple ports similar to what iperf is doing?

Thanks.
-Adam

On Mon, Jul 10, 2017 at 9:31 PM, Adam Sylvester <op8...@gmail.com> wrote:

> Thanks again Gilles.  Ahh, better yet - I wasn't familiar with the config
> file way to set these parameters... it'll be easy to bake this into my AMI
> so that I don't have to set them each time while waiting for the next Open
> MPI release.
>
> Out of mostly laziness I try to keep to the formal releases rather than
> applying patches myself, but thanks for the link to it (the commit comments
> were useful to understand why this improved performance).
>
> -Adam
>
> On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet <gil...@rist.or.jp>
> wrote:
>
>> Adam,
>>
>>
>> Thanks for letting us know your performance issue has been resolved.
>>
>>
>> yes, https://www.open-mpi.org/faq/?category=tcp is the best place to
>> look for this kind of information.
>>
>> i will add a reference to these parameters. i will also ask folks at AWS
>> if they have additional/other recommendations.
>>
>>
>> note you have a few options before 2.1.2 (or 3.0.0) is released :
>>
>>
>> - update your system wide config file (/.../etc/openmpi-mca-params.conf)
>> or user config file
>>
>>   ($HOME/.openmpi/mca-params.conf) and add the following lines
>>
>> btl_tcp_sndbuf = 0
>>
>> btl_tcp_rcvbuf = 0
>>
>>
>> - add the following environment variable to your environment
>>
>> export OMPI_MCA_btl_tcp_sndbuf=0
>>
>> export OMPI_MCA_btl_tcp_rcvbuf=0
>>
>>
>> - use Open MPI 2.0.3
>>
>>
>> - last but not least, you can manually download and apply the patch
>> available at
>>
>> https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc
>> 7c4693f9c1ef01dfb69f.patch
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On 7/9/2017 11:04 PM, Adam Sylvester wrote:
>>
>>> Gilles,
>>>
>>> Thanks for the fast response!
>>>
>>> The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended
>>> made a huge difference - this got me up to 5.7 Gb/s! I wasn't aware of
>>> these flags... with a little Googling, is https://www.open-mpi.org/faq/?
>>> category=tcp the best place to look for this kind of information and
>>> any other tweaks I may want to try (or if there's a better FAQ out there,
>>> please let me know)?
>>> There is only eth0 on my machines so nothing to tweak there (though good
>>> to know for the future). I also didn't see any improvement by specifying
>>> more sockets per instance. But, your initial suggestion had a major impact.
>>> In general I try to stay relatively up to date with my Open MPI version;
>>> I'll be extra motivated to upgrade to 2.1.2 so that I don't have to
>>> remember to set these --mca flags on the command line. :o)
>>> -Adam
>>>
>>> On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>
>>> wrote:
>>>
>>> Adam,
>>>
>>> at first, you need to change the default send and receive socket
>>> buffers :
>>> mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_

Re: [OMPI users] Network performance over TCP

2017-07-11 Thread Adam Sylvester
Thanks again Gilles.  Ahh, better yet - I wasn't familiar with the config
file way to set these parameters... it'll be easy to bake this into my AMI
so that I don't have to set them each time while waiting for the next Open
MPI release.

Out of mostly laziness I try to keep to the formal releases rather than
applying patches myself, but thanks for the link to it (the commit comments
were useful to understand why this improved performance).

-Adam

On Mon, Jul 10, 2017 at 12:04 AM, Gilles Gouaillardet <gil...@rist.or.jp>
wrote:

> Adam,
>
>
> Thanks for letting us know your performance issue has been resolved.
>
>
> yes, https://www.open-mpi.org/faq/?category=tcp is the best place to look
> for this kind of information.
>
> i will add a reference to these parameters. i will also ask folks at AWS
> if they have additional/other recommendations.
>
>
> note you have a few options before 2.1.2 (or 3.0.0) is released :
>
>
> - update your system wide config file (/.../etc/openmpi-mca-params.conf)
> or user config file
>
>   ($HOME/.openmpi/mca-params.conf) and add the following lines
>
> btl_tcp_sndbuf = 0
>
> btl_tcp_rcvbuf = 0
>
>
> - add the following environment variable to your environment
>
> export OMPI_MCA_btl_tcp_sndbuf=0
>
> export OMPI_MCA_btl_tcp_rcvbuf=0
>
>
> - use Open MPI 2.0.3
>
>
> - last but not least, you can manually download and apply the patch
> available at
>
> https://github.com/open-mpi/ompi/commit/b64fedf4f652cadc9bfc
> 7c4693f9c1ef01dfb69f.patch
>
>
> Cheers,
>
> Gilles
>
> On 7/9/2017 11:04 PM, Adam Sylvester wrote:
>
>> Gilles,
>>
>> Thanks for the fast response!
>>
>> The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended
>> made a huge difference - this got me up to 5.7 Gb/s! I wasn't aware of
>> these flags... with a little Googling, is https://www.open-mpi.org/faq/?
>> category=tcp the best place to look for this kind of information and any
>> other tweaks I may want to try (or if there's a better FAQ out there,
>> please let me know)?
>> There is only eth0 on my machines so nothing to tweak there (though good
>> to know for the future). I also didn't see any improvement by specifying
>> more sockets per instance. But, your initial suggestion had a major impact.
>> In general I try to stay relatively up to date with my Open MPI version;
>> I'll be extra motivated to upgrade to 2.1.2 so that I don't have to
>> remember to set these --mca flags on the command line. :o)
>> -Adam
>>
>> On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>
>> wrote:
>>
>> Adam,
>>
>> at first, you need to change the default send and receive socket
>> buffers :
>> mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
>> /* note this will be the default from Open MPI 2.1.2 */
>>
>> hopefully, that will be enough to greatly improve the bandwidth for
>> large messages.
>>
>>
>> generally speaking, i recommend you use the latest (e.g. Open MPI
>> 2.1.1) available version
>>
>> how many interfaces can be used to communicate between hosts ?
>> if there is more than one (for example a slow and a fast one), you'd
>> rather only use the fast one.
>> for example, if eth0 is the fast interface, that can be achieved with
>> mpirun --mca btl_tcp_if_include eth0 ...
>>
>> also, you might be able to achieve better results by using more than
>> one socket on the fast interface.
>> for example, if you want to use 4 sockets per interface
>> mpirun --mca btl_tcp_links 4 ...
>>
>>
>>
>> Cheers,
>>
>> Gilles
>>
>> On Sun, Jul 9, 2017 at 10:10 PM, Adam Sylvester <op8...@gmail.com
>> <mailto:op8...@gmail.com>> wrote:
>> > I am using Open MPI 2.1.0 on RHEL 7.  My application has one
>> unavoidable
>> > pinch point where a large amount of data needs to be transferred
>> (about 8 GB
>> > of data needs to be both sent to and received all other ranks),
>> and I'm
>> > seeing worse performance than I would expect; this step has a
>> major impact
>> > on my overall runtime.  In the real application, I am using
>> MPI_Alltoall()
>> > for this step, but for the purpose of a simple benchmark, I
>> simplified to
>> > simply do a single MPI_Send() / MPI_Recv() between two ranks of
>> 

Re: [OMPI users] Network performance over TCP

2017-07-09 Thread Adam Sylvester
Gilles,

Thanks for the fast response!

The --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 flags you recommended
made a huge difference - this got me up to 5.7 Gb/s! I wasn't aware of
these flags... with a little Googling, is
https://www.open-mpi.org/faq/?category=tcp the best place to look for this
kind of information and any other tweaks I may want to try (or if there's a
better FAQ out there, please let me know)?

There is only eth0 on my machines so nothing to tweak there (though good to
know for the future). I also didn't see any improvement by specifying more
sockets per instance. But, your initial suggestion had a major impact.

In general I try to stay relatively up to date with my Open MPI version;
I'll be extra motivated to upgrade to 2.1.2 so that I don't have to
remember to set these --mca flags on the command line. :o)

-Adam

On Sun, Jul 9, 2017 at 9:26 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> at first, you need to change the default send and receive socket buffers :
> mpirun --mca btl_tcp_sndbuf 0 --mca btl_tcp_rcvbuf 0 ...
> /* note this will be the default from Open MPI 2.1.2 */
>
> hopefully, that will be enough to greatly improve the bandwidth for
> large messages.
>
>
> generally speaking, i recommend you use the latest (e.g. Open MPI
> 2.1.1) available version
>
> how many interfaces can be used to communicate between hosts ?
> if there is more than one (for example a slow and a fast one), you'd
> rather only use the fast one.
> for example, if eth0 is the fast interface, that can be achieved with
> mpirun --mca btl_tcp_if_include eth0 ...
>
> also, you might be able to achieve better results by using more than
> one socket on the fast interface.
> for example, if you want to use 4 sockets per interface
> mpirun --mca btl_tcp_links 4 ...
>
>
>
> Cheers,
>
> Gilles
>
> On Sun, Jul 9, 2017 at 10:10 PM, Adam Sylvester <op8...@gmail.com> wrote:
> > I am using Open MPI 2.1.0 on RHEL 7.  My application has one unavoidable
> > pinch point where a large amount of data needs to be transferred (about
> 8 GB
> > of data needs to be both sent to and received all other ranks), and I'm
> > seeing worse performance than I would expect; this step has a major
> impact
> > on my overall runtime.  In the real application, I am using
> MPI_Alltoall()
> > for this step, but for the purpose of a simple benchmark, I simplified to
> > simply do a single MPI_Send() / MPI_Recv() between two ranks of a 2 GB
> > buffer.
> >
> > I'm running this in AWS with instances that have 10 Gbps connectivity in
> the
> > same availability zone (according to tracepath, there are no hops between
> > them) and MTU set to 8801 bytes.  Doing a non-MPI benchmark of sending
> data
> > directly over TCP between these two instances, I reliably get around 4
> Gbps.
> > Between these same two instances with MPI_Send() / MPI_Recv(), I reliably
> > get around 2.4 Gbps.  This seems like a major performance degradation
> for a
> > single MPI operation.
> >
> > I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings.  I'm
> > connecting between instances via ssh and using I assume TCP for the
> actual
> > network transfer (I'm not setting any special command-line or
> programmatic
> > settings).  The actual command I'm running is:
> > mpirun -N 1 --bind-to none --hostfile hosts.txt my_app
> >
> > Any advice on other things to test or compilation and/or runtime flags to
> > set would be much appreciated!
> > -Adam
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Network performance over TCP

2017-07-09 Thread Adam Sylvester
I am using Open MPI 2.1.0 on RHEL 7.  My application has one unavoidable
pinch point where a large amount of data needs to be transferred (about 8
GB of data needs to be both sent to and received all other ranks), and I'm
seeing worse performance than I would expect; this step has a major impact
on my overall runtime.  In the real application, I am using MPI_Alltoall()
for this step, but for the purpose of a simple benchmark, I simplified to
simply do a single MPI_Send() / MPI_Recv() between two ranks of a 2 GB
buffer.

I'm running this in AWS with instances that have 10 Gbps connectivity in
the same availability zone (according to tracepath, there are no hops
between them) and MTU set to 8801 bytes.  Doing a non-MPI benchmark of
sending data directly over TCP between these two instances, I reliably get
around 4 Gbps.  Between these same two instances with MPI_Send() /
MPI_Recv(), I reliably get around 2.4 Gbps.  This seems like a major
performance degradation for a single MPI operation.

I compiled Open MPI 2.1.0 with gcc 4.9.1 and default settings.  I'm
connecting between instances via ssh and using I assume TCP for the actual
network transfer (I'm not setting any special command-line or programmatic
settings).  The actual command I'm running is:
mpirun -N 1 --bind-to none --hostfile hosts.txt my_app

Any advice on other things to test or compilation and/or runtime flags to
set would be much appreciated!
-Adam
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] How to launch ompi-server?

2017-05-28 Thread Adam Sylvester
Thanks!  Similar to the MPI_Comm_accept() thread, I've been working around
this but looking forward to using it in 3.0 to clean up my applications.

On Sat, May 27, 2017 at 4:02 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> This is now fixed in the master and will make it for v3.0, which is
> planned for release in the near future
>
> On Mar 19, 2017, at 1:40 PM, Adam Sylvester <op8...@gmail.com> wrote:
>
> I did a little more testing in case this helps... if I run ompi-server on
> the same host as the one I call MPI_Publish_name() on, it does successfully
> connect.  But when I run it on a separate machine (which is on the same
> network and accessible via TCP), I get the issue above where it hangs.
>
> Thanks for taking a look - if you'd like me to open a bug report for this
> one somewhere, just let me know.
>
> -Adam
>
> On Sun, Mar 19, 2017 at 2:46 PM, r...@open-mpi.org <r...@open-mpi.org>
> wrote:
>
>> Well, your initial usage looks correct - you don’t launch ompi-server via
>> mpirun. However, it sounds like there is probably a bug somewhere if it
>> hangs as you describe.
>>
>> Scratching my head, I can only recall less than a handful of people ever
>> using these MPI functions to cross-connect jobs, so it does tend to fall
>> into disrepair. As I said, I’ll try to repair it, at least for 3.0.
>>
>>
>> On Mar 19, 2017, at 4:37 AM, Adam Sylvester <op8...@gmail.com> wrote:
>>
>> I am trying to use ompi-server with Open MPI 1.10.6.  I'm wondering if I
>> should run this with or without the mpirun command.  If I run this:
>>
>> ompi-server --no-daemonize -r +
>>
>> It prints something such as 959315968.0;tcp://172.31.3.57:45743 to
>> stdout but I have thus far been unable to connect to it.  That is, in
>> another application on another machine which is on the same network as the
>> ompi-server machine, I try
>>
>> MPI_Info info;
>> MPI_Info_create();
>> MPI_Info_set(info, "ompi_global_scope", "true");
>>
>> char myport[MPI_MAX_PORT_NAME];
>> MPI_Open_port(MPI_INFO_NULL, myport);
>> MPI_Publish_name("adam-server", info, myport);
>>
>> But the MPI_Publish_name() function hangs forever when I run it like
>>
>> mpirun -np 1 --ompi-server "959315968.0;tcp://172.31.3.57:45743" server
>>
>> Blog posts are inconsistent as to if you should run ompi-server with
>> mpirun or not so I tried using it but this seg faults:
>>
>> mpirun -np 1 ompi-server --no-daemonize -r +
>> [ip-172-31-5-39:14785] *** Process received signal ***
>> [ip-172-31-5-39:14785] Signal: Segmentation fault (11)
>> [ip-172-31-5-39:14785] Signal code: Address not mapped (1)
>> [ip-172-31-5-39:14785] Failing at address: 0x6e0
>> [ip-172-31-5-39:14785] [ 0] /lib64/libpthread.so.0(+0xf370
>> )[0x7f895d7a5370]
>> [ip-172-31-5-39:14785] [ 1] /usr/local/lib/libopen-pal.so.
>> 13(opal_hwloc191_hwloc_get_cpubind+0x9)[0x7f895e336839]
>> [ip-172-31-5-39:14785] [ 2] /usr/local/lib/libopen-rte.so.
>> 12(orte_ess_base_proc_binding+0x17a)[0x7f895e5d8fca]
>> [ip-172-31-5-39:14785] [ 3] /usr/local/lib/openmpi/mca_ess
>> _env.so(+0x15dd)[0x7f895cdcd5dd]
>> [ip-172-31-5-39:14785] [ 4] /usr/local/lib/libopen-rte.so.
>> 12(orte_init+0x168)[0x7f895e5b5368]
>> [ip-172-31-5-39:14785] [ 5] ompi-server[0x4014d4]
>> [ip-172-31-5-39:14785] [ 6] /lib64/libc.so.6(__libc_start_
>> main+0xf5)[0x7f895d3f6b35]
>> [ip-172-31-5-39:14785] [ 7] ompi-server[0x40176b]
>> [ip-172-31-5-39:14785] *** End of error message ***
>>
>> Am I doing something wrong?
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_accept()

2017-05-28 Thread Adam Sylvester
Thanks!  I've been working around this in the meantime but will look
forward to using it in 3.0.

On Sat, May 27, 2017 at 4:02 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> Hardly the hoped-for quick turnaround, but it has been fixed in master and
> will go into v3.0, which is planned for release in the near future
>
> On Mar 14, 2017, at 6:26 PM, Adam Sylvester <op8...@gmail.com> wrote:
>
> Excellent - I appreciate the quick turnaround.
>
> On Tue, Mar 14, 2017 at 10:24 AM, r...@open-mpi.org <r...@open-mpi.org>
> wrote:
>
>> I don’t see an issue right away, though I know it has been brought up
>> before. I hope to resolve it either this week or next - will reply to this
>> thread with the PR link when ready.
>>
>>
>> On Mar 13, 2017, at 6:16 PM, Adam Sylvester <op8...@gmail.com> wrote:
>>
>> Bummer - thanks for the update.  I will revert back to 1.10.x for now
>> then.  Should I file a bug report for this on GitHub or elsewhere?  Or if
>> there's an issue for this already open, can you point me to it so I can
>> keep track of when it's fixed?  Any best guess calendar-wise as to when you
>> expect this to be fixed?
>>
>> Thanks.
>>
>> On Mon, Mar 13, 2017 at 10:45 AM, r...@open-mpi.org <r...@open-mpi.org>
>> wrote:
>>
>>> You should consider it a bug for now - it won’t work in the 2.0 series,
>>> and I don’t think it will work in the upcoming 2.1.0 release. Probably will
>>> be fixed after that.
>>>
>>>
>>> On Mar 13, 2017, at 5:17 AM, Adam Sylvester <op8...@gmail.com> wrote:
>>>
>>> As a follow-up, I tried this with Open MPI 1.10.4 and this worked as
>>> expected (the port formatting looks really different):
>>>
>>> $ mpirun -np 1 ./server
>>> Port name is 1286733824.0;tcp://10.102.16.1
>>> 35:43074+1286733825.0;tcp://10.102.16.135::300
>>> Accepted!
>>>
>>> $ mpirun -np 1 ./client "1286733824.0;tcp://10.102.16.
>>> 135:43074+1286733825.0;tcp://10.102.16.135::300"
>>> Trying with '1286733824.0;tcp://10.102.16.135:43074+1286733825.0;tcp://1
>>> 0.102.16.135::300'
>>> Connected!
>>>
>>> I've found some other posts of users asking about similar things
>>> regarding the 2.x release - is this a bug?
>>>
>>> On Sun, Mar 12, 2017 at 9:38 PM, Adam Sylvester <op8...@gmail.com>
>>> wrote:
>>>
>>>> I'm using Open MPI 2.0.2 on RHEL 7.  I'm trying to use MPI_Open_port()
>>>> / MPI_Comm_accept() / MPI_Conn_connect().  My use case is that I'll have
>>>> two processes running on two machines that don't initially know about each
>>>> other (i.e. I can't do the typical mpirun with a list of IPs); eventually I
>>>> think I may need to use ompi-server to accomplish what I want but for now
>>>> I'm trying to test this out running two processes on the same machine with
>>>> some toy programs.
>>>>
>>>> server.cpp creates the port, prints it, and waits for a client to
>>>> accept using it:
>>>>
>>>> #include 
>>>> #include 
>>>>
>>>> int main(int argc, char** argv)
>>>> {
>>>> MPI_Init(NULL, NULL);
>>>>
>>>> char myport[MPI_MAX_PORT_NAME];
>>>> MPI_Comm intercomm;
>>>>
>>>> MPI_Open_port(MPI_INFO_NULL, myport);
>>>> std::cout << "Port name is " << myport << std::endl;
>>>>
>>>> MPI_Comm_accept(myport, MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>>> );
>>>>
>>>> std::cout << "Accepted!" << std::endl;
>>>>
>>>> MPI_Finalize();
>>>> return 0;
>>>> }
>>>>
>>>> client.cpp takes in this port on the command line and tries to connect
>>>> to it:
>>>>
>>>> #include 
>>>> #include 
>>>>
>>>> int main(int argc, char** argv)
>>>> {
>>>> MPI_Init(NULL, NULL);
>>>>
>>>> MPI_Comm intercomm;
>>>>
>>>> const std::string name(argv[1]);
>>>> std::cout << "Trying with '" << name << "'" << std::endl;
>>>> MPI_Comm_connect(name.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>>> );
>>>>
>>>> std::cout << "Connected!" << std::endl;
>>>>
>>&

Re: [OMPI users] How to launch ompi-server?

2017-03-19 Thread Adam Sylvester
I did a little more testing in case this helps... if I run ompi-server on
the same host as the one I call MPI_Publish_name() on, it does successfully
connect.  But when I run it on a separate machine (which is on the same
network and accessible via TCP), I get the issue above where it hangs.

Thanks for taking a look - if you'd like me to open a bug report for this
one somewhere, just let me know.

-Adam

On Sun, Mar 19, 2017 at 2:46 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> Well, your initial usage looks correct - you don’t launch ompi-server via
> mpirun. However, it sounds like there is probably a bug somewhere if it
> hangs as you describe.
>
> Scratching my head, I can only recall less than a handful of people ever
> using these MPI functions to cross-connect jobs, so it does tend to fall
> into disrepair. As I said, I’ll try to repair it, at least for 3.0.
>
>
> On Mar 19, 2017, at 4:37 AM, Adam Sylvester <op8...@gmail.com> wrote:
>
> I am trying to use ompi-server with Open MPI 1.10.6.  I'm wondering if I
> should run this with or without the mpirun command.  If I run this:
>
> ompi-server --no-daemonize -r +
>
> It prints something such as 959315968.0;tcp://172.31.3.57:45743 to stdout
> but I have thus far been unable to connect to it.  That is, in another
> application on another machine which is on the same network as the
> ompi-server machine, I try
>
> MPI_Info info;
> MPI_Info_create();
> MPI_Info_set(info, "ompi_global_scope", "true");
>
> char myport[MPI_MAX_PORT_NAME];
> MPI_Open_port(MPI_INFO_NULL, myport);
> MPI_Publish_name("adam-server", info, myport);
>
> But the MPI_Publish_name() function hangs forever when I run it like
>
> mpirun -np 1 --ompi-server "959315968.0;tcp://172.31.3.57:45743" server
>
> Blog posts are inconsistent as to if you should run ompi-server with
> mpirun or not so I tried using it but this seg faults:
>
> mpirun -np 1 ompi-server --no-daemonize -r +
> [ip-172-31-5-39:14785] *** Process received signal ***
> [ip-172-31-5-39:14785] Signal: Segmentation fault (11)
> [ip-172-31-5-39:14785] Signal code: Address not mapped (1)
> [ip-172-31-5-39:14785] Failing at address: 0x6e0
> [ip-172-31-5-39:14785] [ 0] /lib64/libpthread.so.0(+
> 0xf370)[0x7f895d7a5370]
> [ip-172-31-5-39:14785] [ 1] /usr/local/lib/libopen-pal.so.
> 13(opal_hwloc191_hwloc_get_cpubind+0x9)[0x7f895e336839]
> [ip-172-31-5-39:14785] [ 2] /usr/local/lib/libopen-rte.so.
> 12(orte_ess_base_proc_binding+0x17a)[0x7f895e5d8fca]
> [ip-172-31-5-39:14785] [ 3] /usr/local/lib/openmpi/mca_
> ess_env.so(+0x15dd)[0x7f895cdcd5dd]
> [ip-172-31-5-39:14785] [ 4] /usr/local/lib/libopen-rte.so.
> 12(orte_init+0x168)[0x7f895e5b5368]
> [ip-172-31-5-39:14785] [ 5] ompi-server[0x4014d4]
> [ip-172-31-5-39:14785] [ 6] /lib64/libc.so.6(__libc_start_
> main+0xf5)[0x7f895d3f6b35]
> [ip-172-31-5-39:14785] [ 7] ompi-server[0x40176b]
> [ip-172-31-5-39:14785] *** End of error message ***
>
> Am I doing something wrong?
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] How to launch ompi-server?

2017-03-19 Thread Adam Sylvester
I am trying to use ompi-server with Open MPI 1.10.6.  I'm wondering if I
should run this with or without the mpirun command.  If I run this:

ompi-server --no-daemonize -r +

It prints something such as 959315968.0;tcp://172.31.3.57:45743 to stdout
but I have thus far been unable to connect to it.  That is, in another
application on another machine which is on the same network as the
ompi-server machine, I try

MPI_Info info;
MPI_Info_create();
MPI_Info_set(info, "ompi_global_scope", "true");

char myport[MPI_MAX_PORT_NAME];
MPI_Open_port(MPI_INFO_NULL, myport);
MPI_Publish_name("adam-server", info, myport);

But the MPI_Publish_name() function hangs forever when I run it like

mpirun -np 1 --ompi-server "959315968.0;tcp://172.31.3.57:45743" server

Blog posts are inconsistent as to if you should run ompi-server with mpirun
or not so I tried using it but this seg faults:

mpirun -np 1 ompi-server --no-daemonize -r +
[ip-172-31-5-39:14785] *** Process received signal ***
[ip-172-31-5-39:14785] Signal: Segmentation fault (11)
[ip-172-31-5-39:14785] Signal code: Address not mapped (1)
[ip-172-31-5-39:14785] Failing at address: 0x6e0
[ip-172-31-5-39:14785] [ 0] /lib64/libpthread.so.0(+0xf370)[0x7f895d7a5370]
[ip-172-31-5-39:14785] [ 1]
/usr/local/lib/libopen-pal.so.13(opal_hwloc191_hwloc_get_cpubind+0x9)[0x7f895e336839]
[ip-172-31-5-39:14785] [ 2]
/usr/local/lib/libopen-rte.so.12(orte_ess_base_proc_binding+0x17a)[0x7f895e5d8fca]
[ip-172-31-5-39:14785] [ 3]
/usr/local/lib/openmpi/mca_ess_env.so(+0x15dd)[0x7f895cdcd5dd]
[ip-172-31-5-39:14785] [ 4]
/usr/local/lib/libopen-rte.so.12(orte_init+0x168)[0x7f895e5b5368]
[ip-172-31-5-39:14785] [ 5] ompi-server[0x4014d4]
[ip-172-31-5-39:14785] [ 6]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f895d3f6b35]
[ip-172-31-5-39:14785] [ 7] ompi-server[0x40176b]
[ip-172-31-5-39:14785] *** End of error message ***

Am I doing something wrong?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_accept()

2017-03-14 Thread Adam Sylvester
Excellent - I appreciate the quick turnaround.

On Tue, Mar 14, 2017 at 10:24 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> I don’t see an issue right away, though I know it has been brought up
> before. I hope to resolve it either this week or next - will reply to this
> thread with the PR link when ready.
>
>
> On Mar 13, 2017, at 6:16 PM, Adam Sylvester <op8...@gmail.com> wrote:
>
> Bummer - thanks for the update.  I will revert back to 1.10.x for now
> then.  Should I file a bug report for this on GitHub or elsewhere?  Or if
> there's an issue for this already open, can you point me to it so I can
> keep track of when it's fixed?  Any best guess calendar-wise as to when you
> expect this to be fixed?
>
> Thanks.
>
> On Mon, Mar 13, 2017 at 10:45 AM, r...@open-mpi.org <r...@open-mpi.org>
> wrote:
>
>> You should consider it a bug for now - it won’t work in the 2.0 series,
>> and I don’t think it will work in the upcoming 2.1.0 release. Probably will
>> be fixed after that.
>>
>>
>> On Mar 13, 2017, at 5:17 AM, Adam Sylvester <op8...@gmail.com> wrote:
>>
>> As a follow-up, I tried this with Open MPI 1.10.4 and this worked as
>> expected (the port formatting looks really different):
>>
>> $ mpirun -np 1 ./server
>> Port name is 1286733824.0;tcp://10.102.16.135:43074+1286733825.0;tcp://10
>> .102.16.135::300
>> Accepted!
>>
>> $ mpirun -np 1 ./client "1286733824.0;tcp://10.102.16.
>> 135:43074+1286733825.0;tcp://10.102.16.135::300"
>> Trying with '1286733824.0;tcp://10.102.16.135:43074+1286733825.0;tcp://1
>> 0.102.16.135::300'
>> Connected!
>>
>> I've found some other posts of users asking about similar things
>> regarding the 2.x release - is this a bug?
>>
>> On Sun, Mar 12, 2017 at 9:38 PM, Adam Sylvester <op8...@gmail.com> wrote:
>>
>>> I'm using Open MPI 2.0.2 on RHEL 7.  I'm trying to use MPI_Open_port() /
>>> MPI_Comm_accept() / MPI_Conn_connect().  My use case is that I'll have two
>>> processes running on two machines that don't initially know about each
>>> other (i.e. I can't do the typical mpirun with a list of IPs); eventually I
>>> think I may need to use ompi-server to accomplish what I want but for now
>>> I'm trying to test this out running two processes on the same machine with
>>> some toy programs.
>>>
>>> server.cpp creates the port, prints it, and waits for a client to accept
>>> using it:
>>>
>>> #include 
>>> #include 
>>>
>>> int main(int argc, char** argv)
>>> {
>>> MPI_Init(NULL, NULL);
>>>
>>> char myport[MPI_MAX_PORT_NAME];
>>> MPI_Comm intercomm;
>>>
>>> MPI_Open_port(MPI_INFO_NULL, myport);
>>> std::cout << "Port name is " << myport << std::endl;
>>>
>>> MPI_Comm_accept(myport, MPI_INFO_NULL, 0, MPI_COMM_SELF, );
>>>
>>> std::cout << "Accepted!" << std::endl;
>>>
>>> MPI_Finalize();
>>> return 0;
>>> }
>>>
>>> client.cpp takes in this port on the command line and tries to connect
>>> to it:
>>>
>>> #include 
>>> #include 
>>>
>>> int main(int argc, char** argv)
>>> {
>>> MPI_Init(NULL, NULL);
>>>
>>> MPI_Comm intercomm;
>>>
>>> const std::string name(argv[1]);
>>> std::cout << "Trying with '" << name << "'" << std::endl;
>>> MPI_Comm_connect(name.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF,
>>> );
>>>
>>> std::cout << "Connected!" << std::endl;
>>>
>>> MPI_Finalize();
>>> return 0;
>>> }
>>>
>>> I run the server first:
>>> $ mpirun ./server
>>> Port name is 2720137217.0:595361386
>>>
>>> Then a second later I run the client:
>>> $ mpirun ./client 2720137217.0:595361386
>>> Trying with '2720137217.0:595361386'
>>>
>>> Both programs hang for awhile and then eventually time out.  I have a
>>> feeling I'm misunderstanding something and doing something dumb but from
>>> all the examples I've seen online it seems like this should work.
>>>
>>> Thanks for the help.
>>> -Adam
>>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_accept()

2017-03-13 Thread Adam Sylvester
Bummer - thanks for the update.  I will revert back to 1.10.x for now
then.  Should I file a bug report for this on GitHub or elsewhere?  Or if
there's an issue for this already open, can you point me to it so I can
keep track of when it's fixed?  Any best guess calendar-wise as to when you
expect this to be fixed?

Thanks.

On Mon, Mar 13, 2017 at 10:45 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:

> You should consider it a bug for now - it won’t work in the 2.0 series,
> and I don’t think it will work in the upcoming 2.1.0 release. Probably will
> be fixed after that.
>
>
> On Mar 13, 2017, at 5:17 AM, Adam Sylvester <op8...@gmail.com> wrote:
>
> As a follow-up, I tried this with Open MPI 1.10.4 and this worked as
> expected (the port formatting looks really different):
>
> $ mpirun -np 1 ./server
> Port name is 1286733824.0;tcp://10.102.16.135:43074+1286733825.0;tcp://
> 10.102.16.135::300
> Accepted!
>
> $ mpirun -np 1 ./client "1286733824.0;tcp://10.102.16.
> 135:43074+1286733825.0;tcp://10.102.16.135::300"
> Trying with '1286733824.0;tcp://10.102.16.135:43074+1286733825.0;tcp://
> 10.102.16.135::300'
> Connected!
>
> I've found some other posts of users asking about similar things regarding
> the 2.x release - is this a bug?
>
> On Sun, Mar 12, 2017 at 9:38 PM, Adam Sylvester <op8...@gmail.com> wrote:
>
>> I'm using Open MPI 2.0.2 on RHEL 7.  I'm trying to use MPI_Open_port() /
>> MPI_Comm_accept() / MPI_Conn_connect().  My use case is that I'll have two
>> processes running on two machines that don't initially know about each
>> other (i.e. I can't do the typical mpirun with a list of IPs); eventually I
>> think I may need to use ompi-server to accomplish what I want but for now
>> I'm trying to test this out running two processes on the same machine with
>> some toy programs.
>>
>> server.cpp creates the port, prints it, and waits for a client to accept
>> using it:
>>
>> #include 
>> #include 
>>
>> int main(int argc, char** argv)
>> {
>> MPI_Init(NULL, NULL);
>>
>> char myport[MPI_MAX_PORT_NAME];
>> MPI_Comm intercomm;
>>
>> MPI_Open_port(MPI_INFO_NULL, myport);
>> std::cout << "Port name is " << myport << std::endl;
>>
>> MPI_Comm_accept(myport, MPI_INFO_NULL, 0, MPI_COMM_SELF, );
>>
>> std::cout << "Accepted!" << std::endl;
>>
>> MPI_Finalize();
>> return 0;
>> }
>>
>> client.cpp takes in this port on the command line and tries to connect to
>> it:
>>
>> #include 
>> #include 
>>
>> int main(int argc, char** argv)
>> {
>> MPI_Init(NULL, NULL);
>>
>> MPI_Comm intercomm;
>>
>> const std::string name(argv[1]);
>> std::cout << "Trying with '" << name << "'" << std::endl;
>> MPI_Comm_connect(name.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF,
>> );
>>
>> std::cout << "Connected!" << std::endl;
>>
>> MPI_Finalize();
>> return 0;
>> }
>>
>> I run the server first:
>> $ mpirun ./server
>> Port name is 2720137217.0:595361386
>>
>> Then a second later I run the client:
>> $ mpirun ./client 2720137217.0:595361386
>> Trying with '2720137217.0:595361386'
>>
>> Both programs hang for awhile and then eventually time out.  I have a
>> feeling I'm misunderstanding something and doing something dumb but from
>> all the examples I've seen online it seems like this should work.
>>
>> Thanks for the help.
>> -Adam
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Comm_accept()

2017-03-13 Thread Adam Sylvester
As a follow-up, I tried this with Open MPI 1.10.4 and this worked as
expected (the port formatting looks really different):

$ mpirun -np 1 ./server
Port name is 1286733824.0;tcp://10.102.16.135:43074
+1286733825.0;tcp://10.102.16.135::300
Accepted!

$ mpirun -np 1 ./client "1286733824.0;tcp://10.102.16.135:43074
+1286733825.0;tcp://10.102.16.135::300"
Trying with '1286733824.0;tcp://10.102.16.135:43074
+1286733825.0;tcp://10.102.16.135::300'
Connected!

I've found some other posts of users asking about similar things regarding
the 2.x release - is this a bug?

On Sun, Mar 12, 2017 at 9:38 PM, Adam Sylvester <op8...@gmail.com> wrote:

> I'm using Open MPI 2.0.2 on RHEL 7.  I'm trying to use MPI_Open_port() /
> MPI_Comm_accept() / MPI_Conn_connect().  My use case is that I'll have two
> processes running on two machines that don't initially know about each
> other (i.e. I can't do the typical mpirun with a list of IPs); eventually I
> think I may need to use ompi-server to accomplish what I want but for now
> I'm trying to test this out running two processes on the same machine with
> some toy programs.
>
> server.cpp creates the port, prints it, and waits for a client to accept
> using it:
>
> #include 
> #include 
>
> int main(int argc, char** argv)
> {
> MPI_Init(NULL, NULL);
>
> char myport[MPI_MAX_PORT_NAME];
> MPI_Comm intercomm;
>
> MPI_Open_port(MPI_INFO_NULL, myport);
> std::cout << "Port name is " << myport << std::endl;
>
> MPI_Comm_accept(myport, MPI_INFO_NULL, 0, MPI_COMM_SELF, );
>
> std::cout << "Accepted!" << std::endl;
>
> MPI_Finalize();
> return 0;
> }
>
> client.cpp takes in this port on the command line and tries to connect to
> it:
>
> #include 
> #include 
>
> int main(int argc, char** argv)
> {
> MPI_Init(NULL, NULL);
>
> MPI_Comm intercomm;
>
> const std::string name(argv[1]);
> std::cout << "Trying with '" << name << "'" << std::endl;
> MPI_Comm_connect(name.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF,
> );
>
> std::cout << "Connected!" << std::endl;
>
> MPI_Finalize();
> return 0;
> }
>
> I run the server first:
> $ mpirun ./server
> Port name is 2720137217.0:595361386
>
> Then a second later I run the client:
> $ mpirun ./client 2720137217.0:595361386
> Trying with '2720137217.0:595361386'
>
> Both programs hang for awhile and then eventually time out.  I have a
> feeling I'm misunderstanding something and doing something dumb but from
> all the examples I've seen online it seems like this should work.
>
> Thanks for the help.
> -Adam
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] MPI_Comm_accept()

2017-03-12 Thread Adam Sylvester
I'm using Open MPI 2.0.2 on RHEL 7.  I'm trying to use MPI_Open_port() /
MPI_Comm_accept() / MPI_Conn_connect().  My use case is that I'll have two
processes running on two machines that don't initially know about each
other (i.e. I can't do the typical mpirun with a list of IPs); eventually I
think I may need to use ompi-server to accomplish what I want but for now
I'm trying to test this out running two processes on the same machine with
some toy programs.

server.cpp creates the port, prints it, and waits for a client to accept
using it:

#include 
#include 

int main(int argc, char** argv)
{
MPI_Init(NULL, NULL);

char myport[MPI_MAX_PORT_NAME];
MPI_Comm intercomm;

MPI_Open_port(MPI_INFO_NULL, myport);
std::cout << "Port name is " << myport << std::endl;

MPI_Comm_accept(myport, MPI_INFO_NULL, 0, MPI_COMM_SELF, );

std::cout << "Accepted!" << std::endl;

MPI_Finalize();
return 0;
}

client.cpp takes in this port on the command line and tries to connect to
it:

#include 
#include 

int main(int argc, char** argv)
{
MPI_Init(NULL, NULL);

MPI_Comm intercomm;

const std::string name(argv[1]);
std::cout << "Trying with '" << name << "'" << std::endl;
MPI_Comm_connect(name.c_str(), MPI_INFO_NULL, 0, MPI_COMM_SELF,
);

std::cout << "Connected!" << std::endl;

MPI_Finalize();
return 0;
}

I run the server first:
$ mpirun ./server
Port name is 2720137217.0:595361386

Then a second later I run the client:
$ mpirun ./client 2720137217.0:595361386
Trying with '2720137217.0:595361386'

Both programs hang for awhile and then eventually time out.  I have a
feeling I'm misunderstanding something and doing something dumb but from
all the examples I've seen online it seems like this should work.

Thanks for the help.
-Adam
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] mpirun with ssh tunneling

2017-01-01 Thread Adam Sylvester
 these ports
>   and port 22 (ssh).
>
> you can also refer to https://github.com/open-mpi/ompi/issues/1511
> yet an other way to use docker was discussed here.
>
> last but not least, if you want to use containers but you are not tied to
> docker, you can consider http://singularity.lbl.gov/
> (as far as Open MPI is concerned,native support is expected for Open MPI
> 2.1)
>
>
> Cheers,
>
> Gilles
>
>
> On 12/26/2016 6:11 AM, Adam Sylvester wrote:
>
> I'm trying to use OpenMPI 1.10.4 to communicate between two Docker
> containers running on two different physical machines.  Docker doesn't have
> much to do with my question (unless someone has a suggestion for a better
> way to do what I'm trying to :o) )... each Docker container is running an
> OpenSSH server which shows up as 172.17.0.1 on the physical hosts:
>
> $ ifconfig docker0
> docker0   Link encap:Ethernet  HWaddr 02:42:8E:07:05:A0
>   inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
>   inet6 addr: fe80::42:8eff:fe07:5a0/64 Scope:Link
>
> The Docker container's ssh port is published on the physical host as port
> 32768.
>
> The Docker container has a user 'mpirun' which I have public/private ssh
> keys set up for.
>
> Let's call the physical hosts host1 and host2; each host is running a
> Docker container I'll refer to as docker1 and docker2 respectively.  So,
> this means I can...
> 1. ssh From host1 into docker1:
> ssh mpirun@172.17.0.1 -i ssh/id_rsa -p 32768
>
> 2. Set up an ssh tunnel from inside docker1, through host2, into docker2,
> on local port 4334 (ec2-user is the login to host2)
> ssh -f -N -q -o "TCPKeepAlive yes" -o "ServerAliveInterval 60" -L 4334:
> 172.17.0.1:32768 -l ec2-user host2
>
> 3. Update my ~/.ssh/config file to name this host 'docker2':
> StrictHostKeyChecking no
> Host docker2
>   HostName 127.0.0.1
>   Port 4334
>   User mpirun
>
> 4. I can now do 'ssh docker2' and ssh into it without issues.
>
> Here's where I get stuck.  I'd read that OpenMPI's mpirun didn't support
> ssh'ing on a non-standard port, so I thought I could just do step 3 above
> and then list the hosts when I run mpirun from docker1:
>
> mpirun --prefix /usr/local -n 2 -H localhost,docker2
> /home/mpirun/mpi_hello_world
>
> However, I get:
> [3524ae84a26b:00197] [[55635,0],1] tcp_peer_send_blocking: send() to
> socket 9 failed: Broken pipe (32)
> --
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
>
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
>
> * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to
> use.
>
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
>
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --
>
> I'm guessing that something's going wrong when docker2 tries to
> communicate back to docker1.  However, I'm not sure what additional
> tunneling to set up to support this.  My understanding of ssh tunnels is
> relatively basic... I can of course create a tunnel on docker2 back to
> docker1 but I don't know how ssh/mpi will "find" it.  I've read a bit about
> reverse ssh tunneling but it's not clear enough to me what this is doing to
> apply it here.
>
> Any help is much appreciated!
> -Adam
>
>
> ___
> users mailing 
> listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] mpirun with ssh tunneling

2016-12-25 Thread Adam Sylvester
I'm trying to use OpenMPI 1.10.4 to communicate between two Docker
containers running on two different physical machines.  Docker doesn't have
much to do with my question (unless someone has a suggestion for a better
way to do what I'm trying to :o) )... each Docker container is running an
OpenSSH server which shows up as 172.17.0.1 on the physical hosts:

$ ifconfig docker0
docker0   Link encap:Ethernet  HWaddr 02:42:8E:07:05:A0
  inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
  inet6 addr: fe80::42:8eff:fe07:5a0/64 Scope:Link

The Docker container's ssh port is published on the physical host as port
32768.

The Docker container has a user 'mpirun' which I have public/private ssh
keys set up for.

Let's call the physical hosts host1 and host2; each host is running a
Docker container I'll refer to as docker1 and docker2 respectively.  So,
this means I can...
1. ssh From host1 into docker1:
ssh mpirun@172.17.0.1 -i ssh/id_rsa -p 32768

2. Set up an ssh tunnel from inside docker1, through host2, into docker2,
on local port 4334 (ec2-user is the login to host2)
ssh -f -N -q -o "TCPKeepAlive yes" -o "ServerAliveInterval 60" -L 4334:
172.17.0.1:32768 -l ec2-user host2

3. Update my ~/.ssh/config file to name this host 'docker2':
StrictHostKeyChecking no
Host docker2
  HostName 127.0.0.1
  Port 4334
  User mpirun

4. I can now do 'ssh docker2' and ssh into it without issues.

Here's where I get stuck.  I'd read that OpenMPI's mpirun didn't support
ssh'ing on a non-standard port, so I thought I could just do step 3 above
and then list the hosts when I run mpirun from docker1:

mpirun --prefix /usr/local -n 2 -H localhost,docker2
/home/mpirun/mpi_hello_world

However, I get:
[3524ae84a26b:00197] [[55635,0],1] tcp_peer_send_blocking: send() to socket
9 failed: Broken pipe (32)
--
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--

I'm guessing that something's going wrong when docker2 tries to communicate
back to docker1.  However, I'm not sure what additional tunneling to set up
to support this.  My understanding of ssh tunnels is relatively basic... I
can of course create a tunnel on docker2 back to docker1 but I don't know
how ssh/mpi will "find" it.  I've read a bit about reverse ssh tunneling
but it's not clear enough to me what this is doing to apply it here.

Any help is much appreciated!
-Adam
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users