Ralph --

Quick question: ORTE should be using local named sockets for connections to the 
orted, right?

I guess what I'm asking is: if there's a 
single-server-only-and-it-happens-to-be-the-local-server job, shouldn't it only 
be using local named sockets, not IP sockets?



On Dec 3, 2013, at 8:16 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Best guess I can offer is that they are blocking loopback on those networks - 
> i.e., they are configured such that you can use them to connect to a remote 
> machine, but not to a process on your local machine. I'll take a look at the 
> connection logic and see if I can get it to failover to the loopback device 
> in that case. I believe we disable use of the loopback if an active TCP 
> network is available as we expect it to include loopback capability.
> 
> Meantime, you might want to talk to your IT folks and see if that is correct 
> and intentional - and if so, why.
> 
> 
> 
> On Tue, Dec 3, 2013 at 5:04 AM, Meredith, Karl <karl.mered...@fmglobal.com> 
> wrote:
> I disconnected for our corporate network (ethernet connection) and tried 
> running again:  same result, it stalls.
> 
> Then, I also disconnected from our local wifi network and tried running 
> again:  it worked!
> 
> bash-4.2$ mpirun -np 2 --mca btl sm,self hello_c
> Hello, world, I am 0 of 2, (Open MPI v1.7.4a1, package: Open MPI 
> meredi...@meredithk-mac.corp.fmglobal.com<mailto:meredi...@meredithk-mac.corp.fmglobal.com>
>  Distribution, ident: 1.7.4a1r29784, repo rev: r29784, Dec 02, 2013 (nightly 
> snapshot tarball), 173)
> Hello, world, I am 1 of 2, (Open MPI v1.7.4a1, package: Open MPI 
> meredi...@meredithk-mac.corp.fmglobal.com<mailto:meredi...@meredithk-mac.corp.fmglobal.com>
>  Distribution, ident: 1.7.4a1r29784, repo rev: r29784, Dec 02, 2013 (nightly 
> snapshot tarball), 173)
> bash-4.2$ mpirun -np 2 hello_c
> Hello, world, I am 0 of 2, (Open MPI v1.7.4a1, package: Open MPI 
> meredi...@meredithk-mac.corp.fmglobal.com<mailto:meredi...@meredithk-mac.corp.fmglobal.com>
>  Distribution, ident: 1.7.4a1r29784, repo rev: r29784, Dec 02, 2013 (nightly 
> snapshot tarball), 173)
> Hello, world, I am 1 of 2, (Open MPI v1.7.4a1, package: Open MPI 
> meredi...@meredithk-mac.corp.fmglobal.com<mailto:meredi...@meredithk-mac.corp.fmglobal.com>
>  Distribution, ident: 1.7.4a1r29784, repo rev: r29784, Dec 02, 2013 (nightly 
> snapshot tarball), 173)
> 
> Why?  What would be causing the network to be interfering with mpirun?  Do 
> you have any insight?
> 
> Karl
> 
> 
> 
> On Dec 3, 2013, at 7:54 AM, Ralph Castain 
> <r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:
> 
> Hmmm...are you connected to a network, or at least have a network active, 
> when you do this? It looks a little like the system is trying to open a port 
> between the process and mpirun, but is failing to do so.
> 
> 
> 
> On Tue, Dec 3, 2013 at 4:51 AM, Meredith, Karl 
> <karl.mered...@fmglobal.com<mailto:karl.mered...@fmglobal.com>> wrote:
> Using openmpi-1.7.4, no macports, only apple compilers/tools:
> 
> mpirun -np 2 --mca btl sm,self hello_c
> 
> This hangs, also in MPI_Init().
> 
> Here’s the back trace from the debugger:
> 
> bash-4.2$ lldb -p 4517
> Attaching to process with:
>     process attach -p 4517
> Process 4517 stopped
> Executable module set to 
> "/Users/meredithk/tools/openmpi-1.7.4a1r29784/examples/hello_c".
> Architecture set to: x86_64-apple-macosx.
> (lldb) bt
> * thread #1: tid = 0x57efb, 0x00007fff8c991a3a 
> libsystem_kernel.dylib`__semwait_signal + 10, queue = 'com.apple.main-thread, 
> stop reason = signal SIGSTOP
>     frame #0: 0x00007fff8c991a3a libsystem_kernel.dylib`__semwait_signal + 10
>     frame #1: 0x00007fff8ade4e60 libsystem_c.dylib`nanosleep + 200
>     frame #2: 0x0000000108d668e3 
> libopen-rte.6.dylib`orte_routed_base_register_sync(setup=true) + 2435 at 
> routed_base_fns.c:344
>     frame #3: 0x000000010904e3a7 
> mca_routed_binomial.so`init_routes(job=1294401537, ndat=0x0000000000000000) + 
> 2759 at routed_binomial.c:708
>     frame #4: 0x0000000108d1b84d 
> libopen-rte.6.dylib`orte_ess_base_app_setup(db_restrict_local=true) + 2109 at 
> ess_base_std_app.c:233
>     frame #5: 0x0000000108fbc442 mca_ess_env.so`rte_init + 418 at 
> ess_env_module.c:146
>     frame #6: 0x0000000108cd6cfe 
> libopen-rte.6.dylib`orte_init(pargc=0x0000000000000000, 
> pargv=0x0000000000000000, flags=32) + 718 at orte_init.c:158
>     frame #7: 0x0000000108a3b3c8 libmpi.1.dylib`ompi_mpi_init(argc=1, 
> argv=0x00007fff57200508, requested=0, provided=0x00007fff57200360) + 616 at 
> ompi_mpi_init.c:451
>     frame #8: 0x0000000108a895a0 
> libmpi.1.dylib`MPI_Init(argc=0x00007fff572004d0, argv=0x00007fff572004c8) + 
> 480 at init.c:84
>     frame #9: 0x00000001089ffe4a hello_c`main(argc=1, 
> argv=0x00007fff57200508) + 58 at hello_c.c:18
>     frame #10: 0x00007fff8d5df5fd libdyld.dylib`start + 1
>     frame #11: 0x00007fff8d5df5fd libdyld.dylib`start + 1
> 
> 
> On Dec 2, 2013, at 2:11 PM, Jeff Squyres (jsquyres) 
> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
> 
> > Karl --
> >
> > Can you force the use of just the shared memory transport -- i.e., disable 
> > the TCP BTL?  For example:
> >
> >    mpirun -np 2 --mca btl sm,self hello_c
> >
> > If that also hangs, can you attach a debugger and see *where* it is hanging 
> > inside MPI_Init()?  (In OMPI, MPI::Init() simply invokes MPI_Init())
> >
> >
> > On Nov 27, 2013, at 2:56 PM, "Meredith, Karl" 
> > <karl.mered...@fmglobal.com<mailto:karl.mered...@fmglobal.com>> wrote:
> >
> >> /opt/trunk/apple-only/bin/ompi_info --param oob tcp --level 9
> >>                MCA oob: parameter "oob_tcp_verbose" (current value: "0", 
> >> data source: default, level: 9 dev/all, type: int)
> >>                         Verbose level for the OOB tcp component
> >>                MCA oob: parameter "oob_tcp_peer_limit" (current value: 
> >> "-1", data source: default, level: 9 dev/all, type: int)
> >>                         Maximum number of peer connections to 
> >> simultaneously maintain (-1 = infinite)
> >>                MCA oob: parameter "oob_tcp_peer_retries" (current value: 
> >> "60", data source: default, level: 9 dev/all, type: int)
> >>                         Number of times to try shutting down a connection 
> >> before giving up
> >>                MCA oob: parameter "oob_tcp_debug" (current value: "0", 
> >> data source: default, level: 9 dev/all, type: int)
> >>                         Enable (1) / disable (0) debugging output for this 
> >> component
> >>                MCA oob: parameter "oob_tcp_sndbuf" (current value: 
> >> "131072", data source: default, level: 9 dev/all, type: int)
> >>                         TCP socket send buffering size (in bytes)
> >>                MCA oob: parameter "oob_tcp_rcvbuf" (current value: 
> >> "131072", data source: default, level: 9 dev/all, type: int)
> >>                         TCP socket receive buffering size (in bytes)
> >>                MCA oob: parameter "oob_tcp_if_include" (current value: "", 
> >> data source: default, level: 9 dev/all, type: string, synonyms: 
> >> oob_tcp_include)
> >>                         Comma-delimited list of devices and/or CIDR 
> >> notation of networks to use for Open MPI bootstrap communication (e.g., 
> >> "eth0,192.168.0.0/16<http://192.168.0.0/16>").  Mutually exclusive with 
> >> oob_tcp_if_exclude.
> >>                MCA oob: parameter "oob_tcp_if_exclude" (current value: "", 
> >> data source: default, level: 9 dev/all, type: string, synonyms: 
> >> oob_tcp_exclude)
> >>                         Comma-delimited list of devices and/or CIDR 
> >> notation of networks to NOT use for Open MPI bootstrap communication -- 
> >> all devices not matching these specifications will be used (e.g., 
> >> "eth0,192.168.0.0/16<http://192.168.0.0/16>").  If set to a non-default 
> >> value, it is mutually exclusive with oob_tcp_if_include.
> >>                MCA oob: parameter "oob_tcp_connect_sleep" (current value: 
> >> "1", data source: default, level: 9 dev/all, type: int)
> >>                         Enable (1) / disable (0) random sleep for 
> >> connection wireup.
> >>                MCA oob: parameter "oob_tcp_listen_mode" (current value: 
> >> "event", data source: default, level: 9 dev/all, type: int)
> >>                         Mode for HNP to accept incoming connections: 
> >> event, listen_thread.
> >>                         Valid values: 0:"event", 1:"listen_thread"
> >>                MCA oob: parameter "oob_tcp_listen_thread_max_queue" 
> >> (current value: "10", data source: default, level: 9 dev/all, type: int)
> >>                         High water mark for queued accepted socket list 
> >> size.  Used only when listen_mode is listen_thread.
> >>                MCA oob: parameter "oob_tcp_listen_thread_wait_time" 
> >> (current value: "10", data source: default, level: 9 dev/all, type: int)
> >>                         Time in milliseconds to wait before actively 
> >> checking for new connections when listen_mode is listen_thread.
> >>                MCA oob: parameter "oob_tcp_static_ports" (current value: 
> >> "", data source: default, level: 9 dev/all, type: string)
> >>                         Static ports for daemons and procs (IPv4)
> >>                MCA oob: parameter "oob_tcp_dynamic_ports" (current value: 
> >> "", data source: default, level: 9 dev/all, type: string)
> >>                         Range of ports to be dynamically used by daemons 
> >> and procs (IPv4)
> >>                MCA oob: parameter "oob_tcp_disable_family" (current value: 
> >> "none", data source: default, level: 9 dev/all, type: int)
> >>                         Disable IPv4 (4) or IPv6 (6)
> >>                         Valid values: 0:"none", 4:"IPv4", 6:"IPv6"
> >>
> >> /opt/trunk/apple-only/bin/ompi_info --param btl tcp --level 9
> >>                MCA btl: parameter "btl_tcp_links" (current value: "1", 
> >> data source: default, level: 4 tuner/basic, type: unsigned)
> >>                MCA btl: parameter "btl_tcp_if_include" (current value: "", 
> >> data source: default, level: 1 user/basic, type: string)
> >>                         Comma-delimited list of devices and/or CIDR 
> >> notation of networks to use for MPI communication (e.g., 
> >> "eth0,192.168.0.0/16<http://192.168.0.0/16>").  Mutually exclusive with 
> >> btl_tcp_if_exclude.
> >>                MCA btl: parameter "btl_tcp_if_exclude" (current value: 
> >> "127.0.0.1/8,sppp<http://127.0.0.1/8,sppp>", data source: default, level: 
> >> 1 user/basic, type: string)
> >>                         Comma-delimited list of devices and/or CIDR 
> >> notation of networks to NOT use for MPI communication -- all devices not 
> >> matching these specifications will be used (e.g., 
> >> "eth0,192.168.0.0/16<http://192.168.0.0/16>").  If set to a non-default 
> >> value, it is mutually exclusive with btl_tcp_if_include.
> >>                MCA btl: parameter "btl_tcp_free_list_num" (current value: 
> >> "8", data source: default, level: 5 tuner/detail, type: int)
> >>                MCA btl: parameter "btl_tcp_free_list_max" (current value: 
> >> "-1", data source: default, level: 5 tuner/detail, type: int)
> >>                MCA btl: parameter "btl_tcp_free_list_inc" (current value: 
> >> "32", data source: default, level: 5 tuner/detail, type: int)
> >>                MCA btl: parameter "btl_tcp_sndbuf" (current value: 
> >> "131072", data source: default, level: 4 tuner/basic, type: int)
> >>                MCA btl: parameter "btl_tcp_rcvbuf" (current value: 
> >> "131072", data source: default, level: 4 tuner/basic, type: int)
> >>                MCA btl: parameter "btl_tcp_endpoint_cache" (current value: 
> >> "30720", data source: default, level: 4 tuner/basic, type: int)
> >>                         The size of the internal cache for each TCP 
> >> connection. This cache is used to reduce the number of syscalls, by 
> >> replacing them with memcpy. Every read will read the expected data plus 
> >> the amount of the endpoint_cache
> >>                MCA btl: parameter "btl_tcp_use_nagle" (current value: "0", 
> >> data source: default, level: 4 tuner/basic, type: int)
> >>                         Whether to use Nagle's algorithm or not (using 
> >> Nagle's algorithm may increase short message latency)
> >>                MCA btl: parameter "btl_tcp_port_min_v4" (current value: 
> >> "1024", data source: default, level: 2 user/detail, type: int)
> >>                         The minimum port where the TCP BTL will try to 
> >> bind (default 1024)
> >>                MCA btl: parameter "btl_tcp_port_range_v4" (current value: 
> >> "64511", data source: default, level: 2 user/detail, type: int)
> >>                         The number of ports where the TCP BTL will try to 
> >> bind (default 64511). This parameter together with the port min, define a 
> >> range of ports where Open MPI will open sockets.
> >>                MCA btl: parameter "btl_tcp_exclusivity" (current value: 
> >> "100", data source: default, level: 7 dev/basic, type: unsigned)
> >>                         BTL exclusivity (must be >= 0)
> >>                MCA btl: parameter "btl_tcp_flags" (current value: "314", 
> >> data source: default, level: 5 tuner/detail, type: unsigned)
> >>                         BTL bit flags (general flags: SEND=1, PUT=2, 
> >> GET=4, SEND_INPLACE=8, RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only 
> >> used by the "dr" PML (ignored by others): ACK=16, CHECKSUM=32, 
> >> RDMA_COMPLETION=128; flags only used by the "bfo" PML (ignored by others): 
> >> FAILOVER_SUPPORT=512)
> >>                MCA btl: parameter "btl_tcp_rndv_eager_limit" (current 
> >> value: "65536", data source: default, level: 4 tuner/basic, type: size_t)
> >>                         Size (in bytes, including header) of "phase 1" 
> >> fragment sent for all large messages (must be >= 0 and <= eager_limit)
> >>                MCA btl: parameter "btl_tcp_eager_limit" (current value: 
> >> "65536", data source: default, level: 4 tuner/basic, type: size_t)
> >>                         Maximum size (in bytes, including header) of 
> >> "short" messages (must be >= 1).
> >>                MCA btl: parameter "btl_tcp_max_send_size" (current value: 
> >> "131072", data source: default, level: 4 tuner/basic, type: size_t)
> >>                         Maximum size (in bytes) of a single "phase 2" 
> >> fragment of a long message when using the pipeline protocol (must be >= 1)
> >>                MCA btl: parameter "btl_tcp_rdma_pipeline_send_length" 
> >> (current value: "131072", data source: default, level: 4 tuner/basic, 
> >> type: size_t)
> >>                         Length of the "phase 2" portion of a large message 
> >> (in bytes) when using the pipeline protocol.  This part of the message 
> >> will be split into fragments of size max_send_size and sent using 
> >> send/receive semantics (must be >= 0; only relevant when the PUT flag is 
> >> set)
> >>                MCA btl: parameter "btl_tcp_rdma_pipeline_frag_size" 
> >> (current value: "2147483647", data source: default, level: 4 tuner/basic, 
> >> type: size_t)
> >>                         Maximum size (in bytes) of a single "phase 3" 
> >> fragment from a long message when using the pipeline protocol.  These 
> >> fragments will be sent using RDMA semantics (must be >= 1; only relevant 
> >> when the PUT flag is set)
> >>                MCA btl: parameter "btl_tcp_min_rdma_pipeline_size" 
> >> (current value: "196608", data source: default, level: 4 tuner/basic, 
> >> type: size_t)
> >>                         Messages smaller than this size (in bytes) will 
> >> not use the RDMA pipeline protocol.  Instead, they will be split into 
> >> fragments of max_send_size and sent using send/receive semantics (must be 
> >> >=0, and is automatically adjusted up to at least 
> >> (eager_limit+btl_rdma_pipeline_send_length); only relevant when the PUT 
> >> flag is set)
> >>                MCA btl: parameter "btl_tcp_bandwidth" (current value: 
> >> "100", data source: default, level: 5 tuner/detail, type: unsigned)
> >>                         Approximate maximum bandwidth of interconnect (0 = 
> >> auto-detect value at run-time [not supported in all BTL modules], >= 1 = 
> >> bandwidth in Mbps)
> >>                MCA btl: parameter "btl_tcp_disable_family" (current value: 
> >> "0", data source: default, level: 2 user/detail, type: int)
> >>                MCA btl: parameter "btl_tcp_if_seq" (current value: "", 
> >> data source: default, level: 9 dev/all, type: string)
> >>                         If specified, a comma-delimited list of TCP 
> >> interfaces.  Interfaces will be assigned, one to each MPI process, in a 
> >> round-robin fashion on each server.  For example, if the list is 
> >> "eth0,eth1" and four MPI processes are run on a single server, then local 
> >> ranks 0 and 2 will use eth0 and local ranks 1 and 3 will use eth1.
> >>
> >>
> >> On Nov 27, 2013, at 2:41 PM, George Bosilca 
> >> <bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu><mailto:bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>>>
> >>  wrote:
> >>
> >> ompi_info —param oob tcp —level 9
> >> ompi_info —param btl tcp —level 9
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org<mailto:us...@open-mpi.org>
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com<mailto:jsquy...@cisco.com>
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org<mailto:us...@open-mpi.org>
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to