Ralph, here is my understanding of what happens on Linux :
lo: 127.0.0.1/8 eth0: 192.168.122.101/24 mpirun --mca orte_oob_tcp_if_include eth0 ... so the mpi task tries to contact orted/mpirun on 192.168.0.1/24 that works just fine if the loopback interface is active, and that hangs if there is no loopback interface. imho that is a linux oddity, and OMPI has nothing to do with it Cheers, Gilles [root@slurm1 ~]# ping -c 3 192.168.122.101 PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data. 64 bytes from 192.168.122.101: icmp_seq=1 ttl=64 time=0.013 ms 64 bytes from 192.168.122.101: icmp_seq=2 ttl=64 time=0.009 ms 64 bytes from 192.168.122.101: icmp_seq=3 ttl=64 time=0.011 ms --- 192.168.122.101 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1999ms rtt min/avg/max/mdev = 0.009/0.011/0.013/0.001 ms [root@slurm1 ~]# ifdown lo [root@slurm1 ~]# ping -c 3 192.168.122.101 PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data. --- 192.168.122.101 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 11999ms On 2014/12/12 13:54, Ralph Castain wrote: > I honestly think it has to be a selected interface, Gilles, else we will fail > to connect. > >> On Dec 11, 2014, at 8:26 PM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >> Paul, >> >> about the five warnings : >> can you confirm you are running mpirun *not* on n15 nor n16 ? >> if my guess is correct, then you can get up to 5 warnings : mpirun + 2 orted >> + 2 mpi tasks >> >> do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in your >> openmpi-mca-params.conf ? >> >> here is attached a patch to fix this issue. >> what we really want is test there is a loopback interface, period. >> the current code (my bad for not having reviewed in a timely manner) seems >> to check >> there is a *selected* loopback interface. >> >> Cheers, >> >> Gilles >> >> On 2014/12/12 13:15, Paul Hargrove wrote: >>> Ralph, >>> >>> Sorry to be the bearer of more bad news. >>> The "good" news is I've seen the new warning regarding the lack of a >>> loopback interface. >>> The BAD news is that it is occurring on a Linux cluster that I'ver verified >>> DOES have 'lo' configured on the front-end and compute nodes (UP and >>> RUNNING according to ifconfig). >>> >>> Though run with "-np 2" the warning appears FIVE times. >>> ADDITIONALLY, there is a SEGV at exit! >>> >>> Unfortunately, despite configuring with --enable-debug, I didn't get line >>> numbers from the core (and there was no backtrace printed). >>> >>> All of this appears below (and no, "-mca mtl psm" is not a typo or a joke). >>> >>> Let me know what tracing flags to apply to gather the info needed to debug >>> this. >>> >>> -Paul >>> >>> >>> $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c >>> -------------------------------------------------------------------------- >>> WARNING: No loopback interface was found. This can cause problems >>> when we spawn processes as they are likely to be unable to connect >>> back to their host daemon. Sadly, it may take awhile for the connect >>> attempt to fail, so you may experience a significant hang time. >>> >>> You may wish to ctrl-c out of your job and activate loopback support >>> on at least one interface before trying again. >>> >>> -------------------------------------------------------------------------- >>> [... above message FOUR more times ...] >>> Process 1 exiting >>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>> Process 0 sent to 1 >>> Process 0 decremented value: 9 >>> Process 0 decremented value: 8 >>> Process 0 decremented value: 7 >>> Process 0 decremented value: 6 >>> Process 0 decremented value: 5 >>> Process 0 decremented value: 4 >>> Process 0 decremented value: 3 >>> Process 0 decremented value: 2 >>> Process 0 decremented value: 1 >>> Process 0 decremented value: 0 >>> Process 0 exiting >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal >>> 11 (Segmentation fault). >>> -------------------------------------------------------------------------- >>> >>> $ /sbin/ifconfig lo >>> lo Link encap:Local Loopback >>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> inet6 addr: ::1/128 Scope:Host >>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>> RX packets:481228 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:0 >>> RX bytes:81039065 (77.2 MiB) TX bytes:81039065 (77.2 MiB) >>> >>> $ ssh n15 /sbin/ifconfig lo >>> lo Link encap:Local Loopback >>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> inet6 addr: ::1/128 Scope:Host >>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>> RX packets:24885 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:24885 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:0 >>> RX bytes:1509940 (1.4 MiB) TX bytes:1509940 (1.4 MiB) >>> >>> $ ssh n16 /sbin/ifconfig lo >>> lo Link encap:Local Loopback >>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> inet6 addr: ::1/128 Scope:Host >>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>> RX packets:24938 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:24938 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:0 >>> RX bytes:1543408 (1.4 MiB) TX bytes:1543408 (1.4 MiB) >>> >>> $ gdb examples/ring_c core.29728 >>> [...] >>> (gdb) where >>> #0 0x0000002a97a19980 in ?? () >>> #1 <signal handler called> >>> #2 0x0000003a6d40607c in _Unwind_FindEnclosingFunction () from >>> /lib64/libgcc_s.so.1 >>> #3 0x0000003a6d406b57 in _Unwind_RaiseException () from >>> /lib64/libgcc_s.so.1 >>> #4 0x0000003a6d406c4c in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1 >>> #5 0x0000003a6c30ac50 in __pthread_unwind () from >>> /lib64/tls/libpthread.so.0 >>> #6 0x0000003a6c305202 in sigcancel_handler () from >>> /lib64/tls/libpthread.so.0 >>> #7 <signal handler called> >>> #8 0x0000003a6b6bd9a2 in poll () from /lib64/tls/libc.so.6 >>> #9 0x0000002a978f8f7d in ?? () >>> #10 0x002000010000000e in ?? () >>> #11 0x0000000000000000 in ?? () >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16525.php >>> <http://www.open-mpi.org/community/lists/devel/2014/12/16525.php> >> <loopback.diff>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16526.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16527.php