Kewl - thanks to both of you for the explanation. I’ll make the adjustment.
> On Dec 11, 2014, at 9:10 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Ralph, > > The "understanding" Gilles just expresses matches my own. > > The issue that the OP observed on an ARM/Linux system (and I was able to > reproduce on Linux w/ any arch) is that when the LO interface is missing > Linux is unable to pass loopback messages sent on ANY interface. The oob_tcp > code was trying to connect to a 172.18.0.x address when I reproduced it. > > In summary: > > For LINUX the lack of a loopback interface (selected or not) prevents local > connection. > For NON-LINUX the lack of a loopback interface MAKES NO DIFFERENCE. > > So, I think Gilles's version is correct, but that making the logic (at least > the reporting) conditional on Linux might be an improvement. > > Since this is a warning, it might be better to remove from 1.8 until we have > more certainty about where/when it matters. I don't think users will > appreciate a "cry wolf" release. > > -Paul > > On Thu, Dec 11, 2014 at 9:01 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org <mailto:gilles.gouaillar...@iferc.org>> wrote: > Ralph, > > here is my understanding of what happens on Linux : > > lo: 127.0.0.1/8 <http://127.0.0.1/8> > eth0: 192.168.122.101/24 <http://192.168.122.101/24> > > mpirun --mca orte_oob_tcp_if_include eth0 ... > > so the mpi task tries to contact orted/mpirun on 192.168.0.1/24 > <http://192.168.0.1/24> > > that works just fine if the loopback interface is active, > and that hangs if there is no loopback interface. > > > imho that is a linux oddity, and OMPI has nothing to do with it > > Cheers, > > Gilles > > [root@slurm1 ~]# ping -c 3 192.168.122.101 > PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data. > 64 bytes from 192.168.122.101 <http://192.168.122.101/>: icmp_seq=1 ttl=64 > time=0.013 ms > 64 bytes from 192.168.122.101 <http://192.168.122.101/>: icmp_seq=2 ttl=64 > time=0.009 ms > 64 bytes from 192.168.122.101 <http://192.168.122.101/>: icmp_seq=3 ttl=64 > time=0.011 ms > > --- 192.168.122.101 ping statistics --- > 3 packets transmitted, 3 received, 0% packet loss, time 1999ms > rtt min/avg/max/mdev = 0.009/0.011/0.013/0.001 ms > > > > [root@slurm1 ~]# ifdown lo > [root@slurm1 ~]# ping -c 3 192.168.122.101 > PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data. > > --- 192.168.122.101 ping statistics --- > 3 packets transmitted, 0 received, 100% packet loss, time 11999ms > > > > On 2014/12/12 13:54, Ralph Castain wrote: >> I honestly think it has to be a selected interface, Gilles, else we will >> fail to connect. >> >>> On Dec 11, 2014, at 8:26 PM, Gilles Gouaillardet >>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> >>> wrote: >>> >>> Paul, >>> >>> about the five warnings : >>> can you confirm you are running mpirun *not* on n15 nor n16 ? >>> if my guess is correct, then you can get up to 5 warnings : mpirun + 2 >>> orted + 2 mpi tasks >>> >>> do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in your >>> openmpi-mca-params.conf ? >>> >>> here is attached a patch to fix this issue. >>> what we really want is test there is a loopback interface, period. >>> the current code (my bad for not having reviewed in a timely manner) seems >>> to check >>> there is a *selected* loopback interface. >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/12/12 13:15, Paul Hargrove wrote: >>>> Ralph, >>>> >>>> Sorry to be the bearer of more bad news. >>>> The "good" news is I've seen the new warning regarding the lack of a >>>> loopback interface. >>>> The BAD news is that it is occurring on a Linux cluster that I'ver verified >>>> DOES have 'lo' configured on the front-end and compute nodes (UP and >>>> RUNNING according to ifconfig). >>>> >>>> Though run with "-np 2" the warning appears FIVE times. >>>> ADDITIONALLY, there is a SEGV at exit! >>>> >>>> Unfortunately, despite configuring with --enable-debug, I didn't get line >>>> numbers from the core (and there was no backtrace printed). >>>> >>>> All of this appears below (and no, "-mca mtl psm" is not a typo or a joke). >>>> >>>> Let me know what tracing flags to apply to gather the info needed to debug >>>> this. >>>> >>>> -Paul >>>> >>>> >>>> $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c >>>> -------------------------------------------------------------------------- >>>> WARNING: No loopback interface was found. This can cause problems >>>> when we spawn processes as they are likely to be unable to connect >>>> back to their host daemon. Sadly, it may take awhile for the connect >>>> attempt to fail, so you may experience a significant hang time. >>>> >>>> You may wish to ctrl-c out of your job and activate loopback support >>>> on at least one interface before trying again. >>>> >>>> -------------------------------------------------------------------------- >>>> [... above message FOUR more times ...] >>>> Process 1 exiting >>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring) >>>> Process 0 sent to 1 >>>> Process 0 decremented value: 9 >>>> Process 0 decremented value: 8 >>>> Process 0 decremented value: 7 >>>> Process 0 decremented value: 6 >>>> Process 0 decremented value: 5 >>>> Process 0 decremented value: 4 >>>> Process 0 decremented value: 3 >>>> Process 0 decremented value: 2 >>>> Process 0 decremented value: 1 >>>> Process 0 decremented value: 0 >>>> Process 0 exiting >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal >>>> 11 (Segmentation fault). >>>> -------------------------------------------------------------------------- >>>> >>>> $ /sbin/ifconfig lo >>>> lo Link encap:Local Loopback >>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>> inet6 addr: ::1/128 Scope:Host >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>> RX packets:481228 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:0 >>>> RX bytes:81039065 (77.2 MiB) TX bytes:81039065 (77.2 MiB) >>>> >>>> $ ssh n15 /sbin/ifconfig lo >>>> lo Link encap:Local Loopback >>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>> inet6 addr: ::1/128 Scope:Host >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>> RX packets:24885 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:24885 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:0 >>>> RX bytes:1509940 (1.4 MiB) TX bytes:1509940 (1.4 MiB) >>>> >>>> $ ssh n16 /sbin/ifconfig lo >>>> lo Link encap:Local Loopback >>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>> inet6 addr: ::1/128 Scope:Host >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>> RX packets:24938 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:24938 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:0 >>>> RX bytes:1543408 (1.4 MiB) TX bytes:1543408 (1.4 MiB) >>>> >>>> $ gdb examples/ring_c core.29728 >>>> [...] >>>> (gdb) where >>>> #0 0x0000002a97a19980 in ?? () >>>> #1 <signal handler called> >>>> #2 0x0000003a6d40607c in _Unwind_FindEnclosingFunction () from >>>> /lib64/libgcc_s.so.1 >>>> #3 0x0000003a6d406b57 in _Unwind_RaiseException () from >>>> /lib64/libgcc_s.so.1 >>>> #4 0x0000003a6d406c4c in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1 >>>> #5 0x0000003a6c30ac50 in __pthread_unwind () from >>>> /lib64/tls/libpthread.so.0 >>>> #6 0x0000003a6c305202 in sigcancel_handler () from >>>> /lib64/tls/libpthread.so.0 >>>> #7 <signal handler called> >>>> #8 0x0000003a6b6bd9a2 in poll () from /lib64/tls/libc.so.6 >>>> #9 0x0000002a978f8f7d in ?? () >>>> #10 0x002000010000000e in ?? () >>>> #11 0x0000000000000000 in ?? () >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> >>>> <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/12/16525.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16525.php> >>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16525.php> >>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16525.php> >>> <loopback.diff>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/12/16526.php >>> <http://www.open-mpi.org/community/lists/devel/2014/12/16526.php> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/12/16527.php >> <http://www.open-mpi.org/community/lists/devel/2014/12/16527.php> > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16529.php > <http://www.open-mpi.org/community/lists/devel/2014/12/16529.php> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16531.php