Kewl - thanks to both of you for the explanation. I’ll make the adjustment.

> On Dec 11, 2014, at 9:10 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
> 
> Ralph,
> 
> The "understanding" Gilles just expresses matches my own.
> 
> The issue that the OP observed on an ARM/Linux system (and I was able to 
> reproduce on Linux w/ any arch) is that when the LO interface is missing 
> Linux is unable to pass loopback messages sent on ANY interface.  The oob_tcp 
> code was trying to connect to a 172.18.0.x address when I reproduced it.
> 
> In summary:
> 
> For LINUX the lack of a loopback interface (selected or not) prevents local 
> connection.
> For NON-LINUX the lack of a loopback interface MAKES NO DIFFERENCE.
> 
> So, I think Gilles's version is correct, but that making the logic (at least 
> the reporting) conditional on Linux might be an improvement.
> 
> Since this is a warning, it might be better to remove from 1.8 until we have 
> more certainty about where/when it matters.  I don't think users will 
> appreciate a "cry wolf" release.
> 
> -Paul
> 
> On Thu, Dec 11, 2014 at 9:01 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org <mailto:gilles.gouaillar...@iferc.org>> wrote:
> Ralph,
> 
> here is my understanding of what happens on Linux :
> 
> lo: 127.0.0.1/8 <http://127.0.0.1/8>
> eth0: 192.168.122.101/24 <http://192.168.122.101/24>
> 
> mpirun --mca orte_oob_tcp_if_include eth0 ...
> 
> so the mpi task tries to contact orted/mpirun on 192.168.0.1/24 
> <http://192.168.0.1/24>
> 
> that works just fine if the loopback interface is active,
> and that hangs if there is no loopback interface.
> 
> 
> imho that is a linux oddity, and OMPI has nothing to do with it
> 
> Cheers,
> 
> Gilles
> 
> [root@slurm1 ~]# ping -c 3 192.168.122.101
> PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data.
> 64 bytes from 192.168.122.101 <http://192.168.122.101/>: icmp_seq=1 ttl=64 
> time=0.013 ms
> 64 bytes from 192.168.122.101 <http://192.168.122.101/>: icmp_seq=2 ttl=64 
> time=0.009 ms
> 64 bytes from 192.168.122.101 <http://192.168.122.101/>: icmp_seq=3 ttl=64 
> time=0.011 ms
> 
> --- 192.168.122.101 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
> rtt min/avg/max/mdev = 0.009/0.011/0.013/0.001 ms
> 
> 
> 
> [root@slurm1 ~]# ifdown lo
> [root@slurm1 ~]# ping -c 3 192.168.122.101
> PING 192.168.122.101 (192.168.122.101) 56(84) bytes of data.
> 
> --- 192.168.122.101 ping statistics ---
> 3 packets transmitted, 0 received, 100% packet loss, time 11999ms
> 
> 
> 
> On 2014/12/12 13:54, Ralph Castain wrote:
>> I honestly think it has to be a selected interface, Gilles, else we will 
>> fail to connect.
>> 
>>> On Dec 11, 2014, at 8:26 PM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@iferc.org> <mailto:gilles.gouaillar...@iferc.org> 
>>> wrote:
>>> 
>>> Paul,
>>> 
>>> about the five warnings :
>>> can you confirm you are running mpirun *not* on n15 nor n16 ?
>>> if my guess is correct, then you can get up to 5 warnings : mpirun + 2 
>>> orted + 2 mpi tasks
>>> 
>>> do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in your 
>>> openmpi-mca-params.conf ?
>>> 
>>> here is attached a patch to fix this issue.
>>> what we really want is test there is a loopback interface, period.
>>> the current code (my bad for not having reviewed in a timely manner) seems 
>>> to check
>>> there is a *selected* loopback interface.
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/12/12 13:15, Paul Hargrove wrote:
>>>> Ralph,
>>>> 
>>>> Sorry to be the bearer of more bad news.
>>>> The "good" news is I've seen the new warning regarding the lack of a
>>>> loopback interface.
>>>> The BAD news is that it is occurring on a Linux cluster that I'ver verified
>>>> DOES have 'lo' configured on the front-end and compute nodes (UP and
>>>> RUNNING according to ifconfig).
>>>> 
>>>> Though run with "-np 2" the warning appears FIVE times.
>>>> ADDITIONALLY, there is a SEGV at exit!
>>>> 
>>>> Unfortunately, despite configuring with --enable-debug, I didn't get line
>>>> numbers from the core (and there was no backtrace printed).
>>>> 
>>>> All of this appears below (and no, "-mca mtl psm" is not a typo or a joke).
>>>> 
>>>> Let me know what tracing flags to apply to gather the info needed to debug
>>>> this.
>>>> 
>>>> -Paul
>>>> 
>>>> 
>>>> $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c
>>>> --------------------------------------------------------------------------
>>>> WARNING: No loopback interface was found. This can cause problems
>>>> when we spawn processes as they are likely to be unable to connect
>>>> back to their host daemon. Sadly, it may take awhile for the connect
>>>> attempt to fail, so you may experience a significant hang time.
>>>> 
>>>> You may wish to ctrl-c out of your job and activate loopback support
>>>> on at least one interface before trying again.
>>>> 
>>>> --------------------------------------------------------------------------
>>>> [... above message FOUR more times ...]
>>>> Process 1 exiting
>>>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>>>> Process 0 sent to 1
>>>> Process 0 decremented value: 9
>>>> Process 0 decremented value: 8
>>>> Process 0 decremented value: 7
>>>> Process 0 decremented value: 6
>>>> Process 0 decremented value: 5
>>>> Process 0 decremented value: 4
>>>> Process 0 decremented value: 3
>>>> Process 0 decremented value: 2
>>>> Process 0 decremented value: 1
>>>> Process 0 decremented value: 0
>>>> Process 0 exiting
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal
>>>> 11 (Segmentation fault).
>>>> --------------------------------------------------------------------------
>>>> 
>>>> $ /sbin/ifconfig lo
>>>> lo        Link encap:Local Loopback
>>>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>>>           inet6 addr: ::1/128 Scope:Host
>>>>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>>>           RX packets:481228 errors:0 dropped:0 overruns:0 frame:0
>>>>           TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0
>>>>           collisions:0 txqueuelen:0
>>>>           RX bytes:81039065 (77.2 MiB)  TX bytes:81039065 (77.2 MiB)
>>>> 
>>>> $ ssh n15 /sbin/ifconfig lo
>>>> lo        Link encap:Local Loopback
>>>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>>>           inet6 addr: ::1/128 Scope:Host
>>>>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>>>           RX packets:24885 errors:0 dropped:0 overruns:0 frame:0
>>>>           TX packets:24885 errors:0 dropped:0 overruns:0 carrier:0
>>>>           collisions:0 txqueuelen:0
>>>>           RX bytes:1509940 (1.4 MiB)  TX bytes:1509940 (1.4 MiB)
>>>> 
>>>> $ ssh n16 /sbin/ifconfig lo
>>>> lo        Link encap:Local Loopback
>>>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>>>           inet6 addr: ::1/128 Scope:Host
>>>>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>>>           RX packets:24938 errors:0 dropped:0 overruns:0 frame:0
>>>>           TX packets:24938 errors:0 dropped:0 overruns:0 carrier:0
>>>>           collisions:0 txqueuelen:0
>>>>           RX bytes:1543408 (1.4 MiB)  TX bytes:1543408 (1.4 MiB)
>>>> 
>>>> $ gdb examples/ring_c core.29728
>>>> [...]
>>>> (gdb) where
>>>> #0  0x0000002a97a19980 in ?? ()
>>>> #1  <signal handler called>
>>>> #2  0x0000003a6d40607c in _Unwind_FindEnclosingFunction () from
>>>> /lib64/libgcc_s.so.1
>>>> #3  0x0000003a6d406b57 in _Unwind_RaiseException () from
>>>> /lib64/libgcc_s.so.1
>>>> #4  0x0000003a6d406c4c in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
>>>> #5  0x0000003a6c30ac50 in __pthread_unwind () from
>>>> /lib64/tls/libpthread.so.0
>>>> #6  0x0000003a6c305202 in sigcancel_handler () from
>>>> /lib64/tls/libpthread.so.0
>>>> #7  <signal handler called>
>>>> #8  0x0000003a6b6bd9a2 in poll () from /lib64/tls/libc.so.6
>>>> #9  0x0000002a978f8f7d in ?? ()
>>>> #10 0x002000010000000e in ?? ()
>>>> #11 0x0000000000000000 in ?? ()
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> <mailto:de...@open-mpi.org> 
>>>> <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/12/16525.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16525.php> 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16525.php> 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16525.php>
>>>  <loopback.diff>_______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16526.php 
>>> <http://www.open-mpi.org/community/lists/devel/2014/12/16526.php>
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16527.php 
>> <http://www.open-mpi.org/community/lists/devel/2014/12/16527.php>
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16529.php 
> <http://www.open-mpi.org/community/lists/devel/2014/12/16529.php>
> 
> 
> 
> -- 
> Paul H. Hargrove                          phhargr...@lbl.gov 
> <mailto:phhargr...@lbl.gov>
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16531.php

Reply via email to