Paul,
about the five warnings :
can you confirm you are running mpirun *not* on n15 nor n16 ?
if my guess is correct, then you can get up to 5 warnings : mpirun + 2
orted + 2 mpi tasks
do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in
your openmpi-mca-params.conf ?
here is attached a patch to fix this issue.
what we really want is test there is a loopback interface, period.
the current code (my bad for not having reviewed in a timely manner)
seems to check
there is a *selected* loopback interface.
Cheers,
Gilles
On 2014/12/12 13:15, Paul Hargrove wrote:
> Ralph,
>
> Sorry to be the bearer of more bad news.
> The "good" news is I've seen the new warning regarding the lack of a
> loopback interface.
> The BAD news is that it is occurring on a Linux cluster that I'ver verified
> DOES have 'lo' configured on the front-end and compute nodes (UP and
> RUNNING according to ifconfig).
>
> Though run with "-np 2" the warning appears FIVE times.
> ADDITIONALLY, there is a SEGV at exit!
>
> Unfortunately, despite configuring with --enable-debug, I didn't get line
> numbers from the core (and there was no backtrace printed).
>
> All of this appears below (and no, "-mca mtl psm" is not a typo or a joke).
>
> Let me know what tracing flags to apply to gather the info needed to debug
> this.
>
> -Paul
>
>
> $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c
> --------------------------------------------------------------------------
> WARNING: No loopback interface was found. This can cause problems
> when we spawn processes as they are likely to be unable to connect
> back to their host daemon. Sadly, it may take awhile for the connect
> attempt to fail, so you may experience a significant hang time.
>
> You may wish to ctrl-c out of your job and activate loopback support
> on at least one interface before trying again.
>
> --------------------------------------------------------------------------
> [... above message FOUR more times ...]
> Process 1 exiting
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal
> 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> $ /sbin/ifconfig lo
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:481228 errors:0 dropped:0 overruns:0 frame:0
> TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:81039065 (77.2 MiB) TX bytes:81039065 (77.2 MiB)
>
> $ ssh n15 /sbin/ifconfig lo
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:24885 errors:0 dropped:0 overruns:0 frame:0
> TX packets:24885 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:1509940 (1.4 MiB) TX bytes:1509940 (1.4 MiB)
>
> $ ssh n16 /sbin/ifconfig lo
> lo Link encap:Local Loopback
> inet addr:127.0.0.1 Mask:255.0.0.0
> inet6 addr: ::1/128 Scope:Host
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:24938 errors:0 dropped:0 overruns:0 frame:0
> TX packets:24938 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:1543408 (1.4 MiB) TX bytes:1543408 (1.4 MiB)
>
> $ gdb examples/ring_c core.29728
> [...]
> (gdb) where
> #0 0x0000002a97a19980 in ?? ()
> #1 <signal handler called>
> #2 0x0000003a6d40607c in _Unwind_FindEnclosingFunction () from
> /lib64/libgcc_s.so.1
> #3 0x0000003a6d406b57 in _Unwind_RaiseException () from
> /lib64/libgcc_s.so.1
> #4 0x0000003a6d406c4c in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
> #5 0x0000003a6c30ac50 in __pthread_unwind () from
> /lib64/tls/libpthread.so.0
> #6 0x0000003a6c305202 in sigcancel_handler () from
> /lib64/tls/libpthread.so.0
> #7 <signal handler called>
> #8 0x0000003a6b6bd9a2 in poll () from /lib64/tls/libc.so.6
> #9 0x0000002a978f8f7d in ?? ()
> #10 0x002000010000000e in ?? ()
> #11 0x0000000000000000 in ?? ()
>
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16525.php
diff --git a/orte/mca/oob/tcp/oob_tcp_component.c
b/orte/mca/oob/tcp/oob_tcp_component.c
index 3c42269..ecb6c28 100644
--- a/orte/mca/oob/tcp/oob_tcp_component.c
+++ b/orte/mca/oob/tcp/oob_tcp_component.c
@@ -460,6 +460,11 @@ static bool component_available(void)
/* look at all available interfaces */
for (i = opal_ifbegin(); i >= 0; i = opal_ifnext(i)) {
+ /* if this interface has loopback support, record that fact */
+ if (opal_ifisloopback(i)) {
+ loopback = true;
+ }
+
if (OPAL_SUCCESS != opal_ifindextoaddr(i, (struct sockaddr*) &my_ss,
sizeof (my_ss))) {
opal_output (0, "oob_tcp: problems getting address for index %i
(kernel index %i)\n",
@@ -527,11 +532,6 @@ static bool component_available(void)
continue;
}
}
- /* if this interface has loopback support, record that fact */
- if (opal_ifisloopback(i)) {
- loopback = true;
- }
-
/* Refs ticket #3019
* it would probably be worthwhile to print out a warning if OMPI
detects multiple
* IP interfaces that are "up" on the same subnet (because that's a
Bad Idea). Note