Paul, about the five warnings : can you confirm you are running mpirun *not* on n15 nor n16 ? if my guess is correct, then you can get up to 5 warnings : mpirun + 2 orted + 2 mpi tasks
do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in your openmpi-mca-params.conf ? here is attached a patch to fix this issue. what we really want is test there is a loopback interface, period. the current code (my bad for not having reviewed in a timely manner) seems to check there is a *selected* loopback interface. Cheers, Gilles On 2014/12/12 13:15, Paul Hargrove wrote: > Ralph, > > Sorry to be the bearer of more bad news. > The "good" news is I've seen the new warning regarding the lack of a > loopback interface. > The BAD news is that it is occurring on a Linux cluster that I'ver verified > DOES have 'lo' configured on the front-end and compute nodes (UP and > RUNNING according to ifconfig). > > Though run with "-np 2" the warning appears FIVE times. > ADDITIONALLY, there is a SEGV at exit! > > Unfortunately, despite configuring with --enable-debug, I didn't get line > numbers from the core (and there was no backtrace printed). > > All of this appears below (and no, "-mca mtl psm" is not a typo or a joke). > > Let me know what tracing flags to apply to gather the info needed to debug > this. > > -Paul > > > $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c > -------------------------------------------------------------------------- > WARNING: No loopback interface was found. This can cause problems > when we spawn processes as they are likely to be unable to connect > back to their host daemon. Sadly, it may take awhile for the connect > attempt to fail, so you may experience a significant hang time. > > You may wish to ctrl-c out of your job and activate loopback support > on at least one interface before trying again. > > -------------------------------------------------------------------------- > [... above message FOUR more times ...] > Process 1 exiting > Process 0 sending 10 to 1, tag 201 (2 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > Process 0 decremented value: 8 > Process 0 decremented value: 7 > Process 0 decremented value: 6 > Process 0 decremented value: 5 > Process 0 decremented value: 4 > Process 0 decremented value: 3 > Process 0 decremented value: 2 > Process 0 decremented value: 1 > Process 0 decremented value: 0 > Process 0 exiting > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal > 11 (Segmentation fault). > -------------------------------------------------------------------------- > > $ /sbin/ifconfig lo > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:481228 errors:0 dropped:0 overruns:0 frame:0 > TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:81039065 (77.2 MiB) TX bytes:81039065 (77.2 MiB) > > $ ssh n15 /sbin/ifconfig lo > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:24885 errors:0 dropped:0 overruns:0 frame:0 > TX packets:24885 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1509940 (1.4 MiB) TX bytes:1509940 (1.4 MiB) > > $ ssh n16 /sbin/ifconfig lo > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:24938 errors:0 dropped:0 overruns:0 frame:0 > TX packets:24938 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1543408 (1.4 MiB) TX bytes:1543408 (1.4 MiB) > > $ gdb examples/ring_c core.29728 > [...] > (gdb) where > #0 0x0000002a97a19980 in ?? () > #1 <signal handler called> > #2 0x0000003a6d40607c in _Unwind_FindEnclosingFunction () from > /lib64/libgcc_s.so.1 > #3 0x0000003a6d406b57 in _Unwind_RaiseException () from > /lib64/libgcc_s.so.1 > #4 0x0000003a6d406c4c in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1 > #5 0x0000003a6c30ac50 in __pthread_unwind () from > /lib64/tls/libpthread.so.0 > #6 0x0000003a6c305202 in sigcancel_handler () from > /lib64/tls/libpthread.so.0 > #7 <signal handler called> > #8 0x0000003a6b6bd9a2 in poll () from /lib64/tls/libc.so.6 > #9 0x0000002a978f8f7d in ?? () > #10 0x002000010000000e in ?? () > #11 0x0000000000000000 in ?? () > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16525.php
diff --git a/orte/mca/oob/tcp/oob_tcp_component.c b/orte/mca/oob/tcp/oob_tcp_component.c index 3c42269..ecb6c28 100644 --- a/orte/mca/oob/tcp/oob_tcp_component.c +++ b/orte/mca/oob/tcp/oob_tcp_component.c @@ -460,6 +460,11 @@ static bool component_available(void) /* look at all available interfaces */ for (i = opal_ifbegin(); i >= 0; i = opal_ifnext(i)) { + /* if this interface has loopback support, record that fact */ + if (opal_ifisloopback(i)) { + loopback = true; + } + if (OPAL_SUCCESS != opal_ifindextoaddr(i, (struct sockaddr*) &my_ss, sizeof (my_ss))) { opal_output (0, "oob_tcp: problems getting address for index %i (kernel index %i)\n", @@ -527,11 +532,6 @@ static bool component_available(void) continue; } } - /* if this interface has loopback support, record that fact */ - if (opal_ifisloopback(i)) { - loopback = true; - } - /* Refs ticket #3019 * it would probably be worthwhile to print out a warning if OMPI detects multiple * IP interfaces that are "up" on the same subnet (because that's a Bad Idea). Note