On Fri, Feb 15, 2008 at 09:02:10AM -0500, Tim Prins wrote: > >> 3. If the exclude list does not contain 'lo', or the include list > >> contains 'lo', the job hangs when using multiple nodes: > > That's weird. Loopback interfaces should automatically be excluded right > > from the beginning. See opal/util/if.c. > I took a quick glance at this file, and I'd be lying if I said I > understood what was going on in it. One thing I did notice is that the > parameter btl_tcp_if_exclude defaults to 'lo', but the user can of > course overwrite it.
I was wrong. To be more precise, there are conflicting comments in if.c: #if 0 if ((ifr->ifr_flags & IFF_LOOPBACK) != 0) continue; #endif And: /* skip interface if it is a loopback device (IFF_LOOPBACK set) */ /* or if it is a point-to-point interface */ /* TODO: do we really skip p2p? */ if(0 != (cur_ifaddrs->ifa_flags & IFF_LOOPBACK) || 0!= (cur_ifaddrs->ifa_flags & IFF_POINTOPOINT)) { continue; } and: if ( (! IN6_IS_ADDR_LOOPBACK (&my_addr->sin6_addr)) && (! IN6_IS_ADDR_LINKLOCAL (&my_addr->sin6_addr))) { /* create interface for newly found address */ and: /* generate the interface name on your own .... loopback: lo Rest: eth0, eth1, ..... */ if (if_list[i].iiFlags & IFF_LOOPBACK) { sprintf (intf.if_name, "lo"); } else { sprintf (intf.if_name, "eth%u", interface_counter++); } To be honest: When porting to IPv6, I've excluded lo, because I see no use in using it. That is what the code reflects: 127.0.0.1 is included (IPv4-lo), but ::1 is excluded (IPv6-lo). > It might be worth looking into this further. If the user got an error or > the job aborted if they did something wrong with 'lo' I would not worry > about it at all. But the fact that it causes a hang is worrisome to me. It could be treated as the user's fault. I see three approaches: a) remove lo globally (in if.c). I expect objections. ;) b) print a warning from BTL/TCP if the interfaces in use contain lo. Like "Warning: You've included the loopback for communication. This may cause hanging processes due to unreachable peers." c) Throw away 127.0.0.1 on the remote side. But when doing so, what's the use for including it at all? So as mentioned earlier: It could be the user's fault. ;) If he includes lo, this means he wants to announce 127.0.0.1 to remote peers. And this sounds useless (unless you want local communication without SM). -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de