Hi, IIRC there were some bug fixes between 1.8.1 and 1.8.2 in order to really use all the published interfaces.
by any change, are you running a firewall on your head node ? one possible explanation is the compute node tries to access the public interface of the head node, and packets get dropped by the firewall. if you are running a firewall, can you make a test without it ? /* if you do need NAT, then just remove the DROP and REJECT rules "/ an other possible explanation is the compute node is doing (reverse) dns requests with the public name and/or ip of the head node and that takes some time to complete (success or failure, this does not really matter here) /* a simple test is to make sure all the hosts/ip of the head node are in the /etc/hosts of the compute node */ could you check your network config (firewall and dns) ? can you reproduce the delay when running mpirun on the head node and with one mpi task on the compute node ? if yes, then the hard way to trace the delay issue would be to strace -ttt both orted and mpi task that are launched on the compute node and see where the time is lost. /* at this stage, i would suspect orted ... */ Cheers, Gilles On Mon, Nov 10, 2014 at 5:56 PM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > Am 10.11.2014 um 16:39 schrieb Ralph Castain: > > > That is indeed bizarre - we haven’t heard of anything similar from other > users. What is your network configuration? If you use oob_tcp_if_include or > exclude, can you resolve the problem? > > Thx - this option helped to get it working. > > These tests were made for sake of simplicity between the headnode of the > cluster and one (idle) compute node. I tried then between the (identical) > compute nodes and this worked fine. The headnode of the cluster and the > compute node are slightly different though (i.e. number of cores), and > using eth1 resp. eth0 for the internal network of the cluster. > > I tried --hetero-nodes with no change. > > Then I turned to: > > reuti@annemarie:~> date; mpiexec -mca btl self,tcp --mca > oob_tcp_if_include 192.168.154.0/26 -n 4 --hetero-nodes --hostfile > machines ./mpihello; date > > and the application started instantly. On another cluster, where the > headnode is identical to the compute nodes but with the same network setup > as above, I observed a delay of "only" 30 seconds. Nevertheless, also on > this cluster the working addition was the correct "oob_tcp_if_include" to > solve the issue. > > The questions which remain: a) is this a targeted behavior, b) what > changed in this scope between 1.8.1 and 1.8.2? > > -- Reuti > > > > > >> On Nov 10, 2014, at 4:50 AM, Reuti <re...@staff.uni-marburg.de> wrote: > >> > >> Am 10.11.2014 um 12:50 schrieb Jeff Squyres (jsquyres): > >> > >>> Wow, that's pretty terrible! :( > >>> > >>> Is the behavior BTL-specific, perchance? E.G., if you only use > certain BTLs, does the delay disappear? > >> > >> You mean something like: > >> > >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile > machines ./mpihello; date > >> Mon Nov 10 13:44:34 CET 2014 > >> Hello World from Node 1. > >> Total: 4 > >> Universe: 4 > >> Hello World from Node 0. > >> Hello World from Node 3. > >> Hello World from Node 2. > >> Mon Nov 10 13:46:42 CET 2014 > >> > >> (the above was even the latest v1.8.3-186-g978f61d) > >> > >> Falling back to 1.8.1 gives (as expected): > >> > >> reuti@annemarie:~> date; mpiexec -mca btl self,tcp -n 4 --hostfile > machines ./mpihello; date > >> Mon Nov 10 13:49:51 CET 2014 > >> Hello World from Node 1. > >> Total: 4 > >> Universe: 4 > >> Hello World from Node 0. > >> Hello World from Node 2. > >> Hello World from Node 3. > >> Mon Nov 10 13:49:53 CET 2014 > >> > >> > >> -- Reuti > >> > >>> FWIW: the use-all-IP interfaces approach has been in OMPI forever. > >>> > >>> Sent from my phone. No type good. > >>> > >>>> On Nov 10, 2014, at 6:42 AM, Reuti <re...@staff.uni-marburg.de> > wrote: > >>>> > >>>>> Am 10.11.2014 um 12:24 schrieb Reuti: > >>>>> > >>>>> Hi, > >>>>> > >>>>>> Am 09.11.2014 um 05:38 schrieb Ralph Castain: > >>>>>> > >>>>>> FWIW: during MPI_Init, each process “publishes” all of its > interfaces. Each process receives a complete map of that info for every > process in the job. So when the TCP btl sets itself up, it attempts to > connect across -all- the interfaces published by the other end. > >>>>>> > >>>>>> So it doesn’t matter what hostname is provided by the RM. We > discover and “share” all of the interface info for every node, and then use > them for loadbalancing. > >>>>> > >>>>> does this lead to any time delay when starting up? I stayed with > Open MPI 1.6.5 for some time and tried to use Open MPI 1.8.3 now. As there > is a delay when the applications starts in my first compilation of 1.8.3 I > disregarded even all my extra options and run it outside of any > queuingsystem - the delay remains - on two different clusters. > >>>> > >>>> I forgot to mention: the delay is more or less exactly 2 minutes from > the time I issued `mpiexec` until the `mpihello` starts up (there is no > delay for the initial `ssh` to reach the other node though). > >>>> > >>>> -- Reuti > >>>> > >>>> > >>>>> I tracked it down, that up to 1.8.1 it is working fine, but 1.8.2 > already creates this delay when starting up a simple mpihello. I assume it > may lay in the way how to reach other machines, as with one single machine > there is no delay. But using one (and only one - no tree spawn involved) > additional machine already triggers this delay. > >>>>> > >>>>> Did anyone else notice it? > >>>>> > >>>>> -- Reuti > >>>>> > >>>>> > >>>>>> HTH > >>>>>> Ralph > >>>>>> > >>>>>> > >>>>>>> On Nov 8, 2014, at 8:13 PM, Brock Palen <bro...@umich.edu> wrote: > >>>>>>> > >>>>>>> Ok I figured, i'm going to have to read some more for my own > curiosity. The reason I mention the Resource Manager we use, and that the > hostnames given but PBS/Torque match the 1gig-e interfaces, i'm curious > what path it would take to get to a peer node when the node list given all > match the 1gig interfaces but yet data is being sent out the 10gig > eoib0/ib0 interfaces. > >>>>>>> > >>>>>>> I'll go do some measurements and see. > >>>>>>> > >>>>>>> Brock Palen > >>>>>>> www.umich.edu/~brockp > >>>>>>> CAEN Advanced Computing > >>>>>>> XSEDE Campus Champion > >>>>>>> bro...@umich.edu > >>>>>>> (734)936-1985 > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> On Nov 8, 2014, at 8:30 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >>>>>>>> > >>>>>>>> Ralph is right: OMPI aggressively uses all Ethernet interfaces by > default. > >>>>>>>> > >>>>>>>> This short FAQ has links to 2 other FAQs that provide detailed > information about reachability: > >>>>>>>> > >>>>>>>> http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network > >>>>>>>> > >>>>>>>> The usNIC BTL uses UDP for its wire transport and actually does a > much more standards-conformant peer reachability determination (i.e., it > actually checks routing tables to see if it can reach a given peer which > has all kinds of caching benefits, kernel controls if you want them, > etc.). We haven't back-ported this to the TCP BTL because a) most people > who use TCP for MPI still use a single L2 address space, and b) no one has > asked for it. :-) > >>>>>>>> > >>>>>>>> As for the round robin scheduling, there's no indication from the > Linux TCP stack what the bandwidth is on a given IP interface. So unless > you use the btl_tcp_bandwidth_<IP_INTERFACE_NAME> (e.g., > btl_tcp_bandwidth_eth0) MCA params, OMPI will round-robin across them > equally. > >>>>>>>> > >>>>>>>> If you have multiple IP interfaces sharing a single physical > link, there will likely be no benefit from having Open MPI use more than > one of them. You should probably use btl_tcp_if_include / > btl_tcp_if_exclude to select just one. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> On Nov 7, 2014, at 2:53 PM, Brock Palen <bro...@umich.edu> > wrote: > >>>>>>>>> > >>>>>>>>> I was doing a test on our IB based cluster, where I was diabling > IB > >>>>>>>>> > >>>>>>>>> --mca btl ^openib --mca mtl ^mxm > >>>>>>>>> > >>>>>>>>> I was sending very large messages >1GB and I was surppised by > the speed. > >>>>>>>>> > >>>>>>>>> I noticed then that of all our ethernet interfaces > >>>>>>>>> > >>>>>>>>> eth0 (1gig-e) > >>>>>>>>> ib0 (ip over ib, for lustre configuration at vendor request) > >>>>>>>>> eoib0 (ethernet over IB interface for IB -> Ethernet gateway > for some extrnal storage support at >1Gig speed > >>>>>>>>> > >>>>>>>>> I saw all three were getting traffic. > >>>>>>>>> > >>>>>>>>> We use torque for our Resource Manager and use TM support, the > hostnames given by torque match the eth0 interfaces. > >>>>>>>>> > >>>>>>>>> How does OMPI figure out that it can also talk over the others? > How does it chose to load balance? > >>>>>>>>> > >>>>>>>>> BTW that is fine, but we will use if_exclude on one of the IB > ones as ib0 and eoib0 are the same physical device and may screw with load > balancing if anyone ver falls back to TCP. > >>>>>>>>> > >>>>>>>>> Brock Palen > >>>>>>>>> www.umich.edu/~brockp > >>>>>>>>> CAEN Advanced Computing > >>>>>>>>> XSEDE Campus Champion > >>>>>>>>> bro...@umich.edu > >>>>>>>>> (734)936-1985 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> users mailing list > >>>>>>>>> us...@open-mpi.org > >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25709.php > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Jeff Squyres > >>>>>>>> jsquy...@cisco.com > >>>>>>>> For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org > >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25713.php > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org > >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25715.php > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25716.php > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25721.php > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25722.php > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25724.php > >>> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >> Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25725.php > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25733.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/11/25736.php >