Aha! You are the first to fall thru the timeout. How interesting. Can you please try adding “-mca oob_tcp_connect_timeout 5:0”?
On Dec 12, 2014, at 8:53 AM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > > First, I want to ask what became of the issue discussed in this thread? > http://www.open-mpi.org/community/lists/devel/2014/11/16160.php > <http://www.open-mpi.org/community/lists/devel/2014/11/16160.php> > I though we had concluded that one just needed -D_REENTRANT. > I mention that only for completeness, because I think my current problem is > different. > > The following works fine with 1.8.3, making the current behavior a regression. > > I am still on the same system as that previous report, and still/again see a > message like the following: > > ------------------------------------------------------------ > A process or daemon was unable to complete a TCP connection > to another process: > Local host: pcp-j-19 > Remote host: 172.18.0.120 > This is usually caused by a firewall on the remote host. Please > check that any firewall (e.g., iptables) has been disabled and > try again. > ------------------------------------------------------------ > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > [...etc...] > > It may be worth noting that the hostname pcp-j-19 (172.16.0.119) and the > address 172.18.0.120 are on different subnets. > > I CANNOT resolve the issue this time by adding -D_REENTRANT to CFLAGS at > configure time (I didn't bother to check if it there by default now or not). > > NOR can I resolve it by using "-mca oob_tcp_if_include bge0" to allow only > the 172.16.0.120 subnet. > IN FACT, the message is the same with that option, other than "172.18" > changing to "172.16". > > I've attached the output generated by "-mca oob_base_verbose 20" both with > and without the oob_tcp_if_include. > > I should also note that that the following is my full mpirun command, which > excludes the tcp BTL. > pcp-j-20$ mpirun -mca oob_tcp_if_include bge0 -mca oob_base_verbose 20 -mca > btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 examples/ring_c > > > -Paul > > -- > Paul H. Hargrove phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > <stdout-inc.txt><stderr-2if.txt>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/12/16551.php